ADR-003: Background Worker Health Must Reflect Forward Progress
Status
Accepted
Date
2026-04-22
Context
servers_v2 now relies on split API and worker runtimes for several domains.
Simple process liveness is not enough to determine whether a worker is healthy.
Examples of false-green behavior include:
- a worker loop that keeps catching exceptions and sleeping forever
- an outbox poller that keeps finding rows but never successfully publishes them
- a scheduler that is still running while critical jobs repeatedly fail
If health only checks process survival or generic heartbeat file updates, orchestration can report a worker healthy while real business work is stalled.
Decision
Worker health in servers_v2 must represent forward progress, not just process liveness.
Production rule:
- a worker is healthy only if every critical loop remains within its failure budget and has recent successful progress inside its freshness window
Required semantics:
- successful loop iterations refresh loop-specific success freshness
- repeated loop failures consume a failure budget
- a loop that does not make successful progress inside its freshness TTL becomes unhealthy
- worker heartbeat files must be gated by readiness checks derived from real loop health
Outbox-specific rule:
- if an outbox iteration sees publish failures, the iteration must surface failure to supervision
- a cycle that publishes zero rows successfully while failures occur is unhealthy work, not healthy idle
- partial publish success with failures is still a degraded iteration and must not be silently recorded as healthy
Scheduler-specific rule:
- job registries must treat repeated critical job failures as worker unhealthiness
- a running scheduler is not enough; critical jobs must also remain healthy
Consequences
Positive:
- worker container health aligns with business progress
- orchestration can detect stalled publication or retry loops sooner
- production issues become observable instead of silently degraded
Negative:
- requires loop-specific success markers and freshness bookkeeping
- may mark workers unhealthy more aggressively until failure budgets are tuned
Follow-Up
- apply these rules to wallet, rolling, promotion, agent, and future recon workers
- ensure compose and container health checks rely on worker readiness, not only process liveness