Skip to main content

ADR-003: Background Worker Health Must Reflect Forward Progress

Status

Accepted

Date

2026-04-22

Context

servers_v2 now relies on split API and worker runtimes for several domains. Simple process liveness is not enough to determine whether a worker is healthy.

Examples of false-green behavior include:

  • a worker loop that keeps catching exceptions and sleeping forever
  • an outbox poller that keeps finding rows but never successfully publishes them
  • a scheduler that is still running while critical jobs repeatedly fail

If health only checks process survival or generic heartbeat file updates, orchestration can report a worker healthy while real business work is stalled.

Decision

Worker health in servers_v2 must represent forward progress, not just process liveness.

Production rule:

  • a worker is healthy only if every critical loop remains within its failure budget and has recent successful progress inside its freshness window

Required semantics:

  • successful loop iterations refresh loop-specific success freshness
  • repeated loop failures consume a failure budget
  • a loop that does not make successful progress inside its freshness TTL becomes unhealthy
  • worker heartbeat files must be gated by readiness checks derived from real loop health

Outbox-specific rule:

  • if an outbox iteration sees publish failures, the iteration must surface failure to supervision
  • a cycle that publishes zero rows successfully while failures occur is unhealthy work, not healthy idle
  • partial publish success with failures is still a degraded iteration and must not be silently recorded as healthy

Scheduler-specific rule:

  • job registries must treat repeated critical job failures as worker unhealthiness
  • a running scheduler is not enough; critical jobs must also remain healthy

Consequences

Positive:

  • worker container health aligns with business progress
  • orchestration can detect stalled publication or retry loops sooner
  • production issues become observable instead of silently degraded

Negative:

  • requires loop-specific success markers and freshness bookkeeping
  • may mark workers unhealthy more aggressively until failure budgets are tuned

Follow-Up

  • apply these rules to wallet, rolling, promotion, agent, and future recon workers
  • ensure compose and container health checks rely on worker readiness, not only process liveness