Worker Health And Settlement Hardening Spec
Status
Approved
Date
2026-04-22
Goal
Harden servers_v2 background execution so worker health and settlement outcomes reflect
real business progress instead of process liveness or silent partial failure.
Scope
In scope:
- wallet, rolling, and agent outbox publication semantics
- worker health supervision and freshness rules
- promotion settlement result semantics for rebate, cashback, and lossback
- compose and health contract alignment
- logging and operational visibility for these paths
Out of scope:
- broad service-boundary redesign
- new domain extraction work
- full retry-platform redesign beyond what is needed to avoid silent success
Background
Two production-grade risks must be prevented:
- outbox workers can appear healthy while row-level publication keeps failing
- settlement jobs can appear successful while all eligible credits fail
Both failures create false-green operational signals.
Requirements
1. Outbox Publication Semantics
For wallet, rolling, and agent outbox pollers:
- row-level publish exceptions must be surfaced to worker supervision
- an iteration with failures must not be recorded as healthy
- if all rows in the polled batch fail to publish, the iteration must fail loudly
- if some rows publish and some fail, the iteration must still be treated as degraded and surfaced to supervision
- already-published rows may still be committed, but supervision must see the degraded iteration
2. Worker Health Semantics
Worker health must use both:
- consecutive failure budget
- freshness of successful progress
Required behavior:
- success freshness is tracked per critical loop
- a loop that keeps failing beyond the configured budget becomes unhealthy
- a loop that keeps running but has no successful progress beyond its freshness window becomes unhealthy
- container health must depend on those readiness checks
3. Settlement Semantics
For rebate, cashback, and lossback settlement:
- “no eligible records” is a valid clean outcome
- “eligible records found, all credits failed” is a hard failure
- “some credits succeeded and some failed” is a degraded outcome and must be visible in logs and monitoring
- the settlement summary must make attempted, credited, and failed counts explicit
4. Observability
Required signals:
- per-worker critical loop freshness
- per-loop consecutive failures
- outbox publish success/failure counts
- settlement attempted/credited/failed counts
- explicit logs for full-batch settlement failure
- explicit logs for degraded partial settlement runs
5. Test Requirements
Required automated coverage:
- outbox poller failure contract tests
- worker health freshness tests
- scheduler/worker health integration tests
- settlement failure semantic tests
- compose contract tests for worker health dependencies where applicable
Acceptance Criteria
- wallet, rolling, and agent outbox paths do not record failed publication iterations as healthy
- worker readiness can fail when forward progress stalls even if the process is still running
- rebate, cashback, and lossback jobs raise when all eligible credits fail
- settlement summaries distinguish clean, degraded, and failed outcomes
- required tests exist and pass