Skip to main content

Worker Health And Settlement Hardening Spec

Status

Approved

Date

2026-04-22

Goal

Harden servers_v2 background execution so worker health and settlement outcomes reflect real business progress instead of process liveness or silent partial failure.

Scope

In scope:

  • wallet, rolling, and agent outbox publication semantics
  • worker health supervision and freshness rules
  • promotion settlement result semantics for rebate, cashback, and lossback
  • compose and health contract alignment
  • logging and operational visibility for these paths

Out of scope:

  • broad service-boundary redesign
  • new domain extraction work
  • full retry-platform redesign beyond what is needed to avoid silent success

Background

Two production-grade risks must be prevented:

  1. outbox workers can appear healthy while row-level publication keeps failing
  2. settlement jobs can appear successful while all eligible credits fail

Both failures create false-green operational signals.

Requirements

1. Outbox Publication Semantics

For wallet, rolling, and agent outbox pollers:

  • row-level publish exceptions must be surfaced to worker supervision
  • an iteration with failures must not be recorded as healthy
  • if all rows in the polled batch fail to publish, the iteration must fail loudly
  • if some rows publish and some fail, the iteration must still be treated as degraded and surfaced to supervision
  • already-published rows may still be committed, but supervision must see the degraded iteration

2. Worker Health Semantics

Worker health must use both:

  • consecutive failure budget
  • freshness of successful progress

Required behavior:

  • success freshness is tracked per critical loop
  • a loop that keeps failing beyond the configured budget becomes unhealthy
  • a loop that keeps running but has no successful progress beyond its freshness window becomes unhealthy
  • container health must depend on those readiness checks

3. Settlement Semantics

For rebate, cashback, and lossback settlement:

  • “no eligible records” is a valid clean outcome
  • “eligible records found, all credits failed” is a hard failure
  • “some credits succeeded and some failed” is a degraded outcome and must be visible in logs and monitoring
  • the settlement summary must make attempted, credited, and failed counts explicit

4. Observability

Required signals:

  • per-worker critical loop freshness
  • per-loop consecutive failures
  • outbox publish success/failure counts
  • settlement attempted/credited/failed counts
  • explicit logs for full-batch settlement failure
  • explicit logs for degraded partial settlement runs

5. Test Requirements

Required automated coverage:

  • outbox poller failure contract tests
  • worker health freshness tests
  • scheduler/worker health integration tests
  • settlement failure semantic tests
  • compose contract tests for worker health dependencies where applicable

Acceptance Criteria

  • wallet, rolling, and agent outbox paths do not record failed publication iterations as healthy
  • worker readiness can fail when forward progress stalls even if the process is still running
  • rebate, cashback, and lossback jobs raise when all eligible credits fail
  • settlement summaries distinguish clean, degraded, and failed outcomes
  • required tests exist and pass