Worker Health And Settlement Validation Runbook
Status
Approved
Last Approved: 2026-05-15 (HEAD=3f36e7a8)
Goal
Validate that background workers and settlement jobs in servers_v2 reflect real business progress in health and logging behavior.
Preconditions
- worker-health and settlement-hardening changes are deployed to the target environment
- automated tests for outbox and settlement semantics are green
- logs and metrics are accessible
Validation Checklist
Worker Health
- Confirm worker containers are healthy at steady state.
- Confirm heartbeat files are updating normally.
- Confirm each API service exposes
/metricsand includesrgb_service_up. - Confirm each worker exposes Prometheus metrics on
WORKER_METRICS_PORTand writes a textfile at/tmp/<worker>.prom. - Confirm
rgb_worker_up,rgb_worker_last_heartbeat_timestamp_seconds, andrgb_worker_heartbeats_totalmove as heartbeat files update. - Confirm per-loop success freshness is recent for every critical loop.
- Induce or simulate a repeated failure in a critical loop.
- Confirm:
- consecutive failure logs appear
rgb_worker_failures_totalincrements for the failed component/metricsremains reachable for at leastWORKER_METRICS_FAILURE_GRACE_SECONDSafter the fatal failure is reported/tmp/<worker>.promcontains the finalrgb_worker_up=0state for textfile collection- readiness eventually fails when the loop exceeds its budget or freshness window
- the worker no longer remains false-green
Outbox Publication
- Create unpublished outbox rows in a safe test environment.
- Simulate downstream publish failures.
- Confirm:
- failed rows are logged
- the iteration is surfaced as failed/degraded
- successful rows, if any, are still marked published correctly
- worker supervision reflects the failure instead of reporting a clean iteration
Settlement Jobs
- Run settlement in a safe environment with no eligible players.
- Confirm the result is a clean no-op, not a failure.
- Run settlement with eligible players and force all wallet credits to fail.
- Confirm the job raises and logs a full-delivery failure.
- Run settlement with mixed downstream success and failure.
- Confirm the result is visibly degraded in logs and summary output.
Release Gate
Do not approve production rollout if any of the following are observed:
- a worker stays healthy while a critical loop is repeatedly failing
- an outbox batch with failed publication is recorded as healthy
- a settlement batch with eligible targets and zero delivered credits reports success
- logs do not distinguish no-op, degraded, and failed settlement outcomes
Rollback Trigger
Rollback or pause rollout if:
- worker health oscillates unexpectedly after the new checks are enabled
- successful work is incorrectly marked stale or unhealthy
- settlement jobs fail to surface real downstream failures
- new health logic causes orchestration instability without clear business justification