跳到主要内容

Worker Health And Settlement Validation Runbook

Status

Approved

Last Approved: 2026-05-15 (HEAD=3f36e7a8)

Goal

Validate that background workers and settlement jobs in servers_v2 reflect real business progress in health and logging behavior.

Preconditions

  • worker-health and settlement-hardening changes are deployed to the target environment
  • automated tests for outbox and settlement semantics are green
  • logs and metrics are accessible

Validation Checklist

Worker Health

  1. Confirm worker containers are healthy at steady state.
  2. Confirm heartbeat files are updating normally.
  3. Confirm each API service exposes /metrics and includes rgb_service_up.
  4. Confirm each worker exposes Prometheus metrics on WORKER_METRICS_PORT and writes a textfile at /tmp/<worker>.prom.
  5. Confirm rgb_worker_up, rgb_worker_last_heartbeat_timestamp_seconds, and rgb_worker_heartbeats_total move as heartbeat files update.
  6. Confirm per-loop success freshness is recent for every critical loop.
  7. Induce or simulate a repeated failure in a critical loop.
  8. Confirm:
    • consecutive failure logs appear
    • rgb_worker_failures_total increments for the failed component
    • /metrics remains reachable for at least WORKER_METRICS_FAILURE_GRACE_SECONDS after the fatal failure is reported
    • /tmp/<worker>.prom contains the final rgb_worker_up=0 state for textfile collection
    • readiness eventually fails when the loop exceeds its budget or freshness window
    • the worker no longer remains false-green

Outbox Publication

  1. Create unpublished outbox rows in a safe test environment.
  2. Simulate downstream publish failures.
  3. Confirm:
    • failed rows are logged
    • the iteration is surfaced as failed/degraded
    • successful rows, if any, are still marked published correctly
    • worker supervision reflects the failure instead of reporting a clean iteration

Settlement Jobs

  1. Run settlement in a safe environment with no eligible players.
  2. Confirm the result is a clean no-op, not a failure.
  3. Run settlement with eligible players and force all wallet credits to fail.
  4. Confirm the job raises and logs a full-delivery failure.
  5. Run settlement with mixed downstream success and failure.
  6. Confirm the result is visibly degraded in logs and summary output.

Release Gate

Do not approve production rollout if any of the following are observed:

  • a worker stays healthy while a critical loop is repeatedly failing
  • an outbox batch with failed publication is recorded as healthy
  • a settlement batch with eligible targets and zero delivered credits reports success
  • logs do not distinguish no-op, degraded, and failed settlement outcomes

Rollback Trigger

Rollback or pause rollout if:

  • worker health oscillates unexpectedly after the new checks are enabled
  • successful work is incorrectly marked stale or unhealthy
  • settlement jobs fail to surface real downstream failures
  • new health logic causes orchestration instability without clear business justification