본문으로 건너뛰기

Worker Health And Settlement Hardening Plan

Status

In Progress

  • docs/adr/ADR-003-background-worker-health-must-reflect-forward-progress.md
  • docs/adr/ADR-004-settlement-jobs-must-fail-loudly-on-non-delivery.md
  • docs/specs/worker-health/2026-04-22-worker-health-and-settlement-hardening.md

Goal

Remove false-green worker and settlement behavior from servers_v2 background execution paths.

Phase 1: Outbox Failure Semantics

  • Review wallet, rolling, and agent outbox pollers for row-level exception handling.
  • Ensure failed row publication always surfaces an iteration failure to supervision.
  • Preserve successful published_at updates for rows that were actually published.
  • Add tests for:
    • all rows fail
    • some rows fail
    • all rows succeed

Phase 2: Forward-Progress Health

  • Extend worker health supervision to support freshness-based readiness for critical loops.
  • Add per-loop success freshness to split workers and service-local loops where needed.
  • Make worker heartbeat publication conditional on readiness, not simple loop survival.
  • Add tests proving a loop can become unhealthy while still running if it stops making successful progress.

Phase 3: Settlement Failure Semantics

  • Review rebate, cashback, and lossback settlement paths.
  • Ensure all-eligible-but-all-failed runs raise hard failures.
  • Ensure partial failures are explicitly visible in logs and result summaries.
  • Add tests for:
    • no eligible players
    • all eligible credits fail
    • mixed success and failure

Phase 4: Operational Alignment

  • Update worker-related runbooks and local compose expectations if health semantics change.
  • Verify scheduler and worker health checks still behave correctly under intentional failure conditions.
  • Cross-review logs and alerts for clear operator diagnostics.

Done Definition

  • false-green outbox publication is no longer possible
  • false-green settlement success on full non-delivery is no longer possible
  • freshness-based readiness is documented and tested
  • worker and settlement semantics are production-visible and reviewable