Skip to main content

ADR-006: Observability Baseline

Status

Accepted

Date

2026-04-27

Owners

  • Platform Backend
  • Platform Operations

Affected Services

  • gateway
  • player_service
  • wallet_service
  • game_service
  • rolling_service
  • promotion_service
  • agent_service
  • admin_service
  • recon_service
  • docs/services/gateway.md
  • docs/runbooks/worker-health/worker-health-and-settlement-validation.md
  • docs/specs/worker-health/2026-04-22-worker-health-and-settlement-hardening.md

Context

Production support requires every request and background loop to leave enough evidence for operators to answer what failed, where it failed, and which caller or job triggered it.

Plain text logs, missing request IDs, and worker-only heartbeat files make cross-service failures slow to diagnose.

Decision

All servers_v2 services use the shared observability baseline:

  • JSON loguru logs emitted to stdout or stderr
  • service and request_id bound into every HTTP request log context
  • X-Request-ID accepted from callers, generated when absent, echoed on responses, and forwarded to downstream internal services
  • /metrics exposed by API services in Prometheus text format
  • background workers expose worker health metrics and write textfile metrics
  • fail-open infrastructure behavior, such as rate-limit Redis failures, must be visible through counters or gauges

Consequences

Positive:

  • request traces can be correlated across services
  • Prometheus can alert on API and worker health without parsing application logs
  • worker fatal states remain observable even after the heartbeat file is removed

Negative:

  • in-process metrics are per process, so multi-worker deployment needs either one process per service replica or a future multiprocess/sidecar aggregation design
  • every new service must install the shared middleware before production cutover

Constraints

  • Do not add service-local logging formats unless they preserve service and request_id.
  • Do not swallow infrastructure failures without a log and a metric.
  • /metrics must not count its own scrape requests in request totals.

Follow-Up

  • Add deployment-level Prometheus scrape config before production cutover.
  • Decide whether API services run one process per container or adopt prometheus_client multiprocess mode.