ADR-006: Observability Baseline
Status
Accepted
Date
2026-04-27
Owners
- Platform Backend
- Platform Operations
Affected Services
gatewayplayer_servicewallet_servicegame_servicerolling_servicepromotion_serviceagent_serviceadmin_servicerecon_service
Related Docs
docs/services/gateway.mddocs/runbooks/worker-health/worker-health-and-settlement-validation.mddocs/specs/worker-health/2026-04-22-worker-health-and-settlement-hardening.md
Context
Production support requires every request and background loop to leave enough evidence for operators to answer what failed, where it failed, and which caller or job triggered it.
Plain text logs, missing request IDs, and worker-only heartbeat files make cross-service failures slow to diagnose.
Decision
All servers_v2 services use the shared observability baseline:
- JSON loguru logs emitted to stdout or stderr
serviceandrequest_idbound into every HTTP request log contextX-Request-IDaccepted from callers, generated when absent, echoed on responses, and forwarded to downstream internal services/metricsexposed by API services in Prometheus text format- background workers expose worker health metrics and write textfile metrics
- fail-open infrastructure behavior, such as rate-limit Redis failures, must be visible through counters or gauges
Consequences
Positive:
- request traces can be correlated across services
- Prometheus can alert on API and worker health without parsing application logs
- worker fatal states remain observable even after the heartbeat file is removed
Negative:
- in-process metrics are per process, so multi-worker deployment needs either one process per service replica or a future multiprocess/sidecar aggregation design
- every new service must install the shared middleware before production cutover
Constraints
- Do not add service-local logging formats unless they preserve
serviceandrequest_id. - Do not swallow infrastructure failures without a log and a metric.
/metricsmust not count its own scrape requests in request totals.
Follow-Up
- Add deployment-level Prometheus scrape config before production cutover.
- Decide whether API services run one process per container or adopt
prometheus_clientmultiprocess mode.