ADR-006: Observability Baseline

Status

Accepted

Date

2026-04-27

Owners

Platform Backend
Platform Operations

Affected Services

gateway
player_service
wallet_service
game_service
rolling_service
promotion_service
agent_service
admin_service
recon_service

docs/services/gateway.md
docs/runbooks/worker-health/worker-health-and-settlement-validation.md
docs/specs/worker-health/2026-04-22-worker-health-and-settlement-hardening.md

Context

Production support requires every request and background loop to leave enough evidence for operators to answer what failed, where it failed, and which caller or job triggered it.

Plain text logs, missing request IDs, and worker-only heartbeat files make cross-service failures slow to diagnose.

Decision

All servers_v2 services use the shared observability baseline:

JSON loguru logs emitted to stdout or stderr
service and request_id bound into every HTTP request log context
X-Request-ID accepted from callers, generated when absent, echoed on responses, and forwarded to downstream internal services
/metrics exposed by API services in Prometheus text format
background workers expose worker health metrics and write textfile metrics
fail-open infrastructure behavior, such as rate-limit Redis failures, must be visible through counters or gauges

Consequences

Positive:

request traces can be correlated across services
Prometheus can alert on API and worker health without parsing application logs
worker fatal states remain observable even after the heartbeat file is removed

Negative:

in-process metrics are per process, so multi-worker deployment needs either one process per service replica or a future multiprocess/sidecar aggregation design
every new service must install the shared middleware before production cutover

Constraints

Do not add service-local logging formats unless they preserve service and request_id.
Do not swallow infrastructure failures without a log and a metric.
/metrics must not count its own scrape requests in request totals.

Follow-Up

Add deployment-level Prometheus scrape config before production cutover.
Decide whether API services run one process per container or adopt prometheus_client multiprocess mode.

Status​

Date​

Owners​

Affected Services​

Related Docs​

Context​

Decision​

Consequences​

Constraints​

Follow-Up​