Skip to main content

Multi-Brand Counter Diagnosis Playbook

Status

Ready

Date

2026-04-28

Last Verified Commit

934d47f4 (T16-D doc sync). Counter contracts unchanged since T9-G; the LVC bump tracks the surrounding runbook updates so all multi-brand docs share the same verified commit.

Owners

  • Platform Backend
  • Wallet Domain
  • Agent Domain

Purpose

Operator-facing entry point for the multi-brand observability counters. On-call gets paged at 03:00 with no time to read source. Each section maps one counter to: what it counts, when it normally fires, alarm thresholds, what it means when it fires, diagnosis steps, likely fix, and who to escalate to.

Counters are organized by emitting service. The legacy brand_resolution_failed_total{reason} and wallet_idempotency_brand_split_total counters already have entries in the rollout runbook's Diagnosis Playbook section (docs/runbooks/multi-brand/multi-brand-isolation-rollout.md); this file picks up the brand-signature, brand-check, allow-list, callback-resolution, and admin-audit counters that lacked an entry.

Common conventions

  • "Soak window" means the same window used before the Phase 16 flip (see phase-16-hard-flip-checklist.md). Default 120 min.
  • "Observe mode" never returns end-user errors; the counter records what enforce mode WOULD have rejected.
  • "Enforce mode" hard-rejects with the v1 envelope (HTTP 200 + {"status":3,"msg":"...","data":null}) for HTTP entry paths, or raises CrossBrandRejected for wallet command paths.
  • All thresholds are per-instance unless explicitly summed.
  • Log greps use loguru's extra= fields; in production these surface as JSON keys.

wallet_brand_signature_failed_total

Emitter: servers_v2/wallet_service/app/services/brand_signature_verifier.py (_record_failure, line ~89)

Labels: caller_service, reason, mode. Reasons:

  • missing_headers:<list> -- one or more of X-Brand-Signature, X-Brand-Signature-Timestamp, X-Caller-Service, X-Request-Id was absent.
  • invalid_timestamp -- timestamp header was not parseable as int.
  • signature_mismatch -- HMAC did not match.
  • signature_replay -- a valid signature was presented twice within the 600s replay-cache window (see wallet_brand_signature_replay_blocked_total below for the alias-counter view).

What it counts One increment per inbound brand-scoped wallet write whose X-Brand-Signature envelope failed verification, tagged with the calling service, the failure reason, and the active enforcement mode.

When it normally fires

  • During Stage A/B of the per-caller signing rollout (cfd89b05): missing_headers:* is expected non-zero from any caller that has not yet been updated. Use wallet_brand_signature_missing_total per-caller counter to see which caller; that is the rollout signal not an alarm.
  • Once WALLET_BRAND_SIGNATURE_REQUIRE=on: expected steady-state rate is zero.
  • invalid_timestamp / signature_mismatch / signature_replay are always abnormal at any rollout stage.

Alarm thresholds

  • Warning: any signature_mismatch or invalid_timestamp increment in a 5-min window.
  • Page: >=10/hr of any reason other than missing_headers:* in enforce mode, OR ANY rate of signature_mismatch in production.
  • Hard incident: sustained signature_mismatch from a single caller -- treat as an active spoof attempt.

What it means when it fires An internal caller sent a wallet write whose HMAC envelope did not verify. Possible causes:

  • (a) caller was not issued BRAND_SIGNING_KEY, or has a stale value after rotation;
  • (b) clock skew >300s between caller and wallet pods (the verify_brand_assertion helper enforces +/- 300s);
  • (c) a rotated BRAND_SIGNING_KEY was not propagated to every pod before the old key was removed;
  • (d) attempted spoof from outside the trust boundary (rare; the signing key is internal).

Diagnosis steps

  1. kubectl logs deploy/wallet-service | grep "wallet brand-signature failure" | tail -50 -- look at caller_service, reason, mode extras.
  2. If reason=missing_headers:*: check wallet_brand_signature_missing_total{caller_service} for the same caller; if non-zero, that caller's HTTP client is not injecting the headers. Inspect the deploy of that caller.
  3. If reason=signature_mismatch: confirm both pods have the same BRAND_SIGNING_KEY version: kubectl exec deploy/<caller> -- printenv BRAND_SIGNING_KEY | sha256sum and compare to wallet pod.
  4. If reason=invalid_timestamp: check the raw header value in logs (the verifier logs the failed parse). Likely a bug in the caller's timestamp formatter.
  5. If clock skew suspected: kubectl exec into both pods and run date -u. Compare against NTP sync state.

Likely fix

  • Missing headers from a known caller: deploy the caller's HTTP client fix that injects all four headers (see rgb_contracts.infra.brand_signature.sign_brand_assertion).
  • Mismatch after rotation: re-roll the affected caller pods to pick up the new BRAND_SIGNING_KEY. See "BRAND_SIGNING_KEY rotation" in the rollout runbook.
  • Clock skew: re-sync NTP on both pods; consider increasing the 600s replay cache margin only with security review.
  • Spoof: rotate BRAND_SIGNING_KEY immediately and audit the caller pod for compromise.

Escalation Wallet domain owner if not resolved in 15 min. Security on-call immediately on any signature_mismatch from a caller whose signing key has not been rotated recently.


wallet_brand_signature_replay_blocked_total

Emitter: same file as above. Surfaces as wallet_brand_signature_failed_total{reason="signature_replay"} -- not a separate metric name. The "replay blocked" view is the slice of that counter where reason=signature_replay.

Labels: same (caller_service, reason="signature_replay", mode).

What it counts One increment when a caller presented a previously-seen valid (caller_service, request_id) signature pair within the 600s replay-cache window. Defense-in-depth on top of HMAC + skew.

When it normally fires Steady-state rate is zero. Legitimate clients never replay a valid signature within 600s because request_id is unique per request.

Alarm thresholds

  • Warning: any single increment in observe mode (operator should understand the cause before flipping enforce).
  • Page: >=1/min sustained in enforce mode -- a caller is reusing request IDs OR a captured signature is being replayed.
  • Hard incident: sustained replay from a caller whose request_id generator is supposed to be UUID4 -- suspect signature capture.

What it means when it fires

  • (a) caller's request_id generator is not actually unique within the 10-min window (e.g. retry loop reusing the same id without re-signing);
  • (b) caller is retrying without re-signing -- the original signed envelope is being replayed by the HTTP client retry logic;
  • (c) an attacker captured a valid signed request and is replaying it (real intrusion).

Diagnosis steps

  1. kubectl logs deploy/wallet-service | grep '"reason":"signature_replay"' | tail -50 -- look at caller_service. The X-Request-Id is in the request header but not in this counter's logged extras; check the caller's logs for the same window.
  2. Inspect the caller's HTTP retry policy. If it retries on 5xx without re-signing, that is the bug -- a retried request needs a fresh timestamp + signature.
  3. Check if the request_id pattern looks like a counter or timestamp instead of UUID4 -- if so, the caller's id generator is collision-prone.
  4. Redis cache state: redis-cli GET brand_sig_seen:<caller>:<request_id> -- if the key exists with TTL ~600, that confirms the previous valid signature is still cached.

Likely fix

  • Caller retry loop: patch the caller to re-sign on each retry attempt (timestamp + signature must be regenerated). The Phase 5B cfd89b05 signing helpers do this correctly when invoked per-attempt; older hand-rolled clients may sign once and reuse.
  • Non-unique request_id: switch the caller to UUID4 generation.
  • Capture/replay attack: rotate BRAND_SIGNING_KEY immediately (replay cache will not help once the key changes), page security.

Escalation Wallet domain owner if not resolved in 15 min. Security on-call immediately for sustained replay from a caller with verified-correct signing.


wallet_brand_signature_misconfigured_total (added by P1-4)

Emitter: servers_v2/shared/contracts/src/rgb_contracts/infra/brand_observability.py::record_wallet_brand_signature_misconfigured, called from servers_v2/wallet_service/app/services/brand_signature_verifier.py::verify_request_brand_signature when the verifier observes an empty BRAND_SIGNING_KEY at request time.

Labels: mode. One of observe or enforce (the off branch returns True without recording).

What it counts One increment per inbound brand-scoped wallet write that reached the verifier while BRAND_SIGNING_KEY was empty. In observe mode the write proceeds (rollout-safe); in enforce mode the write is rejected (return False -- runtime fail-closed defense-in-depth on top of the boot-time guard assert_brand_signing_key_configured in app/main.py).

When it normally fires Steady state: zero. Boot-time guard refuses to start wallet_service in the bad combination (MULTI_BRAND_ENFORCEMENT=enforce AND WALLET_BRAND_SIGNATURE_REQUIRE=on AND empty BRAND_SIGNING_KEY). Any non-zero rate means either:

  • the boot guard was bypassed (test env, weird container restart race, deploy template stripped the env var after boot);
  • the deploy template provisioned BRAND_SIGNING_KEY= (literal empty string) instead of the secret value.

Alarm thresholds

  • Page: ANY non-zero rate. Empty signing key in enforce mode means every brand-scoped wallet write from that pod is being rejected; in observe mode it means HMAC verification is silently disabled.
  • Hard incident: sustained mode="enforce" -- production wallet writes from that pod are failing; users see write errors.

What it means when it fires

  • (a) mode="enforce": the runtime fail-closed branch is firing. Every brand-scoped wallet write from this pod returns False. The boot guard SHOULD have prevented the pod from starting; investigate why it didn't.
  • (b) mode="observe": HMAC verification is disabled on this pod. No writes are blocked but the brand-signature defense is dead -- operator MUST fix before flipping WALLET_BRAND_SIGNATURE_REQUIRE=on (and certainly before flipping MULTI_BRAND_ENFORCEMENT=enforce).

Diagnosis steps

  1. kubectl logs deploy/wallet-service | grep "BRAND_SIGNING_KEY empty" | tail -50 -- look at mode extras. The error message includes the Vault path operators must read from.
  2. kubectl exec deploy/wallet-service -- printenv BRAND_SIGNING_KEY | wc -c -- confirm the env var is set on the pod (>1 byte expected).
  3. Compare to the deploy template / Vault binding for the affected environment.
  4. Check the boot logs for the affected pod: did the pod start cleanly or did the guard raise (in which case the pod is in CrashLoopBackOff and the runtime counter wouldn't be firing)?

Likely fix

  • Re-provision BRAND_SIGNING_KEY from the secrets manager (Vault path secret/multi-brand/brand-signing/<env>); roll the wallet pods.
  • Audit the deploy template -- if it provisioned an empty value, fix the template before the next deploy.

Escalation Wallet domain owner immediately for mode="enforce" rates. Security on-call if the empty value cannot be explained by a deploy-template bug (suspect tampering).


wallet_cross_brand_rejected_total

Emitter: servers_v2/wallet_service/app/services/brand_check.py (_record_mismatch, line ~107). Invoked from wallet_bucket_commands.py::_check_player_brand on every brand-scoped wallet write (authorize_bet, settle_bet, approve_deposit, recycle_deposit, credit_promotion, etc.).

Labels: command, mode. command is the wallet command name (authorize_bet, settle_bet, approve_deposit, decline_deposit, credit_promotion, ...). mode is observe or enforce.

What it counts One increment per wallet write where the request brand (X-Brand-Id) disagreed with the target row's brand_id. In observe mode the write proceeds using the REQUEST brand; in enforce mode the command raises CrossBrandRejected and the route returns the v1 reject envelope.

When it normally fires

  • Pre-flip soak: must reach zero before Phase 16 hard flip (gate: wallet_cross_brand_rejected_total{mode="observe"} == 0 across the soak window).
  • Post-flip steady state: zero.
  • Any increment in production traffic indicates a real brand-isolation bug or attacker probing.

Alarm thresholds

  • Warning: any non-zero mode=observe rate during the soak window.
  • Page: >=1/min of mode=enforce (real money path is rejecting).
  • Hard incident: sustained mode=enforce rejections AND user reports of failed bets/deposits.

What it means when it fires

  • (a) an internal caller is not forwarding X-Brand-Id correctly;
  • (b) the target row (player, deposit, etc.) carries the wrong brand_id due to historical backfill error;
  • (c) a deliberate cross-brand request reached wallet (rare; the edge should reject earlier);
  • (d) X-Brand-Id was stripped by a misconfigured proxy.

In observe mode the write still lands in the REQUEST brand -- this is intentional money-safety behavior so a forgotten X-Brand-Id in observe rollout does not silently dump money in the wrong brand.

Diagnosis steps

  1. kubectl logs deploy/wallet-service | grep "wallet cross-brand mismatch" | tail -50 -- look at command, request_brand, target_brand, mode.
  2. Filter by command to see if it concentrates on one operation (e.g. only approve_deposit). That narrows which caller is misbehaving.
  3. Cross-check the originating service: deposit-class commands come from admin_service or plisio callbacks; bet-class from game_service; promotion-class from promotion_service.
  4. If target_brand is consistently NULL, the row's brand_id was never backfilled -- run the Phase 12 verify script.
  5. Inspect a sample request_id: grep across services for the same request_id to trace where X-Brand-Id was dropped.

Likely fix

  • Single-caller bug: hot-fix the caller's brand-forwarding code, OR flip just that caller's MULTI_BRAND_ENFORCEMENT to observe to buy time (writes still land in request brand, no money misrouted).
  • Backfill miss: run the Phase 12 backfill verify + repair script (servers_v2/tools/multi_brand_backfill/).
  • Edge bug: revert gateway to observe; investigate brand resolution in the affected route.

Escalation Wallet domain owner immediately. Page deploy lead if mid-flip.


brand_resolution_failed_total{service="gateway"}

Emitter: servers_v2/gateway/app/middleware/brand_enforcement.py (_increment_failure, line ~89). The user task referred to this as gateway_brand_mismatch_total; the actual emitted counter name is brand_resolution_failed_total with service="gateway". This is the spec-canonical name used across the rollout runbook and dashboards.

Labels: reason, service="gateway". Reasons:

  • jwt_domain_mismatch -- JWT carries a brand that disagrees with the domain-resolved brand.
  • jwt_missing_brand -- JWT decoded but lacks a brand_id claim (legacy tokens; expected during the post-Phase-5 drain window).
  • unknown_domain -- JWT carries a brand but the request's domain cannot be resolved (no Redis entry for the host).

What it counts One increment per request reaching BrandEnforcementMiddleware where the JWT brand and the domain-resolved brand disagree (or either side is missing). In observe the request proceeds using the domain-resolved brand; in enforce it rejects with the v1 envelope.

When it normally fires

  • jwt_missing_brand: non-zero is expected for 2 * JWT_EXPIRE_MIN after Phase 5 deploy. Must reach zero before Phase 16 flip.
  • jwt_domain_mismatch: should be zero in steady state. Low rate from scanners / stale mobile clients is expected baseline noise.
  • unknown_domain: zero except during a Redis domain-binding rewrite or a brand onboarding mid-rollout.

Alarm thresholds

  • Warning: >=10/hr of jwt_domain_mismatch from a single user-agent class.
  • Page: >=1/min of jwt_domain_mismatch clustered on a legitimate browser user-agent and one brand domain.
  • Hard incident: unknown_domain sustained -- a brand's traffic is not resolving at the edge; users are locked out.

What it means when it fires

  • (a) Redis domain:agent:* or domain:level:* mapping was changed and the new mapping disagrees with what JWT minting saw at login time;
  • (b) refresh-token rotation is minting brand-unaware tokens post-Phase-5 (a Phase 5 implementation bug);
  • (c) attacker is replaying a JWT minted on brand A against brand B's domain;
  • (d) for unknown_domain: the LB forwarded a Host not present in any domain:agent:* entry (typically a brand-disable that forgot to keep at least one mapping, or DNS pointing at the wrong LB).

Diagnosis steps

  1. kubectl logs deploy/gateway | grep "brand resolution failed" | tail -50 -- look at reason, request_brand_id, jwt_brand_id, path, mode.
  2. If counts cluster on one source IP / user-agent: scanner or stale client; refine alert threshold.
  3. If counts cluster on a legitimate browser UA + one brand domain: redis-cli KEYS 'domain:*' -- look for a recent change to the affected brand's mapping. Check audit log for brand-status flips.
  4. If jwt_missing_brand is still non-zero >2 * JWT_EXPIRE_MIN after Phase 5 deploy: the player_service / agent_service refresh path is minting brand-unaware tokens. See Phase 5 in the isolation plan.

Likely fix

  • Baseline noise: ignore; refine the alert threshold.
  • Domain-mapping change: re-verify the Redis mapping for that domain; inspect the LB and any DNS changes; revert the mapping change if unintended.
  • Refresh-token bug: hot-fix the issuing service to reject brand-unaware refresh tokens; force re-login for affected sessions.
  • Phase 16 flip caused a real spike: revert the affected service to observe per the mid-flip revert procedure.

Escalation Platform backend on-call. Wallet domain owner if money paths are affected.


agent_brand_not_allowed_total (a.k.a. agent_brand_allow_list_rejected)

Emitter: servers_v2/agent_service/app/middleware/agent_brand_enforcement.py (_increment_failure, line ~109). Note: the user task referred to this as agent_brand_allow_list_rejected_total; the actual emitted metric name is agent_brand_not_allowed_total. The middleware also emits the spec-canonical brand_resolution_failed_total with the same reason for cross-service dashboards.

Labels: reason, service="agent_service". Reasons:

  • agent_brand_not_allowed -- the (agent_id, brand_id) pair is not in the enabled allow-list.
  • unresolved_brand -- the request reached the middleware but upstream brand resolution attached no brand_id.

What it counts One increment per authenticated agent_service request whose (agent_id, brand_id) does not appear in agent_brand WHERE status='enabled' (cached in-process with 60s TTL + Redis pub/sub invalidation).

When it normally fires

  • Pre-Phase-12 re-seed: non-zero is expected if the autoseed missed any agent.
  • Post-flip steady state: zero. A non-zero rate indicates either the allow-list cache is stale or an agent was unbound from a brand they are still trying to access.

Alarm thresholds

  • Warning: >=5/hr from a single agent_id (likely stale cache).
  • Page: >=1/min sustained in enforce mode (real agents are locked out).
  • Hard incident: a known-good agent_id appears repeatedly across multiple brands AND the agent_brand row exists -- the cache invalidation path is broken.

What it means when it fires

  • (a) agent_brand row is missing for an agent that was created without the autoseed trigger firing (Phase 12 re-seed gap);
  • (b) an admin disabled agent_brand.status for an agent who is still actively logged in (their JWT remains valid until expiration);
  • (c) the per-process allow-list cache is stale and a recent insert has not been picked up (TTL 60s OR AGENT_BRAND_CHANGED pub/sub message dropped);
  • (d) unresolved_brand reason: brand resolution upstream silently failed -- treated as a separate alarm class (see rollout runbook diagnosis).

Diagnosis steps

  1. kubectl logs deploy/agent-service | grep "agent brand allow-list check failed" | tail -50 -- look at agent_id, request_brand_id, path, mode, reason.
  2. SQL check: SELECT * FROM agent_brand WHERE agent_id=<id> AND brand_id=<id>; -- row missing or status != 'enabled' confirms cause (a) or (b).
  3. If row exists and is enabled: cache is stale. Force a refresh by restarting one agent_service pod, OR redis-cli PUBLISH AGENT_BRAND_CHANGED "" to invalidate every pod's cache.
  4. For unresolved_brand: see the rollout runbook's brand_resolution_failed_total{reason="unresolved_brand"} entry.

Likely fix

  • Missing seed: insert (agent_id, brand_id, 'enabled') into agent_brand via the audited admin endpoint (do NOT INSERT directly; the audit row is the operator trail).
  • Stale cache: publish AGENT_BRAND_CHANGED or rolling-restart agent_service.
  • Status mistakenly disabled: re-enable through the admin audit-tracked write.

Escalation Agent domain owner if not resolved in 15 min. Customer support if specific named agents are reporting login failures.


game_callback_brand_unresolved_total

Emitter: servers_v2/game_service/app/services/account_namespace.py::record_unresolved_callback (line ~129), invoked from servers_v2/game_service/app/services/callback_brand.py::resolve_callback_brand when the inbound namespaced account does not reverse-parse to any known brand_code prefix.

Labels: provider (one of: ho, wc, mg, bti, bt1, splus, digitain) plus optional reason for the legacy raw-fallback rejection mode added by P1-3:

  • no reason label -- the inbound namespaced account did not match any brand_code prefix. This is the original counter shape.
  • reason="raw_account_in_enforce" -- the inbound callback omitted the namespaced form and fell into the legacy raw-account lookup. Hard-rejected in enforce (Bug #3 fix); logged + counted then proceeds in observe; silent fall-through in off.
  • reason="raw_guid_in_enforce" -- same but for the legacy numeric guid lookup (e.g. BTI cust_id). guid is globally unique today, but defense-in-depth: enforce hard-rejects so an upstream caller is forced to plumb the brand through the provider envelope. Observe logs + counts then proceeds.
  • reason="raw_brand_unresolved" -- T8/T9 catch-all: the inbound callback presented a raw account/guid form (no namespaced prefix) AND the brand resolution downstream of the raw-fallback also returned None (player not found, or found but brand_id NULL). Hard rejected in enforce, DLQ'd, and a callback-shaped 200-ack returned to the provider so retries do not pile up. Distinct from the two raw_*_in_enforce labels above because those fire on the FALLBACK policy decision; this one fires when the fallback executed but the resolver still couldn't attribute the row to any brand. Operator action is identical to the other raw-fallback labels: migrate the upstream provider to namespaced form.

A non-zero rate of either reason="raw_*_in_enforce" label or reason="raw_brand_unresolved" is a hard block on the Phase 16 enforce flip (see the rollout runbook's Release Gate); operators must migrate the upstream provider to send the namespaced form before flipping.

What it counts One increment per inbound provider callback whose {brand_code}_{account} outbound account string did not match any known brand prefix, OR (P1-3) whose adapter fell into the brand-unaware legacy raw-account / raw-guid path. Reverse-parse failures are also persisted to the game_callback_dead_letter table (alembic 0030) for later replay; in enforce mode the raw-fallback rejections are DLQ'd as well so operators have an audit trail of which providers need migration.

When it normally fires

  • Steady state: zero.
  • Brand-onboarding window: low non-zero is expected for a few minutes after a new brand is created until the per-process reverse-parse cache TTL (60s) refreshes OR the BRAND_CATALOG_CHANGED pub/sub propagates.
  • Provider-side test traffic: known-bad-account allowlist may generate a slow trickle.

Alarm thresholds

  • Warning: >=1/min from any provider.
  • Page: >=10/min sustained from any provider (real money may be stuck on the provider side).
  • Hard incident: sustained from multiple providers simultaneously -- suspect a brand catalog corruption.

What it means when it fires

  • (a) per-process reverse-parse cache stale after a new brand was created (most common);
  • (b) provider sent a callback for a player that never existed (out-of-band test traffic);
  • (c) outbound namespacing for that provider is broken (e.g. provider truncated the account string below brand_code length);
  • (d) brand catalog corruption (a brand_code row was deleted while outbound traffic still references it).

Real money impact: callbacks that fail reverse-parse are 200-acked to the provider (per the documented policy) but written to DLQ. No double-spend risk; provider-side state may diverge until DLQ is replayed.

DLQ persistence is durable: post Codex-R2, record_dead_letter opens an independent AsyncSession from app.state.pool and commits inside that session before the resolver returns None. The row therefore survives the provider route's rejection-and-rollback path (every adapter exits its async with session_maker() as session: block without committing the caller's transaction). The only remaining loss path is a DB error during the DLQ INSERT itself -- that is still swallowed + WARN-logged so the provider response is never blocked, but the game_callback_brand_unresolved_total metric will still fire so operators see the rejection.

Diagnosis steps

  1. kubectl logs deploy/game-service | grep "game callback brand-unresolved" | tail -50 -- look at provider, namespaced_account, reason.
  2. psql -c "SELECT id, provider, outbound_account_raw, parse_failure_reason, created_at FROM game_callback_dead_letter WHERE replay_status='pending' ORDER BY created_at DESC LIMIT 20;" -- list recent DLQ rows.
  3. Confirm the brand catalog has the expected brand_code: psql -c "SELECT brand_id, brand_code, status FROM brand;".
  4. If a recent brand was added: redis-cli PUBLISH BRAND_CATALOG_CHANGED "" OR rolling-restart game_service pods to refresh per-process cache.
  5. Inspect a sample namespaced account: does it actually start with a known brand_code? If not, the outbound namespacing helper for that provider is broken.

Likely fix

  • Stale cache: publish BRAND_CATALOG_CHANGED or rolling-restart game_service.
  • Provider noise: add an allowlist for known-bad accounts at the game callback edge.
  • Outbound bug: hot-fix the affected provider's namespacing helper (account_namespace.py); replay DLQ rows after fix lands.
  • All cases: follow the "Game callback DLQ procedure" in the rollout runbook to replay or abandon affected rows.

Escalation Game domain owner if not resolved in 30 min. Provider relations if provider-side reconciliation is needed.


game_callback_verification_bypassed_total (added by T1-D-C2)

Emitter: servers_v2/shared/contracts/src/rgb_contracts/infra/brand_observability.py::record_game_callback_verification_bypassed, called from each of:

  • servers_v2/game_service/app/api/routes/ho_callback.py::_verify_ho_callback
  • servers_v2/game_service/app/api/routes/mg_callback.py::_verify_mg_callback
  • servers_v2/game_service/app/api/routes/wc_callback.py::_verify_wc_callback

every time the request reaches the if not settings.VERIFY_CALLBACKS: return ... bypass branch.

Labels: provider. One of ho, mg, wc. Other values collapse to unknown.

What it counts One increment per inbound HO / MG / WC provider callback that short-circuited signature verification because VERIFY_CALLBACKS=False. The handler then resolves the player by account, calls wallet_service to settle/credit/rollback, and returns the provider envelope -- with NO authentication of the caller. The /HO/, /mg/, /wc/ paths are listed in gateway PUBLIC_PATH_PREFIXES, so anyone on the public internet can drive the bypass branch.

When it normally fires Steady state in production: zero. The boot-time guard assert_verify_callbacks_safe_for_runtime in game_service/app/core/boot_guards.py refuses to start game_service when the runtime env resolves to a production-grade label (production / prod / staging) and VERIFY_CALLBACKS=false. In dev/staging the counter may fire when an engineer intentionally disables verification to iterate locally without provider credentials; the boot path also emits a loud WARN in that case.

Alarm thresholds

  • Page immediately on ANY non-zero rate in production. This is a security incident: the public internet is currently driving unauthenticated wallet writes through this pod.
  • Page on sustained non-zero rate in staging if staging is being used to soak production-shaped traffic.
  • Dev: tracked but not paged.

What it means when it fires

  • (a) A production pod started with VERIFY_CALLBACKS=false and the boot guard did not catch it. Likely cause: the deploy template's env injection happened AFTER the boot guard ran (e.g. sidecar inject delay, wrong env var precedence between the deploy template and the cluster default), or the runtime-env labels (RGB_ENV / APP_ENV / ENVIRONMENT) are unset on the pod so the guard treated it as non-production. Either way: the public internet can currently POST arbitrary settle/credit/rollback payloads to wallet_service through this pod.
  • (b) Non-prod env intentionally has it disabled (expected during local dev). Log + counter is the documentation of that choice.

Diagnosis steps

  1. Confirm the runtime label on the affected pod: kubectl exec deploy/game-service -- printenv | grep -E '^(RGB_ENV|APP_ENV|ENVIRONMENT|VERIFY_CALLBACKS)='. In a real production pod every one of RGB_ENV / APP_ENV / ENVIRONMENT should resolve to production / prod / staging AND VERIFY_CALLBACKS=true.
  2. Check the boot logs for the WARN line (VERIFY_CALLBACKS=False detected at boot in non-production runtime): kubectl logs deploy/game-service | grep "VERIFY_CALLBACKS=False detected at boot". Its presence proves the pod considered itself non-production at boot -- which is itself the bug if the pod is actually serving prod traffic.
  3. Cross-check the gateway: is this game_service deployment behind the public gateway? If yes, treat as in-progress incident and begin containment (block the game_service backend at the gateway, or scale to zero and roll a corrected pod).
  4. Check the per-provider rate breakdown -- a single provider firing may indicate a single misrouted backend; all three firing together is a deployment-wide misconfiguration.

Likely fix

  • Hot-fix the deploy template / Helm values so VERIFY_CALLBACKS=true AND one of RGB_ENV / APP_ENV / ENVIRONMENT is set to production (or staging for staging clusters). Roll the game_service pods.
  • Audit other game_service replicas for the same misconfiguration.
  • Open an incident ticket and review wallet ledger entries since the first non-zero counter sample for any unexpected settle/credit/rollback rows; reconcile manually with the upstream provider.

Escalation Security on-call immediately for any production firing -- this is a public-internet-facing money-flow control failure, not a configuration drift. Game domain owner for the deploy-template root cause and ledger reconciliation.


admin_operator_id_mismatch_total (added by P1-β)

Emitter: servers_v2/shared/contracts/src/rgb_contracts/infra/brand_observability.py::record_admin_operator_id_mismatch (line ~253). Called from admin_service write routes (e.g. admin_service/app/api/routes/player.py) when the JWT-vs-header operator-id cross-check fails.

Labels: reason. Bounded enum:

  • header_vs_jwt -- the X-Operator-Id LB header value disagrees with the JWT admin_id claim.
  • jwt_missing_claim -- the admin JWT lacks an admin_id claim entirely (legacy token reached a write route).

What it counts One increment per admin write request that failed the operator-id JWT cross-check. The route handler MUST hard-reject with HTTP 403; this counter exists so on-call can alarm independently of the route status code.

When it normally fires Steady state: zero. The cross-check is security-sensitive and no legitimate admin flow produces either reason.

Alarm thresholds

  • Page: ANY non-zero rate. Both reasons indicate either a spoofing attempt OR an SSO/LB misconfiguration.
  • Hard incident: sustained header_vs_jwt from one source -- treat as active spoof attempt.

What it means when it fires

  • (a) header_vs_jwt: someone (likely a misbehaving LB or a compromised internal proxy) injected an X-Operator-Id header whose value does not match the authenticated admin's identity. Also fires if SSO is rotating identities mid-session.
  • (b) jwt_missing_claim: a legacy admin token (issued before the P1-β admin_id claim was required) reached a money-mutating route. The token issuer needs to be cycled.

Diagnosis steps

  1. kubectl logs deploy/admin-service | grep "admin_operator_id_mismatch\|operator-id mismatch" | tail -50 -- look at the request path, source IP, JWT admin_id, header value.
  2. If reason=header_vs_jwt: trace the source IP back through the LB. The LB header should be set ONLY by the trusted ingress; if it is set elsewhere, that is the misconfig.
  3. If reason=jwt_missing_claim: query a sample of issued admin tokens for the admin_id claim. Cycle the admin auth issuer if any current tokens lack the claim.
  4. Cross-reference against admin_audit table for the same request_id -- the route handler logs the rejection there.

Likely fix

  • Spoof / LB misconfig: revoke the spoofed admin's session, audit recent writes by that admin_id, fix the LB header injection rule.
  • Missing JWT claim: bounce the admin auth issuer to mint claim-bearing tokens; force admin re-login.
  • In both cases: review admin_audit for any rows committed before the cross-check started rejecting (those rows would carry the spoofed operator_id).

Escalation Security on-call IMMEDIATELY for any non-zero rate. Platform on-call in parallel for SSO / LB infrastructure investigation.


admin_audit_write_failed_total (added by P1-β)

Emitter: servers_v2/shared/contracts/src/rgb_contracts/infra/brand_observability.py::record_admin_audit_write_failed (line ~301). Called from servers_v2/admin_service/app/services/admin_audit.py::write_money_admin_audit (line ~152) when the post-money-write audit row INSERT fails.

Labels: route. Bounded enum: adjust_player, approve_deposit, decline_deposit, recycle_deposit, coin_agree_deposit, shooter_agree_deposit, review_agree_withdraw, review_decline_withdraw, pay_agree_withdraw, pay_decline_withdraw, agree_agent_withdraw, decline_agent_withdraw, pay_agent_withdraw. Unknown collapses to unknown.

What it counts One increment per money-mutating admin endpoint whose admin_audit row INSERT failed AFTER the wallet_service call already committed. The money write CANNOT be rolled back; the audit gap must be reconciled manually.

When it normally fires Steady state: zero. Failures here mean either the audit DB session crashed mid-write or the append-only triggers (alembic 0032) fired (which is itself a security event).

Alarm thresholds

  • Page: ANY non-zero rate. Money committed without audit row; on-call must reconcile.
  • Hard incident: any log line containing admin_audit is append-only (the trigger fired) -- treat as active intrusion attempt OR a runaway data-cleanup script.

What it means when it fires

  • (a) DB session error during the audit INSERT (transient: pool exhaustion, network blip, connection reset);
  • (b) the audit-side INSERT was rejected by the append-only triggers (trg_admin_audit_no_update_delete, trg_admin_audit_no_truncate) -- impossible for a fresh INSERT but possible if a misbehaving caller passed an UPDATE path;
  • (c) schema drift after migration 0032 (audit table missing a column that the helper writes);
  • (d) the append-only audit role was misconfigured to forbid INSERT (defence-in-depth role hardening misapplied).

Diagnosis steps

  1. kubectl logs deploy/admin-service | grep "admin_audit write failed AFTER money write committed" | tail -50 -- the helper logs the full traceback at ERROR. Look at the route, target_type, target_id, brand_id, operator, err.
  2. Identify the affected money write in wallet_service ledger: psql -c "SELECT * FROM wallet.wallet_ledger WHERE brand_id=<bid> AND created_at BETWEEN <window> ORDER BY created_at DESC LIMIT 50;".
  3. Reconstruct the audit row manually using the logged before / after payload and the route name. Insert via the audited admin endpoint OR a controlled SQL INSERT with on-call manager approval.
  4. If logs contain admin_audit is append-only: page security on-call immediately; do not retry the audit insert until the cause of the trigger fire is understood.

Likely fix

  • Transient DB error: retry the manual reconciliation INSERT after pool health recovers.
  • Trigger fire: investigate the call site -- the append-only trigger should never fire on a fresh INSERT. If it does, either schema drift or an attacker is the cause.
  • Schema drift: deploy the missing column / constraint; backfill the missed audit rows from the wallet ledger window.

Escalation Platform on-call AND wallet domain owner immediately. Audit gap is treated as a compliance incident.


wallet_brand_signature_replay_redis_outage_total (delivered P2-δ)

Emitter: servers_v2/wallet_service/app/services/brand_signature_verifier.py::_record_replay_redis_outage (line ~168), called from _claim_signature_replay_slot whenever the Redis SETNX on brand_sig_seen:{caller}:{request_id} raises.

Labels: caller_service (per-caller visibility into who was affected; truncated to 32 chars), mode (one of observe / enforce). Emitted in BOTH modes so the alarm fires even while we are still rolling out and observe-mode does not yet hard-reject.

What it counts One increment per inbound brand-scoped wallet write where the replay-cache Redis SETNX raised an exception (Redis pod down, network blip, auth expiry, etc.). The verifier logs at WARN and fails open (proceeds without the replay check) so signed wallet writes do not wedge during a Redis outage. HMAC + skew still hold, so this is a defence-in-depth degradation, not a security breach.

When it normally fires Steady state: zero. Any non-zero rate correlates 1:1 with Redis outage windows for the wallet pod's Redis client.

Alarm thresholds

  • Warning: any non-zero rate sustained >1 min.
  • Page: >=10/min sustained -- replay defence is degraded for the whole brand-signed wallet write surface.

What it means when it fires The brand_sig_seen:* Redis keys cannot be claimed; signed wallet writes are proceeding without replay protection. HMAC + skew still hold so this is a defence-in-depth degradation, not a security breach.

Diagnosis steps

  1. Confirm the Redis incident: check the wallet_service Redis client connectivity (kubectl logs deploy/wallet-service | grep "brand-signature replay cache unavailable").
  2. Cross-reference with the wallet outbox publisher Redis health (same Redis target -- both should flap together).
  3. Filter the counter by caller_service -- a single caller spiking while others stay quiet usually means a pod-local network blip, not a Redis incident.
  4. Once Redis is back: confirm the counter rate returns to zero.

Likely fix Resolve the underlying Redis outage. Consider rotating BRAND_SIGNING_KEY if the outage window was long enough to allow captured-signature replay (rotation invalidates any cached signature).

Escalation Wallet domain owner AND platform on-call (Redis).


plisio_callback_replay_blocked_total (added by T1-D-C5/I6)

Emitter: servers_v2/wallet_service/app/api/routes/plisio.py::plisio_callback via rgb_contracts.infra.brand_observability.record_plisio_callback_replay_blocked.

Labels: status (one of the PlisioTxnStatus lower-cased values, e.g. pending, completed, expired; spaced labels like pending internal are normalized to pending_internal).

What it counts One increment per Plisio callback rejected by the (order_number, status, txn_id_first_16) Redis SETNX guard. The guard's TTL is 24h; a duplicate delivery within that window is acknowledged with the success-shaped _ok() payload (so Plisio stops retrying) but the deposit / coin_transaction state mutation is skipped.

Alarm thresholds

  • Warning: >=1/min sustained for >5 min.
  • Page: >=10/min sustained — likely active replay attack against the webhook surface.

Diagnosis steps

  1. Pull the WARN log line: kubectl logs deploy/wallet-service | grep "Plisio callback replay blocked". The order_first_8 and status labels there match the metric labels.
  2. Cross-reference order_first_8 against the Plisio dashboard: confirm whether the merchant API shows a single delivery or multiple.
  3. If Plisio shows a single delivery and our counter shows replays, pivot to "captured signature" — rotate PLISIO_SECRET immediately and audit the coin_transaction rows for the affected order_number against the Plisio source-of-truth.

Escalation Payments domain owner. Treat >=10/min sustained as a security incident.


plisio_callback_replay_redis_outage_total (added by T1-D-C5/I6)

Emitter: servers_v2/wallet_service/app/api/routes/plisio.py::_claim_plisio_callback_slot via rgb_contracts.infra.brand_observability.record_plisio_callback_replay_redis_outage.

Labels: env — one of production (boot env resolved as production/staging/prod via rgb_contracts.infra.runtime_env.is_production_runtime_env) or non_production (everything else).

What it counts One increment per inbound Plisio callback whose replay-cache SETNX raised an exception (Redis pod down, network blip, auth expiry). Behaviour by env after the increment:

  • env="production": handler fails closed, returns HTTP 503 to Plisio so it retries the legit callback. Better to retry a legitimate delivery than re-credit a replay.
  • env="non_production": handler logs WARN and proceeds best-effort so dev/staging workflows keep moving.

Alarm thresholds

  • Warning: env="production" AND any non-zero rate sustained >1 min.
  • Page: env="production" AND >=10/min sustained — the deposit pipeline is wedging until Redis recovers.

Diagnosis steps

  1. Confirm the Redis incident: check the wallet_service Redis client connectivity (kubectl logs deploy/wallet-service | grep "Plisio callback replay cache unavailable").
  2. Cross-reference with the wallet outbox publisher Redis health — same Redis target, so both should be flapping together.
  3. Once Redis is back: confirm the counter rate returns to zero and Plisio's retry queue drains.

Escalation Platform on-call (Redis) AND payments domain owner.


wallet_topology_default_brand_unprimed_total (delivered P2-δ)

Emitter: servers_v2/wallet_service/app/services/wallet_topology_store.py::_record_default_brand_unprimed (line ~1080), invoked from _brand_filter_clause when a topology or policy lookup arrives with brand_id=None AND the default-brand cache has not been primed at lifespan startup (_resolve_default_brand_id).

Labels: none -- the unprimed default-brand cache is a process-wide condition, not per-command. The companion counter wallet_topology_default_brand_fallback_total{caller_origin} already gives the per-call-site breakdown when the default brand is primed.

What it counts One increment per topology / policy lookup that arrived with no caller-supplied brand_id AND found the cached default brand id unset. Because _brand_filter_clause hard-fails in that branch (post-0024c the legacy IS NULL filter matches zero rows and would silently break every downstream lookup), every increment maps 1:1 to a RuntimeError raised to the caller. The counter exists so the alarm fires before the user-visible error surfaces.

When it normally fires Steady state: zero. The lifespan handler in wallet_service/app/main.py MUST call _resolve_default_brand_id at startup; if it does, this counter never increments. Any non-zero rate means lifespan startup did not prime the cache (incomplete deploy template, broken boot order, or a test harness that bypassed lifespan).

Alarm thresholds

  • Page: ANY non-zero rate. Every increment is a hard failure on the calling wallet command.
  • Hard incident: sustained non-zero AND user-visible 5xx on wallet write paths -- the entire wallet write surface for the default brand is wedged.

What it means when it fires The wallet pod started without priming the default-brand cache. Until the cache is primed (or the calling code is updated to pass brand_id explicitly) every legacy caller without request brand context fails. This is distinct from wallet_topology_default_brand_fallback_total, which counts successful default-brand fallbacks (cache primed, brand resolved).

Diagnosis steps

  1. kubectl logs deploy/wallet-service | grep "default-brand cache" -- check whether the lifespan handler logged the prime call.
  2. Restart the affected wallet pod; the lifespan handler should prime the cache on boot. Re-check the counter -- if it stays non-zero, the lifespan code path is broken.
  3. Cross-check the affected request: _brand_filter_clause raises RuntimeError("default-brand cache not primed") so the calling command will surface the error at the user.

Likely fix

  • Boot-order regression: confirm app/main.py lifespan calls _resolve_default_brand_id before the first inbound request.
  • Test harness leaking into prod: ensure no test-only entry point is mounted on the production app.
  • Migration gap: confirm migration 0023_seed_default_brand.py ran and the brand table has at least one row with the documented default brand_code.

Escalation Wallet domain owner immediately. Page deploy lead if a recent deploy correlates with the spike.


recovery_sms_delivery_failed_total (added by Codex P1-#6)

Emitter: servers_v2/shared/contracts/src/rgb_contracts/infra/recovery_brand.py::record_recovery_sms_delivery_failed, called from _send_recovery_code in servers_v2/player_service/app/api/routes/recovery.py and servers_v2/agent_service/app/api/routes/recovery.py whenever the SMS path returns not-sent OR raises.

Labels: service. One of player_service / agent_service.

What it counts One increment per /recovery/request whose SMS branch could not deliver -- either deliver_sms returned False (provider not configured / provider replied not-sent) or the call raised. The user-visible response is unchanged (canonical "if account exists" envelope) so the SMS failure does not become an existence oracle; this counter is the ONLY operator-visible signal.

When it normally fires A small steady rate is expected when the SMS provider is partially configured (e.g. dev environment). Production should be at or near zero. Any sustained non-zero rate against service="player_service" or service="agent_service" means real users are not receiving recovery codes.

Alarm thresholds

  • Warn: >5 per minute per service for 5+ minutes (provider degraded).
  • Page: >50 per minute per service (provider down -- recovery flow effectively offline).

What it means when it fires

  • (a) service="player_service": real player password recoveries are silently failing. The audit row still lands as REQUEST / success because the player WAS matched; only the SMS side channel failed.
  • (b) service="agent_service": same, for agent recoveries.

Diagnosis steps

  1. Check the SMS provider status (CoolSMS / BulkSMS / iCode dashboard per the configured platform).
  2. Search recent logs for recovery sms returned not-sent or recovery sms delivery exception lines (these now carry only contact_hash=<...> -- no PII).
  3. Cross-reference the recovery_audit table: rows with action='REQUEST', outcome='success' against the affected window are the requests that hit the SMS path.
  4. Confirm the SMS provider env vars have not been rotated / stripped on the affected pod.

Likely fix Restore the SMS provider credentials, retry. There is no automatic retry of the SMS itself -- the user has to re-issue /recovery/request.

Escalation Player-platform owner if the provider is healthy but the counter keeps firing (likely an issue with our deliver_sms integration rather than the provider).


recovery_email_unprovisioned_total (added by Codex P1-#6)

Emitter: servers_v2/shared/contracts/src/rgb_contracts/infra/recovery_brand.py::record_recovery_email_unprovisioned, called from _send_recovery_code in servers_v2/player_service/app/api/routes/recovery.py and servers_v2/agent_service/app/api/routes/recovery.py whenever the email path is selected.

Labels: service. One of player_service / agent_service.

What it counts One increment per /recovery/request with contact_type='email'. There is no transactional email provider in v2 yet, so this counter is the operator-visible signal that the pending-impl gap is being exercised by real traffic. Previously the route logged addr=<raw> code=<raw> to compensate -- that PII / OTP leakage was removed in P1-#6 and replaced with this counter + a hashed log line.

When it normally fires Background rate in any env that allows email recovery on the front end. Because there is no provider wired in, every increment is effectively a "user did not receive a recovery code" event.

Alarm thresholds

  • Warn: any sustained non-zero rate. The frontend should be hiding the email-recovery option when no provider is wired.
  • Page: persistent >1/min for >24h (means the gap is actively affecting users in the field).

What it means when it fires The email recovery code was generated, the audit row landed, the short-code key was stashed in Redis -- but no email left the service. The user will time out on the recovery flow.

Diagnosis steps

  1. Confirm the provider gap is still real: grep for any email-bearing SMTP / SES integration in servers_v2/.
  2. Check the frontend flow for the affected service to see how often the email recovery option is being offered.
  3. Cross-reference recovery_audit for rows with metadata->>'channel' = 'email'.

Likely fix Provision a transactional email provider (SES / Postmark / similar) and wire it into rgb_contracts.infra as a sibling of sms.py. Until then: hide the email-recovery option in the frontend or route those flows back to SMS.

Escalation Player-platform owner. Track as a recovery-flow follow-up to P1-#6.


gateway_session_legacy_key_used_total (retired)

This counter and the gateway legacy session dual-read branch were removed by the Phase 4E dual-read sunset cleanup. Gateway now reads only user-session:brand:{brand_id}:{account} and no longer accepts the pre-Phase-4E user-session-{account} key.

Use gateway_jwt_session_missing_total for the retained JWT-valid-but-session-gone signal. If old dashboards still query gateway_session_legacy_key_used_total, delete or archive those panels; they refer to the pre-sunset transition path.

Diagnosis steps

  1. Confirm the request has a resolved brand: check brand_resolution_failed_total{service="gateway"} and gateway logs for domain-resolution failures.
  2. Confirm the player has an active brand-prefixed Redis row: redis-cli --scan --pattern 'user-session:brand:*:<account>'.
  3. If the JWT is valid but the session key is absent, treat it as an expired or evicted session and use the normal login/session incident path keyed by gateway_jwt_session_missing_total.

Escalation Platform backend on-call.


gateway_jwt_session_missing_total (added by T1-D-C1)

Emitter: servers_v2/gateway/app/middleware/auth.py::_record_jwt_session_missing, called from JWTAuthMiddleware._try_attach_auth_state whenever a JWT verifies successfully but the brand-prefixed Redis session key does not return a matching session id.

Labels: service="gateway". No reason label -- the upstream brand_resolution_failed_total covers brand-mismatch cases. This counter is a single dashboard signal for "JWT looked good, but session was gone."

What it counts One increment per request where:

  • the JWT signature, kid, alg, sub, and required claims all validated, AND
  • the brand-prefixed Redis lookup returned None or did not match the JWT-bound session_id.

The user sees the v1 AUTH_REQUIRED envelope -- the same shape they would see for a bad/expired JWT -- but operationally this is a different class of failure that needs a different response.

When it normally fires

  • Steady state: a small baseline rate is expected from session-revoke flows (admin force-logout, user "log out other devices") and from JWT_EXPIRE_MIN slop where the JWT is still in TTL but Redis has already expired the session row.
  • Right after a Redis flush / failover: a brief spike is expected as every active client hits a missing session.
  • After a MULTI_BRAND_ENFORCEMENT flip from observe to enforce: may briefly fire if the flip exposes brand-mismatched lookups that were silently routed to the wrong brand under observe.

Alarm thresholds

  • Warning: a sustained rate >10x the historical baseline for the affected pod.
  • Page: >1/sec sustained -- many users are seeing AUTH_REQUIRED while their JWTs are still valid. Almost certainly a Redis-side incident or a cutover regression.
  • Hard incident: this counter spikes to match the gateway's total request rate -- the entire authenticated surface is locked out.

What it means when it fires

  • (a) Redis evict-LRU pressure dropped session rows before JWT_EXPIRE_MIN -- check Redis memory utilization.
  • (b) Redis failover dropped non-replicated keys (sessions are not AOF-persisted in some envs); affected users see one AUTH_REQUIRED and have to log in again.
  • (c) An admin / user-driven session revoke ran for the affected account; expected.
  • (d) The brand-prefixed key was written under a different brand_id than the gateway is reading (cross-brand desync) -- in which case brand_resolution_failed_total should also be firing for the same requests.
  • (e) player_service failed to write the session row at login (rare; the route fails the login response in that case so the user never gets a JWT to start with).

Diagnosis steps

  1. redis-cli INFO memory -- check evicted_keys rate against the counter rate. A correlation suggests cause (a).
  2. redis-cli --scan --pattern 'user-session:brand:*' | wc -l and compare to the gateway's recently-active player count -- if the session pool is thin, evict pressure or failover loss is likely.
  3. kubectl logs deploy/gateway | grep "JWT session lookup failed" -- if this is firing too, the lookup raised before missing was recorded; suggests a Redis client bug or a connectivity blip.
  4. Sample a single affected request_id end-to-end: gateway logs -> player_service login logs from the matching JWT issue time. If no login log exists at all, the JWT is stale or forged (different incident class).

Likely fix

  • Redis evict pressure: scale Redis up, or lower JWT_EXPIRE_MIN so TTL aligns with capacity.
  • Redis failover loss: enable AOF persistence on the session Redis, OR accept the one-time relogin and document.
  • Cross-brand desync: see brand_resolution_failed_total{reason="jwt_domain_mismatch"} for the canonical playbook.
  • Cutover regression: if this counter spikes immediately after a gateway or player_service deploy, revert and investigate the session-key contract before re-rolling.

Escalation Platform backend on-call. Page Redis owner if the rate correlates with evicted_keys or failover events.


gateway_jwt_unknown_kid_total (added by T4-D-I4)

Emitter: servers_v2/shared/contracts/src/rgb_contracts/infra/brand_observability.py::record_gateway_jwt_unknown_kid, called from JWTAuthMiddleware._try_attach_auth_state whenever the underlying JwtVerifier.decode raises JwtVerificationError("unknown kid ...").

Labels: service="gateway". Bounded.

What it counts One increment per request whose Authorization carries a JWT whose kid header is not present in the active JWT_PUBLIC_KEYS map. The user sees the v1 AUTH_REQUIRED envelope.

When it normally fires

  • Mid-rotation: tokens minted under the new kid arrive at gateway pods whose verifier cache still carries only the old key set. Spike drains as soon as the operator publishes the jwt_public_keys_changed Redis channel and the per-pod listener invalidates the cache.
  • Right after the operator drops the old kid from JWT_PUBLIC_KEYS but before all clients refresh -- legacy still-valid-by-TTL tokens fail.

Alarm thresholds

  • Warning: any sustained non-zero rate that lasts longer than the publish→rebuild latency budget (~5 s in healthy clusters).
  • Page: sustained non-zero rate AND no concurrent rotation activity -- indicates either a misconfigured signer or attempted forgery.

Diagnosis sequence

  1. Cross-reference: was a tools/jwt_kid/rotate_jwt_kid.sh recently run? If yes, watch for drain.
  2. If no rotation in flight, dump a sample request -- the JWT header includes a kid claim. Compare against the operator-side allow-list.
  3. If the kid is recognised but no pod accepts it, check the pub/sub listener: app.state.jwt_pubsub_task should be a live task, and app.state.jwt_verifier should be a non-None JwtVerifier carrying the expected kids in verifier.kids().

Escalation Platform backend on-call. If forgery is suspected, page security and freeze rotations.


admin_ws_legacy_query_token_total (added by T4-D-I7)

Emitter: servers_v2/shared/contracts/src/rgb_contracts/infra/brand_observability.py::record_admin_ws_legacy_query_token, called from the admin /api/v1/ws endpoint when the JWT was supplied via the legacy ?token= query string instead of the Sec-WebSocket-Protocol subprotocol.

Labels: none.

What it counts One increment per accepted WebSocket connection that authenticated via the legacy query-string fallback. Production-runtime requests using the query-string path are rejected (close code 4003) and do NOT increment this counter -- this counter is the dev/cutover indicator, not the attack indicator.

Alarm thresholds

  • Pre-cutover: expected to be non-zero. Watch for the trend to flatten to zero as front-ends migrate to the subprotocol API.
  • Post-cutover: must be zero. Set admin_ws_allow_query_token=False in production once it has been zero for the full soak window (recommend 7 days) so the legacy branch refuses new traffic permanently.

Escalation None as a steady-state alert. Send to the Backoffice frontend team when the counter has been silent long enough to flip the production flag.


recovery_rate_limit_redis_outage_total (added by T4-D-I5)

Emitter: servers_v2/shared/contracts/src/rgb_contracts/infra/recovery_brand.py::record_recovery_rate_limit_redis_outage, called from rate_limit_check_and_increment whenever a Redis exception is raised on the per-contact bucket of /recovery/request (and similar SMS-DOS-target buckets) AND the runtime resolves to a production env.

Labels: bucket="contact"|"ip"|"token". Bounded.

What it counts One increment per fail-closed rejection. The user sees the same generic envelope they would see for a normal rate-limit hit (no oracle leak between "you are limited" and "Redis is down so we denied you"). For per-IP and per-token buckets the Redis outage fails OPEN -- those buckets have no contact oracle and a fail-closed IP path locks out everyone behind a NAT during a Redis blip.

When it normally fires

  • Should be zero. Any non-zero rate is either a Redis outage in flight or a route added with the wrong fail-mode flag.

Alarm thresholds

  • Warning: any non-zero rate.
  • Page: sustained non-zero rate AND no concurrent Redis incident -- indicates a routing / configuration bug.

Diagnosis sequence

  1. Cross-reference Redis health: are evicted_keys / connected_clients abnormal? If yes, this is the safety gate firing as designed.
  2. Check whether legitimate users are reporting "I never got the SMS code" -- if yes, file a follow-up to lift the limit cap once Redis recovers (the in-window denials cannot be retroactively allowed).
  3. If Redis is healthy, inspect the route handler stack: only per-contact buckets should pass fail_closed_in_production=True.

Escalation Platform backend on-call + Redis owner.


gateway_rate_limit_redis_outage_total (added by T6-D2)

Emitter: servers_v2/gateway/app/middleware/rate_limit.py::_handle_redis_outage, called every time the gateway's per-IP sliding-window rate limiter catches a Redis exception. Emitted regardless of whether the request is then allowed or rejected, so the metric tracks the outage rate independent of the policy branch taken.

Labels: env (resolved via resolve_runtime_env, default dev) and reason (the Exception.__class__.__name__ of the Redis failure, e.g. ConnectionError, TimeoutError). Both bounded.

What it counts One increment per request that hit the Redis-outage handler. The follow-up policy branch depends on is_production_runtime_env():

  • Production-grade runtime: fail-closed -- the response is HTTP 503 with Retry-After set; the limiter refuses to silently bypass the limit during the exact window an attacker is most likely to use to brute-force login / SMS / withdrawal flows.
  • Dev / staging-without-prod-label: fail-open -- the request is allowed through so a developer running without a local Redis is not blocked. The counter still increments so dashboards see the outage.

When it normally fires

  • Should be zero in production. Any non-zero rate is either a Redis outage in flight or a network blip on the gateway -> Redis path.
  • Non-zero in dev is expected when working without Redis running locally; do not alarm.

Alarm thresholds

  • Warning: any non-zero rate with env="production".
  • Page: sustained env="production" non-zero rate AND no concurrent Redis incident -- indicates a routing / configuration bug (e.g. wrong REDIS_URL, gateway pinned to a stale Redis IP).

Diagnosis sequence

  1. Cross-reference Redis health: are evicted_keys / connected_clients abnormal? If yes, this is the safety gate firing as designed -- confirm the on-call Redis playbook is engaged.
  2. Check the gateway logs for the matching Rate limiter Redis error warning and the reason label -- a uniform ConnectionError spike points at a network partition; a uniform TimeoutError spike points at Redis CPU saturation.
  3. Cross-reference user reports: production fail-closed mode returns HTTP 503 from /api/v1/user/login, /api/v1/user/register, and the SMS endpoints -- treat any user-visible "service unavailable" spike as the same incident.
  4. If Redis is healthy and gateway_rate_limit_redis_outage_total is climbing, inspect the gateway pod's network path to Redis (DNS resolution, TLS handshake, security-group rules).

Escalation Platform backend on-call + Redis owner. The fail-closed posture is intentional -- do not "temporarily" disable it without a decision from the security on-call.


internal_caller_token_legacy_rejected_total (added by T4-D-I2)

Emitter: servers_v2/shared/contracts/src/rgb_contracts/infra/brand_observability.py::record_internal_caller_token_legacy_rejected, called from every *_service/app/api/deps.py::require_internal_service_token when PER_CALLER_TOKEN_REQUIRED=on and the request fell through to the legacy single-token fallback path.

Labels: consumer (the receiving service) and caller (the asserted X-Caller-Service header, or "unknown" when absent). Bounded -- both truncated to 32 chars.

What it counts One increment per request rejected with HTTP 403 because the operator flipped PER_CALLER_TOKEN_REQUIRED on but the request still arrived via the legacy single-token path -- meaning the caller has not been migrated to its INTERNAL_SERVICE_TOKEN_<CALLER> env var yet.

When it normally fires

  • Should be zero at all times once Phase 16 (the Multi-Brand hard flip) has completed and PER_CALLER_TOKEN_REQUIRED=on rolled out.
  • Pre-flip: must be zero before the flip can proceed (see docs/runbooks/multi-brand/phase-16-hard-flip-checklist.md).
  • Right after a new internal HTTP client is added: a spike points at the new client forgetting to set its per-caller env var.

Alarm thresholds

  • Warning: any non-zero rate.
  • Page: sustained non-zero rate for the same (consumer, caller) pair lasting more than 5 minutes -- a known caller is broken.

Diagnosis sequence

  1. Identify the offending (consumer, caller) pair from the counter labels.
  2. On the caller's container, printenv | grep INTERNAL_SERVICE_TOKEN_<UPPER(caller)> -- if empty, the env var was never deployed.
  3. On the consumer's container, printenv PER_CALLER_TOKEN_REQUIRED -- if on, this is the expected rejection path. If unset / off, the rejection path should not have triggered; investigate.
  4. Either deploy the missing per-caller env var OR roll back PER_CALLER_TOKEN_REQUIRED on the consumer until the caller is migrated.

Escalation Platform backend on-call + the team that owns the offending caller service.


agent_balance_legacy_write_total (added by T4-D-I3)

Emitter: servers_v2/shared/contracts/src/rgb_contracts/infra/brand_observability.py::record_agent_balance_legacy_write, called from admin_service POST /api/v1/agents/{id}/adjust-balance on every invocation while the route still mutates agent.balance directly (anti-pattern -- should be wallet_service).

Labels: change_type="add"|"sub"|"unknown".

What it counts Each direct-write call. Goal: zero, after the route has been migrated to wallet_service.

Alarm thresholds

  • Pre-migration: expected non-zero baseline (admin balance adjustments are normal back-office activity).
  • Post-migration: must be zero. Any non-zero rate after the migration PR has shipped indicates the old code path is still reachable (rollback, A/B traffic, etc.) and needs investigation.

Escalation Platform backend on-call.


agent_cross_brand_rejected_total

Emitter: servers_v2/agent_service/app/services/brand_check.py::_record_mismatch (line ~105). Invoked from agent-write helpers when the request brand disagrees with the target row's brand_id.

Labels: command, mode. command is the agent operation name (e.g. agent_message_read, agent_password_reset). mode is observe or enforce.

What it counts One increment per agent_service write whose target row brand differs from the request brand. Observe proceeds with request brand; enforce raises CrossBrandRejected and the route returns the v1 reject envelope.

When it normally fires

  • Pre-flip soak: must reach zero before Phase 16 hard flip.
  • Post-flip steady state: zero.

Alarm thresholds

  • Warning: any non-zero mode=observe rate during the soak window.
  • Page: >=1/min of mode=enforce.

Diagnosis steps

  1. kubectl logs deploy/agent-service | grep "agent cross-brand mismatch" -- look at command, request_brand, target_brand, mode.
  2. If concentrated on one command, the originating caller is not forwarding X-Brand-Id correctly.
  3. Cross-check the originating service via request_id.

Likely fix Same shape as wallet_cross_brand_rejected_total: hot-fix the caller's brand-forwarding code OR flip the affected service to observe.

Escalation Agent domain owner immediately for mode="enforce" rates.


game_cross_brand_rejected_total

Emitter: servers_v2/game_service/app/services/brand_check.py::_record_mismatch (line ~105). Invoked from game-write helpers (e.g. callback-driven state mutations) when the request brand disagrees with the target row's brand_id.

Labels: command, mode. Same shape as the wallet variant.

What it counts One increment per game_service write whose target row brand differs from the request brand. Observe proceeds with request brand; enforce raises CrossBrandRejected.

When it normally fires Same gates as wallet_cross_brand_rejected_total; pre-flip soak must reach zero.

Diagnosis steps

  1. Filter game_service logs for event=cross_brand_rejected. Look at command, request_brand, target_brand, mode.
  2. Most common cause: a brand-unaware game callback path that reverse-parses to brand A but the persisted transaction state row was written under brand B. Cross-check the originating reverse-parse; see also game_callback_brand_unresolved_total.

Likely fix Same shape as the wallet variant.

Escalation Game domain owner immediately for mode="enforce" rates.


promotion_cross_brand_rejected_total

Emitter: servers_v2/promotion_service/app/services/brand_check.py::_record_mismatch (line ~105). Invoked from promotion-write helpers (coupon use, saga steps, settlement projections) when the request brand disagrees with the target row's brand_id.

Labels: command, mode. Same shape as the wallet variant.

Diagnosis steps

  1. kubectl logs deploy/promotion-service | grep "promotion cross-brand mismatch".
  2. Most common cause: a coupon row pre-dates the brand backfill OR a saga step lost the brand context across a Redis Stream boundary; check the upstream wallet event payload for brand_id.

Likely fix / escalation Same shape as the wallet variant. Promotion domain owner.


recon_cross_brand_rejected_total

Emitter: servers_v2/recon_service/app/services/brand_check.py::_record_mismatch (line ~115). Invoked when a recon match attempts to mutate a row whose brand differs from the matched deposit's brand.

Labels: command, mode. Same shape as the wallet variant.

Diagnosis steps

  1. kubectl logs deploy/recon-service | grep "recon cross-brand mismatch".
  2. Most common cause: shooter_sms.brand_id was inferred incorrectly at match time; cross-check the matched player_deposit.brand_id.

Likely fix / escalation Same shape as the wallet variant. Recon domain owner.


rolling_cross_brand_rejected_total

Emitter: servers_v2/rolling_service/app/services/brand_check.py::_record_mismatch (line ~108). Invoked from the rolling event consumer and internal rolling routes when the request brand disagrees with the target rolling row's brand_id.

Labels: command, mode. Same shape as the wallet variant.

Diagnosis steps

  1. kubectl logs deploy/rolling-service | grep "rolling cross-brand mismatch".
  2. Most common cause: a wallet event landed on the rolling stream without a brand_id in the envelope (legacy schema_version=1 event escaped the soak window). Cross-check event_legacy_schema_total{stream,consumer="rolling_service"}.

Likely fix / escalation Same shape as the wallet variant. Rolling domain owner.


pii_aes_unconfigured_total (T6-F2)

Emitter: servers_v2/shared/contracts/src/rgb_contracts/infra/aes_guard.py, called by encrypt_pii_strict / decrypt_pii_strict whenever the service-level AES_KEY / AES_IV are missing or empty and the helper falls back to plaintext (allowed only outside production).

Labels: service (e.g. admin_service, agent_service, player_service), op (one of encrypt, decrypt).

What it counts One increment per call that bypassed AES because the key material was not configured. In production the boot guard refuses to start a service with empty key material, so a non-zero rate in production means either the boot guard was bypassed (a deploy regression) or a code path constructed its own settings shim that did not inherit the keys.

When it normally fires

  • Local development without .env. Expected non-zero. The strict helpers log a WARN per call so the developer sees the fallback.
  • CI / unit-test paths that intentionally exercise the dev fallback. These do not export Prometheus, so the counter is invisible there.

Alarm thresholds

  • Production: ANY non-zero rate is a page. PII at rest is supposed to be encrypted; a single increment means at least one row was written in the clear.
  • Staging: warn-on-non-zero-for-15-min so we catch boot-guard drift before a prod cutover.

Diagnosis sequence

  1. Identify the offending service via the service label.
  2. SSH the pod, dump the resolved settings — AES_KEY and AES_IV must be non-empty SecretStr values. If empty, the boot guard either never ran or was patched out.
  3. If an admin/agent route, check the request path against the audit log to identify which row was written. The row's PII fields are plaintext and must be re-encrypted via a one-shot rotation script.
  4. Fix the env / settings, redeploy, then re-encrypt affected rows.

Escalation Security on-call. Compliance must be notified for any PII written in the clear in production.


pii_aes_gcm_decrypt_failed_total

Emitter: servers_v2/shared/contracts/src/rgb_contracts/infra/aes_guard.py, from decrypt_pii_strict when a versioned aesgcm:v1: PII ciphertext cannot authenticate.

What it counts One increment per invalid AES-256-GCM ciphertext. Unlike legacy base64-shaped AES-CBC/plaintext ambiguity, the aesgcm:v1: prefix is an unambiguous encrypted-data marker, so production reads raise AesPiiDecryptError immediately rather than returning the value.

Diagnosis sequence

  1. Identify the service label and affected request path.
  2. Check whether the row was manually edited, truncated, or encrypted with the wrong AES_KEY.
  3. Restore from a known-good row backup or re-encrypt from a trusted source of truth. Do not disable AES_STRICT_DECRYPT; that flag only applies to ambiguous legacy CBC/plaintext reads.

admin_ws_rate_limited_total (T6-F3)

Emitter: servers_v2/shared/contracts/src/rgb_contracts/infra/brand_observability.py::record_admin_ws_rate_limited, called from the admin /api/v1/ws endpoint when a connection attempt is rejected by the per-IP accept-rate limiter installed in T4-D-I7.

Labels: reason -- one of:

  • per_ip_window_exceeded -- IP exceeded the configured accept-per-min budget (admin_ws_rate_limit_per_min).
  • redis_outage -- the limiter could not consult Redis and fail-closed in production (or fail-open in dev) per the same env-aware policy as the gateway global limiter.

What it counts One increment per rejected WebSocket accept attempt. Each rejection closes the socket with code 4029 (custom for "ws rate limited") and does NOT mint an admin session.

When it normally fires

  • Burst from a single browser tab refresh storm. Should self-correct within one window (60 s default).
  • Operator dashboard mass-reload during deploy. Mostly harmless.

Alarm thresholds

  • Warning: sustained non-zero rate from a single IP across more than 3 windows. Could indicate a misbehaving frontend client.
  • Page: reason="redis_outage" non-zero in production -- the limiter is the only protection between the admin panel and unbounded session churn, and a Redis outage that lasts longer than ~30 s should be investigated regardless of which counter fires.

Diagnosis sequence

  1. Pivot the counter by IP from access logs (the limiter does not emit IP as a label to keep cardinality bounded).
  2. If a single IP, ask the BO frontend team if the user reports a refresh loop / stuck reconnect timer. Roll the offending tab.
  3. If reason="redis_outage", cross-check Redis pod health and the service's own redis_connect_errors_total.

Escalation Backoffice team for client-side issues; platform on-call for Redis outages.


See Also

  • docs/runbooks/multi-brand/multi-brand-isolation-rollout.md -- Diagnosis Playbook section for brand_resolution_failed_total, wallet_idempotency_brand_split_total, security_downgrade_total, per-brand request_total, and per-brand wallet outbox freshness.
  • docs/runbooks/multi-brand/phase-16-hard-flip-checklist.md -- pre-flip gate that consumes every counter above. Every entry in this file maps to a Phase 16 pre-flip checkbox or to a post-flip page-on-non-zero alarm.
  • docs/specs/multi-brand/2026-04-27-multi-brand-isolation-spec.md -- canonical counter definitions and acceptance criteria.
  • servers_v2/shared/contracts/src/rgb_contracts/infra/brand_observability.py -- shared helper module for spec counters.