Multi-Brand Counter Diagnosis Playbook

Status

Ready

2026-07-17 Aggregator-only update: Native provider callbacks, account-prefix reverse parsing, BRAND_CATALOG_CHANGED, the callback DLQ replay utility, VERIFY_CALLBACKS, and the game_callback_* native-callback counters were retired. Sections explicitly marked "Retired" below are historical incident context only and must not be used as current commands. The active boundary is documented in docs/services/game-service.md.

Date

2026-04-28

Last Verified Commit

Use git log -- <this file> for current last-touch history; this field is intentionally not pinned to a static hash so it does not become stale after unrelated commits.

Owners

Platform Backend
Wallet Domain
Agent Domain

Purpose

Operator-facing entry point for the multi-brand observability counters. On-call gets paged at 03:00 with no time to read source. Each section maps one counter to: what it counts, when it normally fires, alarm thresholds, what it means when it fires, diagnosis steps, likely fix, and who to escalate to.

Counters are organized by emitting service. The legacy brand_resolution_failed_total{reason} and wallet_idempotency_brand_split_total counters already have entries in the rollout runbook's Diagnosis Playbook section (docs/runbooks/multi-brand/multi-brand-isolation-rollout.md); this file picks up the brand-signature, brand-check, allow-list, and admin-audit counters that lacked an entry.

Common conventions

"Soak window" means the same window used before the Phase 16 flip (see phase-16-hard-flip-checklist.md). Default 120 min.
"Observe mode" never returns end-user errors; the counter records what enforce mode WOULD have rejected.
"Enforce mode" hard-rejects with the v1 envelope (HTTP 200 + {"status":3,"msg":"...","data":null}) for HTTP entry paths, or raises CrossBrandRejected for wallet command paths.
All thresholds are per-instance unless explicitly summed.
Log greps use loguru's extra= fields; in production these surface as JSON keys.

`wallet_brand_signature_failed_total`

Emitter: servers_v2/wallet_service/app/services/brand_signature_verifier.py (_record_failure, line ~89)

Labels: caller_service, reason, mode. Reasons:

missing_headers:<list> -- one or more of X-Brand-Signature, X-Brand-Signature-Timestamp, X-Caller-Service, X-Request-Id was absent.
invalid_timestamp -- timestamp header was not parseable as int.
signature_mismatch -- HMAC did not match.
signature_replay -- a valid signature was presented twice within the 600s replay-cache window (see wallet_brand_signature_replay_blocked_total below for the alias-counter view).

What it counts One increment per inbound brand-scoped wallet write whose X-Brand-Signature envelope failed verification, tagged with the calling service, the failure reason, and the active enforcement mode.

When it normally fires

During Stage A/B of the per-caller signing rollout (cfd89b05): missing_headers:* is expected non-zero from any caller that has not yet been updated. Use wallet_brand_signature_missing_total per-caller counter to see which caller; that is the rollout signal not an alarm.
Once WALLET_BRAND_SIGNATURE_REQUIRE=on: expected steady-state rate is zero.
invalid_timestamp / signature_mismatch / signature_replay are always abnormal at any rollout stage.

Alarm thresholds

Warning: any signature_mismatch or invalid_timestamp increment in a 5-min window.
Page: >=10/hr of any reason other than missing_headers:* in enforce mode, OR ANY rate of signature_mismatch in production.
Hard incident: sustained signature_mismatch from a single caller -- treat as an active spoof attempt.

What it means when it fires An internal caller sent a wallet write whose HMAC envelope did not verify. Possible causes:

(a) caller was not issued BRAND_SIGNING_KEY, or has a stale value after rotation;
(b) clock skew >300s between caller and wallet pods (the verify_brand_assertion helper enforces +/- 300s);
(c) a rotated BRAND_SIGNING_KEY was not propagated to every pod before the old key was removed;
(d) attempted spoof from outside the trust boundary (rare; the signing key is internal).

Diagnosis steps

kubectl logs deploy/wallet-service | grep "wallet brand-signature failure" | tail -50 -- look at caller_service, reason, mode extras.
If reason=missing_headers:*: check wallet_brand_signature_missing_total{caller_service} for the same caller; if non-zero, that caller's HTTP client is not injecting the headers. Inspect the deploy of that caller.
If reason=signature_mismatch: confirm both pods have the same BRAND_SIGNING_KEY version: kubectl exec deploy/<caller> -- printenv BRAND_SIGNING_KEY | sha256sum and compare to wallet pod.
If reason=invalid_timestamp: check the raw header value in logs (the verifier logs the failed parse). Likely a bug in the caller's timestamp formatter.
If clock skew suspected: kubectl exec into both pods and run date -u. Compare against NTP sync state.

Likely fix

Missing headers from a known caller: deploy the caller's HTTP client fix that injects all four headers (see rgb_contracts.infra.brand_signature.sign_brand_assertion).
Mismatch after rotation: re-roll the affected caller pods to pick up the new BRAND_SIGNING_KEY. See "BRAND_SIGNING_KEY rotation" in the rollout runbook.
Clock skew: re-sync NTP on both pods; consider increasing the 600s replay cache margin only with security review.
Spoof: rotate BRAND_SIGNING_KEY immediately and audit the caller pod for compromise.

Escalation Wallet domain owner if not resolved in 15 min. Security on-call immediately on any signature_mismatch from a caller whose signing key has not been rotated recently.

`wallet_brand_signature_replay_blocked_total`

Emitter: same file as above. Surfaces as wallet_brand_signature_failed_total{reason="signature_replay"} -- not a separate metric name. The "replay blocked" view is the slice of that counter where reason=signature_replay.

Labels: same (caller_service, reason="signature_replay", mode).

What it counts One increment when a caller presented a previously-seen valid (caller_service, request_id) signature pair within the 600s replay-cache window. Defense-in-depth on top of HMAC + skew.

When it normally fires Steady-state rate is zero. Legitimate clients never replay a valid signature within 600s because request_id is unique per request.

Alarm thresholds

Warning: any single increment in observe mode (operator should understand the cause before flipping enforce).
Page: >=1/min sustained in enforce mode -- a caller is reusing request IDs OR a captured signature is being replayed.
Hard incident: sustained replay from a caller whose request_id generator is supposed to be UUID4 -- suspect signature capture.

What it means when it fires

(a) caller's request_id generator is not actually unique within the 10-min window (e.g. retry loop reusing the same id without re-signing);
(b) caller is retrying without re-signing -- the original signed envelope is being replayed by the HTTP client retry logic;
(c) an attacker captured a valid signed request and is replaying it (real intrusion).

Diagnosis steps

kubectl logs deploy/wallet-service | grep '"reason":"signature_replay"' | tail -50 -- look at caller_service. The X-Request-Id is in the request header but not in this counter's logged extras; check the caller's logs for the same window.
Inspect the caller's HTTP retry policy. If it retries on 5xx without re-signing, that is the bug -- a retried request needs a fresh timestamp + signature.
Check if the request_id pattern looks like a counter or timestamp instead of UUID4 -- if so, the caller's id generator is collision-prone.
Redis cache state: redis-cli GET brand_sig_seen:<caller>:<request_id> -- if the key exists with TTL ~600, that confirms the previous valid signature is still cached.

Likely fix

Caller retry loop: patch the caller to re-sign on each retry attempt (timestamp + signature must be regenerated). The Phase 5B cfd89b05 signing helpers do this correctly when invoked per-attempt; older hand-rolled clients may sign once and reuse.
Non-unique request_id: switch the caller to UUID4 generation.
Capture/replay attack: rotate BRAND_SIGNING_KEY immediately (replay cache will not help once the key changes), page security.

Escalation Wallet domain owner if not resolved in 15 min. Security on-call immediately for sustained replay from a caller with verified-correct signing.

`wallet_brand_signature_misconfigured_total` (added by P1-4)

Emitter: servers_v2/shared/contracts/src/rgb_contracts/infra/brand_observability.py::record_wallet_brand_signature_misconfigured, called from servers_v2/wallet_service/app/services/brand_signature_verifier.py::verify_request_brand_signature when the verifier observes an empty BRAND_SIGNING_KEY at request time.

Labels: mode. One of observe or enforce (the off branch returns True without recording).

What it counts One increment per inbound brand-scoped wallet write that reached the verifier while BRAND_SIGNING_KEY was empty. In observe mode the write proceeds (rollout-safe); in enforce mode the write is rejected (return False -- runtime fail-closed defense-in-depth on top of the boot-time guard assert_brand_signing_key_configured in app/main.py).

When it normally fires Steady state: zero. Boot-time guard refuses to start wallet_service in the bad combination (MULTI_BRAND_ENFORCEMENT=enforce AND WALLET_BRAND_SIGNATURE_REQUIRE=on AND empty BRAND_SIGNING_KEY). Any non-zero rate means either:

the boot guard was bypassed (test env, weird container restart race, deploy template stripped the env var after boot);
the deploy template provisioned BRAND_SIGNING_KEY= (literal empty string) instead of the secret value.

Alarm thresholds

Page: ANY non-zero rate. Empty signing key in enforce mode means every brand-scoped wallet write from that pod is being rejected; in observe mode it means HMAC verification is silently disabled.
Hard incident: sustained mode="enforce" -- production wallet writes from that pod are failing; users see write errors.

What it means when it fires

(a) mode="enforce": the runtime fail-closed branch is firing. Every brand-scoped wallet write from this pod returns False. The boot guard SHOULD have prevented the pod from starting; investigate why it didn't.
(b) mode="observe": HMAC verification is disabled on this pod. No writes are blocked but the brand-signature defense is dead -- operator MUST fix before flipping WALLET_BRAND_SIGNATURE_REQUIRE=on (and certainly before flipping MULTI_BRAND_ENFORCEMENT=enforce).

Diagnosis steps

kubectl logs deploy/wallet-service | grep "BRAND_SIGNING_KEY empty" | tail -50 -- look at mode extras. The error message includes the Vault path operators must read from.
kubectl exec deploy/wallet-service -- printenv BRAND_SIGNING_KEY | wc -c -- confirm the env var is set on the pod (>1 byte expected).
Compare to the deploy template / Vault binding for the affected environment.
Check the boot logs for the affected pod: did the pod start cleanly or did the guard raise (in which case the pod is in CrashLoopBackOff and the runtime counter wouldn't be firing)?

Likely fix

Re-provision BRAND_SIGNING_KEY from the secrets manager (Vault path secret/multi-brand/brand-signing/<env>); roll the wallet pods.
Audit the deploy template -- if it provisioned an empty value, fix the template before the next deploy.

Escalation Wallet domain owner immediately for mode="enforce" rates. Security on-call if the empty value cannot be explained by a deploy-template bug (suspect tampering).

`wallet_cross_brand_rejected_total`

Emitter: servers_v2/wallet_service/app/services/brand_check.py (_record_mismatch, line ~107). Invoked from wallet_bucket_commands.py::_check_player_brand on every brand-scoped wallet write (authorize_bet, settle_bet, approve_deposit, recycle_deposit, credit_promotion, etc.).

Labels: command, mode. command is the wallet command name (authorize_bet, settle_bet, approve_deposit, decline_deposit, credit_promotion, ...). mode is observe or enforce.

What it counts One increment per wallet write where the request brand (X-Brand-Id) disagreed with the target row's brand_id. In observe mode the write proceeds using the REQUEST brand; in enforce mode the command raises CrossBrandRejected and the route returns the v1 reject envelope.

When it normally fires

Pre-flip soak: must reach zero before Phase 16 hard flip (gate: wallet_cross_brand_rejected_total{mode="observe"} == 0 across the soak window).
Post-flip steady state: zero.
Any increment in production traffic indicates a real brand-isolation bug or attacker probing.

Alarm thresholds

Warning: any non-zero mode=observe rate during the soak window.
Page: >=1/min of mode=enforce (real money path is rejecting).
Hard incident: sustained mode=enforce rejections AND user reports of failed bets/deposits.

What it means when it fires

(a) an internal caller is not forwarding X-Brand-Id correctly;
(b) the target row (player, deposit, etc.) carries the wrong brand_id due to historical backfill error;
(c) a deliberate cross-brand request reached wallet (rare; the edge should reject earlier);
(d) X-Brand-Id was stripped by a misconfigured proxy.

In observe mode the write still lands in the REQUEST brand -- this is intentional money-safety behavior so a forgotten X-Brand-Id in observe rollout does not silently dump money in the wrong brand.

Diagnosis steps

kubectl logs deploy/wallet-service | grep "wallet cross-brand mismatch" | tail -50 -- look at command, request_brand, target_brand, mode.
Filter by command to see if it concentrates on one operation (e.g. only approve_deposit). That narrows which caller is misbehaving.
Cross-check the originating service: deposit-class commands come from admin_service or plisio callbacks; bet-class from game_service; promotion-class from promotion_service.
If target_brand is consistently NULL, the row's brand_id was never backfilled -- run the Phase 12 verify script.
Inspect a sample request_id: grep across services for the same request_id to trace where X-Brand-Id was dropped.

Likely fix

Single-caller bug: hot-fix the caller's brand-forwarding code, OR flip just that caller's MULTI_BRAND_ENFORCEMENT to observe to buy time (writes still land in request brand, no money misrouted).
Backfill miss: run the Phase 12 backfill verify + repair script (servers_v2/tools/multi_brand_backfill/).
Edge bug: revert gateway to observe; investigate brand resolution in the affected route.

Escalation Wallet domain owner immediately. Page deploy lead if mid-flip.

`brand_resolution_failed_total{service="gateway"}`

Emitter: servers_v2/gateway/app/middleware/brand_enforcement.py (_increment_failure, line ~89). The user task referred to this as gateway_brand_mismatch_total; the actual emitted counter name is brand_resolution_failed_total with service="gateway". This is the spec-canonical name used across the rollout runbook and dashboards.

Labels: reason, service="gateway". Reasons:

jwt_domain_mismatch -- JWT carries a brand that disagrees with the domain-resolved brand.
jwt_missing_brand -- JWT decoded but lacks a brand_id claim (legacy tokens; expected during the post-Phase-5 drain window).
unknown_domain -- JWT carries a brand but the request's domain cannot be resolved (no Redis entry for the host).

What it counts One increment per request reaching BrandEnforcementMiddleware where the JWT brand and the domain-resolved brand disagree (or either side is missing). In observe the request proceeds using the domain-resolved brand; in enforce it rejects with the v1 envelope.

When it normally fires

jwt_missing_brand: non-zero is expected for 2 * JWT_EXPIRE_MIN after Phase 5 deploy. Must reach zero before Phase 16 flip.
jwt_domain_mismatch: should be zero in steady state. Low rate from scanners / stale mobile clients is expected baseline noise.
unknown_domain: zero except during a Redis domain-binding rewrite or a brand onboarding mid-rollout.

Alarm thresholds

Warning: >=10/hr of jwt_domain_mismatch from a single user-agent class.
Page: >=1/min of jwt_domain_mismatch clustered on a legitimate browser user-agent and one brand domain.
Hard incident: unknown_domain sustained -- a brand's traffic is not resolving at the edge; users are locked out.

What it means when it fires

(a) Redis domain:agent:* or domain:level:* mapping was changed and the new mapping disagrees with what JWT minting saw at login time;
(b) refresh-token rotation is minting brand-unaware tokens post-Phase-5 (a Phase 5 implementation bug);
(c) attacker is replaying a JWT minted on brand A against brand B's domain;
(d) for unknown_domain: the LB forwarded a Host not present in any domain:agent:* entry (typically a brand-disable that forgot to keep at least one mapping, or DNS pointing at the wrong LB).

Diagnosis steps

kubectl logs deploy/gateway | grep "brand resolution failed" | tail -50 -- look at reason, request_brand_id, jwt_brand_id, path, mode.
If counts cluster on one source IP / user-agent: scanner or stale client; refine alert threshold.
If counts cluster on a legitimate browser UA + one brand domain: redis-cli KEYS 'domain:*' -- look for a recent change to the affected brand's mapping. Check audit log for brand-status flips.
If jwt_missing_brand is still non-zero >2 * JWT_EXPIRE_MIN after Phase 5 deploy: the player_service / agent_service refresh path is minting brand-unaware tokens. See Phase 5 in the isolation plan.

Likely fix

Baseline noise: ignore; refine the alert threshold.
Domain-mapping change: re-verify the Redis mapping for that domain; inspect the LB and any DNS changes; revert the mapping change if unintended.
Refresh-token bug: hot-fix the issuing service to reject brand-unaware refresh tokens; force re-login for affected sessions.
Phase 16 flip caused a real spike: revert the affected service to observe per the mid-flip revert procedure.

Escalation Platform backend on-call. Wallet domain owner if money paths are affected.

`agent_brand_not_allowed_total` (a.k.a. `agent_brand_allow_list_rejected`)

Emitter: servers_v2/agent_service/app/middleware/agent_brand_enforcement.py (_increment_failure, line ~109). Note: the user task referred to this as agent_brand_allow_list_rejected_total; the actual emitted metric name is agent_brand_not_allowed_total. The middleware also emits the spec-canonical brand_resolution_failed_total with the same reason for cross-service dashboards.

Labels: reason, service="agent_service". Reasons:

agent_brand_not_allowed -- the (agent_id, brand_id) pair is not in the enabled allow-list.
unresolved_brand -- the request reached the middleware but upstream brand resolution attached no brand_id.

What it counts One increment per authenticated agent_service request whose (agent_id, brand_id) does not appear in agent_brand WHERE status='enabled' (cached in-process with 60s TTL + Redis pub/sub invalidation).

When it normally fires

Pre-Phase-12 re-seed: non-zero is expected if the autoseed missed any agent.
Post-flip steady state: zero. A non-zero rate indicates either the allow-list cache is stale or an agent was unbound from a brand they are still trying to access.

Alarm thresholds

Warning: >=5/hr from a single agent_id (likely stale cache).
Page: >=1/min sustained in enforce mode (real agents are locked out).
Hard incident: a known-good agent_id appears repeatedly across multiple brands AND the agent_brand row exists -- the cache invalidation path is broken.

What it means when it fires

(a) agent_brand row is missing for an agent that was created without the autoseed trigger firing (Phase 12 re-seed gap);
(b) an admin disabled agent_brand.status for an agent who is still actively logged in (their JWT remains valid until expiration);
(c) the per-process allow-list cache is stale and a recent insert has not been picked up (TTL 60s OR AGENT_BRAND_CHANGED pub/sub message dropped);
(d) unresolved_brand reason: brand resolution upstream silently failed -- treated as a separate alarm class (see rollout runbook diagnosis).

Diagnosis steps

kubectl logs deploy/agent-service | grep "agent brand allow-list check failed" | tail -50 -- look at agent_id, request_brand_id, path, mode, reason.
SQL check: SELECT * FROM agent_brand WHERE agent_id=<id> AND brand_id=<id>; -- row missing or status != 'enabled' confirms cause (a) or (b).
If row exists and is enabled: cache is stale. Force a refresh by restarting one agent_service pod, OR redis-cli PUBLISH AGENT_BRAND_CHANGED "" to invalidate every pod's cache.
For unresolved_brand: see the rollout runbook's brand_resolution_failed_total{reason="unresolved_brand"} entry.

Likely fix

Missing seed: insert (agent_id, brand_id, 'enabled') into agent_brand via the audited admin endpoint (do NOT INSERT directly; the audit row is the operator trail).
Stale cache: publish AGENT_BRAND_CHANGED or rolling-restart agent_service.
Status mistakenly disabled: re-enable through the admin audit-tracked write.

Escalation Agent domain owner if not resolved in 15 min. Customer support if specific named agents are reporting login failures.

Retired: `game_callback_brand_unresolved_total`

This metric belonged to the removed provider-direct callback and account-prefix reverse-parsing implementation. It has no emitter in the Aggregator-only runtime and must not be used for alerts, release gates or incident recovery.

The game_callback_dead_letter table is retained only for migration history; it is not a live callback queue. Replaying its unauthenticated historical payloads cannot reconstruct Aggregator provider authentication or signed player context and is prohibited.

For current callback incidents, inspect the complete chain instead:

Aggregator provider-authentication and external-player mapping logs.
Gateway exact Provider method/path/query/body forwarding and relay rejection logs.
game_service relay-signature and signed-context validation.
player_service brand ownership resolution and wallet_service idempotency/ledger results.

Recovery uses the provider retry/reconciliation process through Aggregator. Never post a stored callback payload directly to RGB.

Retired: `game_callback_verification_bypassed_total`

This metric and the VERIFY_CALLBACKS bypass belonged to the deleted native HO/MG/WC callback routes. The current public surface accepts only the exact Aggregator relay allow-list and always requires the length-framed relay HMAC plus signed player context. There is no production bypass switch.

Any native provider path becoming reachable, or any relay accepted without valid credentials and context, is an immediate security incident: block the route, verify the deployed gateway/game_service revision, and reconcile affected wallet ledger entries with Aggregator.

Operator Account Attribution

admin_service now derives operator attribution only from the verified admin JWT account claim. Caller-supplied operator identity is not accepted in request bodies or headers.

When attribution fails

Writes reject with HTTP 400 when the verified admin JWT has no username / account claim. This should be a token-issuer bug or a stale test fixture.

Diagnosis steps

Check the admin-service response body for Authenticated admin account required.
Inspect the token issuer config and confirm new admin tokens include an immutable account claim.
Cross-reference admin_audit.operator_id; despite the historical column name, it stores the account string.

Likely fix

Cycle the admin auth issuer to mint account-bearing tokens, then force admin re-login.

Escalation Security on-call IMMEDIATELY for any non-zero rate. Platform on-call in parallel for SSO / LB infrastructure investigation.

`admin_audit_write_failed_total` (added by P1-β)

Emitter: servers_v2/shared/contracts/src/rgb_contracts/infra/brand_observability.py::record_admin_audit_write_failed (line ~301). Called from servers_v2/admin_service/app/services/admin_audit.py::write_money_admin_audit (line ~152) when the post-money-write audit row INSERT fails.

Labels: route. Bounded enum: adjust_player, approve_deposit, decline_deposit, recycle_deposit, coin_agree_deposit, shooter_agree_deposit, review_agree_withdraw, review_decline_withdraw, pay_agree_withdraw, pay_decline_withdraw, agree_agent_withdraw, decline_agent_withdraw, pay_agent_withdraw. Unknown collapses to unknown.

What it counts One increment per money-mutating admin endpoint whose admin_audit row INSERT failed AFTER the wallet_service call already committed. The money write CANNOT be rolled back; the audit gap must be reconciled manually.

When it normally fires Steady state: zero. Failures here mean either the audit DB session crashed mid-write or the append-only triggers (alembic 0032) fired (which is itself a security event).

Alarm thresholds

Page: ANY non-zero rate. Money committed without audit row; on-call must reconcile.
Hard incident: any log line containing admin_audit is append-only (the trigger fired) -- treat as active intrusion attempt OR a runaway data-cleanup script.

What it means when it fires

(a) DB session error during the audit INSERT (transient: pool exhaustion, network blip, connection reset);
(b) the audit-side INSERT was rejected by the append-only triggers (trg_admin_audit_no_update_delete, trg_admin_audit_no_truncate) -- impossible for a fresh INSERT but possible if a misbehaving caller passed an UPDATE path;
(c) schema drift after migration 0032 (audit table missing a column that the helper writes);
(d) the append-only audit role was misconfigured to forbid INSERT (defence-in-depth role hardening misapplied).

Diagnosis steps

kubectl logs deploy/admin-service | grep "admin_audit write failed AFTER money write committed" | tail -50 -- the helper logs the full traceback at ERROR. Look at the route, target_type, target_id, brand_id, operator, err.
Identify the affected money write in wallet_service ledger: psql -c "SELECT * FROM wallet.wallet_ledger WHERE brand_id=<bid> AND created_at BETWEEN <window> ORDER BY created_at DESC LIMIT 50;".
Reconstruct the audit row manually using the logged before / after payload and the route name. Insert via the audited admin endpoint OR a controlled SQL INSERT with on-call manager approval.
If logs contain admin_audit is append-only: page security on-call immediately; do not retry the audit insert until the cause of the trigger fire is understood.

Likely fix

Transient DB error: retry the manual reconciliation INSERT after pool health recovers.
Trigger fire: investigate the call site -- the append-only trigger should never fire on a fresh INSERT. If it does, either schema drift or an attacker is the cause.
Schema drift: deploy the missing column / constraint; backfill the missed audit rows from the wallet ledger window.

Escalation Platform on-call AND wallet domain owner immediately. Audit gap is treated as a compliance incident.

`wallet_brand_signature_replay_redis_outage_total` (delivered P2-δ)

Emitter: servers_v2/wallet_service/app/services/brand_signature_verifier.py::_record_replay_redis_outage (line ~168), called from _claim_signature_replay_slot whenever the Redis SETNX on brand_sig_seen:{caller}:{request_id} raises.

Labels: caller_service (per-caller visibility into who was affected; truncated to 32 chars), mode (one of observe / enforce). Emitted in BOTH modes so the alarm fires even while we are still rolling out and observe-mode does not yet hard-reject.

What it counts One increment per inbound brand-scoped wallet write where the replay-cache Redis SETNX raised an exception (Redis pod down, network blip, auth expiry, etc.). The verifier logs at WARN and fails open (proceeds without the replay check) so signed wallet writes do not wedge during a Redis outage. HMAC + skew still hold, so this is a defence-in-depth degradation, not a security breach.

When it normally fires Steady state: zero. Any non-zero rate correlates 1:1 with Redis outage windows for the wallet pod's Redis client.

Alarm thresholds

Warning: any non-zero rate sustained >1 min.
Page: >=10/min sustained -- replay defence is degraded for the whole brand-signed wallet write surface.

What it means when it fires The brand_sig_seen:* Redis keys cannot be claimed; signed wallet writes are proceeding without replay protection. HMAC + skew still hold so this is a defence-in-depth degradation, not a security breach.

Diagnosis steps

Confirm the Redis incident: check the wallet_service Redis client connectivity (kubectl logs deploy/wallet-service | grep "brand-signature replay cache unavailable").
Cross-reference with the wallet outbox publisher Redis health (same Redis target -- both should flap together).
Filter the counter by caller_service -- a single caller spiking while others stay quiet usually means a pod-local network blip, not a Redis incident.
Once Redis is back: confirm the counter rate returns to zero.

Likely fix Resolve the underlying Redis outage. Consider rotating BRAND_SIGNING_KEY if the outage window was long enough to allow captured-signature replay (rotation invalidates any cached signature).

Escalation Wallet domain owner AND platform on-call (Redis).

`plisio_callback_replay_blocked_total` (added by T1-D-C5/I6)

Emitter: servers_v2/wallet_service/app/api/routes/plisio.py::plisio_callback via rgb_contracts.infra.brand_observability.record_plisio_callback_replay_blocked.

Labels: status (one of the PlisioTxnStatus lower-cased values, e.g. pending, completed, expired; spaced labels like pending internal are normalized to pending_internal).

What it counts One increment per Plisio callback rejected by the (order_number, status, txn_id_first_16) Redis SETNX guard. The guard's TTL is 24h; a duplicate delivery within that window is acknowledged with the success-shaped _ok() payload (so Plisio stops retrying) but the deposit / coin_transaction state mutation is skipped.

Alarm thresholds

Warning: >=1/min sustained for >5 min.
Page: >=10/min sustained — likely active replay attack against the webhook surface.

Diagnosis steps

Pull the WARN log line: kubectl logs deploy/wallet-service | grep "Plisio callback replay blocked". The order_first_8 and status labels there match the metric labels.
Cross-reference order_first_8 against the Plisio dashboard: confirm whether the merchant API shows a single delivery or multiple.
If Plisio shows a single delivery and our counter shows replays, pivot to "captured signature" — rotate PLISIO_SECRET immediately and audit the coin_transaction rows for the affected order_number against the Plisio source-of-truth.

Escalation Payments domain owner. Treat >=10/min sustained as a security incident.

`plisio_callback_replay_redis_outage_total` (added by T1-D-C5/I6)

Emitter: servers_v2/wallet_service/app/api/routes/plisio.py::_claim_plisio_callback_slot via rgb_contracts.infra.brand_observability.record_plisio_callback_replay_redis_outage.

Labels: env — one of production (boot env resolved as production/staging/prod via rgb_contracts.infra.runtime_env.is_production_runtime_env) or non_production (everything else).

What it counts One increment per inbound Plisio callback whose replay-cache SETNX raised an exception (Redis pod down, network blip, auth expiry). Behaviour by env after the increment:

env="production": handler fails closed, returns HTTP 503 to Plisio so it retries the legit callback. Better to retry a legitimate delivery than re-credit a replay.
env="non_production": handler logs WARN and proceeds best-effort so dev/staging workflows keep moving.

Alarm thresholds

Warning: env="production" AND any non-zero rate sustained >1 min.
Page: env="production" AND >=10/min sustained — the deposit pipeline is wedging until Redis recovers.

Diagnosis steps

Confirm the Redis incident: check the wallet_service Redis client connectivity (kubectl logs deploy/wallet-service | grep "Plisio callback replay cache unavailable").
Cross-reference with the wallet outbox publisher Redis health — same Redis target, so both should be flapping together.
Once Redis is back: confirm the counter rate returns to zero and Plisio's retry queue drains.

Escalation Platform on-call (Redis) AND payments domain owner.

`wallet_topology_default_brand_unprimed_total` (delivered P2-δ)

Emitter: servers_v2/wallet_service/app/services/wallet_topology_store.py::_record_default_brand_unprimed (line ~1080), invoked from _brand_filter_clause when a topology or policy lookup arrives with brand_id=None AND the default-brand cache has not been primed at lifespan startup (_resolve_default_brand_id).

Labels: none -- the unprimed default-brand cache is a process-wide condition, not per-command. The companion counter wallet_topology_default_brand_fallback_total{caller_origin} already gives the per-call-site breakdown when the default brand is primed.

What it counts One increment per topology / policy lookup that arrived with no caller-supplied brand_id AND found the cached default brand id unset. Because _brand_filter_clause hard-fails in that branch (post-0024c the legacy IS NULL filter matches zero rows and would silently break every downstream lookup), every increment maps 1:1 to a RuntimeError raised to the caller. The counter exists so the alarm fires before the user-visible error surfaces.

When it normally fires Steady state: zero. The lifespan handler in wallet_service/app/main.py MUST call _resolve_default_brand_id at startup; if it does, this counter never increments. Any non-zero rate means lifespan startup did not prime the cache (incomplete deploy template, broken boot order, or a test harness that bypassed lifespan).

Alarm thresholds

Page: ANY non-zero rate. Every increment is a hard failure on the calling wallet command.
Hard incident: sustained non-zero AND user-visible 5xx on wallet write paths -- the entire wallet write surface for the default brand is wedged.

What it means when it fires The wallet pod started without priming the default-brand cache. Until the cache is primed (or the calling code is updated to pass brand_id explicitly) every legacy caller without request brand context fails. This is distinct from wallet_topology_default_brand_fallback_total, which counts successful default-brand fallbacks (cache primed, brand resolved).

Diagnosis steps

kubectl logs deploy/wallet-service | grep "default-brand cache" -- check whether the lifespan handler logged the prime call.
Restart the affected wallet pod; the lifespan handler should prime the cache on boot. Re-check the counter -- if it stays non-zero, the lifespan code path is broken.
Cross-check the affected request: _brand_filter_clause raises RuntimeError("default-brand cache not primed") so the calling command will surface the error at the user.

Likely fix

Boot-order regression: confirm app/main.py lifespan calls _resolve_default_brand_id before the first inbound request.
Test harness leaking into prod: ensure no test-only entry point is mounted on the production app.
Migration gap: confirm migration 0023_seed_default_brand.py ran and the brand table has at least one row with the documented default brand_code.

Escalation Wallet domain owner immediately. Page deploy lead if a recent deploy correlates with the spike.

`recovery_sms_delivery_failed_total` (added by Codex P1-#6)

Emitter: servers_v2/shared/contracts/src/rgb_contracts/infra/recovery_brand.py::record_recovery_sms_delivery_failed, called from _send_recovery_code in servers_v2/player_service/app/api/routes/recovery.py and servers_v2/agent_service/app/api/routes/recovery.py whenever the SMS path returns not-sent OR raises.

Labels: service. One of player_service / agent_service.

What it counts One increment per /recovery/request whose SMS branch could not deliver -- either deliver_sms returned False (provider not configured / provider replied not-sent) or the call raised. The user-visible response is unchanged (canonical "if account exists" envelope) so the SMS failure does not become an existence oracle; this counter is the ONLY operator-visible signal.

When it normally fires A small steady rate is expected when the SMS provider is partially configured (e.g. dev environment). Production should be at or near zero. Any sustained non-zero rate against service="player_service" or service="agent_service" means real users are not receiving recovery codes.

Alarm thresholds

Warn: >5 per minute per service for 5+ minutes (provider degraded).
Page: >50 per minute per service (provider down -- recovery flow effectively offline).

What it means when it fires

(a) service="player_service": real player password recoveries are silently failing. The audit row still lands as REQUEST / success because the player WAS matched; only the SMS side channel failed.
(b) service="agent_service": same, for agent recoveries.

Diagnosis steps

Check the SMS provider status (CoolSMS / BulkSMS / iCode dashboard per the configured platform).
Search recent logs for recovery sms returned not-sent or recovery sms delivery exception lines (these now carry only contact_hash=<...> -- no PII).
Cross-reference the recovery_audit table: rows with action='REQUEST', outcome='success' against the affected window are the requests that hit the SMS path.
Confirm the SMS provider env vars have not been rotated / stripped on the affected pod.

Likely fix Restore the SMS provider credentials, retry. There is no automatic retry of the SMS itself -- the user has to re-issue /recovery/request.

Escalation Player-platform owner if the provider is healthy but the counter keeps firing (likely an issue with our deliver_sms integration rather than the provider).

`recovery_email_unprovisioned_total` (added by Codex P1-#6)

Emitter: servers_v2/shared/contracts/src/rgb_contracts/infra/recovery_brand.py::record_recovery_email_unprovisioned, called from _send_recovery_code in servers_v2/player_service/app/api/routes/recovery.py and servers_v2/agent_service/app/api/routes/recovery.py whenever the email path is selected.

Labels: service. One of player_service / agent_service.

What it counts One increment per /recovery/request with contact_type='email'. There is no transactional email provider in v2 yet, so this counter is the operator-visible signal that the pending-impl gap is being exercised by real traffic. Previously the route logged addr=<raw> code=<raw> to compensate -- that PII / OTP leakage was removed in P1-#6 and replaced with this counter + a hashed log line.

When it normally fires Background rate in any env that allows email recovery on the front end. Because there is no provider wired in, every increment is effectively a "user did not receive a recovery code" event.

Alarm thresholds

Warn: any sustained non-zero rate. The frontend should be hiding the email-recovery option when no provider is wired.
Page: persistent >1/min for >24h (means the gap is actively affecting users in the field).

What it means when it fires The email recovery code was generated, the audit row landed, the short-code key was stashed in Redis -- but no email left the service. The user will time out on the recovery flow.

Diagnosis steps

Confirm the provider gap is still real: grep for any email-bearing SMTP / SES integration in servers_v2/.
Check the frontend flow for the affected service to see how often the email recovery option is being offered.
Cross-reference recovery_audit for rows with metadata->>'channel' = 'email'.

Likely fix Provision a transactional email provider (SES / Postmark / similar) and wire it into rgb_contracts.infra as a sibling of sms.py. Until then: hide the email-recovery option in the frontend or route those flows back to SMS.

Escalation Player-platform owner. Track as a recovery-flow follow-up to P1-#6.

`gateway_session_legacy_key_used_total` (retired)

This counter and the gateway legacy session dual-read branch were removed by the Phase 4E dual-read sunset cleanup. Gateway now reads only user-session:brand:{brand_id}:{account} and no longer accepts the pre-Phase-4E user-session-{account} key.

Use gateway_jwt_session_missing_total for the retained JWT-valid-but-session-gone signal. If old dashboards still query gateway_session_legacy_key_used_total, delete or archive those panels; they refer to the pre-sunset transition path.

Diagnosis steps

Confirm the request has a resolved brand: check brand_resolution_failed_total{service="gateway"} and gateway logs for domain-resolution failures.
Confirm the player has an active brand-prefixed Redis row: redis-cli --scan --pattern 'user-session:brand:*:<account>'.
If the JWT is valid but the session key is absent, treat it as an expired or evicted session and use the normal login/session incident path keyed by gateway_jwt_session_missing_total.

Escalation Platform backend on-call.

`gateway_jwt_session_missing_total` (added by T1-D-C1)

Emitter: servers_v2/gateway/app/middleware/auth.py::_record_jwt_session_missing, called from JWTAuthMiddleware._try_attach_auth_state whenever a JWT verifies successfully but the brand-prefixed Redis session key does not return a matching session id.

Labels: service="gateway". No reason label -- the upstream brand_resolution_failed_total covers brand-mismatch cases. This counter is a single dashboard signal for "JWT looked good, but session was gone."

What it counts One increment per request where:

the JWT signature, kid, alg, sub, and required claims all validated, AND
the brand-prefixed Redis lookup returned None or did not match the JWT-bound session_id.

The user sees the v1 AUTH_REQUIRED envelope -- the same shape they would see for a bad/expired JWT -- but operationally this is a different class of failure that needs a different response.

When it normally fires

Steady state: a small baseline rate is expected from session-revoke flows (admin force-logout, user "log out other devices") and from JWT_EXPIRE_MIN slop where the JWT is still in TTL but Redis has already expired the session row.
Right after a Redis flush / failover: a brief spike is expected as every active client hits a missing session.
After a MULTI_BRAND_ENFORCEMENT flip from observe to enforce: may briefly fire if the flip exposes brand-mismatched lookups that were silently routed to the wrong brand under observe.

Alarm thresholds

Warning: a sustained rate >10x the historical baseline for the affected pod.
Page: >1/sec sustained -- many users are seeing AUTH_REQUIRED while their JWTs are still valid. Almost certainly a Redis-side incident or a cutover regression.
Hard incident: this counter spikes to match the gateway's total request rate -- the entire authenticated surface is locked out.

What it means when it fires

(a) Redis evict-LRU pressure dropped session rows before JWT_EXPIRE_MIN -- check Redis memory utilization.
(b) Redis failover dropped non-replicated keys (sessions are not AOF-persisted in some envs); affected users see one AUTH_REQUIRED and have to log in again.
(c) An admin / user-driven session revoke ran for the affected account; expected.
(d) The brand-prefixed key was written under a different brand_id than the gateway is reading (cross-brand desync) -- in which case brand_resolution_failed_total should also be firing for the same requests.
(e) player_service failed to write the session row at login (rare; the route fails the login response in that case so the user never gets a JWT to start with).

Diagnosis steps

redis-cli INFO memory -- check evicted_keys rate against the counter rate. A correlation suggests cause (a).
redis-cli --scan --pattern 'user-session:brand:*' | wc -l and compare to the gateway's recently-active player count -- if the session pool is thin, evict pressure or failover loss is likely.
kubectl logs deploy/gateway | grep "JWT session lookup failed" -- if this is firing too, the lookup raised before missing was recorded; suggests a Redis client bug or a connectivity blip.
Sample a single affected request_id end-to-end: gateway logs -> player_service login logs from the matching JWT issue time. If no login log exists at all, the JWT is stale or forged (different incident class).

Likely fix

Redis evict pressure: scale Redis up, or lower JWT_EXPIRE_MIN so TTL aligns with capacity.
Redis failover loss: enable AOF persistence on the session Redis, OR accept the one-time relogin and document.
Cross-brand desync: see brand_resolution_failed_total{reason="jwt_domain_mismatch"} for the canonical playbook.
Cutover regression: if this counter spikes immediately after a gateway or player_service deploy, revert and investigate the session-key contract before re-rolling.

Escalation Platform backend on-call. Page Redis owner if the rate correlates with evicted_keys or failover events.

`gateway_jwt_unknown_kid_total` (added by T4-D-I4)

Emitter: servers_v2/shared/contracts/src/rgb_contracts/infra/brand_observability.py::record_gateway_jwt_unknown_kid, called from JWTAuthMiddleware._try_attach_auth_state whenever the underlying JwtVerifier.decode raises JwtVerificationError("unknown kid ...").

Labels: service="gateway". Bounded.

What it counts One increment per request whose Authorization carries a JWT whose kid header is not present in the active JWT_PUBLIC_KEYS map. The user sees the v1 AUTH_REQUIRED envelope.

When it normally fires

Mid-rotation: tokens minted under the new kid arrive at gateway pods whose verifier cache still carries only the old key set. Spike drains as soon as the operator publishes the jwt_public_keys_changed Redis channel and the per-pod listener invalidates the cache.
Right after the operator drops the old kid from JWT_PUBLIC_KEYS but before all clients refresh -- legacy still-valid-by-TTL tokens fail.

Alarm thresholds

Warning: any sustained non-zero rate that lasts longer than the publish→rebuild latency budget (~5 s in healthy clusters).
Page: sustained non-zero rate AND no concurrent rotation activity -- indicates either a misconfigured signer or attempted forgery.

Diagnosis sequence

Cross-reference: was a tools/jwt_kid/rotate_jwt_kid.sh recently run? If yes, watch for drain.
If no rotation in flight, dump a sample request -- the JWT header includes a kid claim. Compare against the operator-side allow-list.
If the kid is recognised but no pod accepts it, check the pub/sub listener: app.state.jwt_pubsub_task should be a live task, and app.state.jwt_verifier should be a non-None JwtVerifier carrying the expected kids in verifier.kids().

Escalation Platform backend on-call. If forgery is suspected, page security and freeze rotations.

`admin_ws_legacy_query_token_total` (added by T4-D-I7)

Emitter: servers_v2/shared/contracts/src/rgb_contracts/infra/brand_observability.py::record_admin_ws_legacy_query_token, called from the admin /api/v1/ws endpoint when the JWT was supplied via the legacy ?token= query string instead of the Sec-WebSocket-Protocol subprotocol.

Labels: none.

What it counts One increment per accepted WebSocket connection that authenticated via the legacy query-string fallback. Production-runtime requests using the query-string path are rejected (close code 4003) and do NOT increment this counter -- this counter is the dev/cutover indicator, not the attack indicator.

Alarm thresholds

Pre-cutover: expected to be non-zero. Watch for the trend to flatten to zero as front-ends migrate to the subprotocol API.
Post-cutover: must be zero. Set admin_ws_allow_query_token=False in production once it has been zero for the full soak window (recommend 7 days) so the legacy branch refuses new traffic permanently.

Escalation None as a steady-state alert. Send to the Backoffice frontend team when the counter has been silent long enough to flip the production flag.

`recovery_rate_limit_redis_outage_total` (added by T4-D-I5)

Emitter: servers_v2/shared/contracts/src/rgb_contracts/infra/recovery_brand.py::record_recovery_rate_limit_redis_outage, called from rate_limit_check_and_increment whenever a Redis exception is raised on the per-contact bucket of /recovery/request (and similar SMS-DOS-target buckets) AND the runtime resolves to a production env.

Labels: bucket="contact"|"ip"|"token". Bounded.

What it counts One increment per fail-closed rejection. The user sees the same generic envelope they would see for a normal rate-limit hit (no oracle leak between "you are limited" and "Redis is down so we denied you"). For per-IP and per-token buckets the Redis outage fails OPEN -- those buckets have no contact oracle and a fail-closed IP path locks out everyone behind a NAT during a Redis blip.

When it normally fires

Should be zero. Any non-zero rate is either a Redis outage in flight or a route added with the wrong fail-mode flag.

Alarm thresholds

Warning: any non-zero rate.
Page: sustained non-zero rate AND no concurrent Redis incident -- indicates a routing / configuration bug.

Diagnosis sequence

Cross-reference Redis health: are evicted_keys / connected_clients abnormal? If yes, this is the safety gate firing as designed.
Check whether legitimate users are reporting "I never got the SMS code" -- if yes, file a follow-up to lift the limit cap once Redis recovers (the in-window denials cannot be retroactively allowed).
If Redis is healthy, inspect the route handler stack: only per-contact buckets should pass fail_closed_in_production=True.

Escalation Platform backend on-call + Redis owner.

`gateway_rate_limit_redis_outage_total` (added by T6-D2)

Emitter: servers_v2/gateway/app/middleware/rate_limit.py::_handle_redis_outage, called every time the gateway's per-IP sliding-window rate limiter catches a Redis exception. Emitted regardless of whether the request is then allowed or rejected, so the metric tracks the outage rate independent of the policy branch taken.

Labels: env (resolved via resolve_runtime_env, default dev) and reason (the Exception.__class__.__name__ of the Redis failure, e.g. ConnectionError, TimeoutError). Both bounded.

What it counts One increment per request that hit the Redis-outage handler. The follow-up policy branch depends on is_production_runtime_env():

Production-grade runtime: fail-closed -- the response is HTTP 503 with Retry-After set; the limiter refuses to silently bypass the limit during the exact window an attacker is most likely to use to brute-force login / SMS / withdrawal flows.
Dev / staging-without-prod-label: fail-open -- the request is allowed through so a developer running without a local Redis is not blocked. The counter still increments so dashboards see the outage.

When it normally fires

Should be zero in production. Any non-zero rate is either a Redis outage in flight or a network blip on the gateway -> Redis path.
Non-zero in dev is expected when working without Redis running locally; do not alarm.

Alarm thresholds

Warning: any non-zero rate with env="production".
Page: sustained env="production" non-zero rate AND no concurrent Redis incident -- indicates a routing / configuration bug (e.g. wrong REDIS_URL, gateway pinned to a stale Redis IP).

Diagnosis sequence

Cross-reference Redis health: are evicted_keys / connected_clients abnormal? If yes, this is the safety gate firing as designed -- confirm the on-call Redis playbook is engaged.
Check the gateway logs for the matching Rate limiter Redis error warning and the reason label -- a uniform ConnectionError spike points at a network partition; a uniform TimeoutError spike points at Redis CPU saturation.
Cross-reference user reports: production fail-closed mode returns HTTP 503 from /api/v1/user/login, /api/v1/user/register, and the SMS endpoints -- treat any user-visible "service unavailable" spike as the same incident.
If Redis is healthy and gateway_rate_limit_redis_outage_total is climbing, inspect the gateway pod's network path to Redis (DNS resolution, TLS handshake, security-group rules).

Escalation Platform backend on-call + Redis owner. The fail-closed posture is intentional -- do not "temporarily" disable it without a decision from the security on-call.

`internal_caller_token_legacy_rejected_total` (added by T4-D-I2)

Emitter: servers_v2/shared/contracts/src/rgb_contracts/infra/brand_observability.py::record_internal_caller_token_legacy_rejected, called from every *_service/app/api/deps.py::require_internal_service_token when PER_CALLER_TOKEN_REQUIRED=on and the request fell through to the legacy single-token fallback path.

Labels: consumer (the receiving service) and caller (the asserted X-Caller-Service header, or "unknown" when absent). Bounded -- both truncated to 32 chars.

What it counts One increment per request rejected with HTTP 403 because the operator flipped PER_CALLER_TOKEN_REQUIRED on but the request still arrived via the legacy single-token path -- meaning the caller has not been migrated to its INTERNAL_SERVICE_TOKEN_<CALLER> env var yet.

When it normally fires

Should be zero at all times once Phase 16 (the Multi-Brand hard flip) has completed and PER_CALLER_TOKEN_REQUIRED=on rolled out.
Pre-flip: must be zero before the flip can proceed (see docs/runbooks/multi-brand/phase-16-hard-flip-checklist.md).
Right after a new internal HTTP client is added: a spike points at the new client forgetting to set its per-caller env var.

Alarm thresholds

Warning: any non-zero rate.
Page: sustained non-zero rate for the same (consumer, caller) pair lasting more than 5 minutes -- a known caller is broken.

Diagnosis sequence

Identify the offending (consumer, caller) pair from the counter labels.
On the caller's container, printenv | grep INTERNAL_SERVICE_TOKEN_<UPPER(caller)> -- if empty, the env var was never deployed.
On the consumer's container, printenv PER_CALLER_TOKEN_REQUIRED -- if on, this is the expected rejection path. If unset / off, the rejection path should not have triggered; investigate.
Either deploy the missing per-caller env var OR roll back PER_CALLER_TOKEN_REQUIRED on the consumer until the caller is migrated.

Escalation Platform backend on-call + the team that owns the offending caller service.

`agent_balance_legacy_write_total` (added by T4-D-I3)

Emitter: servers_v2/shared/contracts/src/rgb_contracts/infra/brand_observability.py::record_agent_balance_legacy_write, called from direct agent.balance mutation paths that still bypass the future wallet_service agent-money command boundary:

admin_service POST /api/v1/agents/{id}/adjust-balance
admin_service POST /api/v1/agents/withdrawals/{id}/decline
admin_service POST /api/v1/agents/withdrawals/{id}/pay-decline
agent_service POST /agent/withdraw/add
agent_service legacy alias POST /withdraw/add

Labels: change_type="add"|"sub"|"unknown".

What it counts Each direct-write call. Goal: zero after all agent-money writes, including withdrawal refund compensation paths, have migrated to wallet_service.

Alarm thresholds

Pre-migration: expected non-zero baseline (admin balance adjustments are normal back-office activity).
Post-migration: must be zero. Any non-zero rate after the migration PR has shipped indicates the old code path is still reachable (rollback, A/B traffic, etc.) and needs investigation.

Escalation Platform backend on-call.

`agent_cross_brand_rejected_total`

Emitter: servers_v2/agent_service/app/services/brand_check.py::_record_mismatch (line ~105). Invoked from agent-write helpers when the request brand disagrees with the target row's brand_id.

Labels: command, mode. command is the agent operation name (e.g. agent_message_read, agent_password_reset). mode is observe or enforce.

What it counts One increment per agent_service write whose target row brand differs from the request brand. Observe proceeds with request brand; enforce raises CrossBrandRejected and the route returns the v1 reject envelope.

When it normally fires

Pre-flip soak: must reach zero before Phase 16 hard flip.
Post-flip steady state: zero.

Alarm thresholds

Warning: any non-zero mode=observe rate during the soak window.
Page: >=1/min of mode=enforce.

Diagnosis steps

kubectl logs deploy/agent-service | grep "agent cross-brand mismatch" -- look at command, request_brand, target_brand, mode.
If concentrated on one command, the originating caller is not forwarding X-Brand-Id correctly.
Cross-check the originating service via request_id.

Likely fix Same shape as wallet_cross_brand_rejected_total: hot-fix the caller's brand-forwarding code OR flip the affected service to observe.

Escalation Agent domain owner immediately for mode="enforce" rates.

`game_cross_brand_rejected_total`

Emitter: servers_v2/game_service/app/services/brand_check.py::_record_mismatch (line ~105). Invoked from game-write helpers (e.g. callback-driven state mutations) when the request brand disagrees with the target row's brand_id.

Labels: command, mode. Same shape as the wallet variant.

What it counts One increment per game_service write whose target row brand differs from the request brand. Observe proceeds with request brand; enforce raises CrossBrandRejected.

When it normally fires Same gates as wallet_cross_brand_rejected_total; pre-flip soak must reach zero.

Diagnosis steps

Filter game_service logs for event=cross_brand_rejected. Look at command, request_brand, target_brand, mode.
For callback-driven writes, compare the Aggregator-authenticated external player, signed relay context, player_service lookup result and the persisted provider transaction's brand_id. The signed identity and authoritative player row must resolve to the same brand; no account-prefix reverse parsing or fallback exists.
Correlate the Aggregator, gateway, game_service, player_service and wallet_service logs using the same request ID before changing or replaying any money state.

Likely fix Same shape as the wallet variant.

Escalation Game domain owner immediately for mode="enforce" rates.

`promotion_cross_brand_rejected_total`

Emitter: servers_v2/promotion_service/app/services/brand_check.py::_record_mismatch (line ~105). Invoked from promotion-write helpers (coupon use, saga steps, settlement projections) when the request brand disagrees with the target row's brand_id.

Labels: command, mode. Same shape as the wallet variant.

Diagnosis steps

kubectl logs deploy/promotion-service | grep "promotion cross-brand mismatch".
Most common cause: a coupon row pre-dates the brand backfill OR a saga step lost the brand context across a Redis Stream boundary; check the upstream wallet event payload for brand_id.

Likely fix / escalation Same shape as the wallet variant. Promotion domain owner.

`recon_cross_brand_rejected_total`

Emitter: servers_v2/recon_service/app/services/brand_check.py::_record_mismatch (line ~115). Invoked when a recon match attempts to mutate a row whose brand differs from the matched deposit's brand.

Labels: command, mode. Same shape as the wallet variant.

Diagnosis steps

kubectl logs deploy/recon-service | grep "recon cross-brand mismatch".
Most common cause: shooter_sms.brand_id was inferred incorrectly at match time; cross-check the matched player_deposit.brand_id.

Likely fix / escalation Same shape as the wallet variant. Recon domain owner.

`rolling_cross_brand_rejected_total`

Emitter: servers_v2/rolling_service/app/services/brand_check.py::_record_mismatch (line ~108). Invoked from the rolling event consumer and internal rolling routes when the request brand disagrees with the target rolling row's brand_id.

Labels: command, mode. Same shape as the wallet variant.

Diagnosis steps

kubectl logs deploy/rolling-service | grep "rolling cross-brand mismatch".
Most common cause: a wallet event landed on the rolling stream without a brand_id in the envelope (legacy schema_version=1 event escaped the soak window). Cross-check event_legacy_schema_total{stream,consumer="rolling_service"}.

Likely fix / escalation Same shape as the wallet variant. Rolling domain owner.

`pii_aes_unconfigured_total` (T6-F2)

Emitter: servers_v2/shared/contracts/src/rgb_contracts/infra/aes_guard.py, called by encrypt_pii_strict / decrypt_pii_strict whenever the service-level AES_KEY / AES_IV are missing or empty and the helper falls back to plaintext (allowed only outside production).

Labels: service (e.g. admin_service, agent_service, player_service), op (one of encrypt, decrypt).

What it counts One increment per call that bypassed AES because the key material was not configured. In production the boot guard refuses to start a service with empty key material, so a non-zero rate in production means either the boot guard was bypassed (a deploy regression) or a code path constructed its own settings shim that did not inherit the keys.

When it normally fires

Local development without .env. Expected non-zero. The strict helpers log a WARN per call so the developer sees the fallback.
CI / unit-test paths that intentionally exercise the dev fallback. These do not export Prometheus, so the counter is invisible there.

Alarm thresholds

Production: ANY non-zero rate is a page. PII at rest is supposed to be encrypted; a single increment means at least one row was written in the clear.
Staging: warn-on-non-zero-for-15-min so we catch boot-guard drift before a prod cutover.

Diagnosis sequence

Identify the offending service via the service label.
SSH the pod, dump the resolved settings — AES_KEY and AES_IV must be non-empty SecretStr values. If empty, the boot guard either never ran or was patched out.
If an admin/agent route, check the request path against the audit log to identify which row was written. The row's PII fields are plaintext and must be re-encrypted via a one-shot rotation script.
Fix the env / settings, redeploy, then re-encrypt affected rows.

Escalation Security on-call. Compliance must be notified for any PII written in the clear in production.

`pii_aes_gcm_decrypt_failed_total`

Emitter: servers_v2/shared/contracts/src/rgb_contracts/infra/aes_guard.py, from decrypt_pii_strict when a versioned aesgcm:v1: PII ciphertext cannot authenticate.

What it counts One increment per invalid AES-256-GCM ciphertext. Unlike legacy base64-shaped AES-CBC/plaintext ambiguity, the aesgcm:v1: prefix is an unambiguous encrypted-data marker, so production reads raise AesPiiDecryptError immediately rather than returning the value.

Diagnosis sequence

Identify the service label and affected request path.
Check whether the row was manually edited, truncated, or encrypted with the wrong AES_KEY.
Restore from a known-good row backup or re-encrypt from a trusted source of truth. Do not disable AES_STRICT_DECRYPT; that flag only applies to ambiguous legacy CBC/plaintext reads.

`admin_ws_rate_limited_total` (T6-F3)

Emitter: servers_v2/shared/contracts/src/rgb_contracts/infra/brand_observability.py::record_admin_ws_rate_limited, called from the admin /api/v1/ws endpoint when a connection attempt is rejected by the per-IP accept-rate limiter installed in T4-D-I7.

Labels: reason -- one of:

per_ip_window_exceeded -- IP exceeded the configured accept-per-min budget (admin_ws_rate_limit_per_min).
redis_outage -- the limiter could not consult Redis and fail-closed in production (or fail-open in dev) per the same env-aware policy as the gateway global limiter.

What it counts One increment per rejected WebSocket accept attempt. Each rejection closes the socket with code 4029 (custom for "ws rate limited") and does NOT mint an admin session.

When it normally fires

Burst from a single browser tab refresh storm. Should self-correct within one window (60 s default).
Operator dashboard mass-reload during deploy. Mostly harmless.

Alarm thresholds

Warning: sustained non-zero rate from a single IP across more than 3 windows. Could indicate a misbehaving frontend client.
Page: reason="redis_outage" non-zero in production -- the limiter is the only protection between the admin panel and unbounded session churn, and a Redis outage that lasts longer than ~30 s should be investigated regardless of which counter fires.

Diagnosis sequence

Pivot the counter by IP from access logs (the limiter does not emit IP as a label to keep cardinality bounded).
If a single IP, ask the BO frontend team if the user reports a refresh loop / stuck reconnect timer. Roll the offending tab.
If reason="redis_outage", cross-check Redis pod health and the service's own redis_connect_errors_total.

Escalation Backoffice team for client-side issues; platform on-call for Redis outages.

Status​

Date​

Last Verified Commit​

Owners​

Purpose​

Common conventions​

wallet_brand_signature_failed_total​

wallet_brand_signature_replay_blocked_total​

wallet_brand_signature_misconfigured_total (added by P1-4)​

wallet_cross_brand_rejected_total​

brand_resolution_failed_total{service="gateway"}​

agent_brand_not_allowed_total (a.k.a. agent_brand_allow_list_rejected)​

Retired: game_callback_brand_unresolved_total​

Retired: game_callback_verification_bypassed_total​

Operator Account Attribution​

admin_audit_write_failed_total (added by P1-β)​

wallet_brand_signature_replay_redis_outage_total (delivered P2-δ)​

plisio_callback_replay_blocked_total (added by T1-D-C5/I6)​

plisio_callback_replay_redis_outage_total (added by T1-D-C5/I6)​

wallet_topology_default_brand_unprimed_total (delivered P2-δ)​

recovery_sms_delivery_failed_total (added by Codex P1-#6)​

recovery_email_unprovisioned_total (added by Codex P1-#6)​

gateway_session_legacy_key_used_total (retired)​

gateway_jwt_session_missing_total (added by T1-D-C1)​

gateway_jwt_unknown_kid_total (added by T4-D-I4)​

admin_ws_legacy_query_token_total (added by T4-D-I7)​

recovery_rate_limit_redis_outage_total (added by T4-D-I5)​

gateway_rate_limit_redis_outage_total (added by T6-D2)​

internal_caller_token_legacy_rejected_total (added by T4-D-I2)​

agent_balance_legacy_write_total (added by T4-D-I3)​

agent_cross_brand_rejected_total​

game_cross_brand_rejected_total​

promotion_cross_brand_rejected_total​

recon_cross_brand_rejected_total​

rolling_cross_brand_rejected_total​

pii_aes_unconfigured_total (T6-F2)​

pii_aes_gcm_decrypt_failed_total​

admin_ws_rate_limited_total (T6-F3)​

See Also​

Status

Date

Last Verified Commit

Owners

Purpose

Common conventions

`wallet_brand_signature_failed_total`

`wallet_brand_signature_replay_blocked_total`

`wallet_brand_signature_misconfigured_total` (added by P1-4)

`wallet_cross_brand_rejected_total`

`brand_resolution_failed_total{service="gateway"}`

`agent_brand_not_allowed_total` (a.k.a. `agent_brand_allow_list_rejected`)

Retired: `game_callback_brand_unresolved_total`

Retired: `game_callback_verification_bypassed_total`

Operator Account Attribution

`admin_audit_write_failed_total` (added by P1-β)

`wallet_brand_signature_replay_redis_outage_total` (delivered P2-δ)

`plisio_callback_replay_blocked_total` (added by T1-D-C5/I6)

`plisio_callback_replay_redis_outage_total` (added by T1-D-C5/I6)

`wallet_topology_default_brand_unprimed_total` (delivered P2-δ)

`recovery_sms_delivery_failed_total` (added by Codex P1-#6)

`recovery_email_unprovisioned_total` (added by Codex P1-#6)

`gateway_session_legacy_key_used_total` (retired)

`gateway_jwt_session_missing_total` (added by T1-D-C1)

`gateway_jwt_unknown_kid_total` (added by T4-D-I4)

`admin_ws_legacy_query_token_total` (added by T4-D-I7)

`recovery_rate_limit_redis_outage_total` (added by T4-D-I5)

`gateway_rate_limit_redis_outage_total` (added by T6-D2)

`internal_caller_token_legacy_rejected_total` (added by T4-D-I2)

`agent_balance_legacy_write_total` (added by T4-D-I3)

`agent_cross_brand_rejected_total`

`game_cross_brand_rejected_total`

`promotion_cross_brand_rejected_total`

`recon_cross_brand_rejected_total`

`rolling_cross_brand_rejected_total`

`pii_aes_unconfigured_total` (T6-F2)

`pii_aes_gcm_decrypt_failed_total`

`admin_ws_rate_limited_total` (T6-F3)

See Also