Multi-Brand Isolation Rollout Runbook

Status

Ready

2026-07-17 Aggregator-only update: Any native game-provider callback, account-prefix reverse-parsing, callback-DLQ replay, BRAND_CATALOG_CHANGED, or VERIFY_CALLBACKS procedure later in this historical rollout record is retired and must not be executed. Current game ingress is limited to the signed provider-protocol relay-backed Seamless boundary documented in docs/services/game-service.md.

Last Verified Commit

Use git log -- <this file> for current last-touch history; this field is intentionally not pinned to a static hash so it does not become stale after unrelated commits.

Date

2026-04-29

Owners

Platform Backend
Player Domain
Wallet Domain
Agent Domain

Affected Services

gateway
player_service
wallet_service
wallet_worker
rolling_service
rolling_worker
promotion_service
promotion_worker
game_service
agent_service
agent_worker
admin_service
admin_worker
recon_service
recon_worker

docs/adr/ADR-009-multi-brand-domain-routed-isolation.md
docs/specs/multi-brand/2026-04-27-multi-brand-isolation-spec.md
docs/plans/multi-brand/2026-04-27-multi-brand-isolation-plan.md
docs/runbooks/local-docker/local-docker-full-stack.md
docs/runbooks/wallet/ruby-wallet-topology-rollout.md

On-call TL;DR

This rollout enables multi-brand isolation on a previously single-brand platform. The biggest risks are:

a Phase 16 enforcement flip that locks legitimate users out of one or more brands (revert per Phase 16 mid-flip procedure)
a brand-resolution failure that misroutes money (use Diagnosis Playbook below; flip the affected service to observe to buy time)
an Aggregator callback HMAC, brand credential, or wallet mapping failure (inspect gateway/game_service logs and Aggregator callback logs)

Who to page: wallet domain owner for any wallet-related alert; agent domain owner for agent_brand allow-list issues; platform on-call for everything else.

Where to look:

Per-brand request rates and rejection counters in the multi-brand-isolation Grafana dashboard
Per-service enforcement mode via multi_brand_enforcement_mode gauge
Live brand catalog: psql -c "SELECT brand_id, brand_code, status FROM brand" (canonical source); Redis KEYS domain:agent:* and KEYS domain:level:* for active domain bindings

Goal

Validate and roll out multi-brand isolation safely. After this rollout the platform can host more than one brand on the same servers_v2 runtime, with brand-scoped player identity, wallet state, rolling, settlement, and configuration.

This runbook is not approved for production enablement of a second brand until the related spec, implementation plan, tests, and migrations are complete and the release gate below passes.

Preconditions

ADR-009 is Accepted.
Spec status is Approved.
Implementation plan done definition is satisfied.
Migrations are reviewed and reversible for non-production environments.
A default brand has been seeded; every existing row has been backfilled with brand_id = default.
An agent_brand row of the form (agent_id, default_brand_id, status='enabled') has been seeded for every existing agent row; without it Phase 16 hard-flip will reject every existing agent.
The Redis maps domain:agent:* and domain:level:* have been rewritten to bind every existing domain to the default brand.
Worker health reflects forward progress per ADR-003.
Wallet ledger and outbox failures surface per ADR-004.
Alerting exists for every counter on which the Phase 16 release gate or an in-runbook alarm depends. Every name below maps to an emitted counter in servers_v2/shared/contracts/src/rgb_contracts/infra/ or per-service code; full per-counter playbooks live in diagnosis-playbook.md.
- Brand resolution / forwarding — brand_resolution_failed_total{reason,service} (the service="gateway" slice is the canonical edge alarm), agent_brand_not_allowed_total, recovery_brand_mismatch (envelope reason; surfaces via brand_resolution_failed_total{reason="recovery_brand_mismatch"}).
- Wallet brand-signature defence-in-depth — wallet_brand_signature_failed_total{caller_service,reason,mode}, wallet_brand_signature_missing_total{caller_service}, wallet_brand_signature_misconfigured_total{mode} (P1-4), wallet_brand_signature_replay_blocked_total (failure-reason slice of the previous), wallet_brand_signature_replay_redis_outage_total{caller_service,mode} (delivered P2-δ).
- Wallet topology — wallet_topology_default_brand_unprimed_total (delivered P2-δ; pages on any non-zero rate — every increment is a hard runtime failure), wallet_topology_default_brand_fallback_total{caller_origin} (per-brand-onboarding signal).
- Cross-brand rejections (per service) — wallet_cross_brand_rejected_total{command,mode}, agent_cross_brand_rejected_total{command,mode}, game_cross_brand_rejected_total{command,mode}, promotion_cross_brand_rejected_total{command,mode}, recon_cross_brand_rejected_total{command,mode}, rolling_cross_brand_rejected_total{command,mode}.
- Aggregator callback edge — the required RGB relay target is complete and HTTPS; invalid relay key, timestamp, signature, path scope, provider authentication or signed player context is rejected; all native provider callback paths return 404.
- Per-caller internal-token rollout — internal_caller_token_legacy_total{consumer,caller} (Stage C drain signal), internal_caller_token_legacy_rejected_total{consumer,caller} (T4-D-I2; pages on any non-zero rate once PER_CALLER_TOKEN_REQUIRED=on).
- Event schema drain — event_legacy_schema_total{stream,consumer} (must be zero before flip).
- Gateway session + JWT — gateway_jwt_session_missing_total (T1-D-C1; baseline + spike alarm after the Phase 4E dual-read fallback removal), gateway_jwt_unknown_kid_total (T4-D-I4; tracks RS256 kid rotation pub/sub propagation).
- Admin operator-identity + audit — admin_operator_id_mismatch_total{reason} (P1-3; pages on any non-zero rate -- spoof or legacy token), admin_audit_write_failed_total{route} (P1-4; pages on any non-zero rate -- money committed without audit row), agent_balance_legacy_write_total{change_type} (T4-D-I3; baseline non-zero pre-migration to wallet_service).
- Admin WebSocket auth (T4-D-I7) — admin_ws_legacy_query_token_total (cutover trend, must be zero before flipping admin_ws_allow_query_token=False in production), admin_ws_rate_limited_total{reason}.
- PII boot guard (T1-D-C3) — pii_aes_unconfigured_total{service,op} (pages on any non-zero rate in production -- the boot guard refuses to start prod-grade pods, so a non-zero prod sample is a high-severity alert).
- Plisio webhook hardening (T1-D-C5/I6) — plisio_callback_replay_blocked_total{status}, plisio_callback_replay_redis_outage_total{env}.
- Recovery flow (Codex P1-#5/#6, T4-D-I5) — recovery_sms_delivery_failed_total{service}, recovery_email_unprovisioned_total{service}, recovery_rate_limit_redis_outage_total{bucket} (T4-D-I5; pages on any non-zero rate when no concurrent Redis incident).
- Enforcement-mode posture — security_downgrade_total{service} (page; service started in non-enforce mode while >1 brand enabled), multi_brand_enforcement_mode{service} gauge.
- Per-brand traffic — per-brand wallet outbox publication freshness (existing wallet outbox alert, scoped per brand label) and per-brand request_total{brand_code,service} zero-traffic alert.
Game provider operations have verified Aggregator launch-to-callback player mappings and signed relay context for every enabled provider protocol.

Local Docker Validation

Start the full local stack using the local Docker runbook.
Apply multi-brand migrations and run the default brand backfill.
Confirm brand, brand_config, and agent_brand exist and that every brand-scoped table has zero NULL brand_id rows.
Seed a second brand brand2 with a distinct brand_code, a distinct default currency, and at least one bound domain.
Bind one domain:agent:* and one domain:level:* Redis entry to brand2.
Confirm gateway, player_service, wallet_service, rolling_service, promotion_service, game_service, agent_service, admin_service, and recon_service health is green.
Confirm every worker reports forward-progress readiness.
Run unit and contract tests for:
- shared brand contracts
- migrations and backfill
- gateway brand resolution and JWT/domain mismatch
- player_service cross-brand registration and recovery
- agent_service allow-list enforcement
- wallet_service cross-brand command rejection per command
- wallet_service per-brand topology and policy resolution
- rolling_service brand-scoped consumption and progress
- promotion_service per-brand resolution
- game_service Aggregator-only catalog/launch, relay HMAC/context and provider-protocol wallet handlers
- recon_service brand-scoped match and approval
- admin_service brand catalog, brand_config, agent_brand CRUD
- admin_service staff removal (the seven deleted legacy route files no longer mount any router; previously served paths under /api/admin/user/*, /api/admin/pushbullet/*, /api/admin/shooter/*, /api/admin/web/rules/*, /api/admin/web/faq/*, /api/admin/web/config/* return 404)
- observability log fields and metric labels
Run two-brand local end-to-end flows:
- register the same account string under default and brand2, confirm two distinct player_id values
- login on default, attempt a request with that JWT against brand2's domain, confirm gateway rejects it
- deposit, bet, settle, rolling, withdraw on each brand independently, confirm no cross-brand state leak
- issue a coupon on brand2, use it on brand2, attempt the same coupon code on default, confirm rejection
- launch a game for brand2, send the provider callback through Aggregator, and confirm the signed external-player context resolves into brand2 with no cross-brand wallet mutation
- run a recon match against a brand2 deposit, confirm the approval resolves into brand2 only
- change a brand_config value on brand2 (e.g. cashback rate) and confirm brand2 settlements use the new value while default settlements continue to use the global default
Confirm every money movement has immutable wallet_ledger rows carrying brand_id.
Confirm wallet, rolling, promotion, and player outbox events carry brand_id in payload.
Confirm structured logs and Prometheus metrics carry brand labels.

Running the two-brand E2E harness

Phase 14 ships a self-contained pytest harness that orchestrates the local two-brand validation step (item 9 above) against a running servers_v2 Docker stack.

Layout:

Seed script: servers_v2/tools/multi_brand_backfill/setup_local_two_brand.py -- idempotent. Ensures default and brand2 exist, ensures the deterministic seed agent exists, seeds the standard brand2 brand_config keys with cashback_rate=0.05 so per-brand overrides are visible, seeds agent_brand, brand-scoped agent_setting, and agent_domain for (agent_id=1, default) and (agent_id=1, brand2), and writes the two domain:agent:* Redis bindings:
```
domain:agent:default.local -> {"brand_id": <default>, "agent_id": 1}
domain:agent:brand2.local  -> {"brand_id": <brand2>,  "agent_id": 1}
```
Harness: servers_v2/tests/test_multi_brand_two_brand_e2e.py. Gated behind both the local_docker_e2e pytest marker AND LOCAL_DOCKER_E2E=1. A normal uv run pytest invocation skips the whole module (the pure-logic helpers in tests/test_setup_local_two_brand_unit.py continue to run).

Prerequisites:

servers_v2 docker compose stack is up and healthy (docker compose -f docker-compose.yml -f docker-compose.dev.yml ps shows every service healthy).
The seed script can run against a migration-only local stack. It also works against an already bootstrapped stack; existing seed rows are upserted or left in place. tools/multi_brand_backfill/verify.py must exit 0 after the seed (the harness re-runs it as a session-scoped guard).
Host name resolution: any browser-side test would need default.local and brand2.local in /etc/hosts pointing at 127.0.0.1. The harness uses the Host: request header instead so no /etc/hosts editing is required.
MULTI_BRAND_ENFORCEMENT=observe set explicitly for this harness. The repository's compose default is now enforce; Phase 14's local two-brand harness still validates the observe-mode soak behavior. Do NOT run this harness with enforce set.
RGB_JWT_KID, RGB_JWT_PRIVATE_KEY, and RGB_JWT_PUBLIC_KEYS are set in .env.compose.local; login and admin-token checks are part of the harness. Local env files may use escaped \n PEM newlines.
NO_PROXY includes localhost,127.0.0.1,::1 so Python/httpx does not send local Docker traffic through a macOS or corporate HTTP proxy.

Run it (preferred — the same recipe make verify-servers-v2-two-brand executes; reproducing here so an operator can step through manually):

cd servers_v2
export RGB_MULTI_BRAND_ENFORCEMENT=observe
docker compose -f docker-compose.yml -f docker-compose.dev.yml --env-file .env.compose.local up -d
# Load .env.compose.local into the current shell. Do NOT use
# ``set -a; . ./.env.compose.local; set +a`` — if your local
# PEM blocks are unescaped multi-line, the shell parses
# ``-----BEGIN PRIVATE KEY-----`` as a command and the source
# aborts. ``tools/dump_env_compose.py`` (BEGIN/END-aware Python
# parser) emits shell-safe export lines for both shapes.
eval "$(uv run python -m tools.dump_env_compose .env.compose.local)"
export DATABASE_URL="postgresql+psycopg://${RGB_POSTGRES_USER}:${RGB_POSTGRES_PASSWORD}@localhost:5433/${RGB_POSTGRES_DB}"
export REDIS_URL=redis://localhost:6381
export NO_PROXY=localhost,127.0.0.1,::1
export no_proxy=localhost,127.0.0.1,::1
# E2E harness needs to know which caller token to attach.
export E2E_INTERNAL_CALLER_SERVICE=gateway
export E2E_INTERNAL_SERVICE_TOKEN="$RGB_INTERNAL_SERVICE_TOKEN_GATEWAY"
uv run python -m tools.multi_brand_backfill.setup_local_two_brand
LOCAL_DOCKER_E2E=1 \
    uv run pytest tests/test_multi_brand_two_brand_e2e.py -v -m local_docker_e2e

Expected output: all collected tests pass with no callback-compatibility skip. Any failure indicates a real isolation regression -- see the relevant section of the Diagnosis Playbook below.

The harness exercises:

Two brands seeded with distinct brand_id and brand_code.
Same account string under both brands -> two distinct player_id values (UNIQUE (brand_id, account) is in effect).
Per-brand wallet read isolation across two players sharing one account string.
cashback_rate override on brand2 resolves to 0.05 while default does not see that value.
Native provider callback paths remain absent, and an unsigned request to the Aggregator relay namespace is rejected.
Externally-supplied X-Brand-Id is stripped at the gateway.
Every legacy back-office path removed in Phase 12 returns 404 against the live admin_service.
At least one Prometheus metrics surface exposes a brand / brand_id / brand_code label.
A JWT issued for default replayed against brand2's domain does not 5xx and (best-effort) increments brand_resolution_failed_total on the gateway.

Staging Validation

Deploy services and migrations to staging.
Run the default brand backfill in staging; confirm zero NULL brand_id rows.
Rewrite staging domain:agent:* and domain:level:* Redis values to the brand-aware shape; confirm every existing domain still resolves to default.
Run all unit, contract, and integration suites against staging.
Re-run the two-brand end-to-end flows from local Docker validation against staging.
Compare wallet bucket totals with ledger-derived totals per brand.
Confirm admin brand_config writes are audited.
Confirm agent_brand writes are audited.
Confirm topology and policy activation in staging is brand-scoped and does not affect the other brand.
Confirm every supported provider authenticates at Aggregator, resolves the launch-created external-player mapping, and reaches RGB only through the signed Provider-protocol relay. Invalid signatures and all native paths must fail closed.
Confirm operational dashboards show brand labels for player-, wallet-, rolling-, promotion-, and game-scoped metrics.
Confirm the three brand-resolution alerts and the per-brand wallet outbox freshness alert fire in a synthetic-failure drill.

Release Gate

Do not enable a second brand in any shared or production-like environment if any of the following are true:

shared brand contracts, migrations, or backfill are missing or failing
any brand-scoped table contains rows with NULL brand_id
the target environment has not progressed MULTI_BRAND_ENFORCEMENT to enforce after the soak window in observe mode (a second brand cannot be enabled while any service is still in observe or off)
the configured RGB team is not present in Aggregator's required relay set, its target is not HTTPS, its provider allow-list is incomplete, or its relay credential does not match the exact team credential loaded by game_service
the soak window has not been at least 2 * JWT_EXPIRE_MIN minutes (default JWT_EXPIRE_MIN = 60, so at least 120 minutes) measured from Phase 5 deploy time (the last moment a brand-unaware JWT could be issued), so that JWTs issued before brand-aware login go live have all expired
any JWT in the active session population still lacks a brand_id claim
any wallet_service command path can mutate a row whose brand differs from the request brand
per-brand topology or policy activation can affect another brand's active document
per-brand topology or policy resolution returns an active document belonging to a different brand than the requesting brand
per-brand rolling completion ratios, cashback / rebate / lossback rates, or payment channel selection do not resolve from brand_config (with documented global default fallback) as specified
any provider callback can bypass Aggregator authentication, signed relay context, authoritative player_service brand resolution or wallet idempotency
any brand-scoped query in any service is missing a brand_id filter
any path previously served by the seven deleted admin_service legacy route files still responds with anything other than 404
the three brand-resolution alerts or per-brand wallet outbox freshness alert are not wired
structured logs or Prometheus metrics are missing brand labels
two-brand local Docker or staging end-to-end flows fail at any step
the project integer in admin_service/app/tasks/tag.py is still in use anywhere
any global_var row whose key is duplicated by a brand_config entry still exists (the deletion list documented in plan Phase 12 has not been applied)

Rollback Trigger

Pause rollout or roll back immediately if:

brand_resolution_failed_total rises above the documented threshold
wallet_cross_brand_rejected_total becomes non-zero in normal traffic
Aggregator relay authentication, identity resolution or downstream wallet settlement failures rise above the documented baseline
a player can register or log in under the wrong brand
a wallet command mutates a row whose brand differs from the request brand
per-brand topology or policy activation alters another brand's active document
ledger writes fail or are delayed beyond the wallet release threshold
bucket totals diverge from ledger-derived totals for any brand
a game provider callback for one brand routes into a different brand's player
a recon approval resolves into the wrong brand

Operational Reference

Grafana dashboard

Dashboard name: multi-brand-isolation. Required panels:

Per-brand request rate (stacked): sum by (brand_code) (rate(request_total[1m]))
Brand-resolution failures (by reason): sum by (reason, service) (rate(brand_resolution_failed_total[5m]))
Wallet cross-brand rejections (by command and mode): sum by (command, mode) (rate(wallet_cross_brand_rejected_total[5m]))
Aggregator callback relay HTTP failures (by route/status): sum by (path, status) (rate(rgb_http_requests_total{service="game_service",path=~"/aggregator/provider/.*",status!~"2.."}[5m]))
Idempotency cross-brand splits: rate(wallet_idempotency_brand_split_total[5m])
Enforcement mode per service (gauge): multi_brand_enforcement_mode
Brand-resolution latency (p99): histogram_quantile(0.99, sum by (le, service) (rate(brand_resolution_latency_seconds_bucket[5m])))
Security downgrade events: increase(security_downgrade_total[1h])
Admin operator-id mismatch (P1-3, by reason): sum by (reason) (rate(admin_operator_id_mismatch_total[5m]))
Admin audit-write failure (P1-4, by route): sum by (route) (rate(admin_audit_write_failed_total[5m]))

Service name vs `X-Caller-Service` / metric label

The runbook below uses two distinct identifiers that operators must not confuse when grepping metrics, logs, or HTTP traces:

service (compose / repo)	`X-Caller-Service` header + `caller_service` metric label	env-var suffix (per-caller token)
`gateway`	`gateway`	`GATEWAY`
`admin_service`	`admin`	`ADMIN`
`agent_service`	`agent`	`AGENT`
`game_service`	`game`	`GAME`
`promotion_service`	`promotion`	`PROMOTION`
`recon_service`	`recon`	`RECON`
`rolling_service`	`rolling`	`ROLLING`
`player_service`	`player`	`PLAYER`

When you see caller_service=admin in a Prometheus query, that is the admin_service process. There are no *_service strings on the wire or in label values — only repo / compose use the _service suffix. HMAC signature payloads, missing-signature counters, and per-caller token env-var lookups all key on the short label.

`INTERNAL_SERVICE_TOKEN_*` rotation

Each consumer service (wallet_service, rolling_service, promotion_service, etc.) accepts a set of caller-service tokens (INTERNAL_SERVICE_TOKEN_GATEWAY, INTERNAL_SERVICE_TOKEN_AGENT, ...). The consumer's require_internal_service_token dependency reads both the active env var AND the _PREV overlap variant; rotation is per token with the following steps:

Set _PREV on the consumer. On every consumer instance (wallet/rolling/promotion/game/player/agent), set INTERNAL_SERVICE_TOKEN_<UPPER(CALLER)>_PREV to the current outgoing value AND keep INTERNAL_SERVICE_TOKEN_<UPPER(CALLER)> pointing at the SAME outgoing value. (At this stage both env vars hold the same value -- nothing changes for callers.) Deploy.
Flip active to the new value on the consumer. Set INTERNAL_SERVICE_TOKEN_<UPPER(CALLER)> to the new secret; _PREV still holds the outgoing one. Deploy. The consumer now accepts BOTH values via has_valid_per_caller_token_with_source(...) (shared rgb_contracts.infra.internal_auth).
Roll the caller service to the new token. Set INTERNAL_SERVICE_TOKEN_<UPPER(CALLER)> to the new secret on the producer side; deploy. Watch internal_caller_token_prev_used_total{consumer,caller} — every request still authenticating via the old key fires this counter. When the rate hits zero, every caller instance is on the new secret.
Remove _PREV from the consumer. Drop INTERNAL_SERVICE_TOKEN_<UPPER(CALLER)>_PREV from env; deploy. The counter must remain at zero post-deploy.

Rollback at any step: revert the consumer env vars and redeploy — the old token regains acceptance immediately.

`BRAND_SIGNING_KEY` rotation

Same overlap shape, applied to the wallet brand-signature verifier:

Set BRAND_SIGNING_KEY_PREV on every wallet_service instance to the current outgoing key; keep BRAND_SIGNING_KEY at the same outgoing value. Deploy.
Flip BRAND_SIGNING_KEY on wallet_service to the new key; _PREV still holds the outgoing one. Deploy. The verifier (verify_request_brand_signature(... signing_key, signing_key_prev, ...)) now accepts signatures produced with either key.
Roll the 6 signing callers (gateway, admin_service, game_service, promotion_service, recon_service, rolling_service) to the new key; deploy. Watch wallet_brand_signature_prev_key_used_total{caller_service} — when the rate hits zero, every caller has cut over.
Remove BRAND_SIGNING_KEY_PREV from wallet_service; deploy.

The verifier requires the active BRAND_SIGNING_KEY to be set — _PREV alone is treated as a misconfigured (empty active key) and fails closed in enforce mode. This prevents leaving the old key in place indefinitely.

Per-caller-service token rollout

Phase 5B introduced per-caller tokens. Each consumer service accepts both the legacy single INTERNAL_SERVICE_TOKEN AND a per-caller variant INTERNAL_SERVICE_TOKEN_<UPPER(caller)>. The rollout has three stages:

Stage A (now): legacy token still active everywhere; per-caller tokens not provisioned. The internal_caller_token_legacy_total{consumer,caller} counter shows 100% legacy traffic.
Stage B (transition): provision per-caller env vars on every consumer; deploy. Both tokens are accepted; per-caller takes precedence. The counter shows mixed legacy + per-caller usage.
Stage C (terminal): revoke the shared INTERNAL_SERVICE_TOKEN from every consumer; deploy. From this point only per-caller tokens are accepted. The counter must remain at zero or only fire for misconfigured callers.

The per-caller env-var names are:

INTERNAL_SERVICE_TOKEN_GATEWAY (consumed by every other service when receiving from gateway)
INTERNAL_SERVICE_TOKEN_AGENT
INTERNAL_SERVICE_TOKEN_ADMIN
INTERNAL_SERVICE_TOKEN_RECON
INTERNAL_SERVICE_TOKEN_GAME
INTERNAL_SERVICE_TOKEN_PROMOTION
INTERNAL_SERVICE_TOKEN_ROLLING

Each consumer reads the relevant subset (e.g. wallet_service reads all 7 since every other service may call into wallet).

Pre-Stage-C release gate:

internal_caller_token_legacy_total{consumer="wallet_service"} is zero across the soak window
internal_caller_token_legacy_total{consumer="player_service"} is zero across the soak window
(repeat for every consumer)

To roll Stage C: remove the INTERNAL_SERVICE_TOKEN env var from each consumer's deploy template and roll one consumer at a time. Mid-flip abort: re-provision the env var and rollback.

The audit test servers_v2/tests/test_per_caller_header_audit.py is a regression net that asserts every internal HTTP client in the stack sends X-Caller-Service. Run it before each Stage C deploy:

cd servers_v2 && uv run pytest tests/test_per_caller_header_audit.py -v

JWT RS256 key (`JWT_PRIVATE_KEY` / `JWT_PUBLIC_KEY`) rotation

JWT signing uses RS256 with kid header. The private key (JWT_PRIVATE_KEY) is provisioned only into player_service and agent_service; the public key (JWT_PUBLIC_KEY) goes everywhere that verifies JWTs (gateway and any other verifier). Verifiers maintain an active key set keyed by kid. Rotation procedure:

Generate a new RSA keypair with a new kid (e.g. kid=YYYYMMDD-N). Add it to the verifiers' active key set under its kid (env var JWT_PUBLIC_KEYS is a JSON map of kid -> public_key); deploy the verifier change first.
Roll the verifier deploy across gateway and any other verifying service; confirm /health lists both old and new kid as accepted.
Update JWT_PRIVATE_KEY and JWT_KID on player_service and agent_service to the new keypair; restart. New tokens issued from this point use the new kid.
Overlap window: keep the old kid in the verifier's active set for at least JWT_EXPIRE_MIN + REFRESH_TOKEN_TTL + 60min buffer (default: 60min + REFRESH_TOKEN_TTL + 60min). All in-flight tokens drain by the end of this window.
Remove the old kid from JWT_PUBLIC_KEYS on the verifiers; re-deploy. Any token presented after this point that uses the old kid is rejected as unknown_kid.

Emergency rotation (key compromise): skip step 4's overlap. Remove the old kid immediately on verifiers; force re-login for every session by rolling a global session-invalidation flag. Customer impact: every active session must re-authenticate. Document in #incidents.

JWT signing tests must include: (a) alg=none rejected, (b) alg=HS256 rejected when configured for RS256, (c) JWT with unknown kid rejected, (d) JWT signed with old kid accepted during overlap, (e) JWT signed with old kid rejected after overlap.

Brand-signature enforcement flip (`WALLET_BRAND_SIGNATURE_REQUIRE`)

wallet_service accepts an X-Brand-Signature header on brand-scoped writes from internal callers. Whether unsigned writes are rejected in enforce mode is gated by the env var WALLET_BRAND_SIGNATURE_REQUIRE on wallet_service:

Default: "off" (transitional). Wallet still verifies signatures when present, but unsigned writes are accepted and a counter (wallet_brand_signature_missing_total{caller_service=*}) is incremented per missing-signature observation, tagged with the reporting X-Caller-Service.
Target: "on". Unsigned brand-scoped writes are rejected with 403 brand_signature_missing. This is the actual security control; if the var is never flipped, brand-signature enforcement is dead even after the global enforce-mode flip.

Rollout stages:

Stage A (current): the 7 wallet write callers (compose names: gateway, agent_service, admin_service, game_service, promotion_service, recon_service, rolling_service; their caller_service metric labels are gateway, agent, admin, game, promotion, recon, rolling respectively per the table above) all sign brand-scoped writes to wallet_service. Wallet verifies presence + signature when supplied but does not reject unsigned writes. player_service has no outbound wallet HTTP client and therefore is not in this list (see the table-freeze section below for the broader DB-writer set, which is a distinct concept).
Stage B (soak): monitor wallet_brand_signature_missing_total{caller_service=*} over the full soak window (same window as the enforce-mode soak). Expect exactly zero increments. Any non-zero value identifies a misconfigured caller (missing BRAND_SIGNING_KEY, missing header injection in the HTTP client, etc.) and blocks the flip until fixed and re-soaked.
Stage C (flip): set WALLET_BRAND_SIGNATURE_REQUIRE=on on every wallet_service instance, restart wallet. Verify rejection works by sending a hand-crafted unsigned request to a brand-scoped endpoint and confirming a 403 response.

Rollback: set WALLET_BRAND_SIGNATURE_REQUIRE=off, restart wallet. Unsigned writes are accepted again immediately on restart. Use only if a real production caller turns out to be missing the header (and file a P0 to fix the caller before re-attempting the flip).

Secret storage and provisioning

Where each shared secret lives:

JWT_PRIVATE_KEY, JWT_PUBLIC_KEYS (a JSON map of kid -> public_key), JWT_KID: provisioned via the secrets manager (Vault path secret/multi-brand/jwt/<env>); read access scoped to player_service and agent_service deploy roles for the private key, all-services for public keys.
BRAND_SIGNING_KEY: provisioned via the secrets manager (Vault path secret/multi-brand/brand-signing/<env>); read access scoped to gateway, agent_service, and wallet_service deploy roles.
INTERNAL_SERVICE_TOKEN_*: one path per consumer-caller pair (secret/multi-brand/internal-tokens/<consumer>/<caller>/<env>).
Production deploys must verify these are present at boot via a startup probe; missing secret fails the boot, not silent fallback.

Application-level table-freeze flag

Several migration steps (0027 PK rewrites, partial-unique-index rebuilds) and Production Rollback B require pausing application writes to a specific table family while the migration / repair runs. The mechanism is a Redis-backed flag set:

Key pattern: table_freeze:{table_name} (string, value 1 when frozen, key absent when unfrozen).
Every writer (wallet_service, player_service, agent_service, etc.) checks the flag before any INSERT/UPDATE on a frozen table and returns the documented 503 service_paused envelope to the caller when the flag is set. Reads are not blocked.
Set: redis-cli SET table_freeze:wallet_topology 1 EX 1800 (30 min TTL safety net; the TTL prevents a forgotten flag from permanently locking writes).
Clear: redis-cli DEL table_freeze:wallet_topology.
All set/clear operations are logged in #incidents with the operator identity, table name, and reason.
Writers cache the flag check for 1 second to avoid Redis hot-pathing; this means the freeze takes effect within 1 second of the SET.

Admin VPN ingress source of truth

The IP allow-list for admin_service ingress is maintained in the infrastructure repo (infra/lb/admin-allowlist.tf). Onboarding a new operator IP requires a PR to that repo plus security-team approval; the PR triggers an LB config push. Off-boarding requires the same PR flow. The source of truth is the Terraform module; ad-hoc LB UI changes are forbidden and surfaced by drift-detection on the infra repo.

Pen-test release gate

A second brand cannot be enabled in production until an external penetration test of the multi-brand isolation surface has completed and findings are remediated. Scope:

gateway brand resolution (Host/Origin spoof, JWT replay, JWT forgery, X-Brand-Id injection)
wallet_service cross-brand command rejection under observe and enforce
admin_service write surface (network isolation, JWT account attribution, audit-row tamper)
Aggregator relay-backed Seamless (provider authentication, signed identity, exact native protocol preservation and retry/idempotency)
MULTI_BRAND_ENFORCEMENT downgrade (operator-side abuse, audit trail)

Diagnosis Playbook

When any of the multi-brand alerts fires, use this section before considering rollback.

For the brand-signature, brand-check, agent-allow-list, Aggregator-relay, and admin-audit signals that lack a section here, see the dedicated operator-facing entry points in diagnosis-playbook.md. That file covers wallet_brand_signature_failed_total, wallet_brand_signature_replay_blocked_total, wallet_cross_brand_rejected_total, brand_resolution_failed_total{service="gateway"}, agent_brand_not_allowed_total, admin_operator_id_mismatch_total, and admin_audit_write_failed_total, plus the brand-signature replay-cache + default-brand-cache counters (wallet_brand_signature_replay_redis_outage_total and wallet_topology_default_brand_unprimed_total, both delivered by P2-δ), the per-service *_cross_brand_rejected_total family for agent_service / game_service / promotion_service / recon_service / rolling_service, and the post-Tier-1/Tier-4 hardening counters (gateway_jwt_session_missing_total, gateway_jwt_unknown_kid_total, pii_aes_unconfigured_total, plisio_callback_replay_blocked_total, plisio_callback_replay_redis_outage_total, internal_caller_token_legacy_rejected_total, agent_balance_legacy_write_total, admin_ws_legacy_query_token_total, admin_ws_rate_limited_total, recovery_sms_delivery_failed_total, recovery_email_unprovisioned_total, recovery_rate_limit_redis_outage_total).

`brand_resolution_failed_total{reason="jwt_domain_mismatch"}`

What it means: a player or agent JWT was presented at a domain whose brand differs from the JWT's brand claim, OR the JWT lacks brand_id.

Diagnosis:

Query the structured logs for the past 5 min filtered on the counter increment. Look at: request_id, jwt.brand_id, request.state.brand_id, domain.
If the counts cluster on one source IP / one user-agent, it is likely a scanner or a stale mobile client; expected baseline noise.
If the counts cluster on legitimate browser user-agents and one brand's domain, look for a recent domain:agent:* or domain:level:* Redis change.
If JWTs lacking brand_id are still appearing more than 2 * JWT_EXPIRE_MIN after Phase 5 deploy, refresh-token rotation is minting brand-unaware tokens; check Phase 5 implementation in player_service/agent_service.

Mitigation:

Baseline noise: ignore; refine the alert threshold.
Single domain spike: re-verify the Redis mapping for that domain; inspect the LB and any DNS changes.
Refresh-token issue: hot-fix player_service/agent_service to reject refreshes from brand-unaware refresh tokens; force re-login for affected sessions.
If the spike is correlated to a Phase 16 flip and is causing real user pain: revert the affected service to observe.

`wallet_cross_brand_rejected_total{mode="enforce"}`

What it means: a wallet command targeted a row whose brand_id differs from the request brand. In enforce mode this rejected.

Diagnosis:

Filter wallet_service logs for the past 5 min on event=cross_brand_rejected. Look at: command, caller_service (from the INTERNAL_SERVICE_TOKEN_* token used), request_brand, target_row_brand, request_id.
If caller_service is consistent (e.g. always recon — the label value for the recon_service process; see the service-name vs caller-label table earlier in this document), that caller is not forwarding X-Brand-Id correctly OR the originating data is wrong (e.g. a deposit row carries the wrong brand_id).
If caller_service varies, something at the gateway / agent_service edge is leaking the wrong brand into the X-Brand-Id propagation.

Mitigation:

Single-caller bug: hot-fix the caller's brand-forwarding code OR flip just that caller's MULTI_BRAND_ENFORCEMENT to observe to buy time (writes proceed using request brand; no money lands in wrong brand because wallet uses request brand authoritatively in observe).
Edge bug: revert gateway to observe; investigate brand resolution in the affected route.
If the rate is high and money flow is impacted: full revert to observe per the Phase 16 mid-flip revert procedure.

Aggregator callback relay failure

What it means: the callback failed at one of the current trust boundaries: provider authentication in Aggregator, launch-created player mapping, Aggregator-to-RGB relay HMAC/context validation, authoritative player/brand resolution, or the idempotent wallet/ledger command.

Diagnosis:

Correlate Aggregator, gateway, game_service, player_service and wallet_service logs by request ID. Record the provider, operation and HTTP status, but never log secrets, raw tokens or full callback payloads.
A 401 from Aggregator means native provider authentication or identity mapping failed. A 401 from game_service means relay key/timestamp/signature/context failed. A 409 from game_service means the exact Inbox event is already leased or conflicts; Aggregator converts a live lease into a retryable 503. A 502 from Aggregator after RGB means the result-envelope signature/scope was invalid. A 503 means an authoritative dependency or local projection write was unavailable.
Verify the exact method/path/query bytes/body bytes used to compute the relay signature. Do not parse and re-encode query strings during diagnosis.
Confirm the signed external player and brand resolve to the same active RGB player, then inspect provider_relay_inbox, provider_relay_mutation_outbox, the Provider transaction and wallet idempotency key for the same event/request hash.

Mitigation:

Credential/configuration drift: restore the matching team-scoped relay keys and complete provider allow-list, then roll both ends.
Identity mapping failure: repair the launch/mapping path; let the provider retry through Aggregator after the fix.
Downstream outage: keep the Aggregator receipt UNKNOWN and release the RGB lease so Provider retry policy remains authoritative. Do not loop-retry money callbacks inside one public request; exact retries are recovered by the two durable stores.
Reconcile provider and RGB ledger state before any manual adjustment. Historical raw callback payloads must never be posted directly to RGB.

`security_downgrade_total` (P0 page)

What it means: a service started in MULTI_BRAND_ENFORCEMENT != 'enforce' while more than one brand is enabled. Either an operator flipped the flag back during incident response, a deploy template lost the enforce env var, or CI rolled out a service with the wrong default.

Diagnosis:

Identify which service: security_downgrade_total{service} label.
Query multi_brand_enforcement_mode{service} for the same service to confirm the current mode (1=observe, 0=off).
Check the deploy history for the affected service in the past 24h; look for a config change that removed or modified MULTI_BRAND_ENFORCEMENT.
Page the deploy lead AND the security on-call simultaneously. This is treated as a security incident until proven otherwise.

Mitigation:

If the downgrade is intentional (legitimate incident response that required observe): document in #incidents, ack the alert with a silence window, and plan re-enforce per the Phase 16 procedure.
If the downgrade is unintentional: re-set MULTI_BRAND_ENFORCEMENT=enforce in the env, restart the affected service, confirm /health and multi_brand_enforcement_mode show enforce. Investigate the deploy pipeline; consider a startup probe that aborts boot if the env var is missing post-Phase-16.
Either way: never silently let the downgrade persist while >1 brand is enabled.

`brand_resolution_failed_total{reason="unresolved_brand"}` (page in enforce)

What it means: an authenticated agent_service request reached the allow-list middleware (agent_brand_enforcement.py) but no brand_id was attached by the upstream brand-resolution middleware. Distinct from unknown_domain (HTTP host had no brand mapping at the edge): unresolved_brand fires when resolution itself failed silently and a request slipped through to authenticated routes without context.

Diagnosis:

Filter agent_service logs for the past 5 min on event=agent brand allow-list check failed and reason=unresolved_brand. Look at: agent_id, path, request_id, plus the upstream Host header.
Cross-check the brand-resolution middleware logs for the same request_id. Either the resolver crashed (look for a stack trace) or the LB forwarded an unmapped Host.
If the spike correlates with a recent deploy of AgentBrandResolutionMiddleware or a brand-catalog change, suspect a cache miss on domain:agent:* or domain:level:*.

Mitigation:

Resolver bug: hot-fix; consider flipping agent_service to observe until rolled out.
Missing domain mapping: add the domain:agent:* Redis entry.
LB / DNS misconfig: route Host correctly; reject unmapped hosts at the edge instead of letting them reach agent surfaces.

`brand_resolution_failed_total{reason="missing_header"}` (page in enforce)

What it means: a brand-scoped internal handler received a request without X-Brand-Id. In enforce mode this is rejected.

Diagnosis:

Filter logs on the affected service for the past 5 min; request_id, caller_service (from the internal token used), route.
Most common cause: a recently-deployed caller service that did not propagate X-Brand-Id. Identify the caller, find the recent deploy, look at the change.
Less common: an external request slipped past the gateway / agent strip and hit an internal handler directly.

Mitigation:

Hot-fix the caller to forward X-Brand-Id, OR temporarily flip the affected consumer to observe until the caller rolls out.
If external request leakage: investigate edge configuration; the X-Brand-Id strip may be misconfigured.

`wallet_idempotency_brand_split_total` (warning)

What it means: an inbound idempotency key already exists for the same key but different brand. Indicates client misrouting or replay.

Diagnosis:

Filter wallet_service logs on event=idempotency_brand_split. Compare request_brand vs the existing-row brand and the caller_service.
If concentrated on one caller, that caller is sending the same idempotency key across brand-namespace boundaries (e.g. account- derived keys without brand prefix).

Mitigation:

Caller fix: prefix idempotency keys with brand_code or brand_id on the caller side.
Not a money-impact incident in itself; observe-mode would proceed with request brand. Track the rate.

Per-brand wallet outbox publication freshness (page)

What it means: the wallet outbox publisher for one specific brand is not making forward progress (per ADR-003).

Diagnosis:

Query the wallet outbox table filtered on the affected brand: SELECT count(*) FROM wallet.wallet_outbox WHERE brand_id = ? AND published_at IS NULL and the oldest unpublished created_at.
Compare to other brands; if only one brand is stuck, look for a data condition specific to that brand (e.g. a poison-pill payload).
Check Redis Stream lag on the wallet stream.

Mitigation:

Standard wallet outbox playbook applies; brand label scopes the investigation.
If a poison-pill: route the offending row to dead-letter, fix the payload, re-publish.

Per-brand `request_total` zero-traffic alert (warning)

What it means: a previously-active brand stopped receiving requests for >15 min. Either brand outage, domain misconfiguration, or upstream LB / DNS change.

Diagnosis:

Confirm via curl that the brand's domain still resolves and the LB forwards to gateway.
Check domain:agent:{host} and domain:level:{host} Redis entries for the brand's domains; ensure they still bind to the correct brand_id.
Check brand-create / brand-disable audit log for any recent change to that brand's status.

Mitigation:

Domain misconfig: re-bind the Redis entries; confirm via a curl request.
Brand status accidentally disabled: re-enable through admin audit-tracked write.

`wallet_cross_brand_rejected_total{mode="observe"}` (soak-window warning)

What it means: a cross-brand wallet command was processed in observe mode (proceeded using request brand). Above zero before Phase 16 flip indicates an internal caller is not forwarding X-Brand-Id correctly.

Diagnosis: same as the enforce-mode diagnosis above; in observe the counter increments but no end-user 4xx is generated.

Mitigation: must be driven to zero before Phase 16 hard-flip. Hot-fix the caller; do not flip enforce until the soak counter is clean.

Retired game callback DLQ procedure

The native-callback dead-letter writer and replay CLI were removed with the Aggregator-only cutover. Alembic migration 0030 and the game_callback_dead_letter table remain only to preserve schema history; the table is not a live queue or an authoritative recovery source.

Do not replay rows from this table. Historical payloads cannot recreate native provider authentication, Aggregator player resolution or signed relay context, and direct replay could create an unauthenticated money-write path.

Current recovery is provider retry/reconciliation through Aggregator, protected by RGB provider-transaction and wallet idempotency. If automatic retry is exhausted, reconcile both ledgers and use the audited wallet adjustment process; never POST a stored callback body directly to gateway or game_service.

Rollback Steps

For non-production environments:

Stop player-facing traffic to brand-aware paths.
Stop workers that consume brand-aware events.
Export brand, brand_config, agent_brand, player, wallet, rolling, promotion, and game tables for investigation.
If destructive schema changes were applied since the last snapshot, restore from the pre-rollout snapshot.
Re-bind the domain:agent:* and domain:level:* Redis values back to the single-brand shape if necessary.
Restart services and workers.
Re-run health checks and the local/staging smoke checks.

For production:

The following procedures are approved for production. Each must be executed only after a documented incident is opened and the deploy lead is paged.

Production rollback A -- Phase 16 enforcement regression

Symptoms: alert fires on wallet_cross_brand_rejected_total{mode= "enforce"}, brand_resolution_failed_total, or sustained Aggregator relay authentication/identity/downstream settlement failures causing user pain.

Procedure: execute the Phase 16 mid-flip revert procedure from the plan (gateway -> agent -> recon -> player -> promotion -> rolling -> wallet, each step flips env var to observe + restarts + confirms /health shows observe). Total budget 30 min. After complete, security_downgrade_total will fire (expected during incident). Re-attempt flip only after root cause is fixed and a new soak window is observed.

Production rollback B -- Backfill corruption (rows with wrong brand_id)

Symptoms: counts of rows with brand_id != expected discovered after 0024b backfill OR wallet_idempotency_brand_split_total non-zero after Phase 6 deploy.

Procedure:

Pause writes to the affected table family via the application-level table-freeze flag.
Open a separate incident; do NOT proceed without a verified correction plan.
For deposit-class corruption (deposit row carries wrong brand leading to misrouted Plisio callback): run a reconciliation script that cross-references the originating request domain (logged at creation time) against the deposit row's brand_id; quarantine mismatches; manually correct after wallet team review.
For player-class corruption (player_id assigned to wrong brand): the row is unrecoverable by automated means; open a customer- support ticket per affected player.
Restore from PITR is the last resort and must be approved by the on-call manager + DBA + a wallet domain owner; restore loses every transaction since the PITR target. Document the cutoff window before any restore decision.

Production rollback C -- Mid-flip abort

Already covered by Phase 16 mid-flip revert above.

Production rollback D -- agent_brand seed race remediation

Symptoms: agent reports cannot log in after Phase 16 flip; allow-list counter on agent_service shows allow-list-failed for valid agents.

Procedure: run the Phase 12 re-seed verification SQL (SELECT count ... LEFT JOIN agent_brand WHERE ab.agent_id IS NULL); for every missing agent, insert (agent_id, default_brand_id, 'enabled'). Confirm the agent can log in again. Root-cause why the autoseed trigger missed the agent (was the trigger removed prematurely?).

Production rollback E -- Brand-config regression

Symptoms: a brand_config write produces unexpected behavior (e.g. rolling ratio set to 0 by mistake).

Procedure: every brand_config write is audited (per spec) with the prior value. Run UPDATE brand_config SET value = '<prior_value>' WHERE brand_id = X AND key = Y; from the audit log entry. Investigate the original change-control failure separately.

General production constraints:

Never run destructive rollback without an approved incident plan and on-call manager sign-off.
Prefer traffic pause and provider callback queuing while ledger and authorization state is inspected, before any data mutation.
Every production rollback action is logged in #incidents and added to the post-incident report.

Smoke Checks

GET /health succeeds for every affected service.
gateway resolves a sample request against default's domain and attaches request.state.brand_id.
gateway rejects a sample request whose JWT brand does not match the domain brand.
wallet_service topology and policy read endpoints return per-brand active documents for both default and brand2.
a small bet on default authorizes, settles, produces ledger rows, and the rows carry brand_id = default.
a small bet on brand2 authorizes, settles, produces ledger rows, and the rows carry brand_id = brand2.
a synthetic cross-brand wallet command increments wallet_cross_brand_rejected_total and returns a stable error envelope.
a callback with an invalid relay key/signature/context is rejected, while every native provider callback path remains 404.
duplicate request IDs remain idempotent within a brand and remain rejected on payload mismatch within a brand.
removed back-office routes return 404: every path previously served by legacy_admin_v2.py, legacy_auth.py, legacy_agents_v2.py, legacy_agent_withdrawals_v2.py, legacy_meta_v2.py, legacy_recon.py, and legacy_web_content.py.

Delivered hardening

Items moved out of the rollout backlog because the code, migrations, counters, and tests have all landed on main. Each entry points to the canonical operator playbook for the alert it surfaces.

Recovery flow brand isolation (delivered)

Brand-aware recovery is live for both player and agent surfaces. servers_v2/player_service/app/api/routes/recovery.py and servers_v2/agent_service/app/api/routes/recovery.py ship a brand-precedence resolver that prefers token-embedded brand_id, falls back to request domain when the token lacks the claim, and hard-rejects on resolved-brand-vs-underlying-subject mismatch regardless of MULTI_BRAND_ENFORCEMENT mode. recovery_brand_mismatch surfaces via brand_resolution_failed_total{reason="recovery_brand_mismatch"}.

Audit guarantees: alembic migration 0033_recovery_audit_table.py adds recovery_audit and every observable outcome (REQUEST / VERIFY / RESET success or rejection, plus RATE_LIMITED / TOKEN_INVALID / BRAND_MISMATCH typed rejections) inserts one append-only row carrying brand_id, SHA-256 hash of contact, SHA-256 hash of source IP, and a per-action JSONB metadata blob (NEVER the raw contact, token, or password). RESET commits the audit row in the SAME transaction as the password UPDATE so an audit-side failure rolls back the recovery write.

PII / OTP-leakage hardening (Codex P1-#5, P1-#6, T1-D-C4): every SMS path now logs contact_hash=<sha256_first_8_hex> only and emits recovery_sms_delivery_failed_total{service} / recovery_email_unprovisioned_total{service} for operator visibility. RECOVERY_EXPOSE_TOKEN_FOR_TESTS=on is the only escape hatch and is boot-guarded against any production-grade RGB_ENV / APP_ENV / ENVIRONMENT label. T4-D-I5 added per-contact rate-limit fail-closed in production via recovery_rate_limit_redis_outage_total{bucket}.

Operator entry: every recovery-related counter has a dedicated section in diagnosis-playbook.md.

Admin operator account attribution (delivered, P1-3)

admin_service resolves the immutable operator account from the verified admin JWT username / account claim on every write route via require_operator_id in app/api/deps.py. Missing account claims hard-reject with HTTP 400. Operator identity is now a JWT-bound security boundary, not a network-only contract; the LB allow-list still exists as defence-in-depth.

The dedicated playbook is in diagnosis-playbook.md under operator account attribution. Missing account claims indicate a token issuer bug and should be fixed before writes are exposed.

Audit row tamper-evidence (delivered, alembic 0032)

admin_audit is DB-level append-only as of alembic migration 0032. Two BEFORE triggers on the table unconditionally raise:

trg_admin_audit_no_update_delete -- BEFORE UPDATE OR DELETE, FOR EACH ROW. Catches both single-row and bulk UPDATE/DELETE FROM admin_audit statements; PG fires the trigger per row visited so the first row aborts the whole statement.
trg_admin_audit_no_truncate -- BEFORE TRUNCATE, FOR EACH STATEMENT. PG forbids row-level TRUNCATE triggers, so this is its own DDL.

Both triggers call rgb.admin_audit_append_only(), which raises EXCEPTION 'admin_audit is append-only; <op> not permitted on rgb.admin_audit'. The exception is intentionally loud.

admin_audit_write_failed_total{route} pages on any non-zero rate; the route handlers in admin_service/app/services/admin_audit.py emit it whenever the post-money-write admin_audit INSERT fails, which after migration 0032 means either a transient DB error or the append-only trigger firing on a misbehaving caller. See the dedicated entry in diagnosis-playbook.md.

Operator interpretation: any "missing audit row" alarm or any log line containing admin_audit is append-only means the trigger fired, which is itself a security event. The expected steady-state rate of these errors is zero. A non-zero rate means either:

An attacker / compromised service is attempting to delete prior audit rows -- treat as an active intrusion.
A migration / data-cleanup script is incorrectly trying to UPDATE or DELETE from admin_audit -- still a security event; page the platform on-call before retrying anything.

Wallet ledger + balance log tamper-evidence (delivered, alembic 0036)

wallet.wallet_ledger (per-bucket money-movement audit trail) and rgb.player_balance_log (legacy compatibility ledger) are DB-level append-only as of alembic migration 0036. The trigger posture mirrors 0032's admin_audit shape:

trg_wallet_ledger_no_update_delete + trg_wallet_ledger_no_truncate call wallet.wallet_ledger_append_only() and raise EXCEPTION 'wallet_ledger is append-only; <op> not permitted on wallet.wallet_ledger'.
trg_player_balance_log_no_update_delete + trg_player_balance_log_no_truncate call rgb.player_balance_log_append_only() and raise the analogous exception.

Each function lives in the same schema as the table it guards (wallet vs. rgb) so cross-schema permission review stays simple.

Operator interpretation: any log line containing wallet_ledger is append-only or player_balance_log is append-only is a security event with the same handling as the admin_audit is append-only case above -- page the platform on-call before retrying anything. Steady-state rate is zero.

Recovery audit table (delivered, alembic 0033)

recovery_audit is created by alembic migration 0033 and consumed by both the player and agent recovery routes. Every recovery outcome inserts a typed append-only row. See "Recovery flow brand isolation" above for the full contract; the dedicated counters (recovery_sms_delivery_failed_total, recovery_email_unprovisioned_total, recovery_rate_limit_redis_outage_total) all have entries in diagnosis-playbook.md.

Retired native-callback DLQ implementation

The former native callback resolver and independent DLQ transaction are no longer part of the runtime. Migration 0030 remains in the immutable Alembic chain, but its table is historical and has no writer or replay command.

Tier 1 hardening sweep (delivered, T1-D-C1 .. T1-D-C5)

T1-D-C1: gateway brand-aware session key lookup. The Phase 4E dual-read fallback and its gateway_session_legacy_key_used_total metric are retired; gateway_jwt_session_missing_total remains the JWT-valid-but-session-gone signal.
T1-D-C2 is historical: the former VERIFY_CALLBACKS switch and native provider routes were removed. Aggregator relay verification is mandatory.
T1-D-C3: AES PII boot guard (assert_aes_pii_keys_configured) + removal of silent plaintext fallback + pii_aes_unconfigured_total{service,op}.
T1-D-C4: registration + agent auth log redaction with RECOVERY_EXPOSE_TOKEN_FOR_TESTS boot-guard escape hatch.
T1-D-C5 / I6: Plisio HMAC constant-time compare + replay protection (plisio_callback_replay_blocked_total{status} / plisio_callback_replay_redis_outage_total{env}) + log redaction.

Every counter has a per-counter playbook entry in diagnosis-playbook.md.

Tier 4 hardening sweep (delivered, T4-D-I2 .. T4-D-I7)

T4-D-I2: PER_CALLER_TOKEN_REQUIRED gate + internal_caller_token_legacy_rejected_total{consumer,caller}.
T4-D-I3: admin_audit rows on agent-mutating + sensitive player admin routes + agent_balance_legacy_write_total{change_type} for the legacy direct-write surface.
T4-D-I4: gateway JWT verifier hot-reload via Redis pub/sub on jwt_public_keys_changed + gateway_jwt_unknown_kid_total.
T4-D-I5: recovery /request per-contact bucket fail-closed in production + recovery_rate_limit_redis_outage_total{bucket}.
T4-D-I7: admin WebSocket token via Sec-WebSocket-Protocol subprotocol + per-IP rate limit + JWT-exp lifetime gate + admin_ws_legacy_query_token_total / admin_ws_rate_limited_total{reason}.

Every counter has a per-counter playbook entry in diagnosis-playbook.md.

Tier 6 hardening sweep (delivered, T6-B1 .. T6-E4)

T6-B1: admin_service migrated off legacy HS256 to RS256 admin JWTs; JWT_PRIVATE_KEY, JWT_PUBLIC_KEYS (JSON {kid: pem}), and JWT_KID are now required env vars. T7-B4 follow-up adds the assert_jwt_public_keys_configured boot guard so production refuses to boot when JWT_PUBLIC_KEYS is empty (otherwise decode_admin_token would fall back to the in-process test-keypair singleton and silently accept forged tokens).
T6-B2: brand_id propagation from admin → wallet / rolling / promotion / recon. Admin's outbound clients now stamp X-Brand-Id
- X-Brand-Signature so downstream brand-isolation checks see the right brand.
T6-C1: 5 missed admin_audit-emitting agent routes covered (parity with T4-D-I3 surface). Audit is written AFTER the money / state mutation succeeds; failure surfaces via admin_audit_write_failed_total{route}.
T6-C2: agent_brand seeded explicitly on create_agent (alembic 0028 dropped the autoseed trigger that previously did this in the DB). T7-A3 closed the follow-on agent_domain brand_id gap introduced in this same change.
T6-C3: append-only triggers extended to wallet_ledger + player_balance_log via alembic 0036; the operator-facing entry is "Wallet ledger + balance log tamper-evidence" above.
T6-D1: gateway JWT verifier hot-reload now refreshes from Settings (not just env) on the jwt_public_keys_changed Redis channel so a key rotation actually takes effect without a gateway restart.
T6-D2: gateway per-IP rate limiter fails CLOSED on Redis outage in production-grade runtimes (HTTP 503 + Retry-After); dev/test remains fail-open. The new gateway_rate_limit_redis_outage_total{env, reason} counter has a dedicated playbook entry in diagnosis-playbook.md.
T6-E1: BrandContext now rejects X-Brand-Id <= 0 (and any non-numeric value) as missing. Callers that need to operate without a brand must omit the header entirely; sending X-Brand-Id: 0 no longer silently creates a "brand 0" write.
T6-E2: strict brand_id resolver in admin paths (adjust_player
- player_finance) -- int(brand_id or 0) removed; missing brand surfaces a clear envelope to the operator instead of a split-brain wallet+audit row. T7-B1 hoisted the helper into app/services/brand_helpers.py so both files share it.
T6-E4: BRAND_AWARE_PUBLISHING_BEGINS sentinel removed from wallet + player outbox pollers. T7-B2 finished the cleanup -- agent + rolling pollers no longer emit either, the rgb_contracts.events.base constant + helper are dropped, and consumers compare against an inline literal so the defensive skip survives without a contracts-side dependency.

Tier 8 hardening sweep (delivered, T8-A .. T8-D)

T8-A: comprehensive grep sweep closed every brand_id INSERT gap that the prior iterations missed. Reach was wider than T6/T7 (raw text()
- SQLAlchemy insert(<Model>) + dotted/relative table refs); the fixes landed in agent_service legacy + create_agent paths (commit 7b2e58e8) and a follow-on grep sweep (commit 03748bf8).
T8-B is historical: it hardened the former account-prefix callback resolver. The entire native resolver was later removed by the Aggregator-only cutover; current identity comes only from signed relay context plus authoritative player_service verification.
T8-C: Hotfix — admin deps were passing only secret_key (not the full Settings) to decode_admin_token, breaking the RS256 resolver and returning 401 on every production admin call once the T6-B1 RS256 migration shipped. The two-line fix at admin_service/app/api/deps.py is committed at fed36449. Test fixtures were already passing the full Settings, which is why this surfaced in production rather than CI; T8-C also adds a regression test that uses a Settings-less call to catch the pattern in CI.
T8-D: admin BrandResolutionMiddleware observability + skip-on- liveness — the silent Redis exception path in admin's brand resolver now records brand_resolution_failed_total{service="admin_service", reason="redis_error"}, and /health / /healthz / /docs / /openapi.json / /redoc skip the Redis hop entirely so a tight probe schedule cannot amplify a Redis brownout into observable route latency.

Tier 9 brand_id INSERT sweep (delivered, T9-A .. T9-G)

T9 is the last-mile audit on every brand-scoped INSERT in both the v2 service tree and the legacy servers/ tree. The companion audit baseline 2026-04-28-brand-id-insert-audit.md captures the classifier inputs from migration 0024a plus the brand-nullable-by-design exemptions from 0037 so that re-running the documented audit commands any time the codebase changes will reproduce the steady-state result RAW FIX=0, FIX-TEST=0, SA FIX=0, SA FIX-TEST=0.

T9-A: the audit baseline -- exhaustive grep + classifier results were captured in docs/runbooks/multi-brand/2026-04-28-brand-id-insert-audit.md. Initial deltas: RAW FIX=55, SA FIX=3, FIX-TEST=6.
T9-B / T9-B-cleanup: admin_service back-office CRUD INSERTs closed across promotion.py, player.py, player_finance.py, agent.py, banking.py, finance_config.py, and system.py using the new shared app/api/brand_ctx.py helper (resolve_brand_id + fail-loud brand_required_envelope, status=54). Test fixture in test_money_admin_audit.py now installs BrandResolutionMiddleware and defaults X-Brand-Id=1 in the shared auth header helper.
T9-C: per-brand iteration in 4 admin scheduled tasks (stat.py::player_stat_acc / agent_stat_day, tag.py::_handle_tagging, expire.py::expire_coupons, check_expired.py::check_deposit / check_coupon). brand_id is sourced directly from the per-row source (p.brand_id for stats, (SELECT brand_id FROM player WHERE guid = :pid) subquery for tags, RETURNING brand_id from the UPDATE for expirations).
T9-D: agent_service legacy /coupon/grant + /coupon/recycle (3 INSERT sites in legacy_routes.py) and the local-dev bootstrap_local_data.py (4 sites: bank, bank_card_group, bank_card, agent_setting -- hard-coded brand=1 with dev-only rationale comments).
T9-E: relaxed shooter_sms.brand_id to nullable via alembic 0037 (the recon ingest path receives raw SMS payloads BEFORE any brand correlation is possible; brand_id is rewritten at match-time from the matched player_deposit). A new fail-loud helper recon_store.assert_sms_brand_after_match raises if the match-time UPDATE fails to land brand_id. Closed the 3 SA FIX sites (wallet_service::_log, player_service::PlayerLogin, rolling_service::PlayerRollingLog) by sourcing brand_id from request brand with player/rolling fallback. Closed 2 promo config sites (lossback_config, rebate_config) with the status=54 fail-loud envelope.
T9-F: agent coupon cross-brand attribution -- both coupon.py::grant_coupon and recycle_coupon now hard-reject the request with a per-player brand check (token.brand_id != player.brand_id) BEFORE any UPDATE / INSERT lands. The agent aggregate is brand-global; the signed token is the request's authority for brand scope. Regression tests pin the refusal so a future regression surfaces at unit-test time.
T9-G: docs sync (this entry) + classifier hardening so docstring / regex-literal mentions of INSERT INTO ... stop being mis-classified as production sites. The classifier's new BRAND_NULLABLE_BY_DESIGN set documents the 0037 exception so re-running it never re-flags shooter_sms.

Tier 14 brand_id RW (SELECT/UPDATE/DELETE) sweep (delivered, T14-A .. T14-C)

T14 is the symmetric cycle-closing batch for the read/update/delete half of the brand-leak surface. Every T6-T13 cross-review only audited INSERT statements; Codex's T14 review surfaced 8 explicit RW leaks in admin SELECT/UPDATE/DELETE statements that no prior classifier had ever looked at. T14 ships the matching audit infrastructure, closes every Codex-flagged leak, and gates Phase 16 on the new classifier reporting RW FIX=0.

T14-A: the audit baseline -- tools/audit/brand_id_rw_classifier.py (sibling of the INSERT classifier; conservative bias, ties go to FIX) plus the bounded universe doc 2026-04-29-brand-id-rw-audit.md. The classifier reads the same 122-table inventory from migration 0024a, supports the # brand-global: <reason> opt-out marker, and allow-lists admin_stat_acc / admin_stat_day (truly brand-global aggregates per data-ownership.md). Initial deltas: RW FIX=350 (235 SELECT + 88 UPDATE + 27 DELETE), FIX-TEST=3.
T14-B: closed the 10 explicit Codex-T14 sites (every active-exploit vector in admin_service back-office routes): agree_agent_withdrawal + sibling decline_ / pay_ / list_agent_withdrawals / agent_withdrawal_stats in agent_finance.py; the 13 sibling stats endpoints in stats.py (provider_stat_day, player_stat_day, agent_stat_day, promotion_stat_day, usdt_stat_day, player_provider_stat_day, rebate_stat_day, player_stat_acc, agent_stat_acc, plus the stats_top.online_count legacy SELECT COUNT(*) FROM player); set_primary_level_domain (4 SQL statements all needed brand_id pinning) + registration_list in player.py; list_rolling in player_finance.py; list_promotions in promotion.py (Codex's 9d981cc9 fixed coupons/list but missed promotions/list); plus the middleware/brand.py docstring fix to remove the stale "fall back to body.brand_ids as before" language. stats.py::stats_player_per_provider also got the table-name fix from player_provider_stat to player_provider_stat_day (the original SQL was a runtime-500 bomb). New regression suite servers_v2/admin_service/tests/test_admin_rw_brand_scope_t14.py (12 tests) pins the same-brand SQL shape, the missing-brand-context fail-loud, and the cross-brand isolation contract per route.
T14-C: docs sync (this entry) + Phase 16 gate. The RW classifier is now wired into .github/workflows/brand-id-audit.yml alongside the INSERT classifier (both must report FIX=0 for the workflow to pass) and into tools/multi_brand_backfill/phase16_dryrun.py as a hard precondition. The Makefile exposes make audit-brand-id-rw as the operator-facing wrapper.

T15 supersedes the T14-D/E/F deferred-backlog plan. The 316 RW-FIX sites the T14-A baseline catalogued were swept end-to-end in Tier 15 (see ### Tier 15 below). make audit-brand-id-rw reports FIX=0 as of T15-E (commit d882362a) and stays at 0 through T16-A/B and the T16-C classifier upgrade.

Tier 15 brand_id RW exhaustive sweep (delivered, T15-A .. T15-E)

T15 closes the entire deferred RW-FIX backlog from T14. Each batch swept one chunk of the 316-site list, brand-scoped every SELECT / UPDATE / DELETE that touched a brand-scoped table, and added defence-in-depth predicates so the SQL itself refuses to scope-cross even when the authz layer is bypassed.

T15-A (005719c3) — wallet_service SELECT/UPDATE/DELETE: every read of player, coin_transaction, player_deposit, player_withdraw, player_balance_log, bank, player_usdt, player_rolling, etc. now carries either AND brand_id = :brand_id or a JOIN-on-parent-brand predicate. The _record_rebate_transfer_stat helper was rewritten so the cross- brand player_id collision can no longer leak another brand's account / agent_id / brand_id triple.
T15-B (e50c8c99) — promotion_service + rolling_service: brand-scoped every event/coupon/promotion/rolling read and write the two services issue. Includes the rolling FIFO progress writes (which previously ran as guid-only UPDATEs).
T15-C (ce8504ae) — player_service + agent_service + game_service + recon_service: brand-scoped the registration / login lookups, agent-tree reads, supplier callbacks, and the reconciliation cross-checks.
T15-D (684f268e) — admin_service money + finance + banking back-office routes: every player_deposit / player_withdraw / player_balance_log / bank / bank_card / level-rebate / coupon-grant SQL the back-office issues now carries the operator brand_id. _resolve_strict_brand_id is the canonical resolver for "operator OR row brand_id" decisions.
T15-E (d882362a) — admin_service residual RW + the entire FIX-TEST tail: the last admin SQL queries that lacked a brand predicate, plus every remaining test-fixture INSERT/UPDATE that was still issuing brand-less queries against the post-0024c schema. After T15-E the audit reports RW FIX=0 / FIX-TEST=0.

Tier 16 CR9 closure + classifier dynamic-SQL upgrade (delivered, T16-A .. T16-D)

T16 is the post-T15 cross-review batch. CR9 (the dedicated admin_service review with manual auditing of the back-office RW contract) surfaced 8 findings the static classifier could not catch — most notably the _query_log_table helper in admin_service/app/api/routes/logs.py whose SQL was built dynamically (WHERE 1=1 builder + f-string template) and so no static line carried a literal brand_id token to be inspected.

T16-A (a72ed1c8) — admin_service Critical + Important leaks closed: logs.py::_query_log_table made brand-required and the 4 real-table log endpoints (player_balance / agent_balance / player_login / agent_login) refuse without X-Brand-Id; agent.py::list_agents switched to JOIN agent_brand on the operator brand (per ADR-009); agent.py::create_agent enforces body.brand_ids ⊆ operator brands; player.py tag CRUD pinned to brand; finance_config.py upserts validate the level row belongs to the operator brand; system.py::update_global_var / update_settings / update_maintenance gated to super_admin via require_super_admin with a new admin_platform_global_access_total counter; player_finance.py::force_complete_rolling migrated to the canonical resolve_brand_id + brand_required_envelope + _resolve_strict_brand_id pattern; deps.py super-admin role set widened to {super_admin, super} for the legacy bo/admin token. Test count: 217 → 238 (21 new T16-A regressions).
T16-B (0e2f6fa8) — wallet_service fail-loud brand context. Every if brand_id is not None: <append predicate> conditional in queries.py, wallet.py, and services/deposit_policy.py was converted to: at the route boundary refuse the call with Brand context required for <route>; in the helper require brand_id: int (no Optional); in the SQL unconditionally carry AND brand_id = :brand_id. 13 query routes + the /deposit pending+bank lookups + the validate_deposit_bonus_event helper
- the _record_rebate_transfer_stat SELECT all converted. Plisio _resolve_manual_bonus_terms now returns zero-bonus when the deposit row is a legacy pre-backfill record without brand_id instead of issuing a brand-less event lookup.
T16-C (934d47f4) — RW classifier dynamic-SQL upgrade. New FIX-DYNAMIC class detects the two construction patterns the static classifier missed:
1. WHERE 1=1 builders that append predicates via conditions.append(...) — the brand_id idiom may live in the helper that joins the conditions list, far from the matched text(f"...") call;
2. f-string SQL templates f"SELECT ... FROM {table} WHERE {where} ..." where both the table and the WHERE clause are templated. Same exit-code semantics as FIX (non-zero when count > 0) so any dynamic-builder regression breaks CI. .venv/lib/... / site-packages/... paths are now treated as OOS-LEGACY so SQLAlchemy core internals don't surface in the new pass. Final classifier output: FIX=0 + FIX-DYNAMIC=0 across the repo.
T16-D (this batch) — docs sync. Bumps the LVC across multi-brand-isolation-rollout.md, 2026-04-29-brand-id-rw-audit.md, 2026-04-28-brand-id-insert-audit.md, phase-16-hard-flip-checklist.md, and diagnosis-playbook.md; documents the post-T15 RW headline (FIX=0 / OK=219+97+24 / OOS-INTENTIONAL=91+18+5); records the 119 OOS-INTENTIONAL breakdown by category (resolver helpers, scheduled tasks, brand-global tables, cross-brand admin views); calls out the 2 known false-positive FIX-TEST entries in test_agent_integration.py (assertion strings that match the production INSERT shape, not actual writes); adds the make audit-brand-id-inserts && make audit-brand-id-rw gate as a Phase-16 hard-flip precondition; records the dynamic-SQL classifier upgrade in the admin_service service doc.

Known follow-ups

These items were identified during the cross-review but are intentionally out of scope of the in-progress rollout work. Each is tracked here so the spec coverage is honest and the next iteration can pick them up directly.

Future audit-row hardening

admin_audit and recovery_audit are append-only at the DB level via 0032 / 0033. The remaining hardening still on the table (not blocking Phase 16):

Restrict the application role to INSERT-only on admin_audit via a separate DB role (defence-in-depth on top of the trigger).
Hash-chain rows (prev_hash, row_hash) so any out-of-band catalog manipulation that bypasses the triggers is detectable on read.

Operator audit trail for `MULTI_BRAND_ENFORCEMENT` flips

The cluster-wide MULTI_BRAND_ENFORCEMENT flip is the single most security-impactful operator action on the platform. As of the P1-5 hardening, flip_enforcement.py requires two new arguments:

--operator-id -- identity of the operator running the flip (no default; suggested format: email or LDAP id). Recorded in admin_audit.operator_id.
--reason -- human-readable justification (no default; suggested format: ticket id plus a short sentence). Recorded in admin_audit.new_value.reason.

Every invocation of the script (forward, rollback, dry-run) writes exactly one admin_audit row with action='ENFORCEMENT_MODE_FLIP', resource='MULTI_BRAND_ENFORCEMENT', resource_key=<env>, and new_value carrying the env, direction, target_mode, services and reason. The script reuses the same INSERT shape as servers_v2/admin_service/app/services/admin_audit.py.

The --no-audit escape hatch is reserved for offline rehearsals where no database is reachable; using it during a real flip is a process violation. If DATABASE_URL is unreachable and --no-audit is not passed, the script exits 3 (audit row write failed) rather than printing the plan and silently succeeding -- the audit trail is the whole point of wiring this in.

Auditing the audit trail: query SELECT * FROM admin_audit WHERE action = 'ENFORCEMENT_MODE_FLIP' ORDER BY created_at DESC to see every flip generated against the target environment. Combined with the append-only triggers above, this is the canonical record for "who flipped enforcement, when, and why" -- and it cannot be retroactively edited.

Retired game callback DLQ table migration

Alembic migration 0030 (game_callback_dead_letter) is retained so existing databases and downgrade history remain valid. The Aggregator-only runtime has no writer or replay tool for this table. It must not be treated as a recovery queue; see “Retired game callback DLQ procedure” above.

Status​

Last Verified Commit​

Date​

Owners​

Affected Services​

Related Docs​

On-call TL;DR​

Goal​

Preconditions​

Local Docker Validation​

Running the two-brand E2E harness​

Staging Validation​

Release Gate​

Rollback Trigger​

Operational Reference​

Grafana dashboard​

Service name vs X-Caller-Service / metric label​

INTERNAL_SERVICE_TOKEN_* rotation​

BRAND_SIGNING_KEY rotation​

Per-caller-service token rollout​

JWT RS256 key (JWT_PRIVATE_KEY / JWT_PUBLIC_KEY) rotation​

Brand-signature enforcement flip (WALLET_BRAND_SIGNATURE_REQUIRE)​

Secret storage and provisioning​

Application-level table-freeze flag​

Admin VPN ingress source of truth​

Pen-test release gate​

Diagnosis Playbook​

brand_resolution_failed_total{reason="jwt_domain_mismatch"}​

wallet_cross_brand_rejected_total{mode="enforce"}​

Aggregator callback relay failure​

security_downgrade_total (P0 page)​

brand_resolution_failed_total{reason="unresolved_brand"} (page in enforce)​

brand_resolution_failed_total{reason="missing_header"} (page in enforce)​

wallet_idempotency_brand_split_total (warning)​

Per-brand wallet outbox publication freshness (page)​

Per-brand request_total zero-traffic alert (warning)​

wallet_cross_brand_rejected_total{mode="observe"} (soak-window warning)​

Retired game callback DLQ procedure​

Rollback Steps​

Smoke Checks​

Delivered hardening​

Recovery flow brand isolation (delivered)​

Admin operator account attribution (delivered, P1-3)​

Audit row tamper-evidence (delivered, alembic 0032)​

Wallet ledger + balance log tamper-evidence (delivered, alembic 0036)​

Recovery audit table (delivered, alembic 0033)​

Retired native-callback DLQ implementation​

Tier 1 hardening sweep (delivered, T1-D-C1 .. T1-D-C5)​

Tier 4 hardening sweep (delivered, T4-D-I2 .. T4-D-I7)​

Tier 6 hardening sweep (delivered, T6-B1 .. T6-E4)​

Tier 8 hardening sweep (delivered, T8-A .. T8-D)​

Tier 9 brand_id INSERT sweep (delivered, T9-A .. T9-G)​

Tier 14 brand_id RW (SELECT/UPDATE/DELETE) sweep (delivered, T14-A .. T14-C)​

Tier 15 brand_id RW exhaustive sweep (delivered, T15-A .. T15-E)​

Tier 16 CR9 closure + classifier dynamic-SQL upgrade (delivered, T16-A .. T16-D)​

Known follow-ups​

Future audit-row hardening​

Operator audit trail for MULTI_BRAND_ENFORCEMENT flips​

Retired game callback DLQ table migration​

Status

Last Verified Commit

Date

Owners

Affected Services

Related Docs

On-call TL;DR

Goal

Preconditions

Local Docker Validation

Running the two-brand E2E harness

Staging Validation

Release Gate

Rollback Trigger

Operational Reference

Grafana dashboard

Service name vs `X-Caller-Service` / metric label

`INTERNAL_SERVICE_TOKEN_*` rotation

`BRAND_SIGNING_KEY` rotation

Per-caller-service token rollout

JWT RS256 key (`JWT_PRIVATE_KEY` / `JWT_PUBLIC_KEY`) rotation

Brand-signature enforcement flip (`WALLET_BRAND_SIGNATURE_REQUIRE`)

Secret storage and provisioning

Application-level table-freeze flag

Admin VPN ingress source of truth

Pen-test release gate

Diagnosis Playbook

`brand_resolution_failed_total{reason="jwt_domain_mismatch"}`

`wallet_cross_brand_rejected_total{mode="enforce"}`

Aggregator callback relay failure

`security_downgrade_total` (P0 page)

`brand_resolution_failed_total{reason="unresolved_brand"}` (page in enforce)

`brand_resolution_failed_total{reason="missing_header"}` (page in enforce)

`wallet_idempotency_brand_split_total` (warning)

Per-brand wallet outbox publication freshness (page)

Per-brand `request_total` zero-traffic alert (warning)

`wallet_cross_brand_rejected_total{mode="observe"}` (soak-window warning)

Retired game callback DLQ procedure

Rollback Steps

Smoke Checks

Delivered hardening

Recovery flow brand isolation (delivered)

Admin operator account attribution (delivered, P1-3)

Audit row tamper-evidence (delivered, alembic 0032)

Wallet ledger + balance log tamper-evidence (delivered, alembic 0036)

Recovery audit table (delivered, alembic 0033)

Retired native-callback DLQ implementation

Tier 1 hardening sweep (delivered, T1-D-C1 .. T1-D-C5)

Tier 4 hardening sweep (delivered, T4-D-I2 .. T4-D-I7)

Tier 6 hardening sweep (delivered, T6-B1 .. T6-E4)

Tier 8 hardening sweep (delivered, T8-A .. T8-D)

Tier 9 brand_id INSERT sweep (delivered, T9-A .. T9-G)

Tier 14 brand_id RW (SELECT/UPDATE/DELETE) sweep (delivered, T14-A .. T14-C)

Tier 15 brand_id RW exhaustive sweep (delivered, T15-A .. T15-E)

Tier 16 CR9 closure + classifier dynamic-SQL upgrade (delivered, T16-A .. T16-D)

Known follow-ups

Future audit-row hardening

Operator audit trail for `MULTI_BRAND_ENFORCEMENT` flips

Retired game callback DLQ table migration