본문으로 건너뛰기

Multi-Brand Isolation Rollout Runbook

Status

Ready

Last Verified Commit

934d47f4 (T16-C classifier upgrade; T15-A..E swept the deferred RW backlog so make audit-brand-id-rw reports FIX=0 + FIX-TEST=0; T16-A closed the CR9 admin_service Critical+Important findings; T16-B fail-loud-converted every conditional brand predicate in wallet_service; T16-C extended the classifier to detect dynamic-SQL helpers; T16-D doc sync — this entry)

Date

2026-04-29

Owners

  • Platform Backend
  • Player Domain
  • Wallet Domain
  • Agent Domain

Affected Services

  • gateway
  • player_service
  • wallet_service
  • wallet_worker
  • rolling_service
  • rolling_worker
  • promotion_service
  • promotion_worker
  • game_service
  • agent_service
  • agent_worker
  • admin_service
  • admin_worker
  • recon_service
  • recon_worker
  • docs/adr/ADR-009-multi-brand-domain-routed-isolation.md
  • docs/specs/multi-brand/2026-04-27-multi-brand-isolation-spec.md
  • docs/plans/multi-brand/2026-04-27-multi-brand-isolation-plan.md
  • docs/runbooks/local-docker/local-docker-full-stack.md
  • docs/runbooks/wallet/ruby-wallet-topology-rollout.md

On-call TL;DR

This rollout enables multi-brand isolation on a previously single-brand platform. The biggest risks are:

  • a Phase 16 enforcement flip that locks legitimate users out of one or more brands (revert per Phase 16 mid-flip procedure)
  • a brand-resolution failure that misroutes money (use Diagnosis Playbook below; flip the affected service to observe to buy time)
  • a game-callback brand-unresolved failure (callbacks land in DLQ; no money loss but provider-side state may diverge until DLQ replayed)

Who to page: wallet domain owner for any wallet-related alert; agent domain owner for agent_brand allow-list issues; platform on-call for everything else.

Where to look:

  • Per-brand request rates and rejection counters in the multi-brand-isolation Grafana dashboard
  • Per-service enforcement mode via multi_brand_enforcement_mode gauge
  • Live brand catalog: psql -c "SELECT brand_id, brand_code, status FROM brand" (canonical source); Redis KEYS domain:agent:* and KEYS domain:level:* for active domain bindings

Goal

Validate and roll out multi-brand isolation safely. After this rollout the platform can host more than one brand on the same servers_v2 runtime, with brand-scoped player identity, wallet state, rolling, settlement, and configuration.

This runbook is not approved for production enablement of a second brand until the related spec, implementation plan, tests, and migrations are complete and the release gate below passes.

Preconditions

  • ADR-009 is Accepted.
  • Spec status is Approved.
  • Implementation plan done definition is satisfied.
  • Migrations are reviewed and reversible for non-production environments.
  • A default brand has been seeded; every existing row has been backfilled with brand_id = default.
  • An agent_brand row of the form (agent_id, default_brand_id, status='enabled') has been seeded for every existing agent row; without it Phase 16 hard-flip will reject every existing agent.
  • The Redis maps domain:agent:* and domain:level:* have been rewritten to bind every existing domain to the default brand.
  • Worker health reflects forward progress per ADR-003.
  • Wallet ledger and outbox failures surface per ADR-004.
  • Alerting exists for every counter on which the Phase 16 release gate or an in-runbook alarm depends. Every name below maps to an emitted counter in servers_v2/shared/contracts/src/rgb_contracts/infra/ or per-service code; full per-counter playbooks live in diagnosis-playbook.md.
    • Brand resolution / forwardingbrand_resolution_failed_total{reason,service} (the service="gateway" slice is the canonical edge alarm), agent_brand_not_allowed_total, recovery_brand_mismatch (envelope reason; surfaces via brand_resolution_failed_total{reason="recovery_brand_mismatch"}).
    • Wallet brand-signature defence-in-depthwallet_brand_signature_failed_total{caller_service,reason,mode}, wallet_brand_signature_missing_total{caller_service}, wallet_brand_signature_misconfigured_total{mode} (P1-4), wallet_brand_signature_replay_blocked_total (failure-reason slice of the previous), wallet_brand_signature_replay_redis_outage_total{caller_service,mode} (delivered P2-δ).
    • Wallet topologywallet_topology_default_brand_unprimed_total (delivered P2-δ; pages on any non-zero rate — every increment is a hard runtime failure), wallet_topology_default_brand_fallback_total{caller_origin} (per-brand-onboarding signal).
    • Cross-brand rejections (per service)wallet_cross_brand_rejected_total{command,mode}, agent_cross_brand_rejected_total{command,mode}, game_cross_brand_rejected_total{command,mode}, promotion_cross_brand_rejected_total{command,mode}, recon_cross_brand_rejected_total{command,mode}, rolling_cross_brand_rejected_total{command,mode}.
    • Game-callback edgegame_callback_brand_unresolved_total{provider}, game_callback_brand_unresolved_total{reason="raw_account_in_enforce"} and {reason="raw_guid_in_enforce"} (P1-3 brand-unaware fallback gate -- both must be zero before flip), game_callback_verification_bypassed_total{provider} (T1-D-C2; must be zero in production).
    • Per-caller internal-token rolloutinternal_caller_token_legacy_total{consumer,caller} (Stage C drain signal), internal_caller_token_legacy_rejected_total{consumer,caller} (T4-D-I2; pages on any non-zero rate once PER_CALLER_TOKEN_REQUIRED=on).
    • Event schema drainevent_legacy_schema_total{stream,consumer} (must be zero before flip).
    • Gateway session + JWTgateway_jwt_session_missing_total (T1-D-C1; baseline + spike alarm after the Phase 4E dual-read fallback removal), gateway_jwt_unknown_kid_total (T4-D-I4; tracks RS256 kid rotation pub/sub propagation).
    • Admin operator-identity + auditadmin_operator_id_mismatch_total{reason} (P1-3; pages on any non-zero rate -- spoof or legacy token), admin_audit_write_failed_total{route} (P1-4; pages on any non-zero rate -- money committed without audit row), agent_balance_legacy_write_total{change_type} (T4-D-I3; baseline non-zero pre-migration to wallet_service).
    • Admin WebSocket auth (T4-D-I7)admin_ws_legacy_query_token_total (cutover trend, must be zero before flipping admin_ws_allow_query_token=False in production), admin_ws_rate_limited_total{reason}.
    • PII boot guard (T1-D-C3)pii_aes_unconfigured_total{service,op} (pages on any non-zero rate in production -- the boot guard refuses to start prod-grade pods, so a non-zero prod sample is a high-severity alert).
    • Plisio webhook hardening (T1-D-C5/I6)plisio_callback_replay_blocked_total{status}, plisio_callback_replay_redis_outage_total{env}.
    • Recovery flow (Codex P1-#5/#6, T4-D-I5)recovery_sms_delivery_failed_total{service}, recovery_email_unprovisioned_total{service}, recovery_rate_limit_redis_outage_total{bucket} (T4-D-I5; pages on any non-zero rate when no concurrent Redis incident).
    • Enforcement-mode posturesecurity_downgrade_total{service} (page; service started in non-enforce mode while >1 brand enabled), multi_brand_enforcement_mode{service} gauge.
    • Per-brand traffic — per-brand wallet outbox publication freshness (existing wallet outbox alert, scoped per brand label) and per-brand request_total{brand_code,service} zero-traffic alert.
  • Game provider operations have signed off on the outbound account namespace change for every supported provider.

Local Docker Validation

  1. Start the full local stack using the local Docker runbook.
  2. Apply multi-brand migrations and run the default brand backfill.
  3. Confirm brand, brand_config, and agent_brand exist and that every brand-scoped table has zero NULL brand_id rows.
  4. Seed a second brand brand2 with a distinct brand_code, a distinct default currency, and at least one bound domain.
  5. Bind one domain:agent:* and one domain:level:* Redis entry to brand2.
  6. Confirm gateway, player_service, wallet_service, rolling_service, promotion_service, game_service, agent_service, admin_service, and recon_service health is green.
  7. Confirm every worker reports forward-progress readiness.
  8. Run unit and contract tests for:
    • shared brand contracts
    • migrations and backfill
    • gateway brand resolution and JWT/domain mismatch
    • player_service cross-brand registration and recovery
    • agent_service allow-list enforcement
    • wallet_service cross-brand command rejection per command
    • wallet_service per-brand topology and policy resolution
    • rolling_service brand-scoped consumption and progress
    • promotion_service per-brand resolution
    • game_service outbound namespacing and inbound reverse-parse per provider
    • recon_service brand-scoped match and approval
    • admin_service brand catalog, brand_config, agent_brand CRUD
    • admin_service staff removal (the seven deleted legacy route files no longer mount any router; previously served paths under /api/admin/user/*, /api/admin/pushbullet/*, /api/admin/shooter/*, /api/admin/web/rules/*, /api/admin/web/faq/*, /api/admin/web/config/* return 404)
    • observability log fields and metric labels
  9. Run two-brand local end-to-end flows:
    • register the same account string under default and brand2, confirm two distinct player_id values
    • login on default, attempt a request with that JWT against brand2's domain, confirm gateway rejects it
    • deposit, bet, settle, rolling, withdraw on each brand independently, confirm no cross-brand state leak
    • issue a coupon on brand2, use it on brand2, attempt the same coupon code on default, confirm rejection
    • send a game provider callback for brand2, confirm the outbound account is {brand_code}_{account} and the callback reverse-parses into brand2
    • run a recon match against a brand2 deposit, confirm the approval resolves into brand2 only
    • change a brand_config value on brand2 (e.g. cashback rate) and confirm brand2 settlements use the new value while default settlements continue to use the global default
  10. Confirm every money movement has immutable wallet_ledger rows carrying brand_id.
  11. Confirm wallet, rolling, promotion, and player outbox events carry brand_id in payload.
  12. Confirm structured logs and Prometheus metrics carry brand labels.

Running the two-brand E2E harness

Phase 14 ships a self-contained pytest harness that orchestrates the local two-brand validation step (item 9 above) against a running servers_v2 Docker stack.

Layout:

  • Seed script: servers_v2/tools/multi_brand_backfill/setup_local_two_brand.py -- idempotent. Ensures default and brand2 exist, ensures the deterministic seed agent exists, seeds the standard brand2 brand_config keys with cashback_rate=0.05 so per-brand overrides are visible, seeds agent_brand, brand-scoped agent_setting, and agent_domain for (agent_id=1, default) and (agent_id=1, brand2), and writes the two domain:agent:* Redis bindings:

    domain:agent:default.local -> {"brand_id": <default>, "agent_id": 1}
    domain:agent:brand2.local -> {"brand_id": <brand2>, "agent_id": 1}
  • Harness: servers_v2/tests/test_multi_brand_two_brand_e2e.py. Gated behind both the local_docker_e2e pytest marker AND LOCAL_DOCKER_E2E=1. A normal uv run pytest invocation skips the whole module (the pure-logic helpers in tests/test_setup_local_two_brand_unit.py continue to run).

Prerequisites:

  • servers_v2 docker compose stack is up and healthy (docker compose -f docker-compose.yml -f docker-compose.dev.yml ps shows every service healthy).
  • The seed script can run against a migration-only local stack. It also works against an already bootstrapped stack; existing seed rows are upserted or left in place. tools/multi_brand_backfill/verify.py must exit 0 after the seed (the harness re-runs it as a session-scoped guard).
  • Host name resolution: any browser-side test would need default.local and brand2.local in /etc/hosts pointing at 127.0.0.1. The harness uses the Host: request header instead so no /etc/hosts editing is required.
  • MULTI_BRAND_ENFORCEMENT=observe set explicitly for this harness. The repository's compose default is now enforce; Phase 14's local two-brand harness still validates the observe-mode soak behavior. Do NOT run this harness with enforce set.
  • RGB_JWT_KID, RGB_JWT_PRIVATE_KEY, and RGB_JWT_PUBLIC_KEYS are set in .env.compose.local; login and admin-token checks are part of the harness. Local env files may use escaped \n PEM newlines.
  • NO_PROXY includes localhost,127.0.0.1,::1 so Python/httpx does not send local Docker traffic through a macOS or corporate HTTP proxy.

Run it (preferred — the same recipe make verify-servers-v2-two-brand executes; reproducing here so an operator can step through manually):

cd servers_v2
export RGB_MULTI_BRAND_ENFORCEMENT=observe
docker compose -f docker-compose.yml -f docker-compose.dev.yml --env-file .env.compose.local up -d
# Load .env.compose.local into the current shell. Do NOT use
# ``set -a; . ./.env.compose.local; set +a`` — if your local
# PEM blocks are unescaped multi-line, the shell parses
# ``-----BEGIN PRIVATE KEY-----`` as a command and the source
# aborts. ``tools/dump_env_compose.py`` (BEGIN/END-aware Python
# parser) emits shell-safe export lines for both shapes.
eval "$(uv run python -m tools.dump_env_compose .env.compose.local)"
export DATABASE_URL="postgresql+psycopg://${RGB_POSTGRES_USER}:${RGB_POSTGRES_PASSWORD}@localhost:5433/${RGB_POSTGRES_DB}"
export REDIS_URL=redis://localhost:6381
export NO_PROXY=localhost,127.0.0.1,::1
export no_proxy=localhost,127.0.0.1,::1
# E2E harness needs to know which caller token to attach.
export E2E_INTERNAL_CALLER_SERVICE=gateway
export E2E_INTERNAL_SERVICE_TOKEN="$RGB_INTERNAL_SERVICE_TOKEN_GATEWAY"
uv run python -m tools.multi_brand_backfill.setup_local_two_brand
LOCAL_DOCKER_E2E=1 \
uv run pytest tests/test_multi_brand_two_brand_e2e.py -v -m local_docker_e2e

Expected output: 19 passed (or a small number skipped where a diagnostic endpoint is not present in the running image, e.g. the game_service internal/game/reverse-parse-account helper). Any FAIL indicates a real isolation regression -- see the relevant section of the Diagnosis Playbook below.

The harness exercises:

  • Two brands seeded with distinct brand_id and brand_code.
  • Same account string under both brands -> two distinct player_id values (UNIQUE (brand_id, account) is in effect).
  • Per-brand wallet read isolation across two players sharing one account string.
  • cashback_rate override on brand2 resolves to 0.05 while default does not see that value.
  • default_alice_e2e / brand2_alice_e2e reverse-parse to their respective brands at game_service.
  • Externally-supplied X-Brand-Id is stripped at the gateway.
  • Every legacy back-office path removed in Phase 12 returns 404 against the live admin_service.
  • At least one Prometheus metrics surface exposes a brand / brand_id / brand_code label.
  • A JWT issued for default replayed against brand2's domain does not 5xx and (best-effort) increments brand_resolution_failed_total on the gateway.

Staging Validation

  1. Deploy services and migrations to staging.
  2. Run the default brand backfill in staging; confirm zero NULL brand_id rows.
  3. Rewrite staging domain:agent:* and domain:level:* Redis values to the brand-aware shape; confirm every existing domain still resolves to default.
  4. Run all unit, contract, and integration suites against staging.
  5. Re-run the two-brand end-to-end flows from local Docker validation against staging.
  6. Compare wallet bucket totals with ledger-derived totals per brand.
  7. Confirm admin brand_config writes are audited.
  8. Confirm agent_brand writes are audited.
  9. Confirm topology and policy activation in staging is brand-scoped and does not affect the other brand.
  10. Confirm game provider callback reverse-parsing succeeds for every supported provider against staging traffic samples.
  11. Confirm operational dashboards show brand labels for player-, wallet-, rolling-, promotion-, and game-scoped metrics.
  12. Confirm the three brand-resolution alerts and the per-brand wallet outbox freshness alert fire in a synthetic-failure drill.

Release Gate

Do not enable a second brand in any shared or production-like environment if any of the following are true:

  • shared brand contracts, migrations, or backfill are missing or failing
  • any brand-scoped table contains rows with NULL brand_id
  • the target environment has not progressed MULTI_BRAND_ENFORCEMENT to enforce after the soak window in observe mode (a second brand cannot be enabled while any service is still in observe or off)
  • game_callback_brand_unresolved_total{reason="raw_account_in_enforce"} OR {reason="raw_guid_in_enforce"} is non-zero across the soak window. Pre-flip every game-provider callback path must already be migrated to send the namespaced {brand_code}_{account} form; the legacy raw-account / raw-guid fallback is brand-unaware and is rejected outright in enforce mode (see Bug #3 / callback_brand.resolve_callback_brand). A non-zero counter means at least one provider still sends the raw form -- migrate the upstream caller and re-soak before flipping enforce.
  • the soak window has not been at least 2 * JWT_EXPIRE_MIN minutes (default JWT_EXPIRE_MIN = 60, so at least 120 minutes) measured from Phase 5 deploy time (the last moment a brand-unaware JWT could be issued), so that JWTs issued before brand-aware login go live have all expired
  • any JWT in the active session population still lacks a brand_id claim
  • any wallet_service command path can mutate a row whose brand differs from the request brand
  • per-brand topology or policy activation can affect another brand's active document
  • per-brand topology or policy resolution returns an active document belonging to a different brand than the requesting brand
  • per-brand rolling completion ratios, cashback / rebate / lossback rates, or payment channel selection do not resolve from brand_config (with documented global default fallback) as specified
  • any game provider callback fails to reverse-parse the namespaced account
  • any brand-scoped query in any service is missing a brand_id filter
  • any path previously served by the seven deleted admin_service legacy route files still responds with anything other than 404
  • the three brand-resolution alerts or per-brand wallet outbox freshness alert are not wired
  • structured logs or Prometheus metrics are missing brand labels
  • two-brand local Docker or staging end-to-end flows fail at any step
  • the project integer in admin_service/app/tasks/tag.py is still in use anywhere
  • any global_var row whose key is duplicated by a brand_config entry still exists (the deletion list documented in plan Phase 12 has not been applied)

Rollback Trigger

Pause rollout or roll back immediately if:

  • brand_resolution_failed_total rises above the documented threshold
  • wallet_cross_brand_rejected_total becomes non-zero in normal traffic
  • game_callback_brand_unresolved_total becomes non-zero in normal traffic
  • a player can register or log in under the wrong brand
  • a wallet command mutates a row whose brand differs from the request brand
  • per-brand topology or policy activation alters another brand's active document
  • ledger writes fail or are delayed beyond the wallet release threshold
  • bucket totals diverge from ledger-derived totals for any brand
  • a game provider callback for one brand routes into a different brand's player
  • a recon approval resolves into the wrong brand

Operational Reference

Grafana dashboard

Dashboard name: multi-brand-isolation. Required panels:

  • Per-brand request rate (stacked): sum by (brand_code) (rate(request_total[1m]))
  • Brand-resolution failures (by reason): sum by (reason, service) (rate(brand_resolution_failed_total[5m]))
  • Wallet cross-brand rejections (by command and mode): sum by (command, mode) (rate(wallet_cross_brand_rejected_total[5m]))
  • Game callback brand unresolved (by provider): sum by (provider) (rate(game_callback_brand_unresolved_total[5m]))
  • Idempotency cross-brand splits: rate(wallet_idempotency_brand_split_total[5m])
  • Enforcement mode per service (gauge): multi_brand_enforcement_mode
  • Brand-resolution latency (p99): histogram_quantile(0.99, sum by (le, service) (rate(brand_resolution_latency_seconds_bucket[5m])))
  • Security downgrade events: increase(security_downgrade_total[1h])
  • Admin operator-id mismatch (P1-3, by reason): sum by (reason) (rate(admin_operator_id_mismatch_total[5m]))
  • Admin audit-write failure (P1-4, by route): sum by (route) (rate(admin_audit_write_failed_total[5m]))

Service name vs X-Caller-Service / metric label

The runbook below uses two distinct identifiers that operators must not confuse when grepping metrics, logs, or HTTP traces:

service (compose / repo)X-Caller-Service header + caller_service metric labelenv-var suffix (per-caller token)
gatewaygatewayGATEWAY
admin_serviceadminADMIN
agent_serviceagentAGENT
game_servicegameGAME
promotion_servicepromotionPROMOTION
recon_servicereconRECON
rolling_servicerollingROLLING
player_serviceplayerPLAYER

When you see caller_service=admin in a Prometheus query, that is the admin_service process. There are no *_service strings on the wire or in label values — only repo / compose use the _service suffix. HMAC signature payloads, missing-signature counters, and per-caller token env-var lookups all key on the short label.

INTERNAL_SERVICE_TOKEN_* rotation

Each consumer service (wallet_service, rolling_service, promotion_service, etc.) accepts a set of caller-service tokens (INTERNAL_SERVICE_TOKEN_GATEWAY, INTERNAL_SERVICE_TOKEN_AGENT, ...). The consumer's require_internal_service_token dependency reads both the active env var AND the _PREV overlap variant; rotation is per token with the following steps:

  1. Set _PREV on the consumer. On every consumer instance (wallet/rolling/promotion/game/player/agent), set INTERNAL_SERVICE_TOKEN_<UPPER(CALLER)>_PREV to the current outgoing value AND keep INTERNAL_SERVICE_TOKEN_<UPPER(CALLER)> pointing at the SAME outgoing value. (At this stage both env vars hold the same value -- nothing changes for callers.) Deploy.
  2. Flip active to the new value on the consumer. Set INTERNAL_SERVICE_TOKEN_<UPPER(CALLER)> to the new secret; _PREV still holds the outgoing one. Deploy. The consumer now accepts BOTH values via has_valid_per_caller_token_with_source(...) (shared rgb_contracts.infra.internal_auth).
  3. Roll the caller service to the new token. Set INTERNAL_SERVICE_TOKEN_<UPPER(CALLER)> to the new secret on the producer side; deploy. Watch internal_caller_token_prev_used_total{consumer,caller} — every request still authenticating via the old key fires this counter. When the rate hits zero, every caller instance is on the new secret.
  4. Remove _PREV from the consumer. Drop INTERNAL_SERVICE_TOKEN_<UPPER(CALLER)>_PREV from env; deploy. The counter must remain at zero post-deploy.

Rollback at any step: revert the consumer env vars and redeploy — the old token regains acceptance immediately.

BRAND_SIGNING_KEY rotation

Same overlap shape, applied to the wallet brand-signature verifier:

  1. Set BRAND_SIGNING_KEY_PREV on every wallet_service instance to the current outgoing key; keep BRAND_SIGNING_KEY at the same outgoing value. Deploy.
  2. Flip BRAND_SIGNING_KEY on wallet_service to the new key; _PREV still holds the outgoing one. Deploy. The verifier (verify_request_brand_signature(... signing_key, signing_key_prev, ...)) now accepts signatures produced with either key.
  3. Roll the 6 signing callers (gateway, admin_service, game_service, promotion_service, recon_service, rolling_service) to the new key; deploy. Watch wallet_brand_signature_prev_key_used_total{caller_service} — when the rate hits zero, every caller has cut over.
  4. Remove BRAND_SIGNING_KEY_PREV from wallet_service; deploy.

The verifier requires the active BRAND_SIGNING_KEY to be set — _PREV alone is treated as a misconfigured (empty active key) and fails closed in enforce mode. This prevents leaving the old key in place indefinitely.

Per-caller-service token rollout

Phase 5B introduced per-caller tokens. Each consumer service accepts both the legacy single INTERNAL_SERVICE_TOKEN AND a per-caller variant INTERNAL_SERVICE_TOKEN_<UPPER(caller)>. The rollout has three stages:

  1. Stage A (now): legacy token still active everywhere; per-caller tokens not provisioned. The internal_caller_token_legacy_total{consumer,caller} counter shows 100% legacy traffic.

  2. Stage B (transition): provision per-caller env vars on every consumer; deploy. Both tokens are accepted; per-caller takes precedence. The counter shows mixed legacy + per-caller usage.

  3. Stage C (terminal): revoke the shared INTERNAL_SERVICE_TOKEN from every consumer; deploy. From this point only per-caller tokens are accepted. The counter must remain at zero or only fire for misconfigured callers.

The per-caller env-var names are:

  • INTERNAL_SERVICE_TOKEN_GATEWAY (consumed by every other service when receiving from gateway)
  • INTERNAL_SERVICE_TOKEN_AGENT
  • INTERNAL_SERVICE_TOKEN_ADMIN
  • INTERNAL_SERVICE_TOKEN_RECON
  • INTERNAL_SERVICE_TOKEN_GAME
  • INTERNAL_SERVICE_TOKEN_PROMOTION
  • INTERNAL_SERVICE_TOKEN_ROLLING

Each consumer reads the relevant subset (e.g. wallet_service reads all 7 since every other service may call into wallet).

Pre-Stage-C release gate:

  • internal_caller_token_legacy_total{consumer="wallet_service"} is zero across the soak window
  • internal_caller_token_legacy_total{consumer="player_service"} is zero across the soak window
  • (repeat for every consumer)

To roll Stage C: remove the INTERNAL_SERVICE_TOKEN env var from each consumer's deploy template and roll one consumer at a time. Mid-flip abort: re-provision the env var and rollback.

The audit test servers_v2/tests/test_per_caller_header_audit.py is a regression net that asserts every internal HTTP client in the stack sends X-Caller-Service. Run it before each Stage C deploy:

cd servers_v2 && uv run pytest tests/test_per_caller_header_audit.py -v

JWT RS256 key (JWT_PRIVATE_KEY / JWT_PUBLIC_KEY) rotation

JWT signing uses RS256 with kid header. The private key (JWT_PRIVATE_KEY) is provisioned only into player_service and agent_service; the public key (JWT_PUBLIC_KEY) goes everywhere that verifies JWTs (gateway and any other verifier). Verifiers maintain an active key set keyed by kid. Rotation procedure:

  1. Generate a new RSA keypair with a new kid (e.g. kid=YYYYMMDD-N). Add it to the verifiers' active key set under its kid (env var JWT_PUBLIC_KEYS is a JSON map of kid -> public_key); deploy the verifier change first.
  2. Roll the verifier deploy across gateway and any other verifying service; confirm /health lists both old and new kid as accepted.
  3. Update JWT_PRIVATE_KEY and JWT_KID on player_service and agent_service to the new keypair; restart. New tokens issued from this point use the new kid.
  4. Overlap window: keep the old kid in the verifier's active set for at least JWT_EXPIRE_MIN + REFRESH_TOKEN_TTL + 60min buffer (default: 60min + REFRESH_TOKEN_TTL + 60min). All in-flight tokens drain by the end of this window.
  5. Remove the old kid from JWT_PUBLIC_KEYS on the verifiers; re-deploy. Any token presented after this point that uses the old kid is rejected as unknown_kid.

Emergency rotation (key compromise): skip step 4's overlap. Remove the old kid immediately on verifiers; force re-login for every session by rolling a global session-invalidation flag. Customer impact: every active session must re-authenticate. Document in #incidents.

JWT signing tests must include: (a) alg=none rejected, (b) alg=HS256 rejected when configured for RS256, (c) JWT with unknown kid rejected, (d) JWT signed with old kid accepted during overlap, (e) JWT signed with old kid rejected after overlap.

Brand-signature enforcement flip (WALLET_BRAND_SIGNATURE_REQUIRE)

wallet_service accepts an X-Brand-Signature header on brand-scoped writes from internal callers. Whether unsigned writes are rejected in enforce mode is gated by the env var WALLET_BRAND_SIGNATURE_REQUIRE on wallet_service:

  • Default: "off" (transitional). Wallet still verifies signatures when present, but unsigned writes are accepted and a counter (wallet_brand_signature_missing_total{caller_service=*}) is incremented per missing-signature observation, tagged with the reporting X-Caller-Service.
  • Target: "on". Unsigned brand-scoped writes are rejected with 403 brand_signature_missing. This is the actual security control; if the var is never flipped, brand-signature enforcement is dead even after the global enforce-mode flip.

Rollout stages:

  • Stage A (current): the 6 wallet write callers (compose names: gateway, admin_service, game_service, promotion_service, recon_service, rolling_service; their caller_service metric labels are gateway, admin, game, promotion, recon, rolling respectively per the table above) all sign brand-scoped writes to wallet_service. Wallet verifies presence + signature when supplied but does not reject unsigned writes. agent_service and player_service have no outbound wallet HTTP client and therefore are not in this list (see the table-freeze section below for the broader DB-writer set, which is a distinct concept).
  • Stage B (soak): monitor wallet_brand_signature_missing_total{caller_service=*} over the full soak window (same window as the enforce-mode soak). Expect exactly zero increments. Any non-zero value identifies a misconfigured caller (missing BRAND_SIGNING_KEY, missing header injection in the HTTP client, etc.) and blocks the flip until fixed and re-soaked.
  • Stage C (flip): set WALLET_BRAND_SIGNATURE_REQUIRE=on on every wallet_service instance, restart wallet. Verify rejection works by sending a hand-crafted unsigned request to a brand-scoped endpoint and confirming a 403 response.

Rollback: set WALLET_BRAND_SIGNATURE_REQUIRE=off, restart wallet. Unsigned writes are accepted again immediately on restart. Use only if a real production caller turns out to be missing the header (and file a P0 to fix the caller before re-attempting the flip).

Secret storage and provisioning

Where each shared secret lives:

  • JWT_PRIVATE_KEY, JWT_PUBLIC_KEYS (a JSON map of kid -> public_key), JWT_KID: provisioned via the secrets manager (Vault path secret/multi-brand/jwt/<env>); read access scoped to player_service and agent_service deploy roles for the private key, all-services for public keys.
  • BRAND_SIGNING_KEY: provisioned via the secrets manager (Vault path secret/multi-brand/brand-signing/<env>); read access scoped to gateway, agent_service, and wallet_service deploy roles.
  • INTERNAL_SERVICE_TOKEN_*: one path per consumer-caller pair (secret/multi-brand/internal-tokens/<consumer>/<caller>/<env>).
  • Production deploys must verify these are present at boot via a startup probe; missing secret fails the boot, not silent fallback.

Application-level table-freeze flag

Several migration steps (0027 PK rewrites, partial-unique-index rebuilds) and Production Rollback B require pausing application writes to a specific table family while the migration / repair runs. The mechanism is a Redis-backed flag set:

  • Key pattern: table_freeze:{table_name} (string, value 1 when frozen, key absent when unfrozen).
  • Every writer (wallet_service, player_service, agent_service, etc.) checks the flag before any INSERT/UPDATE on a frozen table and returns the documented 503 service_paused envelope to the caller when the flag is set. Reads are not blocked.
  • Set: redis-cli SET table_freeze:wallet_topology 1 EX 1800 (30 min TTL safety net; the TTL prevents a forgotten flag from permanently locking writes).
  • Clear: redis-cli DEL table_freeze:wallet_topology.
  • All set/clear operations are logged in #incidents with the operator identity, table name, and reason.
  • Writers cache the flag check for 1 second to avoid Redis hot-pathing; this means the freeze takes effect within 1 second of the SET.

Admin VPN ingress source of truth

The IP allow-list for admin_service ingress is maintained in the infrastructure repo (infra/lb/admin-allowlist.tf). Onboarding a new operator IP requires a PR to that repo plus security-team approval; the PR triggers an LB config push. Off-boarding requires the same PR flow. The source of truth is the Terraform module; ad-hoc LB UI changes are forbidden and surfaced by drift-detection on the infra repo.

Pen-test release gate

A second brand cannot be enabled in production until an external penetration test of the multi-brand isolation surface has completed and findings are remediated. Scope:

  • gateway brand resolution (Host/Origin spoof, JWT replay, JWT forgery, X-Brand-Id injection)
  • wallet_service cross-brand command rejection under observe and enforce
  • admin_service write surface (network isolation, X-Operator-Id capture, audit-row tamper)
  • game callback reverse-parse (account namespace collision, prefix attacks)
  • MULTI_BRAND_ENFORCEMENT downgrade (operator-side abuse, audit trail)

Diagnosis Playbook

When any of the multi-brand alerts fires, use this section before considering rollback.

For the brand-signature, brand-check, agent-allow-list, game-callback-resolve, and admin-audit counters that lack a section here, see the dedicated operator-facing entry points in diagnosis-playbook.md. That file covers wallet_brand_signature_failed_total, wallet_brand_signature_replay_blocked_total, wallet_cross_brand_rejected_total, brand_resolution_failed_total{service="gateway"}, agent_brand_not_allowed_total, game_callback_brand_unresolved_total, admin_operator_id_mismatch_total, and admin_audit_write_failed_total, plus the brand-signature replay-cache + default-brand-cache counters (wallet_brand_signature_replay_redis_outage_total and wallet_topology_default_brand_unprimed_total, both delivered by P2-δ), the per-service *_cross_brand_rejected_total family for agent_service / game_service / promotion_service / recon_service / rolling_service, and the post-Tier-1/Tier-4 hardening counters (gateway_jwt_session_missing_total, gateway_jwt_unknown_kid_total, pii_aes_unconfigured_total, plisio_callback_replay_blocked_total, plisio_callback_replay_redis_outage_total, game_callback_verification_bypassed_total, internal_caller_token_legacy_rejected_total, agent_balance_legacy_write_total, admin_ws_legacy_query_token_total, admin_ws_rate_limited_total, recovery_sms_delivery_failed_total, recovery_email_unprovisioned_total, recovery_rate_limit_redis_outage_total).

brand_resolution_failed_total{reason="jwt_domain_mismatch"}

What it means: a player or agent JWT was presented at a domain whose brand differs from the JWT's brand claim, OR the JWT lacks brand_id.

Diagnosis:

  1. Query the structured logs for the past 5 min filtered on the counter increment. Look at: request_id, jwt.brand_id, request.state.brand_id, domain.
  2. If the counts cluster on one source IP / one user-agent, it is likely a scanner or a stale mobile client; expected baseline noise.
  3. If the counts cluster on legitimate browser user-agents and one brand's domain, look for a recent domain:agent:* or domain:level:* Redis change.
  4. If JWTs lacking brand_id are still appearing more than 2 * JWT_EXPIRE_MIN after Phase 5 deploy, refresh-token rotation is minting brand-unaware tokens; check Phase 5 implementation in player_service/agent_service.

Mitigation:

  • Baseline noise: ignore; refine the alert threshold.
  • Single domain spike: re-verify the Redis mapping for that domain; inspect the LB and any DNS changes.
  • Refresh-token issue: hot-fix player_service/agent_service to reject refreshes from brand-unaware refresh tokens; force re-login for affected sessions.
  • If the spike is correlated to a Phase 16 flip and is causing real user pain: revert the affected service to observe.

wallet_cross_brand_rejected_total{mode="enforce"}

What it means: a wallet command targeted a row whose brand_id differs from the request brand. In enforce mode this rejected.

Diagnosis:

  1. Filter wallet_service logs for the past 5 min on event=cross_brand_rejected. Look at: command, caller_service (from the INTERNAL_SERVICE_TOKEN_* token used), request_brand, target_row_brand, request_id.
  2. If caller_service is consistent (e.g. always recon — the label value for the recon_service process; see the service-name vs caller-label table earlier in this document), that caller is not forwarding X-Brand-Id correctly OR the originating data is wrong (e.g. a deposit row carries the wrong brand_id).
  3. If caller_service varies, something at the gateway / agent_service edge is leaking the wrong brand into the X-Brand-Id propagation.

Mitigation:

  • Single-caller bug: hot-fix the caller's brand-forwarding code OR flip just that caller's MULTI_BRAND_ENFORCEMENT to observe to buy time (writes proceed using request brand; no money lands in wrong brand because wallet uses request brand authoritatively in observe).
  • Edge bug: revert gateway to observe; investigate brand resolution in the affected route.
  • If the rate is high and money flow is impacted: full revert to observe per the Phase 16 mid-flip revert procedure.

game_callback_brand_unresolved_total{provider}

What it means: a provider callback's outbound account string did not reverse-parse to a known brand. Real money may be stuck on the provider side.

Diagnosis:

  1. Filter game_service logs for the past 5 min on event=callback_brand_unresolved. Look at: provider, outbound_account_raw, request_id.
  2. The most common cause is a new brand was created in admin_service but the per-process reverse-parse cache in game_service has not refreshed yet (60s TTL or BRAND_CATALOG_CHANGED Redis pub/sub message missed).
  3. The next most common cause is a game provider sending a callback for a player that never existed (out-of-band test traffic from the provider; expected non-zero baseline that must be allow- listed).
  4. If neither, the outbound namespacing logic for that provider is broken (e.g. provider truncated the account string below brand_code length).

Mitigation:

  • Cache stale: trigger a manual BRAND_CATALOG_CHANGED publish from admin or rolling-restart game_service processes.
  • Provider noise: add an allowlist for known-bad accounts at the game callback edge.
  • Outbound namespacing bug: hot-fix the affected provider's namespacing helper; manually replay the dead-lettered callbacks (see DLQ Procedure below) once the fix lands.
  • All cases: ack/nack the original provider callback per provider policy. Failing-open (silently ack) loses the callback; failing- closed (5xx) triggers provider retries that may DoS. Default policy: 200-ack and write to DLQ; retry from DLQ after fix.

security_downgrade_total (P0 page)

What it means: a service started in MULTI_BRAND_ENFORCEMENT != 'enforce' while more than one brand is enabled. Either an operator flipped the flag back during incident response, a deploy template lost the enforce env var, or CI rolled out a service with the wrong default.

Diagnosis:

  1. Identify which service: security_downgrade_total{service} label.
  2. Query multi_brand_enforcement_mode{service} for the same service to confirm the current mode (1=observe, 0=off).
  3. Check the deploy history for the affected service in the past 24h; look for a config change that removed or modified MULTI_BRAND_ENFORCEMENT.
  4. Page the deploy lead AND the security on-call simultaneously. This is treated as a security incident until proven otherwise.

Mitigation:

  • If the downgrade is intentional (legitimate incident response that required observe): document in #incidents, ack the alert with a silence window, and plan re-enforce per the Phase 16 procedure.
  • If the downgrade is unintentional: re-set MULTI_BRAND_ENFORCEMENT=enforce in the env, restart the affected service, confirm /health and multi_brand_enforcement_mode show enforce. Investigate the deploy pipeline; consider a startup probe that aborts boot if the env var is missing post-Phase-16.
  • Either way: never silently let the downgrade persist while >1 brand is enabled.

brand_resolution_failed_total{reason="unresolved_brand"} (page in enforce)

What it means: an authenticated agent_service request reached the allow-list middleware (agent_brand_enforcement.py) but no brand_id was attached by the upstream brand-resolution middleware. Distinct from unknown_domain (HTTP host had no brand mapping at the edge): unresolved_brand fires when resolution itself failed silently and a request slipped through to authenticated routes without context.

Diagnosis:

  1. Filter agent_service logs for the past 5 min on event=agent brand allow-list check failed and reason=unresolved_brand. Look at: agent_id, path, request_id, plus the upstream Host header.
  2. Cross-check the brand-resolution middleware logs for the same request_id. Either the resolver crashed (look for a stack trace) or the LB forwarded an unmapped Host.
  3. If the spike correlates with a recent deploy of AgentBrandResolutionMiddleware or a brand-catalog change, suspect a cache miss on domain:agent:* or domain:level:*.

Mitigation:

  • Resolver bug: hot-fix; consider flipping agent_service to observe until rolled out.
  • Missing domain mapping: add the domain:agent:* Redis entry.
  • LB / DNS misconfig: route Host correctly; reject unmapped hosts at the edge instead of letting them reach agent surfaces.

brand_resolution_failed_total{reason="missing_header"} (page in enforce)

What it means: a brand-scoped internal handler received a request without X-Brand-Id. In enforce mode this is rejected.

Diagnosis:

  1. Filter logs on the affected service for the past 5 min; request_id, caller_service (from the internal token used), route.
  2. Most common cause: a recently-deployed caller service that did not propagate X-Brand-Id. Identify the caller, find the recent deploy, look at the change.
  3. Less common: an external request slipped past the gateway / agent strip and hit an internal handler directly.

Mitigation:

  • Hot-fix the caller to forward X-Brand-Id, OR temporarily flip the affected consumer to observe until the caller rolls out.
  • If external request leakage: investigate edge configuration; the X-Brand-Id strip may be misconfigured.

wallet_idempotency_brand_split_total (warning)

What it means: an inbound idempotency key already exists for the same key but different brand. Indicates client misrouting or replay.

Diagnosis:

  1. Filter wallet_service logs on event=idempotency_brand_split. Compare request_brand vs the existing-row brand and the caller_service.
  2. If concentrated on one caller, that caller is sending the same idempotency key across brand-namespace boundaries (e.g. account- derived keys without brand prefix).

Mitigation:

  • Caller fix: prefix idempotency keys with brand_code or brand_id on the caller side.
  • Not a money-impact incident in itself; observe-mode would proceed with request brand. Track the rate.

Per-brand wallet outbox publication freshness (page)

What it means: the wallet outbox publisher for one specific brand is not making forward progress (per ADR-003).

Diagnosis:

  1. Query the wallet outbox table filtered on the affected brand: SELECT count(*) FROM wallet.wallet_outbox WHERE brand_id = ? AND published_at IS NULL and the oldest unpublished created_at.
  2. Compare to other brands; if only one brand is stuck, look for a data condition specific to that brand (e.g. a poison-pill payload).
  3. Check Redis Stream lag on the wallet stream.

Mitigation:

  • Standard wallet outbox playbook applies; brand label scopes the investigation.
  • If a poison-pill: route the offending row to dead-letter, fix the payload, re-publish.

Per-brand request_total zero-traffic alert (warning)

What it means: a previously-active brand stopped receiving requests for >15 min. Either brand outage, domain misconfiguration, or upstream LB / DNS change.

Diagnosis:

  1. Confirm via curl that the brand's domain still resolves and the LB forwards to gateway.
  2. Check domain:agent:{host} and domain:level:{host} Redis entries for the brand's domains; ensure they still bind to the correct brand_id.
  3. Check brand-create / brand-disable audit log for any recent change to that brand's status.

Mitigation:

  • Domain misconfig: re-bind the Redis entries; confirm via a curl request.
  • Brand status accidentally disabled: re-enable through admin audit-tracked write.

wallet_cross_brand_rejected_total{mode="observe"} (soak-window warning)

What it means: a cross-brand wallet command was processed in observe mode (proceeded using request brand). Above zero before Phase 16 flip indicates an internal caller is not forwarding X-Brand-Id correctly.

Diagnosis: same as the enforce-mode diagnosis above; in observe the counter increments but no end-user 4xx is generated.

Mitigation: must be driven to zero before Phase 16 hard-flip. Hot-fix the caller; do not flip enforce until the soak counter is clean.

Game callback DLQ procedure

game_service writes every brand-unresolved callback to the dead-letter table game_callback_dead_letter (alembic 0030). Each row holds the provider, the raw outbound account that failed reverse-parse, the full inbound payload (JSONB) for verbatim replay, the parse-failure reason, and the original request_id. Operator interaction is via the CLI tool servers_v2/tools/multi_brand_backfill/replay_game_callback_dlq.py, which reads DATABASE_URL and GAME_SERVICE_URL from the environment (default GAME_SERVICE_URL=http://localhost:8012).

After the brand catalog is fixed, run these commands in order:

  1. List pending rows for the affected provider to scope the incident. The default --status is pending:

    python servers_v2/tools/multi_brand_backfill/replay_game_callback_dlq.py \
    --list --provider mg

    Output is one line per row: #<id> provider=<p> status=<s> reason=<r> account=<a> created_at=<ts>.

  2. Inspect a candidate row in dry-run. Dry-run prints the planned POST target and the payload that would be replayed; it never mutates the row and never issues an HTTP request:

    python servers_v2/tools/multi_brand_backfill/replay_game_callback_dlq.py \
    --id 42 --dry-run
  3. Replay the row. The tool POSTs payload_json back to the per-provider endpoint; on HTTP 2xx it sets replay_status='replayed', replayed_at=now(), replay_request_id=<new>. On HTTP 4xx/5xx the row stays pending and the tool exits with code 3 -- safe to retry. wallet_service idempotency on the underlying authorize/settle/rollback call ensures replays do not double-spend:

    python servers_v2/tools/multi_brand_backfill/replay_game_callback_dlq.py \
    --id 42
  4. If the callback is unrecoverable (player deleted, payload corrupt, etc.), abandon the row. Abandonment marks money lost permanently and MUST carry a human operator id and a free-text reason -- both land in an admin_audit row (action='DLQ_ABANDON', resource='game_callback_dead_letter', resource_key=<row id>) that commits in the same transaction as the DLQ row flip:

    python servers_v2/tools/multi_brand_backfill/replay_game_callback_dlq.py \
    --abandon --id 42 \
    --operator-id "ops:alice" \
    --reason "player guid 12345 deleted in Phase 14 cleanup"

    --operator-id and --reason are mandatory; the tool refuses to run with either missing, blank, or whitespace-only. --notes is still accepted (recorded on the DLQ row for operator-facing inspection) but is independent of the audit-bearing fields.

    If the audit-side INSERT fails (trigger fires, schema drift, audit table append-only protection rejects), the transaction rolls back: the DLQ row stays pending and the operator can retry. Exit code is 3 in that case so wrapper scripts notice the failure.

The tool refuses to act on rows that are already replayed or abandoned (exit code 3) so re-running a script against the same id is safe. DLQ persistence inside game_service is best-effort: a DB outage during the original rejection logs a WARN and falls through to the metric-only path -- it does not block the provider response.

Independent-transaction guarantee (post Codex-R2 fix). The DLQ INSERT is now performed by record_dead_letter on a fresh AsyncSession opened from app.state.pool, with an explicit commit() before the helper returns. This means the row survives even though the calling provider route returns its rejection envelope (e.g. _balance_resp(...), WCResponse.error(...), BTiTokenResponse(...)) and exits its async with session_maker() as session: block without committing the caller's transaction -- without this, AsyncSession's __aexit__ rollback would drop the staged DLQ row. Operators can therefore rely on this table being authoritative for replay; the only loss path remaining is the documented best-effort swallow of DB-side errors during the DLQ INSERT itself (which still emits a WARN log and the game_callback_brand_unresolved_total metric).

Rollback Steps

For non-production environments:

  1. Stop player-facing traffic to brand-aware paths.
  2. Stop workers that consume brand-aware events.
  3. Export brand, brand_config, agent_brand, player, wallet, rolling, promotion, and game tables for investigation.
  4. If destructive schema changes were applied since the last snapshot, restore from the pre-rollout snapshot.
  5. Re-bind the domain:agent:* and domain:level:* Redis values back to the single-brand shape if necessary.
  6. Restart services and workers.
  7. Re-run health checks and the local/staging smoke checks.

For production:

The following procedures are approved for production. Each must be executed only after a documented incident is opened and the deploy lead is paged.

Production rollback A -- Phase 16 enforcement regression

Symptoms: alert fires on wallet_cross_brand_rejected_total{mode= "enforce"}, brand_resolution_failed_total, or game_callback_brand_unresolved_total post-flip with sustained elevated rate causing user pain.

Procedure: execute the Phase 16 mid-flip revert procedure from the plan (gateway -> agent -> recon -> player -> promotion -> rolling -> wallet, each step flips env var to observe + restarts + confirms /health shows observe). Total budget 30 min. After complete, security_downgrade_total will fire (expected during incident). Re-attempt flip only after root cause is fixed and a new soak window is observed.

Production rollback B -- Backfill corruption (rows with wrong brand_id)

Symptoms: counts of rows with brand_id != expected discovered after 0024b backfill OR wallet_idempotency_brand_split_total non-zero after Phase 6 deploy.

Procedure:

  1. Pause writes to the affected table family via the application-level table-freeze flag.
  2. Open a separate incident; do NOT proceed without a verified correction plan.
  3. For deposit-class corruption (deposit row carries wrong brand leading to misrouted Plisio callback): run a reconciliation script that cross-references the originating request domain (logged at creation time) against the deposit row's brand_id; quarantine mismatches; manually correct after wallet team review.
  4. For player-class corruption (player_id assigned to wrong brand): the row is unrecoverable by automated means; open a customer- support ticket per affected player.
  5. Restore from PITR is the last resort and must be approved by the on-call manager + DBA + a wallet domain owner; restore loses every transaction since the PITR target. Document the cutoff window before any restore decision.

Production rollback C -- Mid-flip abort

Already covered by Phase 16 mid-flip revert above.

Production rollback D -- agent_brand seed race remediation

Symptoms: agent reports cannot log in after Phase 16 flip; allow-list counter on agent_service shows allow-list-failed for valid agents.

Procedure: run the Phase 12 re-seed verification SQL (SELECT count ... LEFT JOIN agent_brand WHERE ab.agent_id IS NULL); for every missing agent, insert (agent_id, default_brand_id, 'enabled'). Confirm the agent can log in again. Root-cause why the autoseed trigger missed the agent (was the trigger removed prematurely?).

Production rollback E -- Brand-config regression

Symptoms: a brand_config write produces unexpected behavior (e.g. rolling ratio set to 0 by mistake).

Procedure: every brand_config write is audited (per spec) with the prior value. Run UPDATE brand_config SET value = '<prior_value>' WHERE brand_id = X AND key = Y; from the audit log entry. Investigate the original change-control failure separately.

General production constraints:

  • Never run destructive rollback without an approved incident plan and on-call manager sign-off.
  • Prefer traffic pause and provider callback queuing while ledger and authorization state is inspected, before any data mutation.
  • Every production rollback action is logged in #incidents and added to the post-incident report.

Smoke Checks

  • GET /health succeeds for every affected service.
  • gateway resolves a sample request against default's domain and attaches request.state.brand_id.
  • gateway rejects a sample request whose JWT brand does not match the domain brand.
  • wallet_service topology and policy read endpoints return per-brand active documents for both default and brand2.
  • a small bet on default authorizes, settles, produces ledger rows, and the rows carry brand_id = default.
  • a small bet on brand2 authorizes, settles, produces ledger rows, and the rows carry brand_id = brand2.
  • a synthetic cross-brand wallet command increments wallet_cross_brand_rejected_total and returns a stable error envelope.
  • a synthetic unparseable game callback increments game_callback_brand_unresolved_total and is rejected.
  • duplicate request IDs remain idempotent within a brand and remain rejected on payload mismatch within a brand.
  • removed back-office routes return 404: every path previously served by legacy_admin_v2.py, legacy_auth.py, legacy_agents_v2.py, legacy_agent_withdrawals_v2.py, legacy_meta_v2.py, legacy_recon.py, and legacy_web_content.py.

Delivered hardening

Items moved out of the rollout backlog because the code, migrations, counters, and tests have all landed on main. Each entry points to the canonical operator playbook for the alert it surfaces.

Recovery flow brand isolation (delivered)

Brand-aware recovery is live for both player and agent surfaces. servers_v2/player_service/app/api/routes/recovery.py and servers_v2/agent_service/app/api/routes/recovery.py ship a brand-precedence resolver that prefers token-embedded brand_id, falls back to request domain when the token lacks the claim, and hard-rejects on resolved-brand-vs-underlying-subject mismatch regardless of MULTI_BRAND_ENFORCEMENT mode. recovery_brand_mismatch surfaces via brand_resolution_failed_total{reason="recovery_brand_mismatch"}.

Audit guarantees: alembic migration 0033_recovery_audit_table.py adds recovery_audit and every observable outcome (REQUEST / VERIFY / RESET success or rejection, plus RATE_LIMITED / TOKEN_INVALID / BRAND_MISMATCH typed rejections) inserts one append-only row carrying brand_id, SHA-256 hash of contact, SHA-256 hash of source IP, and a per-action JSONB metadata blob (NEVER the raw contact, token, or password). RESET commits the audit row in the SAME transaction as the password UPDATE so an audit-side failure rolls back the recovery write.

PII / OTP-leakage hardening (Codex P1-#5, P1-#6, T1-D-C4): every SMS path now logs contact_hash=<sha256_first_8_hex> only and emits recovery_sms_delivery_failed_total{service} / recovery_email_unprovisioned_total{service} for operator visibility. RECOVERY_EXPOSE_TOKEN_FOR_TESTS=on is the only escape hatch and is boot-guarded against any production-grade RGB_ENV / APP_ENV / ENVIRONMENT label. T4-D-I5 added per-contact rate-limit fail-closed in production via recovery_rate_limit_redis_outage_total{bucket}.

Operator entry: every recovery-related counter has a dedicated section in diagnosis-playbook.md.

Admin operator-id JWT cross-check (delivered, P1-3)

admin_service cross-checks the X-Operator-Id LB header against the verified admin JWT admin_id claim on every write route via require_operator_id in app/api/deps.py. Mismatches and legacy-token-without-claim cases hard-reject with HTTP 403 and increment admin_operator_id_mismatch_total{reason}. Operator identity is now a JWT-bound security boundary, not a network-only contract; the LB allow-list still exists as defence-in-depth.

The dedicated playbook is in diagnosis-playbook.md under admin_operator_id_mismatch_total. The release-gate alarm pages on any non-zero rate (reason=header_vs_jwt is a spoof attempt; reason=jwt_missing_claim means a legacy admin token reached a write route -- cycle the issuing service).

Audit row tamper-evidence (delivered, alembic 0032)

admin_audit is DB-level append-only as of alembic migration 0032. Two BEFORE triggers on the table unconditionally raise:

  • trg_admin_audit_no_update_delete -- BEFORE UPDATE OR DELETE, FOR EACH ROW. Catches both single-row and bulk UPDATE/DELETE FROM admin_audit statements; PG fires the trigger per row visited so the first row aborts the whole statement.
  • trg_admin_audit_no_truncate -- BEFORE TRUNCATE, FOR EACH STATEMENT. PG forbids row-level TRUNCATE triggers, so this is its own DDL.

Both triggers call rgb.admin_audit_append_only(), which raises EXCEPTION 'admin_audit is append-only; <op> not permitted on rgb.admin_audit'. The exception is intentionally loud.

admin_audit_write_failed_total{route} pages on any non-zero rate; the route handlers in admin_service/app/services/admin_audit.py emit it whenever the post-money-write admin_audit INSERT fails, which after migration 0032 means either a transient DB error or the append-only trigger firing on a misbehaving caller. See the dedicated entry in diagnosis-playbook.md.

Operator interpretation: any "missing audit row" alarm or any log line containing admin_audit is append-only means the trigger fired, which is itself a security event. The expected steady-state rate of these errors is zero. A non-zero rate means either:

  • An attacker / compromised service is attempting to delete prior audit rows -- treat as an active intrusion.
  • A migration / data-cleanup script is incorrectly trying to UPDATE or DELETE from admin_audit -- still a security event; page the platform on-call before retrying anything.

Wallet ledger + balance log tamper-evidence (delivered, alembic 0036)

wallet.wallet_ledger (per-bucket money-movement audit trail) and rgb.player_balance_log (legacy compatibility ledger) are DB-level append-only as of alembic migration 0036. The trigger posture mirrors 0032's admin_audit shape:

  • trg_wallet_ledger_no_update_delete + trg_wallet_ledger_no_truncate call wallet.wallet_ledger_append_only() and raise EXCEPTION 'wallet_ledger is append-only; <op> not permitted on wallet.wallet_ledger'.
  • trg_player_balance_log_no_update_delete + trg_player_balance_log_no_truncate call rgb.player_balance_log_append_only() and raise the analogous exception.

Each function lives in the same schema as the table it guards (wallet vs. rgb) so cross-schema permission review stays simple.

Operator interpretation: any log line containing wallet_ledger is append-only or player_balance_log is append-only is a security event with the same handling as the admin_audit is append-only case above -- page the platform on-call before retrying anything. Steady-state rate is zero.

Recovery audit table (delivered, alembic 0033)

recovery_audit is created by alembic migration 0033 and consumed by both the player and agent recovery routes. Every recovery outcome inserts a typed append-only row. See "Recovery flow brand isolation" above for the full contract; the dedicated counters (recovery_sms_delivery_failed_total, recovery_email_unprovisioned_total, recovery_rate_limit_redis_outage_total) all have entries in diagnosis-playbook.md.

Game-callback DLQ independent transaction (delivered, Codex-R2)

record_dead_letter in servers_v2/game_service/app/services/callback_brand.py opens a fresh AsyncSession from app.state.pool and commits the row inside that session before returning. The DLQ row therefore survives the provider route's rejection-and-rollback path. See the "Game callback DLQ procedure" section earlier in this runbook.

Tier 1 hardening sweep (delivered, T1-D-C1 .. T1-D-C5)

  • T1-D-C1: gateway brand-aware session key lookup. The Phase 4E dual-read fallback and its gateway_session_legacy_key_used_total metric are retired; gateway_jwt_session_missing_total remains the JWT-valid-but-session-gone signal.
  • T1-D-C2: VERIFY_CALLBACKS=true default + boot guard + game_callback_verification_bypassed_total{provider}.
  • T1-D-C3: AES PII boot guard (assert_aes_pii_keys_configured) + removal of silent plaintext fallback + pii_aes_unconfigured_total{service,op}.
  • T1-D-C4: registration + agent auth log redaction with RECOVERY_EXPOSE_TOKEN_FOR_TESTS boot-guard escape hatch.
  • T1-D-C5 / I6: Plisio HMAC constant-time compare + replay protection (plisio_callback_replay_blocked_total{status} / plisio_callback_replay_redis_outage_total{env}) + log redaction.

Every counter has a per-counter playbook entry in diagnosis-playbook.md.

Tier 4 hardening sweep (delivered, T4-D-I2 .. T4-D-I7)

  • T4-D-I2: PER_CALLER_TOKEN_REQUIRED gate + internal_caller_token_legacy_rejected_total{consumer,caller}.
  • T4-D-I3: admin_audit rows on agent-mutating + sensitive player admin routes + agent_balance_legacy_write_total{change_type} for the legacy direct-write surface.
  • T4-D-I4: gateway JWT verifier hot-reload via Redis pub/sub on jwt_public_keys_changed + gateway_jwt_unknown_kid_total.
  • T4-D-I5: recovery /request per-contact bucket fail-closed in production + recovery_rate_limit_redis_outage_total{bucket}.
  • T4-D-I7: admin WebSocket token via Sec-WebSocket-Protocol subprotocol + per-IP rate limit + JWT-exp lifetime gate + admin_ws_legacy_query_token_total / admin_ws_rate_limited_total{reason}.

Every counter has a per-counter playbook entry in diagnosis-playbook.md.

Tier 6 hardening sweep (delivered, T6-B1 .. T6-E4)

  • T6-B1: admin_service migrated off legacy HS256 to RS256 admin JWTs; JWT_PRIVATE_KEY, JWT_PUBLIC_KEYS (JSON {kid: pem}), and JWT_KID are now required env vars. T7-B4 follow-up adds the assert_jwt_public_keys_configured boot guard so production refuses to boot when JWT_PUBLIC_KEYS is empty (otherwise decode_admin_token would fall back to the in-process test-keypair singleton and silently accept forged tokens).
  • T6-B2: brand_id propagation from admin → wallet / rolling / promotion / recon. Admin's outbound clients now stamp X-Brand-Id
    • X-Brand-Signature so downstream brand-isolation checks see the right brand.
  • T6-C1: 5 missed admin_audit-emitting agent routes covered (parity with T4-D-I3 surface). Audit is written AFTER the money / state mutation succeeds; failure surfaces via admin_audit_write_failed_total{route}.
  • T6-C2: agent_brand seeded explicitly on create_agent (alembic 0028 dropped the autoseed trigger that previously did this in the DB). T7-A3 closed the follow-on agent_domain brand_id gap introduced in this same change.
  • T6-C3: append-only triggers extended to wallet_ledger + player_balance_log via alembic 0036; the operator-facing entry is "Wallet ledger + balance log tamper-evidence" above.
  • T6-D1: gateway JWT verifier hot-reload now refreshes from Settings (not just env) on the jwt_public_keys_changed Redis channel so a key rotation actually takes effect without a gateway restart.
  • T6-D2: gateway per-IP rate limiter fails CLOSED on Redis outage in production-grade runtimes (HTTP 503 + Retry-After); dev/test remains fail-open. The new gateway_rate_limit_redis_outage_total{env, reason} counter has a dedicated playbook entry in diagnosis-playbook.md.
  • T6-E1: BrandContext now rejects X-Brand-Id <= 0 (and any non-numeric value) as missing. Callers that need to operate without a brand must omit the header entirely; sending X-Brand-Id: 0 no longer silently creates a "brand 0" write.
  • T6-E2: strict brand_id resolver in admin paths (adjust_player
    • player_finance) -- int(brand_id or 0) removed; missing brand surfaces a clear envelope to the operator instead of a split-brain wallet+audit row. T7-B1 hoisted the helper into app/services/brand_helpers.py so both files share it.
  • T6-E4: BRAND_AWARE_PUBLISHING_BEGINS sentinel removed from wallet + player outbox pollers. T7-B2 finished the cleanup -- agent + rolling pollers no longer emit either, the rgb_contracts.events.base constant + helper are dropped, and consumers compare against an inline literal so the defensive skip survives without a contracts-side dependency.

Tier 8 hardening sweep (delivered, T8-A .. T8-D)

  • T8-A: comprehensive grep sweep closed every brand_id INSERT gap that the prior iterations missed. Reach was wider than T6/T7 (raw text()
    • SQLAlchemy insert(<Model>) + dotted/relative table refs); the fixes landed in agent_service legacy + create_agent paths (commit 7b2e58e8) and a follow-on grep sweep (commit 03748bf8).
  • T8-B: eliminated the silent brand_id=0 fallback in game_service callbacks. The fallback used to swallow callbacks whose namespaced account did not reverse-parse, persisting a NULL-branded bet row; callbacks are now hard-rejected, DLQ'd, and counted under game_callback_brand_unresolved_total{provider, reason="raw_brand_unresolved"}. See the diagnosis playbook entry for the new label.
  • T8-C: Hotfix — admin deps were passing only secret_key (not the full Settings) to decode_admin_token, breaking the RS256 resolver and returning 401 on every production admin call once the T6-B1 RS256 migration shipped. The two-line fix at admin_service/app/api/deps.py is committed at fed36449. Test fixtures were already passing the full Settings, which is why this surfaced in production rather than CI; T8-C also adds a regression test that uses a Settings-less call to catch the pattern in CI.
  • T8-D: admin BrandResolutionMiddleware observability + skip-on- liveness — the silent Redis exception path in admin's brand resolver now records brand_resolution_failed_total{service="admin_service", reason="redis_error"}, and /health / /healthz / /docs / /openapi.json / /redoc skip the Redis hop entirely so a tight probe schedule cannot amplify a Redis brownout into observable route latency.

Tier 9 brand_id INSERT sweep (delivered, T9-A .. T9-G)

T9 is the last-mile audit on every brand-scoped INSERT in both the v2 service tree and the legacy servers/ tree. The companion audit baseline 2026-04-28-brand-id-insert-audit.md captures the classifier inputs from migration 0024a plus the brand-nullable-by-design exemptions from 0037 so that re-running the documented audit commands any time the codebase changes will reproduce the steady-state result RAW FIX=0, FIX-TEST=0, SA FIX=0, SA FIX-TEST=0.

  • T9-A: the audit baseline -- exhaustive grep + classifier results were captured in docs/runbooks/multi-brand/2026-04-28-brand-id-insert-audit.md. Initial deltas: RAW FIX=55, SA FIX=3, FIX-TEST=6.
  • T9-B / T9-B-cleanup: admin_service back-office CRUD INSERTs closed across promotion.py, player.py, player_finance.py, agent.py, banking.py, finance_config.py, and system.py using the new shared app/api/brand_ctx.py helper (resolve_brand_id + fail-loud brand_required_envelope, status=54). Test fixture in test_money_admin_audit.py now installs BrandResolutionMiddleware and defaults X-Brand-Id=1 in the shared auth header helper.
  • T9-C: per-brand iteration in 4 admin scheduled tasks (stat.py::player_stat_acc / agent_stat_day, tag.py::_handle_tagging, expire.py::expire_coupons, check_expired.py::check_deposit / check_coupon). brand_id is sourced directly from the per-row source (p.brand_id for stats, (SELECT brand_id FROM player WHERE guid = :pid) subquery for tags, RETURNING brand_id from the UPDATE for expirations).
  • T9-D: agent_service legacy /coupon/grant + /coupon/recycle (3 INSERT sites in legacy_routes.py) and the local-dev bootstrap_local_data.py (4 sites: bank, bank_card_group, bank_card, agent_setting -- hard-coded brand=1 with dev-only rationale comments).
  • T9-E: relaxed shooter_sms.brand_id to nullable via alembic 0037 (the recon ingest path receives raw SMS payloads BEFORE any brand correlation is possible; brand_id is rewritten at match-time from the matched player_deposit). A new fail-loud helper recon_store.assert_sms_brand_after_match raises if the match-time UPDATE fails to land brand_id. Closed the 3 SA FIX sites (wallet_service::_log, player_service::PlayerLogin, rolling_service::PlayerRollingLog) by sourcing brand_id from request brand with player/rolling fallback. Closed 2 promo config sites (lossback_config, rebate_config) with the status=54 fail-loud envelope.
  • T9-F: agent coupon cross-brand attribution -- both coupon.py::grant_coupon and recycle_coupon now hard-reject the request with a per-player brand check (token.brand_id != player.brand_id) BEFORE any UPDATE / INSERT lands. The agent aggregate is brand-global; the signed token is the request's authority for brand scope. Regression tests pin the refusal so a future regression surfaces at unit-test time.
  • T9-G: docs sync (this entry) + classifier hardening so docstring / regex-literal mentions of INSERT INTO ... stop being mis-classified as production sites. The classifier's new BRAND_NULLABLE_BY_DESIGN set documents the 0037 exception so re-running it never re-flags shooter_sms.

Tier 14 brand_id RW (SELECT/UPDATE/DELETE) sweep (delivered, T14-A .. T14-C)

T14 is the symmetric cycle-closing batch for the read/update/delete half of the brand-leak surface. Every T6-T13 cross-review only audited INSERT statements; Codex's T14 review surfaced 8 explicit RW leaks in admin SELECT/UPDATE/DELETE statements that no prior classifier had ever looked at. T14 ships the matching audit infrastructure, closes every Codex-flagged leak, and gates Phase 16 on the new classifier reporting RW FIX=0.

  • T14-A: the audit baseline -- tools/audit/brand_id_rw_classifier.py (sibling of the INSERT classifier; conservative bias, ties go to FIX) plus the bounded universe doc 2026-04-29-brand-id-rw-audit.md. The classifier reads the same 122-table inventory from migration 0024a, supports the # brand-global: <reason> opt-out marker, and allow-lists admin_stat_acc / admin_stat_day (truly brand-global aggregates per data-ownership.md). Initial deltas: RW FIX=350 (235 SELECT + 88 UPDATE + 27 DELETE), FIX-TEST=3.
  • T14-B: closed the 10 explicit Codex-T14 sites (every active-exploit vector in admin_service back-office routes): agree_agent_withdrawal + sibling decline_ / pay_ / list_agent_withdrawals / agent_withdrawal_stats in agent_finance.py; the 13 sibling stats endpoints in stats.py (provider_stat_day, player_stat_day, agent_stat_day, promotion_stat_day, usdt_stat_day, player_provider_stat_day, rebate_stat_day, player_stat_acc, agent_stat_acc, plus the stats_top.online_count legacy SELECT COUNT(*) FROM player); set_primary_level_domain (4 SQL statements all needed brand_id pinning) + registration_list in player.py; list_rolling in player_finance.py; list_promotions in promotion.py (Codex's 9d981cc9 fixed coupons/list but missed promotions/list); plus the middleware/brand.py docstring fix to remove the stale "fall back to body.brand_ids as before" language. stats.py::stats_player_per_provider also got the table-name fix from player_provider_stat to player_provider_stat_day (the original SQL was a runtime-500 bomb). New regression suite servers_v2/admin_service/tests/test_admin_rw_brand_scope_t14.py (12 tests) pins the same-brand SQL shape, the missing-brand-context fail-loud, and the cross-brand isolation contract per route.
  • T14-C: docs sync (this entry) + Phase 16 gate. The RW classifier is now wired into .github/workflows/brand-id-audit.yml alongside the INSERT classifier (both must report FIX=0 for the workflow to pass) and into tools/multi_brand_backfill/phase16_dryrun.py as a hard precondition. The Makefile exposes make audit-brand-id-rw as the operator-facing wrapper.

T15 supersedes the T14-D/E/F deferred-backlog plan. The 316 RW-FIX sites the T14-A baseline catalogued were swept end-to-end in Tier 15 (see ### Tier 15 below). make audit-brand-id-rw reports FIX=0 as of T15-E (commit d882362a) and stays at 0 through T16-A/B and the T16-C classifier upgrade.

Tier 15 brand_id RW exhaustive sweep (delivered, T15-A .. T15-E)

T15 closes the entire deferred RW-FIX backlog from T14. Each batch swept one chunk of the 316-site list, brand-scoped every SELECT / UPDATE / DELETE that touched a brand-scoped table, and added defence-in-depth predicates so the SQL itself refuses to scope-cross even when the authz layer is bypassed.

  • T15-A (005719c3) — wallet_service SELECT/UPDATE/DELETE: every read of player, coin_transaction, player_deposit, player_withdraw, player_balance_log, bank, player_usdt, player_rolling, etc. now carries either AND brand_id = :brand_id or a JOIN-on-parent-brand predicate. The _record_rebate_transfer_stat helper was rewritten so the cross- brand player_id collision can no longer leak another brand's account / agent_id / brand_id triple.
  • T15-B (e50c8c99) — promotion_service + rolling_service: brand-scoped every event/coupon/promotion/rolling read and write the two services issue. Includes the rolling FIFO progress writes (which previously ran as guid-only UPDATEs).
  • T15-C (ce8504ae) — player_service + agent_service + game_service + recon_service: brand-scoped the registration / login lookups, agent-tree reads, supplier callbacks, and the reconciliation cross-checks.
  • T15-D (684f268e) — admin_service money + finance + banking back-office routes: every player_deposit / player_withdraw / player_balance_log / bank / bank_card / level-rebate / coupon-grant SQL the back-office issues now carries the operator brand_id. _resolve_strict_brand_id is the canonical resolver for "operator OR row brand_id" decisions.
  • T15-E (d882362a) — admin_service residual RW + the entire FIX-TEST tail: the last admin SQL queries that lacked a brand predicate, plus every remaining test-fixture INSERT/UPDATE that was still issuing brand-less queries against the post-0024c schema. After T15-E the audit reports RW FIX=0 / FIX-TEST=0.

Tier 16 CR9 closure + classifier dynamic-SQL upgrade (delivered, T16-A .. T16-D)

T16 is the post-T15 cross-review batch. CR9 (the dedicated admin_service review with manual auditing of the back-office RW contract) surfaced 8 findings the static classifier could not catch — most notably the _query_log_table helper in admin_service/app/api/routes/logs.py whose SQL was built dynamically (WHERE 1=1 builder + f-string template) and so no static line carried a literal brand_id token to be inspected.

  • T16-A (a72ed1c8) — admin_service Critical + Important leaks closed: logs.py::_query_log_table made brand-required and the 4 real-table log endpoints (player_balance / agent_balance / player_login / agent_login) refuse without X-Brand-Id; agent.py::list_agents switched to JOIN agent_brand on the operator brand (per ADR-009); agent.py::create_agent enforces body.brand_ids ⊆ operator brands; player.py tag CRUD pinned to brand; finance_config.py upserts validate the level row belongs to the operator brand; system.py::update_global_var / update_settings / update_maintenance gated to super_admin via require_super_admin with a new admin_platform_global_access_total counter; player_finance.py::force_complete_rolling migrated to the canonical resolve_brand_id + brand_required_envelope + _resolve_strict_brand_id pattern; deps.py super-admin role set widened to {super_admin, super} for the legacy bo/admin token. Test count: 217 → 238 (21 new T16-A regressions).
  • T16-B (0e2f6fa8) — wallet_service fail-loud brand context. Every if brand_id is not None: <append predicate> conditional in queries.py, wallet.py, and services/deposit_policy.py was converted to: at the route boundary refuse the call with Brand context required for <route>; in the helper require brand_id: int (no Optional); in the SQL unconditionally carry AND brand_id = :brand_id. 13 query routes + the /deposit pending+bank lookups + the validate_deposit_bonus_event helper
    • the _record_rebate_transfer_stat SELECT all converted. Plisio _resolve_manual_bonus_terms now returns zero-bonus when the deposit row is a legacy pre-backfill record without brand_id instead of issuing a brand-less event lookup.
  • T16-C (934d47f4) — RW classifier dynamic-SQL upgrade. New FIX-DYNAMIC class detects the two construction patterns the static classifier missed:
    1. WHERE 1=1 builders that append predicates via conditions.append(...) — the brand_id idiom may live in the helper that joins the conditions list, far from the matched text(f"...") call;
    2. f-string SQL templates f"SELECT ... FROM {table} WHERE {where} ..." where both the table and the WHERE clause are templated. Same exit-code semantics as FIX (non-zero when count > 0) so any dynamic-builder regression breaks CI. .venv/lib/... / site-packages/... paths are now treated as OOS-LEGACY so SQLAlchemy core internals don't surface in the new pass. Final classifier output: FIX=0 + FIX-DYNAMIC=0 across the repo.
  • T16-D (this batch) — docs sync. Bumps the LVC across multi-brand-isolation-rollout.md, 2026-04-29-brand-id-rw-audit.md, 2026-04-28-brand-id-insert-audit.md, phase-16-hard-flip-checklist.md, and diagnosis-playbook.md; documents the post-T15 RW headline (FIX=0 / OK=219+97+24 / OOS-INTENTIONAL=91+18+5); records the 119 OOS-INTENTIONAL breakdown by category (resolver helpers, scheduled tasks, brand-global tables, cross-brand admin views); calls out the 2 known false-positive FIX-TEST entries in test_agent_integration.py (assertion strings that match the production INSERT shape, not actual writes); adds the make audit-brand-id-inserts && make audit-brand-id-rw gate as a Phase-16 hard-flip precondition; records the dynamic-SQL classifier upgrade in the admin_service service doc.

Known follow-ups

These items were identified during the cross-review but are intentionally out of scope of the in-progress rollout work. Each is tracked here so the spec coverage is honest and the next iteration can pick them up directly.

Future audit-row hardening

admin_audit and recovery_audit are append-only at the DB level via 0032 / 0033. The remaining hardening still on the table (not blocking Phase 16):

  • Restrict the application role to INSERT-only on admin_audit via a separate DB role (defence-in-depth on top of the trigger).
  • Hash-chain rows (prev_hash, row_hash) so any out-of-band catalog manipulation that bypasses the triggers is detectable on read.

Operator audit trail for MULTI_BRAND_ENFORCEMENT flips

The cluster-wide MULTI_BRAND_ENFORCEMENT flip is the single most security-impactful operator action on the platform. As of the P1-5 hardening, flip_enforcement.py requires two new arguments:

  • --operator-id -- identity of the operator running the flip (no default; suggested format: email or LDAP id). Recorded in admin_audit.operator_id.
  • --reason -- human-readable justification (no default; suggested format: ticket id plus a short sentence). Recorded in admin_audit.new_value.reason.

Every invocation of the script (forward, rollback, dry-run) writes exactly one admin_audit row with action='ENFORCEMENT_MODE_FLIP', resource='MULTI_BRAND_ENFORCEMENT', resource_key=<env>, and new_value carrying the env, direction, target_mode, services and reason. The script reuses the same INSERT shape as servers_v2/admin_service/app/services/admin_audit.py.

The --no-audit escape hatch is reserved for offline rehearsals where no database is reachable; using it during a real flip is a process violation. If DATABASE_URL is unreachable and --no-audit is not passed, the script exits 3 (audit row write failed) rather than printing the plan and silently succeeding -- the audit trail is the whole point of wiring this in.

Auditing the audit trail: query SELECT * FROM admin_audit WHERE action = 'ENFORCEMENT_MODE_FLIP' ORDER BY created_at DESC to see every flip generated against the target environment. Combined with the append-only triggers above, this is the canonical record for "who flipped enforcement, when, and why" -- and it cannot be retroactively edited.

Game callback DLQ table migration

Resolved by alembic migration 0030 (game_callback_dead_letter). servers_v2/game_service/app/services/callback_brand.py now persists every brand-unresolved callback to the table best-effort before logging

  • incrementing game_callback_brand_unresolved_total{provider}. DB errors during the INSERT are caught and logged at WARN -- they never block the provider response. Operator interaction is via servers_v2/tools/multi_brand_backfill/replay_game_callback_dlq.py; see the "Game callback DLQ procedure" section above.