Multi-Brand Isolation Rollout Runbook
Status
Ready
Last Verified Commit
934d47f4 (T16-C classifier upgrade; T15-A..E swept the deferred RW
backlog so make audit-brand-id-rw reports FIX=0 + FIX-TEST=0;
T16-A closed the CR9 admin_service Critical+Important findings; T16-B
fail-loud-converted every conditional brand predicate in
wallet_service; T16-C extended the classifier to detect dynamic-SQL
helpers; T16-D doc sync — this entry)
Date
2026-04-29
Owners
- Platform Backend
- Player Domain
- Wallet Domain
- Agent Domain
Affected Services
gatewayplayer_servicewallet_servicewallet_workerrolling_servicerolling_workerpromotion_servicepromotion_workergame_serviceagent_serviceagent_workeradmin_serviceadmin_workerrecon_servicerecon_worker
Related Docs
docs/adr/ADR-009-multi-brand-domain-routed-isolation.mddocs/specs/multi-brand/2026-04-27-multi-brand-isolation-spec.mddocs/plans/multi-brand/2026-04-27-multi-brand-isolation-plan.mddocs/runbooks/local-docker/local-docker-full-stack.mddocs/runbooks/wallet/ruby-wallet-topology-rollout.md
On-call TL;DR
This rollout enables multi-brand isolation on a previously single-brand platform. The biggest risks are:
- a Phase 16 enforcement flip that locks legitimate users out of one or more brands (revert per Phase 16 mid-flip procedure)
- a brand-resolution failure that misroutes money (use Diagnosis
Playbook below; flip the affected service to
observeto buy time) - a game-callback brand-unresolved failure (callbacks land in DLQ; no money loss but provider-side state may diverge until DLQ replayed)
Who to page: wallet domain owner for any wallet-related alert; agent
domain owner for agent_brand allow-list issues; platform on-call
for everything else.
Where to look:
- Per-brand request rates and rejection counters in the
multi-brand-isolationGrafana dashboard - Per-service enforcement mode via
multi_brand_enforcement_modegauge - Live brand catalog:
psql -c "SELECT brand_id, brand_code, status FROM brand"(canonical source); RedisKEYS domain:agent:*andKEYS domain:level:*for active domain bindings
Goal
Validate and roll out multi-brand isolation safely. After this rollout the
platform can host more than one brand on the same servers_v2 runtime, with
brand-scoped player identity, wallet state, rolling, settlement, and
configuration.
This runbook is not approved for production enablement of a second brand until the related spec, implementation plan, tests, and migrations are complete and the release gate below passes.
Preconditions
- ADR-009 is
Accepted. - Spec status is
Approved. - Implementation plan done definition is satisfied.
- Migrations are reviewed and reversible for non-production environments.
- A
defaultbrand has been seeded; every existing row has been backfilled withbrand_id = default. - An
agent_brandrow of the form(agent_id, default_brand_id, status='enabled')has been seeded for every existingagentrow; without it Phase 16 hard-flip will reject every existing agent. - The Redis maps
domain:agent:*anddomain:level:*have been rewritten to bind every existing domain to thedefaultbrand. - Worker health reflects forward progress per ADR-003.
- Wallet ledger and outbox failures surface per ADR-004.
- Alerting exists for every counter on which the Phase 16 release
gate or an in-runbook alarm depends. Every name below maps to an
emitted counter in
servers_v2/shared/contracts/src/rgb_contracts/infra/or per-service code; full per-counter playbooks live indiagnosis-playbook.md.- Brand resolution / forwarding —
brand_resolution_failed_total{reason,service}(theservice="gateway"slice is the canonical edge alarm),agent_brand_not_allowed_total,recovery_brand_mismatch(envelope reason; surfaces viabrand_resolution_failed_total{reason="recovery_brand_mismatch"}). - Wallet brand-signature defence-in-depth —
wallet_brand_signature_failed_total{caller_service,reason,mode},wallet_brand_signature_missing_total{caller_service},wallet_brand_signature_misconfigured_total{mode}(P1-4),wallet_brand_signature_replay_blocked_total(failure-reason slice of the previous),wallet_brand_signature_replay_redis_outage_total{caller_service,mode}(delivered P2-δ). - Wallet topology —
wallet_topology_default_brand_unprimed_total(delivered P2-δ; pages on any non-zero rate — every increment is a hard runtime failure),wallet_topology_default_brand_fallback_total{caller_origin}(per-brand-onboarding signal). - Cross-brand rejections (per service) —
wallet_cross_brand_rejected_total{command,mode},agent_cross_brand_rejected_total{command,mode},game_cross_brand_rejected_total{command,mode},promotion_cross_brand_rejected_total{command,mode},recon_cross_brand_rejected_total{command,mode},rolling_cross_brand_rejected_total{command,mode}. - Game-callback edge —
game_callback_brand_unresolved_total{provider},game_callback_brand_unresolved_total{reason="raw_account_in_enforce"}and{reason="raw_guid_in_enforce"}(P1-3 brand-unaware fallback gate -- both must be zero before flip),game_callback_verification_bypassed_total{provider}(T1-D-C2; must be zero in production). - Per-caller internal-token rollout —
internal_caller_token_legacy_total{consumer,caller}(Stage C drain signal),internal_caller_token_legacy_rejected_total{consumer,caller}(T4-D-I2; pages on any non-zero rate oncePER_CALLER_TOKEN_REQUIRED=on). - Event schema drain —
event_legacy_schema_total{stream,consumer}(must be zero before flip). - Gateway session + JWT —
gateway_jwt_session_missing_total(T1-D-C1; baseline + spike alarm after the Phase 4E dual-read fallback removal),gateway_jwt_unknown_kid_total(T4-D-I4; tracks RS256 kid rotation pub/sub propagation). - Admin operator-identity + audit —
admin_operator_id_mismatch_total{reason}(P1-3; pages on any non-zero rate -- spoof or legacy token),admin_audit_write_failed_total{route}(P1-4; pages on any non-zero rate -- money committed without audit row),agent_balance_legacy_write_total{change_type}(T4-D-I3; baseline non-zero pre-migration to wallet_service). - Admin WebSocket auth (T4-D-I7) —
admin_ws_legacy_query_token_total(cutover trend, must be zero before flippingadmin_ws_allow_query_token=Falsein production),admin_ws_rate_limited_total{reason}. - PII boot guard (T1-D-C3) —
pii_aes_unconfigured_total{service,op}(pages on any non-zero rate in production -- the boot guard refuses to start prod-grade pods, so a non-zero prod sample is a high-severity alert). - Plisio webhook hardening (T1-D-C5/I6) —
plisio_callback_replay_blocked_total{status},plisio_callback_replay_redis_outage_total{env}. - Recovery flow (Codex P1-#5/#6, T4-D-I5) —
recovery_sms_delivery_failed_total{service},recovery_email_unprovisioned_total{service},recovery_rate_limit_redis_outage_total{bucket}(T4-D-I5; pages on any non-zero rate when no concurrent Redis incident). - Enforcement-mode posture —
security_downgrade_total{service}(page; service started in non-enforce mode while >1 brand enabled),multi_brand_enforcement_mode{service}gauge. - Per-brand traffic — per-brand wallet outbox publication freshness (existing wallet outbox alert, scoped per
brandlabel) and per-brandrequest_total{brand_code,service}zero-traffic alert.
- Brand resolution / forwarding —
- Game provider operations have signed off on the outbound account namespace change for every supported provider.
Local Docker Validation
- Start the full local stack using the local Docker runbook.
- Apply multi-brand migrations and run the
defaultbrand backfill. - Confirm
brand,brand_config, andagent_brandexist and that every brand-scoped table has zero NULLbrand_idrows. - Seed a second brand
brand2with a distinctbrand_code, a distinct default currency, and at least one bound domain. - Bind one
domain:agent:*and onedomain:level:*Redis entry tobrand2. - Confirm
gateway,player_service,wallet_service,rolling_service,promotion_service,game_service,agent_service,admin_service, andrecon_servicehealth is green. - Confirm every worker reports forward-progress readiness.
- Run unit and contract tests for:
- shared brand contracts
- migrations and backfill
gatewaybrand resolution and JWT/domain mismatchplayer_servicecross-brand registration and recoveryagent_serviceallow-list enforcementwallet_servicecross-brand command rejection per commandwallet_serviceper-brand topology and policy resolutionrolling_servicebrand-scoped consumption and progresspromotion_serviceper-brand resolutiongame_serviceoutbound namespacing and inbound reverse-parse per providerrecon_servicebrand-scoped match and approvaladmin_servicebrand catalog,brand_config,agent_brandCRUDadmin_servicestaff removal (the seven deleted legacy route files no longer mount any router; previously served paths under/api/admin/user/*,/api/admin/pushbullet/*,/api/admin/shooter/*,/api/admin/web/rules/*,/api/admin/web/faq/*,/api/admin/web/config/*return404)- observability log fields and metric labels
- Run two-brand local end-to-end flows:
- register the same
accountstring underdefaultandbrand2, confirm two distinctplayer_idvalues - login on
default, attempt a request with that JWT againstbrand2's domain, confirmgatewayrejects it - deposit, bet, settle, rolling, withdraw on each brand independently, confirm no cross-brand state leak
- issue a coupon on
brand2, use it onbrand2, attempt the same coupon code ondefault, confirm rejection - send a game provider callback for
brand2, confirm the outbound account is{brand_code}_{account}and the callback reverse-parses intobrand2 - run a recon match against a
brand2deposit, confirm the approval resolves intobrand2only - change a
brand_configvalue onbrand2(e.g. cashback rate) and confirmbrand2settlements use the new value whiledefaultsettlements continue to use the global default
- register the same
- Confirm every money movement has immutable
wallet_ledgerrows carryingbrand_id. - Confirm wallet, rolling, promotion, and player outbox events carry
brand_idin payload. - Confirm structured logs and Prometheus metrics carry brand labels.
Running the two-brand E2E harness
Phase 14 ships a self-contained pytest harness that orchestrates the
local two-brand validation step (item 9 above) against a running
servers_v2 Docker stack.
Layout:
-
Seed script:
servers_v2/tools/multi_brand_backfill/setup_local_two_brand.py-- idempotent. Ensuresdefaultandbrand2exist, ensures the deterministic seed agent exists, seeds the standardbrand2brand_configkeys withcashback_rate=0.05so per-brand overrides are visible, seedsagent_brand, brand-scopedagent_setting, andagent_domainfor(agent_id=1, default)and(agent_id=1, brand2), and writes the twodomain:agent:*Redis bindings:domain:agent:default.local -> {"brand_id": <default>, "agent_id": 1}domain:agent:brand2.local -> {"brand_id": <brand2>, "agent_id": 1} -
Harness:
servers_v2/tests/test_multi_brand_two_brand_e2e.py. Gated behind both thelocal_docker_e2epytest marker ANDLOCAL_DOCKER_E2E=1. A normaluv run pytestinvocation skips the whole module (the pure-logic helpers intests/test_setup_local_two_brand_unit.pycontinue to run).
Prerequisites:
servers_v2docker compose stack is up and healthy (docker compose -f docker-compose.yml -f docker-compose.dev.yml psshows every servicehealthy).- The seed script can run against a migration-only local stack. It also
works against an already bootstrapped stack; existing seed rows are
upserted or left in place.
tools/multi_brand_backfill/verify.pymust exit 0 after the seed (the harness re-runs it as a session-scoped guard). - Host name resolution: any browser-side test would need
default.localandbrand2.localin/etc/hostspointing at127.0.0.1. The harness uses theHost:request header instead so no/etc/hostsediting is required. MULTI_BRAND_ENFORCEMENT=observeset explicitly for this harness. The repository's compose default is nowenforce; Phase 14's local two-brand harness still validates the observe-mode soak behavior. Do NOT run this harness withenforceset.RGB_JWT_KID,RGB_JWT_PRIVATE_KEY, andRGB_JWT_PUBLIC_KEYSare set in.env.compose.local; login and admin-token checks are part of the harness. Local env files may use escaped\nPEM newlines.NO_PROXYincludeslocalhost,127.0.0.1,::1so Python/httpx does not send local Docker traffic through a macOS or corporate HTTP proxy.
Run it (preferred — the same recipe make verify-servers-v2-two-brand
executes; reproducing here so an operator can step through manually):
cd servers_v2
export RGB_MULTI_BRAND_ENFORCEMENT=observe
docker compose -f docker-compose.yml -f docker-compose.dev.yml --env-file .env.compose.local up -d
# Load .env.compose.local into the current shell. Do NOT use
# ``set -a; . ./.env.compose.local; set +a`` — if your local
# PEM blocks are unescaped multi-line, the shell parses
# ``-----BEGIN PRIVATE KEY-----`` as a command and the source
# aborts. ``tools/dump_env_compose.py`` (BEGIN/END-aware Python
# parser) emits shell-safe export lines for both shapes.
eval "$(uv run python -m tools.dump_env_compose .env.compose.local)"
export DATABASE_URL="postgresql+psycopg://${RGB_POSTGRES_USER}:${RGB_POSTGRES_PASSWORD}@localhost:5433/${RGB_POSTGRES_DB}"
export REDIS_URL=redis://localhost:6381
export NO_PROXY=localhost,127.0.0.1,::1
export no_proxy=localhost,127.0.0.1,::1
# E2E harness needs to know which caller token to attach.
export E2E_INTERNAL_CALLER_SERVICE=gateway
export E2E_INTERNAL_SERVICE_TOKEN="$RGB_INTERNAL_SERVICE_TOKEN_GATEWAY"
uv run python -m tools.multi_brand_backfill.setup_local_two_brand
LOCAL_DOCKER_E2E=1 \
uv run pytest tests/test_multi_brand_two_brand_e2e.py -v -m local_docker_e2e
Expected output: 19 passed (or a small number skipped where a
diagnostic endpoint is not present in the running image, e.g. the
game_service internal/game/reverse-parse-account helper). Any FAIL
indicates a real isolation regression -- see the relevant section of
the Diagnosis Playbook below.
The harness exercises:
- Two brands seeded with distinct
brand_idandbrand_code. - Same
accountstring under both brands -> two distinctplayer_idvalues (UNIQUE (brand_id, account) is in effect). - Per-brand wallet read isolation across two players sharing one account string.
cashback_rateoverride onbrand2resolves to0.05whiledefaultdoes not see that value.default_alice_e2e/brand2_alice_e2ereverse-parse to their respective brands atgame_service.- Externally-supplied
X-Brand-Idis stripped at the gateway. - Every legacy back-office path removed in Phase 12 returns
404against the liveadmin_service. - At least one Prometheus metrics surface exposes a
brand/brand_id/brand_codelabel. - A JWT issued for
defaultreplayed againstbrand2's domain does not 5xx and (best-effort) incrementsbrand_resolution_failed_totalon the gateway.
Staging Validation
- Deploy services and migrations to staging.
- Run the
defaultbrand backfill in staging; confirm zero NULLbrand_idrows. - Rewrite staging
domain:agent:*anddomain:level:*Redis values to the brand-aware shape; confirm every existing domain still resolves todefault. - Run all unit, contract, and integration suites against staging.
- Re-run the two-brand end-to-end flows from local Docker validation against staging.
- Compare wallet bucket totals with ledger-derived totals per brand.
- Confirm admin
brand_configwrites are audited. - Confirm
agent_brandwrites are audited. - Confirm topology and policy activation in staging is brand-scoped and does not affect the other brand.
- Confirm game provider callback reverse-parsing succeeds for every supported provider against staging traffic samples.
- Confirm operational dashboards show brand labels for player-, wallet-, rolling-, promotion-, and game-scoped metrics.
- Confirm the three brand-resolution alerts and the per-brand wallet outbox freshness alert fire in a synthetic-failure drill.
Release Gate
Do not enable a second brand in any shared or production-like environment if any of the following are true:
- shared brand contracts, migrations, or backfill are missing or failing
- any brand-scoped table contains rows with NULL
brand_id - the target environment has not progressed
MULTI_BRAND_ENFORCEMENTtoenforceafter the soak window inobservemode (a second brand cannot be enabled while any service is still inobserveoroff) game_callback_brand_unresolved_total{reason="raw_account_in_enforce"}OR{reason="raw_guid_in_enforce"}is non-zero across the soak window. Pre-flip every game-provider callback path must already be migrated to send the namespaced{brand_code}_{account}form; the legacy raw-account / raw-guid fallback is brand-unaware and is rejected outright in enforce mode (see Bug #3 /callback_brand.resolve_callback_brand). A non-zero counter means at least one provider still sends the raw form -- migrate the upstream caller and re-soak before flipping enforce.- the soak window has not been at least
2 * JWT_EXPIRE_MINminutes (defaultJWT_EXPIRE_MIN = 60, so at least 120 minutes) measured from Phase 5 deploy time (the last moment a brand-unaware JWT could be issued), so that JWTs issued before brand-aware login go live have all expired - any JWT in the active session population still lacks a
brand_idclaim - any
wallet_servicecommand path can mutate a row whose brand differs from the request brand - per-brand topology or policy activation can affect another brand's active document
- per-brand topology or policy resolution returns an active document belonging to a different brand than the requesting brand
- per-brand rolling completion ratios, cashback / rebate / lossback
rates, or payment channel selection do not resolve from
brand_config(with documented global default fallback) as specified - any game provider callback fails to reverse-parse the namespaced account
- any brand-scoped query in any service is missing a
brand_idfilter - any path previously served by the seven deleted
admin_servicelegacy route files still responds with anything other than404 - the three brand-resolution alerts or per-brand wallet outbox freshness alert are not wired
- structured logs or Prometheus metrics are missing brand labels
- two-brand local Docker or staging end-to-end flows fail at any step
- the
projectinteger inadmin_service/app/tasks/tag.pyis still in use anywhere - any
global_varrow whose key is duplicated by abrand_configentry still exists (the deletion list documented in plan Phase 12 has not been applied)
Rollback Trigger
Pause rollout or roll back immediately if:
brand_resolution_failed_totalrises above the documented thresholdwallet_cross_brand_rejected_totalbecomes non-zero in normal trafficgame_callback_brand_unresolved_totalbecomes non-zero in normal traffic- a player can register or log in under the wrong brand
- a wallet command mutates a row whose brand differs from the request brand
- per-brand topology or policy activation alters another brand's active document
- ledger writes fail or are delayed beyond the wallet release threshold
- bucket totals diverge from ledger-derived totals for any brand
- a game provider callback for one brand routes into a different brand's player
- a recon approval resolves into the wrong brand
Operational Reference
Grafana dashboard
Dashboard name: multi-brand-isolation. Required panels:
- Per-brand request rate (stacked):
sum by (brand_code) (rate(request_total[1m])) - Brand-resolution failures (by reason):
sum by (reason, service) (rate(brand_resolution_failed_total[5m])) - Wallet cross-brand rejections (by command and mode):
sum by (command, mode) (rate(wallet_cross_brand_rejected_total[5m])) - Game callback brand unresolved (by provider):
sum by (provider) (rate(game_callback_brand_unresolved_total[5m])) - Idempotency cross-brand splits:
rate(wallet_idempotency_brand_split_total[5m]) - Enforcement mode per service (gauge):
multi_brand_enforcement_mode - Brand-resolution latency (p99):
histogram_quantile(0.99, sum by (le, service) (rate(brand_resolution_latency_seconds_bucket[5m]))) - Security downgrade events:
increase(security_downgrade_total[1h]) - Admin operator-id mismatch (P1-3, by reason):
sum by (reason) (rate(admin_operator_id_mismatch_total[5m])) - Admin audit-write failure (P1-4, by route):
sum by (route) (rate(admin_audit_write_failed_total[5m]))
Service name vs X-Caller-Service / metric label
The runbook below uses two distinct identifiers that operators must not confuse when grepping metrics, logs, or HTTP traces:
| service (compose / repo) | X-Caller-Service header + caller_service metric label | env-var suffix (per-caller token) |
|---|---|---|
gateway | gateway | GATEWAY |
admin_service | admin | ADMIN |
agent_service | agent | AGENT |
game_service | game | GAME |
promotion_service | promotion | PROMOTION |
recon_service | recon | RECON |
rolling_service | rolling | ROLLING |
player_service | player | PLAYER |
When you see caller_service=admin in a Prometheus query, that is the
admin_service process. There are no *_service strings on the wire
or in label values — only repo / compose use the _service suffix.
HMAC signature payloads, missing-signature counters, and per-caller
token env-var lookups all key on the short label.
INTERNAL_SERVICE_TOKEN_* rotation
Each consumer service (wallet_service, rolling_service,
promotion_service, etc.) accepts a set of caller-service tokens
(INTERNAL_SERVICE_TOKEN_GATEWAY, INTERNAL_SERVICE_TOKEN_AGENT, ...).
The consumer's require_internal_service_token dependency reads both
the active env var AND the _PREV overlap variant; rotation is per
token with the following steps:
- Set
_PREVon the consumer. On every consumer instance (wallet/rolling/promotion/game/player/agent), setINTERNAL_SERVICE_TOKEN_<UPPER(CALLER)>_PREVto the current outgoing value AND keepINTERNAL_SERVICE_TOKEN_<UPPER(CALLER)>pointing at the SAME outgoing value. (At this stage both env vars hold the same value -- nothing changes for callers.) Deploy. - Flip active to the new value on the consumer. Set
INTERNAL_SERVICE_TOKEN_<UPPER(CALLER)>to the new secret;_PREVstill holds the outgoing one. Deploy. The consumer now accepts BOTH values viahas_valid_per_caller_token_with_source(...)(sharedrgb_contracts.infra.internal_auth). - Roll the caller service to the new token. Set
INTERNAL_SERVICE_TOKEN_<UPPER(CALLER)>to the new secret on the producer side; deploy. Watchinternal_caller_token_prev_used_total{consumer,caller}— every request still authenticating via the old key fires this counter. When the rate hits zero, every caller instance is on the new secret. - Remove
_PREVfrom the consumer. DropINTERNAL_SERVICE_TOKEN_<UPPER(CALLER)>_PREVfrom env; deploy. The counter must remain at zero post-deploy.
Rollback at any step: revert the consumer env vars and redeploy — the old token regains acceptance immediately.
BRAND_SIGNING_KEY rotation
Same overlap shape, applied to the wallet brand-signature verifier:
- Set
BRAND_SIGNING_KEY_PREVon everywallet_serviceinstance to the current outgoing key; keepBRAND_SIGNING_KEYat the same outgoing value. Deploy. - Flip
BRAND_SIGNING_KEYonwallet_serviceto the new key;_PREVstill holds the outgoing one. Deploy. The verifier (verify_request_brand_signature(... signing_key, signing_key_prev, ...)) now accepts signatures produced with either key. - Roll the 6 signing callers (
gateway,admin_service,game_service,promotion_service,recon_service,rolling_service) to the new key; deploy. Watchwallet_brand_signature_prev_key_used_total{caller_service}— when the rate hits zero, every caller has cut over. - Remove
BRAND_SIGNING_KEY_PREVfromwallet_service; deploy.
The verifier requires the active BRAND_SIGNING_KEY to be set —
_PREV alone is treated as a misconfigured (empty active key) and
fails closed in enforce mode. This prevents leaving the old key in
place indefinitely.
Per-caller-service token rollout
Phase 5B introduced per-caller tokens. Each consumer service accepts both
the legacy single INTERNAL_SERVICE_TOKEN AND a per-caller variant
INTERNAL_SERVICE_TOKEN_<UPPER(caller)>. The rollout has three stages:
-
Stage A (now): legacy token still active everywhere; per-caller tokens not provisioned. The
internal_caller_token_legacy_total{consumer,caller}counter shows 100% legacy traffic. -
Stage B (transition): provision per-caller env vars on every consumer; deploy. Both tokens are accepted; per-caller takes precedence. The counter shows mixed legacy + per-caller usage.
-
Stage C (terminal): revoke the shared
INTERNAL_SERVICE_TOKENfrom every consumer; deploy. From this point only per-caller tokens are accepted. The counter must remain at zero or only fire for misconfigured callers.
The per-caller env-var names are:
INTERNAL_SERVICE_TOKEN_GATEWAY(consumed by every other service when receiving from gateway)INTERNAL_SERVICE_TOKEN_AGENTINTERNAL_SERVICE_TOKEN_ADMININTERNAL_SERVICE_TOKEN_RECONINTERNAL_SERVICE_TOKEN_GAMEINTERNAL_SERVICE_TOKEN_PROMOTIONINTERNAL_SERVICE_TOKEN_ROLLING
Each consumer reads the relevant subset (e.g. wallet_service reads all 7 since every other service may call into wallet).
Pre-Stage-C release gate:
internal_caller_token_legacy_total{consumer="wallet_service"}is zero across the soak windowinternal_caller_token_legacy_total{consumer="player_service"}is zero across the soak window- (repeat for every consumer)
To roll Stage C: remove the INTERNAL_SERVICE_TOKEN env var from each
consumer's deploy template and roll one consumer at a time. Mid-flip
abort: re-provision the env var and rollback.
The audit test
servers_v2/tests/test_per_caller_header_audit.py is a regression net
that asserts every internal HTTP client in the stack sends
X-Caller-Service. Run it before each Stage C deploy:
cd servers_v2 && uv run pytest tests/test_per_caller_header_audit.py -v
JWT RS256 key (JWT_PRIVATE_KEY / JWT_PUBLIC_KEY) rotation
JWT signing uses RS256 with kid header. The private key
(JWT_PRIVATE_KEY) is provisioned only into player_service and
agent_service; the public key (JWT_PUBLIC_KEY) goes everywhere
that verifies JWTs (gateway and any other verifier). Verifiers
maintain an active key set keyed by kid. Rotation procedure:
- Generate a new RSA keypair with a new
kid(e.g.kid=YYYYMMDD-N). Add it to the verifiers' active key set under itskid(env varJWT_PUBLIC_KEYSis a JSON map ofkid -> public_key); deploy the verifier change first. - Roll the verifier deploy across
gatewayand any other verifying service; confirm/healthlists both old and newkidas accepted. - Update
JWT_PRIVATE_KEYandJWT_KIDonplayer_serviceandagent_serviceto the new keypair; restart. New tokens issued from this point use the newkid. - Overlap window: keep the old
kidin the verifier's active set for at leastJWT_EXPIRE_MIN + REFRESH_TOKEN_TTL + 60minbuffer (default:60min + REFRESH_TOKEN_TTL + 60min). All in-flight tokens drain by the end of this window. - Remove the old
kidfromJWT_PUBLIC_KEYSon the verifiers; re-deploy. Any token presented after this point that uses the oldkidis rejected asunknown_kid.
Emergency rotation (key compromise): skip step 4's overlap. Remove
the old kid immediately on verifiers; force re-login for every
session by rolling a global session-invalidation flag. Customer
impact: every active session must re-authenticate. Document in
#incidents.
JWT signing tests must include: (a) alg=none rejected, (b)
alg=HS256 rejected when configured for RS256, (c) JWT with unknown
kid rejected, (d) JWT signed with old kid accepted during overlap,
(e) JWT signed with old kid rejected after overlap.
Brand-signature enforcement flip (WALLET_BRAND_SIGNATURE_REQUIRE)
wallet_service accepts an X-Brand-Signature header on brand-scoped
writes from internal callers. Whether unsigned writes are rejected in
enforce mode is gated by the env var WALLET_BRAND_SIGNATURE_REQUIRE
on wallet_service:
- Default:
"off"(transitional). Wallet still verifies signatures when present, but unsigned writes are accepted and a counter (wallet_brand_signature_missing_total{caller_service=*}) is incremented per missing-signature observation, tagged with the reportingX-Caller-Service. - Target:
"on". Unsigned brand-scoped writes are rejected with403 brand_signature_missing. This is the actual security control; if the var is never flipped, brand-signature enforcement is dead even after the global enforce-mode flip.
Rollout stages:
- Stage A (current): the 6 wallet write callers (compose names:
gateway,admin_service,game_service,promotion_service,recon_service,rolling_service; theircaller_servicemetric labels aregateway,admin,game,promotion,recon,rollingrespectively per the table above) all sign brand-scoped writes towallet_service. Wallet verifies presence + signature when supplied but does not reject unsigned writes.agent_serviceandplayer_servicehave no outbound wallet HTTP client and therefore are not in this list (see the table-freeze section below for the broader DB-writer set, which is a distinct concept). - Stage B (soak): monitor
wallet_brand_signature_missing_total{caller_service=*}over the full soak window (same window as the enforce-mode soak). Expect exactly zero increments. Any non-zero value identifies a misconfigured caller (missingBRAND_SIGNING_KEY, missing header injection in the HTTP client, etc.) and blocks the flip until fixed and re-soaked. - Stage C (flip): set
WALLET_BRAND_SIGNATURE_REQUIRE=onon everywallet_serviceinstance, restart wallet. Verify rejection works by sending a hand-crafted unsigned request to a brand-scoped endpoint and confirming a403response.
Rollback: set WALLET_BRAND_SIGNATURE_REQUIRE=off, restart wallet.
Unsigned writes are accepted again immediately on restart. Use only
if a real production caller turns out to be missing the header (and
file a P0 to fix the caller before re-attempting the flip).
Secret storage and provisioning
Where each shared secret lives:
JWT_PRIVATE_KEY,JWT_PUBLIC_KEYS(a JSON map ofkid -> public_key),JWT_KID: provisioned via the secrets manager (Vault pathsecret/multi-brand/jwt/<env>); read access scoped toplayer_serviceandagent_servicedeploy roles for the private key, all-services for public keys.BRAND_SIGNING_KEY: provisioned via the secrets manager (Vault pathsecret/multi-brand/brand-signing/<env>); read access scoped togateway,agent_service, andwallet_servicedeploy roles.INTERNAL_SERVICE_TOKEN_*: one path per consumer-caller pair (secret/multi-brand/internal-tokens/<consumer>/<caller>/<env>).- Production deploys must verify these are present at boot via a startup probe; missing secret fails the boot, not silent fallback.
Application-level table-freeze flag
Several migration steps (0027 PK rewrites, partial-unique-index
rebuilds) and Production Rollback B require pausing application
writes to a specific table family while the migration / repair runs.
The mechanism is a Redis-backed flag set:
- Key pattern:
table_freeze:{table_name}(string, value1when frozen, key absent when unfrozen). - Every writer (
wallet_service,player_service,agent_service, etc.) checks the flag before any INSERT/UPDATE on a frozen table and returns the documented503 service_pausedenvelope to the caller when the flag is set. Reads are not blocked. - Set:
redis-cli SET table_freeze:wallet_topology 1 EX 1800(30 min TTL safety net; the TTL prevents a forgotten flag from permanently locking writes). - Clear:
redis-cli DEL table_freeze:wallet_topology. - All set/clear operations are logged in
#incidentswith the operator identity, table name, and reason. - Writers cache the flag check for 1 second to avoid Redis hot-pathing; this means the freeze takes effect within 1 second of the SET.
Admin VPN ingress source of truth
The IP allow-list for admin_service ingress is maintained in the
infrastructure repo (infra/lb/admin-allowlist.tf). Onboarding a
new operator IP requires a PR to that repo plus security-team
approval; the PR triggers an LB config push. Off-boarding requires
the same PR flow. The source of truth is the Terraform module; ad-hoc
LB UI changes are forbidden and surfaced by drift-detection on the
infra repo.
Pen-test release gate
A second brand cannot be enabled in production until an external penetration test of the multi-brand isolation surface has completed and findings are remediated. Scope:
- gateway brand resolution (Host/Origin spoof, JWT replay, JWT
forgery,
X-Brand-Idinjection) wallet_servicecross-brand command rejection under observe and enforceadmin_servicewrite surface (network isolation,X-Operator-Idcapture, audit-row tamper)- game callback reverse-parse (account namespace collision, prefix attacks)
MULTI_BRAND_ENFORCEMENTdowngrade (operator-side abuse, audit trail)
Diagnosis Playbook
When any of the multi-brand alerts fires, use this section before considering rollback.
For the brand-signature, brand-check, agent-allow-list,
game-callback-resolve, and admin-audit counters that lack a section
here, see the dedicated operator-facing entry points in
diagnosis-playbook.md. That file covers
wallet_brand_signature_failed_total,
wallet_brand_signature_replay_blocked_total,
wallet_cross_brand_rejected_total,
brand_resolution_failed_total{service="gateway"},
agent_brand_not_allowed_total,
game_callback_brand_unresolved_total,
admin_operator_id_mismatch_total, and
admin_audit_write_failed_total, plus the brand-signature
replay-cache + default-brand-cache counters
(wallet_brand_signature_replay_redis_outage_total and
wallet_topology_default_brand_unprimed_total, both delivered by
P2-δ), the per-service *_cross_brand_rejected_total family for
agent_service / game_service / promotion_service /
recon_service / rolling_service, and the post-Tier-1/Tier-4
hardening counters (gateway_jwt_session_missing_total,
gateway_jwt_unknown_kid_total, pii_aes_unconfigured_total,
plisio_callback_replay_blocked_total,
plisio_callback_replay_redis_outage_total,
game_callback_verification_bypassed_total,
internal_caller_token_legacy_rejected_total,
agent_balance_legacy_write_total,
admin_ws_legacy_query_token_total,
admin_ws_rate_limited_total,
recovery_sms_delivery_failed_total,
recovery_email_unprovisioned_total,
recovery_rate_limit_redis_outage_total).
brand_resolution_failed_total{reason="jwt_domain_mismatch"}
What it means: a player or agent JWT was presented at a domain whose
brand differs from the JWT's brand claim, OR the JWT lacks
brand_id.
Diagnosis:
- Query the structured logs for the past 5 min filtered on the
counter increment. Look at:
request_id,jwt.brand_id,request.state.brand_id,domain. - If the counts cluster on one source IP / one user-agent, it is likely a scanner or a stale mobile client; expected baseline noise.
- If the counts cluster on legitimate browser user-agents and one
brand's domain, look for a recent
domain:agent:*ordomain:level:*Redis change. - If JWTs lacking
brand_idare still appearing more than2 * JWT_EXPIRE_MINafter Phase 5 deploy, refresh-token rotation is minting brand-unaware tokens; check Phase 5 implementation inplayer_service/agent_service.
Mitigation:
- Baseline noise: ignore; refine the alert threshold.
- Single domain spike: re-verify the Redis mapping for that domain; inspect the LB and any DNS changes.
- Refresh-token issue: hot-fix
player_service/agent_serviceto reject refreshes from brand-unaware refresh tokens; force re-login for affected sessions. - If the spike is correlated to a Phase 16 flip and is causing real
user pain: revert the affected service to
observe.
wallet_cross_brand_rejected_total{mode="enforce"}
What it means: a wallet command targeted a row whose brand_id
differs from the request brand. In enforce mode this rejected.
Diagnosis:
- Filter
wallet_servicelogs for the past 5 min onevent=cross_brand_rejected. Look at:command,caller_service(from theINTERNAL_SERVICE_TOKEN_*token used),request_brand,target_row_brand,request_id. - If
caller_serviceis consistent (e.g. alwaysrecon— the label value for therecon_serviceprocess; see the service-name vs caller-label table earlier in this document), that caller is not forwardingX-Brand-Idcorrectly OR the originating data is wrong (e.g. a deposit row carries the wrong brand_id). - If
caller_servicevaries, something at the gateway / agent_service edge is leaking the wrong brand into the X-Brand-Id propagation.
Mitigation:
- Single-caller bug: hot-fix the caller's brand-forwarding code OR
flip just that caller's
MULTI_BRAND_ENFORCEMENTtoobserveto buy time (writes proceed using request brand; no money lands in wrong brand because wallet uses request brand authoritatively in observe). - Edge bug: revert
gatewaytoobserve; investigate brand resolution in the affected route. - If the rate is high and money flow is impacted: full revert to
observeper the Phase 16 mid-flip revert procedure.
game_callback_brand_unresolved_total{provider}
What it means: a provider callback's outbound account string did not reverse-parse to a known brand. Real money may be stuck on the provider side.
Diagnosis:
- Filter
game_servicelogs for the past 5 min onevent=callback_brand_unresolved. Look at:provider,outbound_account_raw,request_id. - The most common cause is a new brand was created in
admin_servicebut the per-process reverse-parse cache ingame_servicehas not refreshed yet (60s TTL orBRAND_CATALOG_CHANGEDRedis pub/sub message missed). - The next most common cause is a game provider sending a callback for a player that never existed (out-of-band test traffic from the provider; expected non-zero baseline that must be allow- listed).
- If neither, the outbound namespacing logic for that provider is
broken (e.g. provider truncated the account string below
brand_codelength).
Mitigation:
- Cache stale: trigger a manual
BRAND_CATALOG_CHANGEDpublish from admin or rolling-restartgame_serviceprocesses. - Provider noise: add an allowlist for known-bad accounts at the game callback edge.
- Outbound namespacing bug: hot-fix the affected provider's namespacing helper; manually replay the dead-lettered callbacks (see DLQ Procedure below) once the fix lands.
- All cases: ack/nack the original provider callback per provider policy. Failing-open (silently ack) loses the callback; failing- closed (5xx) triggers provider retries that may DoS. Default policy: 200-ack and write to DLQ; retry from DLQ after fix.
security_downgrade_total (P0 page)
What it means: a service started in MULTI_BRAND_ENFORCEMENT != 'enforce'
while more than one brand is enabled. Either an operator flipped the
flag back during incident response, a deploy template lost the
enforce env var, or CI rolled out a service with the wrong default.
Diagnosis:
- Identify which service:
security_downgrade_total{service}label. - Query
multi_brand_enforcement_mode{service}for the same service to confirm the current mode (1=observe, 0=off). - Check the deploy history for the affected service in the past
24h; look for a config change that removed or modified
MULTI_BRAND_ENFORCEMENT. - Page the deploy lead AND the security on-call simultaneously. This is treated as a security incident until proven otherwise.
Mitigation:
- If the downgrade is intentional (legitimate incident response that
required
observe): document in#incidents, ack the alert with a silence window, and plan re-enforce per the Phase 16 procedure. - If the downgrade is unintentional: re-set
MULTI_BRAND_ENFORCEMENT=enforcein the env, restart the affected service, confirm/healthandmulti_brand_enforcement_modeshow enforce. Investigate the deploy pipeline; consider a startup probe that aborts boot if the env var is missing post-Phase-16. - Either way: never silently let the downgrade persist while >1 brand is enabled.
brand_resolution_failed_total{reason="unresolved_brand"} (page in enforce)
What it means: an authenticated agent_service request reached the
allow-list middleware (agent_brand_enforcement.py) but no
brand_id was attached by the upstream brand-resolution middleware.
Distinct from unknown_domain (HTTP host had no brand mapping at the
edge): unresolved_brand fires when resolution itself failed silently
and a request slipped through to authenticated routes without context.
Diagnosis:
- Filter agent_service logs for the past 5 min on
event=agent brand allow-list check failedandreason=unresolved_brand. Look at:agent_id,path,request_id, plus the upstreamHostheader. - Cross-check the brand-resolution middleware logs for the same
request_id. Either the resolver crashed (look for a stack trace) or the LB forwarded an unmappedHost. - If the spike correlates with a recent deploy of
AgentBrandResolutionMiddlewareor a brand-catalog change, suspect a cache miss ondomain:agent:*ordomain:level:*.
Mitigation:
- Resolver bug: hot-fix; consider flipping agent_service to
observeuntil rolled out. - Missing domain mapping: add the
domain:agent:*Redis entry. - LB / DNS misconfig: route
Hostcorrectly; reject unmapped hosts at the edge instead of letting them reach agent surfaces.
brand_resolution_failed_total{reason="missing_header"} (page in enforce)
What it means: a brand-scoped internal handler received a request
without X-Brand-Id. In enforce mode this is rejected.
Diagnosis:
- Filter logs on the affected service for the past 5 min;
request_id,caller_service(from the internal token used),route. - Most common cause: a recently-deployed caller service that did
not propagate
X-Brand-Id. Identify the caller, find the recent deploy, look at the change. - Less common: an external request slipped past the gateway / agent strip and hit an internal handler directly.
Mitigation:
- Hot-fix the caller to forward
X-Brand-Id, OR temporarily flip the affected consumer toobserveuntil the caller rolls out. - If external request leakage: investigate edge configuration; the X-Brand-Id strip may be misconfigured.
wallet_idempotency_brand_split_total (warning)
What it means: an inbound idempotency key already exists for the same key but different brand. Indicates client misrouting or replay.
Diagnosis:
- Filter wallet_service logs on
event=idempotency_brand_split. Comparerequest_brandvs the existing-row brand and thecaller_service. - If concentrated on one caller, that caller is sending the same idempotency key across brand-namespace boundaries (e.g. account- derived keys without brand prefix).
Mitigation:
- Caller fix: prefix idempotency keys with
brand_codeorbrand_idon the caller side. - Not a money-impact incident in itself; observe-mode would proceed with request brand. Track the rate.
Per-brand wallet outbox publication freshness (page)
What it means: the wallet outbox publisher for one specific brand is not making forward progress (per ADR-003).
Diagnosis:
- Query the wallet outbox table filtered on the affected brand:
SELECT count(*) FROM wallet.wallet_outbox WHERE brand_id = ? AND published_at IS NULLand the oldest unpublishedcreated_at. - Compare to other brands; if only one brand is stuck, look for a data condition specific to that brand (e.g. a poison-pill payload).
- Check Redis Stream lag on the wallet stream.
Mitigation:
- Standard wallet outbox playbook applies; brand label scopes the investigation.
- If a poison-pill: route the offending row to dead-letter, fix the payload, re-publish.
Per-brand request_total zero-traffic alert (warning)
What it means: a previously-active brand stopped receiving requests for >15 min. Either brand outage, domain misconfiguration, or upstream LB / DNS change.
Diagnosis:
- Confirm via curl that the brand's domain still resolves and the LB forwards to gateway.
- Check
domain:agent:{host}anddomain:level:{host}Redis entries for the brand's domains; ensure they still bind to the correctbrand_id. - Check brand-create / brand-disable audit log for any recent change to that brand's status.
Mitigation:
- Domain misconfig: re-bind the Redis entries; confirm via a curl request.
- Brand status accidentally
disabled: re-enable through admin audit-tracked write.
wallet_cross_brand_rejected_total{mode="observe"} (soak-window warning)
What it means: a cross-brand wallet command was processed in observe
mode (proceeded using request brand). Above zero before Phase 16
flip indicates an internal caller is not forwarding X-Brand-Id
correctly.
Diagnosis: same as the enforce-mode diagnosis above; in observe the counter increments but no end-user 4xx is generated.
Mitigation: must be driven to zero before Phase 16 hard-flip. Hot-fix the caller; do not flip enforce until the soak counter is clean.
Game callback DLQ procedure
game_service writes every brand-unresolved callback to the dead-letter
table game_callback_dead_letter (alembic 0030). Each row holds the
provider, the raw outbound account that failed reverse-parse, the full
inbound payload (JSONB) for verbatim replay, the parse-failure reason,
and the original request_id. Operator interaction is via the CLI tool
servers_v2/tools/multi_brand_backfill/replay_game_callback_dlq.py,
which reads DATABASE_URL and GAME_SERVICE_URL from the environment
(default GAME_SERVICE_URL=http://localhost:8012).
After the brand catalog is fixed, run these commands in order:
-
List pending rows for the affected provider to scope the incident. The default
--statusispending:python servers_v2/tools/multi_brand_backfill/replay_game_callback_dlq.py \--list --provider mgOutput is one line per row:
#<id> provider=<p> status=<s> reason=<r> account=<a> created_at=<ts>. -
Inspect a candidate row in dry-run. Dry-run prints the planned POST target and the payload that would be replayed; it never mutates the row and never issues an HTTP request:
python servers_v2/tools/multi_brand_backfill/replay_game_callback_dlq.py \--id 42 --dry-run -
Replay the row. The tool POSTs
payload_jsonback to the per-provider endpoint; on HTTP 2xx it setsreplay_status='replayed', replayed_at=now(), replay_request_id=<new>. On HTTP 4xx/5xx the row stayspendingand the tool exits with code 3 -- safe to retry. wallet_service idempotency on the underlying authorize/settle/rollback call ensures replays do not double-spend:python servers_v2/tools/multi_brand_backfill/replay_game_callback_dlq.py \--id 42 -
If the callback is unrecoverable (player deleted, payload corrupt, etc.), abandon the row. Abandonment marks money lost permanently and MUST carry a human operator id and a free-text reason -- both land in an
admin_auditrow (action='DLQ_ABANDON',resource='game_callback_dead_letter',resource_key=<row id>) that commits in the same transaction as the DLQ row flip:python servers_v2/tools/multi_brand_backfill/replay_game_callback_dlq.py \--abandon --id 42 \--operator-id "ops:alice" \--reason "player guid 12345 deleted in Phase 14 cleanup"--operator-idand--reasonare mandatory; the tool refuses to run with either missing, blank, or whitespace-only.--notesis still accepted (recorded on the DLQ row for operator-facing inspection) but is independent of the audit-bearing fields.If the audit-side INSERT fails (trigger fires, schema drift, audit table append-only protection rejects), the transaction rolls back: the DLQ row stays
pendingand the operator can retry. Exit code is 3 in that case so wrapper scripts notice the failure.
The tool refuses to act on rows that are already replayed or
abandoned (exit code 3) so re-running a script against the same id is
safe. DLQ persistence inside game_service is best-effort: a DB outage
during the original rejection logs a WARN and falls through to the
metric-only path -- it does not block the provider response.
Independent-transaction guarantee (post Codex-R2 fix). The DLQ
INSERT is now performed by record_dead_letter on a fresh AsyncSession
opened from app.state.pool, with an explicit commit() before the
helper returns. This means the row survives even though the calling
provider route returns its rejection envelope (e.g. _balance_resp(...),
WCResponse.error(...), BTiTokenResponse(...)) and exits its
async with session_maker() as session: block without committing the
caller's transaction -- without this, AsyncSession's __aexit__
rollback would drop the staged DLQ row. Operators can therefore rely on
this table being authoritative for replay; the only loss path remaining
is the documented best-effort swallow of DB-side errors during the DLQ
INSERT itself (which still emits a WARN log and the
game_callback_brand_unresolved_total metric).
Rollback Steps
For non-production environments:
- Stop player-facing traffic to brand-aware paths.
- Stop workers that consume brand-aware events.
- Export
brand,brand_config,agent_brand,player, wallet, rolling, promotion, and game tables for investigation. - If destructive schema changes were applied since the last snapshot, restore from the pre-rollout snapshot.
- Re-bind the
domain:agent:*anddomain:level:*Redis values back to the single-brand shape if necessary. - Restart services and workers.
- Re-run health checks and the local/staging smoke checks.
For production:
The following procedures are approved for production. Each must be executed only after a documented incident is opened and the deploy lead is paged.
Production rollback A -- Phase 16 enforcement regression
Symptoms: alert fires on wallet_cross_brand_rejected_total{mode= "enforce"}, brand_resolution_failed_total, or
game_callback_brand_unresolved_total post-flip with sustained
elevated rate causing user pain.
Procedure: execute the Phase 16 mid-flip revert procedure from the
plan (gateway -> agent -> recon -> player -> promotion -> rolling ->
wallet, each step flips env var to observe + restarts + confirms
/health shows observe). Total budget 30 min. After complete,
security_downgrade_total will fire (expected during incident).
Re-attempt flip only after root cause is fixed and a new soak window
is observed.
Production rollback B -- Backfill corruption (rows with wrong
brand_id)
Symptoms: counts of rows with brand_id != expected discovered after
0024b backfill OR wallet_idempotency_brand_split_total non-zero
after Phase 6 deploy.
Procedure:
- Pause writes to the affected table family via the application-level table-freeze flag.
- Open a separate incident; do NOT proceed without a verified correction plan.
- For deposit-class corruption (deposit row carries wrong brand
leading to misrouted Plisio callback): run a reconciliation script
that cross-references the originating request domain (logged at
creation time) against the deposit row's
brand_id; quarantine mismatches; manually correct after wallet team review. - For player-class corruption (player_id assigned to wrong brand): the row is unrecoverable by automated means; open a customer- support ticket per affected player.
- Restore from PITR is the last resort and must be approved by the on-call manager + DBA + a wallet domain owner; restore loses every transaction since the PITR target. Document the cutoff window before any restore decision.
Production rollback C -- Mid-flip abort
Already covered by Phase 16 mid-flip revert above.
Production rollback D -- agent_brand seed race remediation
Symptoms: agent reports cannot log in after Phase 16 flip; allow-list counter on agent_service shows allow-list-failed for valid agents.
Procedure: run the Phase 12 re-seed verification SQL (SELECT count ... LEFT JOIN agent_brand WHERE ab.agent_id IS NULL); for every
missing agent, insert (agent_id, default_brand_id, 'enabled').
Confirm the agent can log in again. Root-cause why the autoseed
trigger missed the agent (was the trigger removed prematurely?).
Production rollback E -- Brand-config regression
Symptoms: a brand_config write produces unexpected behavior (e.g.
rolling ratio set to 0 by mistake).
Procedure: every brand_config write is audited (per spec) with the
prior value. Run UPDATE brand_config SET value = '<prior_value>' WHERE brand_id = X AND key = Y; from the audit log entry.
Investigate the original change-control failure separately.
General production constraints:
- Never run destructive rollback without an approved incident plan and on-call manager sign-off.
- Prefer traffic pause and provider callback queuing while ledger and authorization state is inspected, before any data mutation.
- Every production rollback action is logged in
#incidentsand added to the post-incident report.
Smoke Checks
GET /healthsucceeds for every affected service.gatewayresolves a sample request againstdefault's domain and attachesrequest.state.brand_id.gatewayrejects a sample request whose JWT brand does not match the domain brand.wallet_servicetopology and policy read endpoints return per-brand active documents for bothdefaultandbrand2.- a small bet on
defaultauthorizes, settles, produces ledger rows, and the rows carrybrand_id = default. - a small bet on
brand2authorizes, settles, produces ledger rows, and the rows carrybrand_id = brand2. - a synthetic cross-brand wallet command increments
wallet_cross_brand_rejected_totaland returns a stable error envelope. - a synthetic unparseable game callback increments
game_callback_brand_unresolved_totaland is rejected. - duplicate request IDs remain idempotent within a brand and remain rejected on payload mismatch within a brand.
- removed back-office routes return
404: every path previously served bylegacy_admin_v2.py,legacy_auth.py,legacy_agents_v2.py,legacy_agent_withdrawals_v2.py,legacy_meta_v2.py,legacy_recon.py, andlegacy_web_content.py.
Delivered hardening
Items moved out of the rollout backlog because the code, migrations,
counters, and tests have all landed on main. Each entry points to
the canonical operator playbook for the alert it surfaces.
Recovery flow brand isolation (delivered)
Brand-aware recovery is live for both player and agent surfaces.
servers_v2/player_service/app/api/routes/recovery.py and
servers_v2/agent_service/app/api/routes/recovery.py ship a
brand-precedence resolver that prefers token-embedded brand_id,
falls back to request domain when the token lacks the claim, and
hard-rejects on resolved-brand-vs-underlying-subject mismatch
regardless of MULTI_BRAND_ENFORCEMENT mode. recovery_brand_mismatch
surfaces via brand_resolution_failed_total{reason="recovery_brand_mismatch"}.
Audit guarantees: alembic migration 0033_recovery_audit_table.py
adds recovery_audit and every observable outcome (REQUEST / VERIFY /
RESET success or rejection, plus RATE_LIMITED / TOKEN_INVALID /
BRAND_MISMATCH typed rejections) inserts one append-only row carrying
brand_id, SHA-256 hash of contact, SHA-256 hash of source IP, and a
per-action JSONB metadata blob (NEVER the raw contact, token, or
password). RESET commits the audit row in the SAME transaction as the
password UPDATE so an audit-side failure rolls back the recovery
write.
PII / OTP-leakage hardening (Codex P1-#5, P1-#6, T1-D-C4): every SMS
path now logs contact_hash=<sha256_first_8_hex> only and emits
recovery_sms_delivery_failed_total{service} /
recovery_email_unprovisioned_total{service} for operator visibility.
RECOVERY_EXPOSE_TOKEN_FOR_TESTS=on is the only escape hatch and is
boot-guarded against any production-grade RGB_ENV / APP_ENV /
ENVIRONMENT label. T4-D-I5 added per-contact rate-limit fail-closed
in production via recovery_rate_limit_redis_outage_total{bucket}.
Operator entry: every recovery-related counter has a dedicated
section in diagnosis-playbook.md.
Admin operator-id JWT cross-check (delivered, P1-3)
admin_service cross-checks the X-Operator-Id LB header against
the verified admin JWT admin_id claim on every write route via
require_operator_id in app/api/deps.py. Mismatches and
legacy-token-without-claim cases hard-reject with HTTP 403 and
increment admin_operator_id_mismatch_total{reason}. Operator
identity is now a JWT-bound security boundary, not a network-only
contract; the LB allow-list still exists as defence-in-depth.
The dedicated playbook is in
diagnosis-playbook.md under
admin_operator_id_mismatch_total. The release-gate alarm pages on
any non-zero rate (reason=header_vs_jwt is a spoof attempt;
reason=jwt_missing_claim means a legacy admin token reached a
write route -- cycle the issuing service).
Audit row tamper-evidence (delivered, alembic 0032)
admin_audit is DB-level append-only as of alembic migration
0032. Two BEFORE triggers on the table unconditionally raise:
trg_admin_audit_no_update_delete-- BEFORE UPDATE OR DELETE, FOR EACH ROW. Catches both single-row and bulkUPDATE/DELETE FROM admin_auditstatements; PG fires the trigger per row visited so the first row aborts the whole statement.trg_admin_audit_no_truncate-- BEFORE TRUNCATE, FOR EACH STATEMENT. PG forbids row-level TRUNCATE triggers, so this is its own DDL.
Both triggers call rgb.admin_audit_append_only(), which raises
EXCEPTION 'admin_audit is append-only; <op> not permitted on rgb.admin_audit'. The exception is intentionally loud.
admin_audit_write_failed_total{route} pages on any non-zero rate;
the route handlers in admin_service/app/services/admin_audit.py
emit it whenever the post-money-write admin_audit INSERT fails,
which after migration 0032 means either a transient DB error or
the append-only trigger firing on a misbehaving caller. See the
dedicated entry in diagnosis-playbook.md.
Operator interpretation: any "missing audit row" alarm or any
log line containing admin_audit is append-only means the trigger
fired, which is itself a security event. The expected steady-state
rate of these errors is zero. A non-zero rate means either:
- An attacker / compromised service is attempting to delete prior audit rows -- treat as an active intrusion.
- A migration / data-cleanup script is incorrectly trying to
UPDATEorDELETEfromadmin_audit-- still a security event; page the platform on-call before retrying anything.
Wallet ledger + balance log tamper-evidence (delivered, alembic 0036)
wallet.wallet_ledger (per-bucket money-movement audit trail) and
rgb.player_balance_log (legacy compatibility ledger) are
DB-level append-only as of alembic migration 0036. The trigger
posture mirrors 0032's admin_audit shape:
trg_wallet_ledger_no_update_delete+trg_wallet_ledger_no_truncatecallwallet.wallet_ledger_append_only()and raiseEXCEPTION 'wallet_ledger is append-only; <op> not permitted on wallet.wallet_ledger'.trg_player_balance_log_no_update_delete+trg_player_balance_log_no_truncatecallrgb.player_balance_log_append_only()and raise the analogous exception.
Each function lives in the same schema as the table it guards
(wallet vs. rgb) so cross-schema permission review stays simple.
Operator interpretation: any log line containing
wallet_ledger is append-only or player_balance_log is append-only
is a security event with the same handling as the
admin_audit is append-only case above -- page the platform on-call
before retrying anything. Steady-state rate is zero.
Recovery audit table (delivered, alembic 0033)
recovery_audit is created by alembic migration 0033 and consumed
by both the player and agent recovery routes. Every recovery
outcome inserts a typed append-only row. See "Recovery flow brand
isolation" above for the full contract; the dedicated counters
(recovery_sms_delivery_failed_total,
recovery_email_unprovisioned_total,
recovery_rate_limit_redis_outage_total) all have entries in
diagnosis-playbook.md.
Game-callback DLQ independent transaction (delivered, Codex-R2)
record_dead_letter in servers_v2/game_service/app/services/callback_brand.py
opens a fresh AsyncSession from app.state.pool and commits the
row inside that session before returning. The DLQ row therefore
survives the provider route's rejection-and-rollback path. See the
"Game callback DLQ procedure" section earlier in this runbook.
Tier 1 hardening sweep (delivered, T1-D-C1 .. T1-D-C5)
- T1-D-C1: gateway brand-aware session key lookup. The Phase 4E
dual-read fallback and its
gateway_session_legacy_key_used_totalmetric are retired;gateway_jwt_session_missing_totalremains the JWT-valid-but-session-gone signal. - T1-D-C2:
VERIFY_CALLBACKS=truedefault + boot guard +game_callback_verification_bypassed_total{provider}. - T1-D-C3: AES PII boot guard
(
assert_aes_pii_keys_configured) + removal of silent plaintext fallback +pii_aes_unconfigured_total{service,op}. - T1-D-C4: registration + agent auth log redaction with
RECOVERY_EXPOSE_TOKEN_FOR_TESTSboot-guard escape hatch. - T1-D-C5 / I6: Plisio HMAC constant-time compare + replay
protection (
plisio_callback_replay_blocked_total{status}/plisio_callback_replay_redis_outage_total{env}) + log redaction.
Every counter has a per-counter playbook entry in
diagnosis-playbook.md.
Tier 4 hardening sweep (delivered, T4-D-I2 .. T4-D-I7)
- T4-D-I2:
PER_CALLER_TOKEN_REQUIREDgate +internal_caller_token_legacy_rejected_total{consumer,caller}. - T4-D-I3:
admin_auditrows on agent-mutating + sensitive player admin routes +agent_balance_legacy_write_total{change_type}for the legacy direct-write surface. - T4-D-I4: gateway JWT verifier hot-reload via Redis pub/sub on
jwt_public_keys_changed+gateway_jwt_unknown_kid_total. - T4-D-I5: recovery
/requestper-contact bucket fail-closed in production +recovery_rate_limit_redis_outage_total{bucket}. - T4-D-I7: admin WebSocket token via
Sec-WebSocket-Protocolsubprotocol + per-IP rate limit + JWT-exp lifetime gate +admin_ws_legacy_query_token_total/admin_ws_rate_limited_total{reason}.
Every counter has a per-counter playbook entry in
diagnosis-playbook.md.
Tier 6 hardening sweep (delivered, T6-B1 .. T6-E4)
- T6-B1:
admin_servicemigrated off legacy HS256 to RS256 admin JWTs;JWT_PRIVATE_KEY,JWT_PUBLIC_KEYS(JSON{kid: pem}), andJWT_KIDare now required env vars. T7-B4 follow-up adds theassert_jwt_public_keys_configuredboot guard so production refuses to boot whenJWT_PUBLIC_KEYSis empty (otherwisedecode_admin_tokenwould fall back to the in-process test-keypair singleton and silently accept forged tokens). - T6-B2:
brand_idpropagation from admin → wallet / rolling / promotion / recon. Admin's outbound clients now stampX-Brand-IdX-Brand-Signatureso downstream brand-isolation checks see the right brand.
- T6-C1: 5 missed
admin_audit-emitting agent routes covered (parity with T4-D-I3 surface). Audit is written AFTER the money / state mutation succeeds; failure surfaces viaadmin_audit_write_failed_total{route}. - T6-C2:
agent_brandseeded explicitly oncreate_agent(alembic 0028 dropped the autoseed trigger that previously did this in the DB). T7-A3 closed the follow-onagent_domainbrand_id gap introduced in this same change. - T6-C3: append-only triggers extended to
wallet_ledger+player_balance_logvia alembic 0036; the operator-facing entry is "Wallet ledger + balance log tamper-evidence" above. - T6-D1: gateway JWT verifier hot-reload now refreshes from
Settings(not just env) on thejwt_public_keys_changedRedis channel so a key rotation actually takes effect without a gateway restart. - T6-D2: gateway per-IP rate limiter fails CLOSED on Redis outage
in production-grade runtimes (HTTP 503 +
Retry-After); dev/test remains fail-open. The newgateway_rate_limit_redis_outage_total{env, reason}counter has a dedicated playbook entry indiagnosis-playbook.md. - T6-E1:
BrandContextnow rejectsX-Brand-Id <= 0(and any non-numeric value) as missing. Callers that need to operate without a brand must omit the header entirely; sendingX-Brand-Id: 0no longer silently creates a "brand 0" write. - T6-E2: strict
brand_idresolver in admin paths (adjust_playerplayer_finance) --int(brand_id or 0)removed; missing brand surfaces a clear envelope to the operator instead of a split-brain wallet+audit row. T7-B1 hoisted the helper intoapp/services/brand_helpers.pyso both files share it.
- T6-E4:
BRAND_AWARE_PUBLISHING_BEGINSsentinel removed from wallet + player outbox pollers. T7-B2 finished the cleanup -- agent + rolling pollers no longer emit either, thergb_contracts.events.baseconstant + helper are dropped, and consumers compare against an inline literal so the defensive skip survives without a contracts-side dependency.
Tier 8 hardening sweep (delivered, T8-A .. T8-D)
- T8-A: comprehensive grep sweep closed every brand_id INSERT gap that
the prior iterations missed. Reach was wider than T6/T7 (raw text()
- SQLAlchemy
insert(<Model>)+ dotted/relative table refs); the fixes landed in agent_service legacy + create_agent paths (commit7b2e58e8) and a follow-on grep sweep (commit03748bf8).
- SQLAlchemy
- T8-B: eliminated the silent
brand_id=0fallback in game_service callbacks. The fallback used to swallow callbacks whose namespaced account did not reverse-parse, persisting a NULL-branded bet row; callbacks are now hard-rejected, DLQ'd, and counted undergame_callback_brand_unresolved_total{provider, reason="raw_brand_unresolved"}. See the diagnosis playbook entry for the new label. - T8-C: Hotfix — admin deps were passing only
secret_key(not the full Settings) todecode_admin_token, breaking the RS256 resolver and returning 401 on every production admin call once the T6-B1 RS256 migration shipped. The two-line fix atadmin_service/app/api/deps.pyis committed atfed36449. Test fixtures were already passing the full Settings, which is why this surfaced in production rather than CI; T8-C also adds a regression test that uses a Settings-less call to catch the pattern in CI. - T8-D: admin BrandResolutionMiddleware observability + skip-on-
liveness — the silent Redis exception path in admin's brand
resolver now records
brand_resolution_failed_total{service="admin_service", reason="redis_error"}, and/health//healthz//docs//openapi.json//redocskip the Redis hop entirely so a tight probe schedule cannot amplify a Redis brownout into observable route latency.
Tier 9 brand_id INSERT sweep (delivered, T9-A .. T9-G)
T9 is the last-mile audit on every brand-scoped INSERT in both the
v2 service tree and the legacy servers/ tree. The companion audit
baseline
2026-04-28-brand-id-insert-audit.md
captures the classifier inputs from migration 0024a plus the
brand-nullable-by-design exemptions from 0037 so that re-running the
documented audit commands any time the codebase changes will reproduce
the steady-state result RAW FIX=0, FIX-TEST=0, SA FIX=0, SA FIX-TEST=0.
- T9-A: the audit baseline -- exhaustive grep + classifier results
were captured in
docs/runbooks/multi-brand/2026-04-28-brand-id-insert-audit.md. Initial deltas: RAW FIX=55, SA FIX=3, FIX-TEST=6. - T9-B / T9-B-cleanup: admin_service back-office CRUD INSERTs
closed across
promotion.py,player.py,player_finance.py,agent.py,banking.py,finance_config.py, andsystem.pyusing the new sharedapp/api/brand_ctx.pyhelper (resolve_brand_id+ fail-loudbrand_required_envelope,status=54). Test fixture intest_money_admin_audit.pynow installsBrandResolutionMiddlewareand defaultsX-Brand-Id=1in the shared auth header helper. - T9-C: per-brand iteration in 4 admin scheduled tasks
(
stat.py::player_stat_acc/agent_stat_day,tag.py::_handle_tagging,expire.py::expire_coupons,check_expired.py::check_deposit/check_coupon). brand_id is sourced directly from the per-row source (p.brand_idfor stats,(SELECT brand_id FROM player WHERE guid = :pid)subquery for tags,RETURNING brand_idfrom the UPDATE for expirations). - T9-D: agent_service legacy
/coupon/grant+/coupon/recycle(3 INSERT sites inlegacy_routes.py) and the local-devbootstrap_local_data.py(4 sites: bank, bank_card_group, bank_card, agent_setting -- hard-coded brand=1 withdev-onlyrationale comments). - T9-E: relaxed
shooter_sms.brand_idto nullable via alembic 0037 (the recon ingest path receives raw SMS payloads BEFORE any brand correlation is possible; brand_id is rewritten at match-time from the matchedplayer_deposit). A new fail-loud helperrecon_store.assert_sms_brand_after_matchraises if the match-time UPDATE fails to land brand_id. Closed the 3 SA FIX sites (wallet_service::_log,player_service::PlayerLogin,rolling_service::PlayerRollingLog) by sourcing brand_id from request brand with player/rolling fallback. Closed 2 promo config sites (lossback_config,rebate_config) with thestatus=54fail-loud envelope. - T9-F: agent coupon cross-brand attribution -- both
coupon.py::grant_couponandrecycle_couponnow hard-reject the request with a per-player brand check (token.brand_id != player.brand_id) BEFORE any UPDATE / INSERT lands. Theagentaggregate is brand-global; the signed token is the request's authority for brand scope. Regression tests pin the refusal so a future regression surfaces at unit-test time. - T9-G: docs sync (this entry) + classifier hardening so docstring
/ regex-literal mentions of
INSERT INTO ...stop being mis-classified as production sites. The classifier's newBRAND_NULLABLE_BY_DESIGNset documents the 0037 exception so re-running it never re-flags shooter_sms.
Tier 14 brand_id RW (SELECT/UPDATE/DELETE) sweep (delivered, T14-A .. T14-C)
T14 is the symmetric cycle-closing batch for the read/update/delete
half of the brand-leak surface. Every T6-T13 cross-review only
audited INSERT statements; Codex's T14 review surfaced 8 explicit
RW leaks in admin SELECT/UPDATE/DELETE statements that no prior
classifier had ever looked at. T14 ships the matching audit
infrastructure, closes every Codex-flagged leak, and gates Phase 16
on the new classifier reporting RW FIX=0.
- T14-A: the audit baseline --
tools/audit/brand_id_rw_classifier.py(sibling of the INSERT classifier; conservative bias, ties go to FIX) plus the bounded universe doc2026-04-29-brand-id-rw-audit.md. The classifier reads the same 122-table inventory from migration 0024a, supports the# brand-global: <reason>opt-out marker, and allow-listsadmin_stat_acc/admin_stat_day(truly brand-global aggregates per data-ownership.md). Initial deltas: RW FIX=350 (235 SELECT + 88 UPDATE + 27 DELETE), FIX-TEST=3. - T14-B: closed the 10 explicit Codex-T14 sites (every active-exploit
vector in admin_service back-office routes):
agree_agent_withdrawal+ siblingdecline_/pay_/list_agent_withdrawals/agent_withdrawal_statsinagent_finance.py; the 13 sibling stats endpoints instats.py(provider_stat_day,player_stat_day,agent_stat_day,promotion_stat_day,usdt_stat_day,player_provider_stat_day,rebate_stat_day,player_stat_acc,agent_stat_acc, plus thestats_top.online_countlegacySELECT COUNT(*) FROM player);set_primary_level_domain(4 SQL statements all needed brand_id pinning) +registration_listinplayer.py;list_rollinginplayer_finance.py;list_promotionsinpromotion.py(Codex's 9d981cc9 fixedcoupons/listbut missedpromotions/list); plus themiddleware/brand.pydocstring fix to remove the stale "fall back to body.brand_ids as before" language.stats.py::stats_player_per_provideralso got the table-name fix fromplayer_provider_stattoplayer_provider_stat_day(the original SQL was a runtime-500 bomb). New regression suiteservers_v2/admin_service/tests/test_admin_rw_brand_scope_t14.py(12 tests) pins the same-brand SQL shape, the missing-brand-context fail-loud, and the cross-brand isolation contract per route. - T14-C: docs sync (this entry) + Phase 16 gate. The
RW classifier is now wired into
.github/workflows/brand-id-audit.ymlalongside the INSERT classifier (both must reportFIX=0for the workflow to pass) and intotools/multi_brand_backfill/phase16_dryrun.pyas a hard precondition. TheMakefileexposesmake audit-brand-id-rwas the operator-facing wrapper.
T15 supersedes the T14-D/E/F deferred-backlog plan. The 316 RW-FIX sites the T14-A baseline catalogued were swept end-to-end in Tier 15 (see
### Tier 15below).make audit-brand-id-rwreports FIX=0 as of T15-E (commitd882362a) and stays at 0 through T16-A/B and the T16-C classifier upgrade.
Tier 15 brand_id RW exhaustive sweep (delivered, T15-A .. T15-E)
T15 closes the entire deferred RW-FIX backlog from T14. Each batch swept one chunk of the 316-site list, brand-scoped every SELECT / UPDATE / DELETE that touched a brand-scoped table, and added defence-in-depth predicates so the SQL itself refuses to scope-cross even when the authz layer is bypassed.
- T15-A (
005719c3) — wallet_service SELECT/UPDATE/DELETE: every read ofplayer,coin_transaction,player_deposit,player_withdraw,player_balance_log,bank,player_usdt,player_rolling, etc. now carries eitherAND brand_id = :brand_idor a JOIN-on-parent-brand predicate. The_record_rebate_transfer_stathelper was rewritten so the cross- brand player_id collision can no longer leak another brand'saccount / agent_id / brand_idtriple. - T15-B (
e50c8c99) — promotion_service + rolling_service: brand-scoped every event/coupon/promotion/rolling read and write the two services issue. Includes the rolling FIFO progress writes (which previously ran as guid-only UPDATEs). - T15-C (
ce8504ae) — player_service + agent_service + game_service + recon_service: brand-scoped the registration / login lookups, agent-tree reads, supplier callbacks, and the reconciliation cross-checks. - T15-D (
684f268e) — admin_service money + finance + banking back-office routes: everyplayer_deposit/player_withdraw/player_balance_log/bank/bank_card/ level-rebate / coupon-grant SQL the back-office issues now carries the operator brand_id._resolve_strict_brand_idis the canonical resolver for "operator OR row brand_id" decisions. - T15-E (
d882362a) — admin_service residual RW + the entire FIX-TEST tail: the last admin SQL queries that lacked a brand predicate, plus every remaining test-fixture INSERT/UPDATE that was still issuing brand-less queries against the post-0024c schema. After T15-E the audit reports RW FIX=0 / FIX-TEST=0.
Tier 16 CR9 closure + classifier dynamic-SQL upgrade (delivered, T16-A .. T16-D)
T16 is the post-T15 cross-review batch. CR9 (the dedicated
admin_service review with manual auditing of the back-office RW
contract) surfaced 8 findings the static classifier could not
catch — most notably the _query_log_table helper in
admin_service/app/api/routes/logs.py whose SQL was built
dynamically (WHERE 1=1 builder + f-string template) and so no
static line carried a literal brand_id token to be inspected.
- T16-A (
a72ed1c8) — admin_service Critical + Important leaks closed:logs.py::_query_log_tablemade brand-required and the 4 real-table log endpoints (player_balance/agent_balance/player_login/agent_login) refuse withoutX-Brand-Id;agent.py::list_agentsswitched to JOIN agent_brand on the operator brand (per ADR-009);agent.py::create_agentenforcesbody.brand_ids ⊆ operator brands;player.pytag CRUD pinned to brand;finance_config.pyupserts validate the level row belongs to the operator brand;system.py::update_global_var/update_settings/update_maintenancegated to super_admin viarequire_super_adminwith a newadmin_platform_global_access_totalcounter;player_finance.py::force_complete_rollingmigrated to the canonicalresolve_brand_id+brand_required_envelope+_resolve_strict_brand_idpattern;deps.pysuper-admin role set widened to{super_admin, super}for the legacy bo/admin token. Test count: 217 → 238 (21 new T16-A regressions). - T16-B (
0e2f6fa8) — wallet_service fail-loud brand context. Everyif brand_id is not None: <append predicate>conditional inqueries.py,wallet.py, andservices/deposit_policy.pywas converted to: at the route boundary refuse the call withBrand context required for <route>; in the helper requirebrand_id: int(no Optional); in the SQL unconditionally carryAND brand_id = :brand_id. 13 query routes + the/depositpending+bank lookups + thevalidate_deposit_bonus_eventhelper- the
_record_rebate_transfer_statSELECT all converted. Plisio_resolve_manual_bonus_termsnow returns zero-bonus when the deposit row is a legacy pre-backfill record withoutbrand_idinstead of issuing a brand-less event lookup.
- the
- T16-C (
934d47f4) — RW classifier dynamic-SQL upgrade. NewFIX-DYNAMICclass detects the two construction patterns the static classifier missed:WHERE 1=1builders that append predicates viaconditions.append(...)— the brand_id idiom may live in the helper that joins the conditions list, far from the matchedtext(f"...")call;- f-string SQL templates
f"SELECT ... FROM {table} WHERE {where} ..."where both the table and the WHERE clause are templated. Same exit-code semantics asFIX(non-zero when count > 0) so any dynamic-builder regression breaks CI..venv/lib/.../site-packages/...paths are now treated as OOS-LEGACY so SQLAlchemy core internals don't surface in the new pass. Final classifier output:FIX=0+FIX-DYNAMIC=0across the repo.
- T16-D (this batch) — docs sync. Bumps the LVC across
multi-brand-isolation-rollout.md,2026-04-29-brand-id-rw-audit.md,2026-04-28-brand-id-insert-audit.md,phase-16-hard-flip-checklist.md, anddiagnosis-playbook.md; documents the post-T15 RW headline (FIX=0 / OK=219+97+24 / OOS-INTENTIONAL=91+18+5); records the 119 OOS-INTENTIONAL breakdown by category (resolver helpers, scheduled tasks, brand-global tables, cross-brand admin views); calls out the 2 known false-positive FIX-TEST entries intest_agent_integration.py(assertion strings that match the production INSERT shape, not actual writes); adds themake audit-brand-id-inserts && make audit-brand-id-rwgate as a Phase-16 hard-flip precondition; records the dynamic-SQL classifier upgrade in the admin_service service doc.
Known follow-ups
These items were identified during the cross-review but are intentionally out of scope of the in-progress rollout work. Each is tracked here so the spec coverage is honest and the next iteration can pick them up directly.
Future audit-row hardening
admin_audit and recovery_audit are append-only at the DB level
via 0032 / 0033. The remaining hardening still on the table (not
blocking Phase 16):
- Restrict the application role to
INSERT-only onadmin_auditvia a separate DB role (defence-in-depth on top of the trigger). - Hash-chain rows (
prev_hash,row_hash) so any out-of-band catalog manipulation that bypasses the triggers is detectable on read.
Operator audit trail for MULTI_BRAND_ENFORCEMENT flips
The cluster-wide MULTI_BRAND_ENFORCEMENT flip is the single
most security-impactful operator action on the platform. As of
the P1-5 hardening, flip_enforcement.py requires two new
arguments:
--operator-id-- identity of the operator running the flip (no default; suggested format: email or LDAP id). Recorded inadmin_audit.operator_id.--reason-- human-readable justification (no default; suggested format: ticket id plus a short sentence). Recorded inadmin_audit.new_value.reason.
Every invocation of the script (forward, rollback, dry-run) writes
exactly one admin_audit row with action='ENFORCEMENT_MODE_FLIP',
resource='MULTI_BRAND_ENFORCEMENT', resource_key=<env>, and
new_value carrying the env, direction, target_mode, services and
reason. The script reuses the same INSERT shape as
servers_v2/admin_service/app/services/admin_audit.py.
The --no-audit escape hatch is reserved for offline rehearsals
where no database is reachable; using it during a real flip is a
process violation. If DATABASE_URL is unreachable and --no-audit
is not passed, the script exits 3 (audit row write failed) rather
than printing the plan and silently succeeding -- the audit trail
is the whole point of wiring this in.
Auditing the audit trail: query
SELECT * FROM admin_audit WHERE action = 'ENFORCEMENT_MODE_FLIP' ORDER BY created_at DESC to see every flip generated against the
target environment. Combined with the append-only triggers above,
this is the canonical record for "who flipped enforcement, when,
and why" -- and it cannot be retroactively edited.
Game callback DLQ table migration
Resolved by alembic migration 0030 (game_callback_dead_letter).
servers_v2/game_service/app/services/callback_brand.py now persists
every brand-unresolved callback to the table best-effort before logging
- incrementing
game_callback_brand_unresolved_total{provider}. DB errors during the INSERT are caught and logged at WARN -- they never block the provider response. Operator interaction is viaservers_v2/tools/multi_brand_backfill/replay_game_callback_dlq.py; see the "Game callback DLQ procedure" section above.