Phase 16 Hard-Flip On-Call Checklist
Status
Ready
Date
2026-04-28
Owners
- Platform Backend
- On-Call
Last Verified Commit
934d47f4 (T16-D doc sync — adds the make audit-brand-id-inserts && make audit-brand-id-rw exit-code gate as a hard precondition; both
must report FIX=0 AND FIX-DYNAMIC=0 before the flip starts)
Purpose
This is the linear cheat sheet the on-call follows when flipping
MULTI_BRAND_ENFORCEMENT from observe to enforce in a target
environment. The canonical multi-step description lives in
docs/plans/multi-brand/2026-04-27-multi-brand-isolation-plan.md Phase
16 and the deeper rationale is in
docs/runbooks/multi-brand/multi-brand-isolation-rollout.md. This
file is the consolidated handbook used during the live flip; it does
not introduce any new policy.
2026-05-06 in-repo posture update. The
servers_v2/docker-compose.ymldefaults are now production-grade out of the box —MULTI_BRAND_ENFORCEMENT=enforce,WALLET_BRAND_SIGNATURE_REQUIRE=on,PER_CALLER_TOKEN_REQUIRED=on,VERIFY_CALLBACKS=true. Every wallet entry point now envelope-rejects on missingX-Brand-Idupfront (bucket_wallet.py,wallet.py,topology.py,queries.py), so the store-layer default-brand fallback is unreachable from any production route. Thewallet_topology_default_brand_fallback_totalcounter is retained to verify the fallback path never fires in soak — the checks below remain the operator gate for the actual production rollout (soak, sequencing, wall-clock budget, mid-flip abort/revert). A passing local test suite is not a substitute for these checks.
The reproducible operator script that prints the per-environment plan
is at servers_v2/tools/multi_brand_backfill/flip_enforcement.py.
The offline rehearsal harness that drives this checklist non-mutably
(precondition checks + critical-counter gate + GO/NO-GO verdict) is at
tools/multi_brand_backfill/phase16_dryrun.sh. Run it BEFORE every
live flip:
tools/multi_brand_backfill/phase16_dryrun.sh \
--metrics-snapshot /tmp/${ENV}-metrics.txt
The wrapper invokes tools/multi_brand_backfill/phase16_dryrun.py
under the hood. Exit code 0 = GO, 1 = NO-GO. The harness is
dependency-light (stdlib only) so it runs from any operator laptop
without provisioning a per-service venv.
Pre-Flip Gate
Every check below must pass against the target environment before the
flip starts. Soak window: at least 2 * JWT_EXPIRE_MIN minutes from
the moment Phase 5 was deployed in the target environment (default
JWT_EXPIRE_MIN = 60, so 120 minutes minimum). If the refresh-token
TTL is longer than 2 * JWT_EXPIRE_MIN, extend the soak window to
2 * REFRESH_TOKEN_TTL.
Audit gate (T16-D — added as hard precondition)
Before any other pre-flip check, the brand_id audit pair must pass on the deployed commit. Both classifiers must exit 0 (no FIX, no FIX-DYNAMIC, no FIX-TEST):
make audit-brand-id-inserts && make audit-brand-id-rw
Expected output (post-T16):
- INSERT:
RAW: FIX=0 FIX-TEST=0,SA: FIX=0 FIX-TEST=0(only the 2 documented assertion-string false positives intest_agent_integration.py:1063,1068may surface — see2026-04-28-brand-id-insert-audit.md§ FIX-TEST callout). - RW:
SELECT FIX=0 FIX-DYNAMIC=0,UPDATE FIX=0 FIX-DYNAMIC=0,DELETE FIX=0 FIX-DYNAMIC=0.
If either classifier reports any FIX/FIX-DYNAMIC entry, STOP — the deployed commit has a brand-leak the audit can statically prove. Triage and revert before proceeding.
Soak gate
- Soak window has been observed for the documented length.
-
brand_resolution_failed_total{reason="jwt_domain_mismatch"}counter is at zero across the soak window. -
brand_resolution_failed_total{reason="missing_header"}counter is at zero across the soak window (proves every internal caller forwardsX-Brand-Idon brand-scoped routes). -
wallet_cross_brand_rejected_total{mode="observe"}counter is at zero across the soak window. -
wallet_brand_signature_failed_total{mode="observe"}counter is at zero across the soak window. -
event_legacy_schema_total{stream,consumer}counter is at zero across the soak window. -
agent_brandrow count covers every activeagent. Run:SELECT (SELECT count(*) FROM agent) AS agents, (SELECT count(DISTINCT agent_id) FROM agent_brand) AS bound;and confirm both numbers match. If they diverge, run the Phase 12 one-shot re-seed (see plan Phase 12) and re-check before proceeding. - All callers send
X-Caller-Serviceheader (verified bytest_per_caller_header_audit.pypassing). Run:cd servers_v2 && uv run pytest tests/test_per_caller_header_audit.py -vand confirm 0 failures. This regression net catches any new internal HTTP client that forgets the per-caller identifier. -
internal_caller_token_legacy_totalis zero across the soak window for every consumer service. Per-consumer query:sum by (consumer) (rate(internal_caller_token_legacy_total[15m])) == 0. A non-zero value means at least one caller is still authenticating with the sharedINTERNAL_SERVICE_TOKEN; resolve before proceeding. - Stage C (revoke shared
INTERNAL_SERVICE_TOKEN) has completed in this environment. Verify byprintenv | grep INTERNAL_SERVICE_TOKENon each consumer container; only the per-caller variants (INTERNAL_SERVICE_TOKEN_<CALLER>) should be present, the bareINTERNAL_SERVICE_TOKENmust be absent. See the per-caller-service token rollout section indocs/runbooks/multi-brand/multi-brand-isolation-rollout.md. -
PER_CALLER_TOKEN_REQUIRED=onis set on every internal-token- accepting service (wallet_service, agent_service, player_service, promotion_service, rolling_service, recon_service). Verify byprintenv PER_CALLER_TOKEN_REQUIREDon each container. With this flag on, the legacy single-token fallback hard-rejects (403 +internal_caller_token_legacy_rejected_total) instead of accepting -- closing the spoof window where an attacker holding the legacy secret could claim ANYX-Caller-Servicevalue that hadn't been migrated yet. T4-D-I2 added this gate; it must advance together withMULTI_BRAND_ENFORCEMENT=enforcebecause the boot-time advisory flags an out-of-sync state but does not prevent boot. -
WALLET_BRAND_SIGNATURE_REQUIRE=onin the target environment ANDwallet_brand_signature_missing_total{caller_service=*} == 0across the soak window. A non-zero value identifies a caller that is not signing brand-scoped writes; resolve before proceeding. See "Brand-signature enforcement flip" indocs/runbooks/multi-brand/multi-brand-isolation-rollout.md. -
BRAND_SIGNING_KEYis set (non-empty) on every wallet pod. Verify by queryingsum(rate(wallet_brand_signature_misconfigured_total[15m])) == 0across the soak window AND confirm the boot-time guard (assert_brand_signing_key_configuredinwallet_service/app/main.py) did not surface in any pod's boot logs. The combinationMULTI_BRAND_ENFORCEMENT=enforceANDWALLET_BRAND_SIGNATURE_REQUIRE=onAND emptyBRAND_SIGNING_KEYis a fail-open: the boot guard refuses to start the pod, the runtime path fails closed if the guard is bypassed, but the safest gate is to prove the value is provisioned BEFORE the flip. Cross-checkkubectl exec deploy/wallet-service -- printenv BRAND_SIGNING_KEY | wc -creturns >1 byte on every wallet pod. -
VERIFY_CALLBACKS=trueon every game_service pod. Runkubectl exec deploy/game-service -- printenv | grep VERIFY_CALLBACKSand confirmVERIFY_CALLBACKS=trueis present on every replica. Cross-checksum(rate(game_callback_verification_bypassed_total[15m])) == 0across the soak window. The combinationruntime=production AND VERIFY_CALLBACKS=falseis a fail-open that exposes/HO/,/mg/,/wc/(gateway public path prefixes) to the public internet without signature verification; the boot-time guard (assert_verify_callbacks_safe_for_runtimeingame_service/app/core/boot_guards.py) refuses to start the pod in that combination, but the safest gate is to prove the value is provisioned BEFORE the flip. Any non-zero rate on the bypass counter means at least one pod started in a non-prod runtime label and is silently skipping verification — root-cause before flipping. -
game_callback_brand_unresolved_total{reason="raw_account_in_enforce"}AND{reason="raw_guid_in_enforce"}are zero across the soak window. Non-zero means a game-provider callback path still sends the legacy raw-account / raw-guid form; that path is brand-unaware and is rejected outright in enforce mode (see Bug #3 /game_service/app/services/callback_brand.py). Migrate the upstream caller to send the namespaced{brand_code}_{account}form and re-soak before proceeding. -
wallet_topology_default_brand_fallback_total == 0across the soak window. A non-zero value means the per-brand topology lookup silently fell back to the default brand for at least one request -- root-cause before the flip (likely a missing per-brand topology row). - On-call has paged backup, opened the incident channel placeholder, and started the 30-minute wall-clock budget timer.
- On-call has the multi-brand counter Diagnosis Playbook open in a
tab:
diagnosis-playbook.md. Every pre-flip counter check above (signature, cross-brand, topology fallback) has a per-counter "what does it mean / what to do" section there for the case where the soak window is non-zero and root-cause is needed before proceeding.
Flip Sequence
Run the operator script first to print a per-environment plan into the
incident channel for record-keeping. As of the P1-5 hardening the
script requires --operator-id and --reason and writes exactly one
admin_audit row per invocation (see "Audit row tamper-evidence" and
"Operator audit trail for MULTI_BRAND_ENFORCEMENT flips" in the
rollout runbook):
DATABASE_URL=<env DSN> \
python -m servers_v2.tools.multi_brand_backfill.flip_enforcement \
--env <staging|production> \
--operator-id <your-email-or-ldap-id> \
--reason "<ticket-id>: <short justification>"
If the script exits non-zero, stop. Exit 3 means the plan was
printed but the admin_audit row write failed -- the security trail
is missing and the flip MUST NOT proceed until the row lands.
Note: admin_audit is DB-level append-only as of migration 0032. Any
"missing audit row" alarm or admin_audit is append-only log line is
itself a security event -- page the platform on-call rather than
retrying.
Then execute the steps below in order. Downstream first; edge last. Per-step budget: 5 minutes wall-clock. Total budget: 30 minutes.
For each step:
- Set
MULTI_BRAND_ENFORCEMENT=enforcefor the listed service in the environment's deploy config. - Restart the API container and any worker container listed.
- Curl
/healthon the just-restarted service and confirm the response showsmulti_brand_enforcement_mode=enforce. - Query Prometheus for
multi_brand_enforcement_mode{service="<svc>"} == 2. - Query Prometheus for
sum(rate(wallet_cross_brand_rejected_total{mode="enforce"}[1m]))at baseline (zero or known background). - If any check fails twice or the step exceeds 5 minutes, abort and revert.
Steps:
- Step 1 -- wallet_service + wallet_worker.
- Step 2 -- rolling_service + rolling_worker.
- Step 3 -- promotion_service + promotion_worker.
- Step 4 -- player_service.
- Step 5 -- recon_service + recon_worker.
- Step 6 -- agent_service + agent_worker.
- Step 7 -- gateway.
Mid-Flip Abort Criteria
Abort and revert if any of the following fires:
- A single service step exceeds 5 minutes wall-clock.
- The readiness check fails twice on the same service.
- The
wallet_cross_brand_rejected_total{mode="enforce"}alert fires during the flip. - The total budget exceeds 30 minutes wall-clock.
Mid-Flip Revert Procedure
Run the operator script with --rollback to print the reverse plan
(same --operator-id + --reason requirements; the audit row records
the rollback intent and the abort reason):
DATABASE_URL=<env DSN> \
python -m servers_v2.tools.multi_brand_backfill.flip_enforcement \
--env <staging|production> \
--operator-id <your-email-or-ldap-id> \
--reason "abort: <which abort criterion fired>" \
--rollback
Then execute the steps in reverse order. Edge first; wallet last. Per-step budget unchanged.
- Revert step 1 -- gateway -> observe.
- Revert step 2 -- agent_service + agent_worker -> observe.
- Revert step 3 -- recon_service + recon_worker -> observe.
- Revert step 4 -- player_service -> observe.
- Revert step 5 -- promotion_service + promotion_worker -> observe.
- Revert step 6 -- rolling_service + rolling_worker -> observe.
- Revert step 7 -- wallet_service + wallet_worker -> observe.
After every reverted service shows observe again, document the
incident in #incidents and root-cause the failure before any retry.
A retry MUST start from a fresh soak window.
Post-Flip Verification
After all 7 forward steps land cleanly:
- Every service reports
enforcevia Prometheusmulti_brand_enforcement_mode == 2. -
security_downgrade_totalcounter remains at zero in the hour after the flip (would fire if any service later re-bootstrapped with the wrong mode). - Synthetic test: a cross-brand JWT replay against
gatewayreturns the documented reject envelope. - Synthetic test: a JWT without
brand_idagainstgatewayreturns the documented reject envelope. - No spike in legitimate rejections in normal traffic.
-
docs/runbooks/multi-brand/multi-brand-isolation-rollout.mdrelease-gate table is updated to mark this environment ashard-enforce. - Incident channel placeholder closed; on-call hands over.
Phase 17 — Post-flip hardening (not blocking Phase 16)
After Phase 16 stabilises (typically 2 weeks of soak with the critical-zero counter gate holding), execute these in order. Each is self-contained and reversible.
T18 closed the code paths for #1, #2, #3 + the supporting infrastructure for #4–#8 — what remains is the operator decision + env-var flip in each target environment.
-
audit_registrationrefactor onto shared helper — closed in commit0939a91f(T18-B). The route now callsassert_player_brand_matches_operator; behaviour identical, counter cardinality matches the other 13 admin routes. -
Composite indexes on 6 high-volume tables — closed in commit
f5b15966(T18-A) via migration0039_brand_id_composite_indexes.py. Apply the migration in each target environment via the standard alembic flow:cd servers_v2/shared/rgb_db && uv run alembic upgrade headCONCURRENTLY-built so no exclusive locks against writers. Verify via
\di+ rgb.ix_player_deposit_brand_status_create_tsetc. -
AES_STRICT_DECRYPT=on flip — once
pii_aes_legacy_plaintext_observed_total{service=*,op=decrypt}drains to zero across player_service / agent_service / admin_service for one full week, set the env var on each service:# On each player/agent/admin pod (or via the secrets manager):AES_STRICT_DECRYPT=onT18-E1 added a boot-time WARN that fires every restart when the flag is OFF in production (
warn_aes_strict_decrypt_recommendation) so on-call sees the deferred flip in the boot logs. After the flip the WARN goes silent. The strict-raise behaviour is wired since T12-C2 — the flag toggle is the only thing left to do. -
Legacy
INTERNAL_SERVICE_TOKENremoval — execute in two steps:-
Verify
internal_caller_token_legacy_totalhas held zero for two soak windows. -
Set
INTERNAL_SERVICE_TOKEN_DEPRECATED=onon every internal-token-accepting service (wallet_service, agent_service, player_service, promotion_service, rolling_service, recon_service, admin_service) AND remove the legacyINTERNAL_SERVICE_TOKENenv var in the same deploy:# On each consumer pod:INTERNAL_SERVICE_TOKEN_DEPRECATED=onunset INTERNAL_SERVICE_TOKEN# The per-caller variants INTERNAL_SERVICE_TOKEN_<CALLER># remain set — they are now the only auth path.T18-E2 added the boot guard
assert_internal_service_token_deprecated_state— if the deprecation flag is on AND the legacy var is still present the service refuses to boot, forcing the operator to remove the legacy var as a single atomic step.
-
-
Gateway legacy session fallback removal — delivered 2026-05-06 as part of the Phase 4E dual-read sunset cleanup. Gateway now reads only
user-session:brand:{brand_id}:{account};GATEWAY_SESSION_LEGACY_KEY_DISABLEDandgateway_session_legacy_key_used_totalare retired. The retained signal isgateway_jwt_session_missing_total.Historical incident notes that mention
GATEWAY_SESSION_LEGACY_KEY_DISABLEDdescribe the pre-sunset transition branch; that flag no longer exists in gateway.
See Also
docs/plans/multi-brand/2026-04-27-multi-brand-isolation-plan.md-- canonical Phase 16 step definition.docs/runbooks/multi-brand/multi-brand-isolation-rollout.md-- rationale, release gate, soak-window reasoning.docs/specs/multi-brand/2026-04-27-multi-brand-isolation-spec.md-- acceptance criteria and observability counter definitions.docs/adr/ADR-009-multi-brand-domain-routed-isolation.md-- the decision the flip enforces.servers_v2/tools/multi_brand_backfill/flip_enforcement.py-- the reproducible operator script that prints this plan.docs/runbooks/multi-brand/diagnosis-playbook.md-- per-counter operator-facing playbook for every multi-brand counter referenced in the pre-flip gate above.