Phase 16 Hard-Flip On-Call Checklist

Status

Ready

2026-07-18 game-boundary update: Native provider, account reverse-parse and normalized Aggregator-wallet checks are retired. The live release gate is the signed Provider-protocol relay described in docs/services/game-service.md.

Date

2026-04-28

Owners

Platform Backend
On-Call

Last Verified Commit

Use git log -- <this file> for current last-touch history; this field is intentionally not pinned to a static hash so it does not become stale after unrelated commits.

Purpose

This is the linear cheat sheet the on-call follows when flipping MULTI_BRAND_ENFORCEMENT from observe to enforce in a target environment. The canonical multi-step description lives in docs/plans/multi-brand/2026-04-27-multi-brand-isolation-plan.md Phase 16 and the deeper rationale is in docs/runbooks/multi-brand/multi-brand-isolation-rollout.md. This file is the consolidated handbook used during the live flip; it does not introduce any new policy.

2026-05-06 in-repo posture update. The servers_v2/docker-compose.yml defaults are now production-grade out of the box — MULTI_BRAND_ENFORCEMENT=enforce, WALLET_BRAND_SIGNATURE_REQUIRE=on, PER_CALLER_TOKEN_REQUIRED=on. Every wallet entry point now envelope-rejects on missing X-Brand-Id upfront (bucket_wallet.py, wallet.py, topology.py, queries.py), so the store-layer default-brand fallback is unreachable from any production route. The wallet_topology_default_brand_fallback_total counter is retained to verify the fallback path never fires in soak — the checks below remain the operator gate for the actual production rollout (soak, sequencing, wall-clock budget, mid-flip abort/revert). A passing local test suite is not a substitute for these checks.

The reproducible operator script that prints the per-environment plan is at servers_v2/tools/multi_brand_backfill/flip_enforcement.py.

The offline rehearsal harness that drives this checklist non-mutably (precondition checks + critical-counter gate + GO/NO-GO verdict) is at tools/multi_brand_backfill/phase16_dryrun.sh. Run it BEFORE every live flip:

tools/multi_brand_backfill/phase16_dryrun.sh \
    --metrics-snapshot /tmp/${ENV}-metrics.txt

The wrapper invokes tools/multi_brand_backfill/phase16_dryrun.py under the hood. Exit code 0 = GO, 1 = NO-GO. The harness is dependency-light (stdlib only) so it runs from any operator laptop without provisioning a per-service venv.

Pre-Flip Gate

Every check below must pass against the target environment before the flip starts. Soak window: at least 2 * JWT_EXPIRE_MIN minutes from the moment Phase 5 was deployed in the target environment (default JWT_EXPIRE_MIN = 60, so 120 minutes minimum). If the refresh-token TTL is longer than 2 * JWT_EXPIRE_MIN, extend the soak window to 2 * REFRESH_TOKEN_TTL.

Audit gate (T16-D — added as hard precondition)

Before any other pre-flip check, the brand_id audit pair must pass on the deployed commit. Both classifiers must exit 0 (no FIX, no FIX-DYNAMIC, no FIX-TEST):

make audit-brand-id-inserts && make audit-brand-id-rw

Expected output (post-T16):

INSERT: RAW: FIX=0 FIX-TEST=0, SA: FIX=0 FIX-TEST=0 (only the 2 documented assertion-string false positives in test_agent_integration.py:1063,1068 may surface — see 2026-04-28-brand-id-insert-audit.md § FIX-TEST callout).
RW: SELECT FIX=0 FIX-DYNAMIC=0, UPDATE FIX=0 FIX-DYNAMIC=0, DELETE FIX=0 FIX-DYNAMIC=0.

If either classifier reports any FIX/FIX-DYNAMIC entry, STOP — the deployed commit has a brand-leak the audit can statically prove. Triage and revert before proceeding.

Soak gate

Flip Sequence

Run the operator script first to print a per-environment plan into the incident channel for record-keeping. As of the P1-5 hardening the script requires --operator-id and --reason and writes exactly one admin_audit row per invocation (see "Audit row tamper-evidence" and "Operator audit trail for MULTI_BRAND_ENFORCEMENT flips" in the rollout runbook):

DATABASE_URL=<env DSN> \
python -m servers_v2.tools.multi_brand_backfill.flip_enforcement \
    --env <staging|production> \
    --operator-id <your-email-or-ldap-id> \
    --reason "<ticket-id>: <short justification>"

If the script exits non-zero, stop. Exit 3 means the plan was printed but the admin_audit row write failed -- the security trail is missing and the flip MUST NOT proceed until the row lands.

Note: admin_audit is DB-level append-only as of migration 0032. Any "missing audit row" alarm or admin_audit is append-only log line is itself a security event -- page the platform on-call rather than retrying.

Then execute the steps below in order. Downstream first; edge last. Per-step budget: 5 minutes wall-clock. Total budget: 30 minutes.

For each step:

Set MULTI_BRAND_ENFORCEMENT=enforce for the listed service in the environment's deploy config.
Restart the API container and any worker container listed.
Curl /health on the just-restarted service and confirm the response shows multi_brand_enforcement_mode=enforce.
Query Prometheus for multi_brand_enforcement_mode{service="<svc>"} == 2.
Query Prometheus for sum(rate(wallet_cross_brand_rejected_total{mode="enforce"}[1m])) at baseline (zero or known background).
If any check fails twice or the step exceeds 5 minutes, abort and revert.

Steps:

Step 1 -- wallet_service + wallet_worker.
Step 2 -- rolling_service + rolling_worker.
Step 3 -- promotion_service + promotion_worker.
Step 4 -- player_service.
Step 5 -- recon_service + recon_worker.
Step 6 -- agent_service + agent_worker.
Step 7 -- gateway.

Mid-Flip Abort Criteria

Abort and revert if any of the following fires:

A single service step exceeds 5 minutes wall-clock.
The readiness check fails twice on the same service.
The wallet_cross_brand_rejected_total{mode="enforce"} alert fires during the flip.
The total budget exceeds 30 minutes wall-clock.

Mid-Flip Revert Procedure

Run the operator script with --rollback to print the reverse plan (same --operator-id + --reason requirements; the audit row records the rollback intent and the abort reason):

DATABASE_URL=<env DSN> \
python -m servers_v2.tools.multi_brand_backfill.flip_enforcement \
    --env <staging|production> \
    --operator-id <your-email-or-ldap-id> \
    --reason "abort: <which abort criterion fired>" \
    --rollback

Then execute the steps in reverse order. Edge first; wallet last. Per-step budget unchanged.

Revert step 1 -- gateway -> observe.
Revert step 2 -- agent_service + agent_worker -> observe.
Revert step 3 -- recon_service + recon_worker -> observe.
Revert step 4 -- player_service -> observe.
Revert step 5 -- promotion_service + promotion_worker -> observe.
Revert step 6 -- rolling_service + rolling_worker -> observe.
Revert step 7 -- wallet_service + wallet_worker -> observe.

After every reverted service shows observe again, document the incident in #incidents and root-cause the failure before any retry. A retry MUST start from a fresh soak window.

Post-Flip Verification

After all 7 forward steps land cleanly:

Every service reports enforce via Prometheus multi_brand_enforcement_mode == 2.
security_downgrade_total counter remains at zero in the hour after the flip (would fire if any service later re-bootstrapped with the wrong mode).
Synthetic test: a cross-brand JWT replay against gateway returns the documented reject envelope.
Synthetic test: a JWT without brand_id against gateway returns the documented reject envelope.
No spike in legitimate rejections in normal traffic.
docs/runbooks/multi-brand/multi-brand-isolation-rollout.md release-gate table is updated to mark this environment as hard-enforce.
Incident channel placeholder closed; on-call hands over.

Phase 17 — Post-flip hardening (not blocking Phase 16)

After Phase 16 stabilises (typically 2 weeks of soak with the critical-zero counter gate holding), execute these in order. Each is self-contained and reversible.

T18 closed the code paths for #1, #2, #3 + the supporting infrastructure for #4–#8 — what remains is the operator decision + env-var flip in each target environment.

audit_registration refactor onto shared helper — closed in commit 0939a91f (T18-B). The route now calls assert_player_brand_matches_operator; behaviour identical, counter cardinality matches the other 13 admin routes.
Composite indexes on 6 high-volume tables — closed in commit f5b15966 (T18-A) via migration 0039_brand_id_composite_indexes.py. Apply the migration in each target environment via the standard alembic flow:
```
cd servers_v2/shared/rgb_db && uv run alembic upgrade head
```
CONCURRENTLY-built so no exclusive locks against writers. Verify via \di+ rgb.ix_player_deposit_brand_status_create_ts etc.
AES_STRICT_DECRYPT=on flip — once pii_aes_legacy_plaintext_observed_total{service=*,op=decrypt} drains to zero across player_service / agent_service / admin_service for one full week, set the env var on each service:
```
# On each player/agent/admin pod (or via the secrets manager):
AES_STRICT_DECRYPT=on
```
T18-E1 added a boot-time WARN that fires every restart when the flag is OFF in production (warn_aes_strict_decrypt_recommendation) so on-call sees the deferred flip in the boot logs. After the flip the WARN goes silent. The strict-raise behaviour is wired since T12-C2 — the flag toggle is the only thing left to do.
Legacy INTERNAL_SERVICE_TOKEN removal — execute in two steps:
1. Verify internal_caller_token_legacy_total has held zero for two soak windows.
2. Set INTERNAL_SERVICE_TOKEN_DEPRECATED=on on every internal-token-accepting service (wallet_service, agent_service, player_service, promotion_service, rolling_service, recon_service, admin_service) AND remove the legacy INTERNAL_SERVICE_TOKEN env var in the same deploy:
```
# On each consumer pod:
INTERNAL_SERVICE_TOKEN_DEPRECATED=on
unset INTERNAL_SERVICE_TOKEN
# The per-caller variants INTERNAL_SERVICE_TOKEN_<CALLER>
# remain set — they are now the only auth path.
```
  T18-E2 added the boot guard assert_internal_service_token_deprecated_state — if the deprecation flag is on AND the legacy var is still present the service refuses to boot, forcing the operator to remove the legacy var as a single atomic step.
Gateway legacy session fallback removal — delivered 2026-05-06 as part of the Phase 4E dual-read sunset cleanup. Gateway now reads only user-session:brand:{brand_id}:{account}; GATEWAY_SESSION_LEGACY_KEY_DISABLED and gateway_session_legacy_key_used_total are retired. The retained signal is gateway_jwt_session_missing_total.

Historical incident notes that mention GATEWAY_SESSION_LEGACY_KEY_DISABLED describe the pre-sunset transition branch; that flag no longer exists in gateway.

Status​

Date​

Owners​

Last Verified Commit​

Purpose​

Pre-Flip Gate​

Audit gate (T16-D — added as hard precondition)​

Soak gate​

Flip Sequence​

Mid-Flip Abort Criteria​

Mid-Flip Revert Procedure​

Post-Flip Verification​

Phase 17 — Post-flip hardening (not blocking Phase 16)​

See Also​