본문으로 건너뛰기

Phase 16 Hard-Flip On-Call Checklist

Status

Ready

Date

2026-04-28

Owners

  • Platform Backend
  • On-Call

Last Verified Commit

934d47f4 (T16-D doc sync — adds the make audit-brand-id-inserts && make audit-brand-id-rw exit-code gate as a hard precondition; both must report FIX=0 AND FIX-DYNAMIC=0 before the flip starts)

Purpose

This is the linear cheat sheet the on-call follows when flipping MULTI_BRAND_ENFORCEMENT from observe to enforce in a target environment. The canonical multi-step description lives in docs/plans/multi-brand/2026-04-27-multi-brand-isolation-plan.md Phase 16 and the deeper rationale is in docs/runbooks/multi-brand/multi-brand-isolation-rollout.md. This file is the consolidated handbook used during the live flip; it does not introduce any new policy.

2026-05-06 in-repo posture update. The servers_v2/docker-compose.yml defaults are now production-grade out of the box — MULTI_BRAND_ENFORCEMENT=enforce, WALLET_BRAND_SIGNATURE_REQUIRE=on, PER_CALLER_TOKEN_REQUIRED=on, VERIFY_CALLBACKS=true. Every wallet entry point now envelope-rejects on missing X-Brand-Id upfront (bucket_wallet.py, wallet.py, topology.py, queries.py), so the store-layer default-brand fallback is unreachable from any production route. The wallet_topology_default_brand_fallback_total counter is retained to verify the fallback path never fires in soak — the checks below remain the operator gate for the actual production rollout (soak, sequencing, wall-clock budget, mid-flip abort/revert). A passing local test suite is not a substitute for these checks.

The reproducible operator script that prints the per-environment plan is at servers_v2/tools/multi_brand_backfill/flip_enforcement.py.

The offline rehearsal harness that drives this checklist non-mutably (precondition checks + critical-counter gate + GO/NO-GO verdict) is at tools/multi_brand_backfill/phase16_dryrun.sh. Run it BEFORE every live flip:

tools/multi_brand_backfill/phase16_dryrun.sh \
--metrics-snapshot /tmp/${ENV}-metrics.txt

The wrapper invokes tools/multi_brand_backfill/phase16_dryrun.py under the hood. Exit code 0 = GO, 1 = NO-GO. The harness is dependency-light (stdlib only) so it runs from any operator laptop without provisioning a per-service venv.

Pre-Flip Gate

Every check below must pass against the target environment before the flip starts. Soak window: at least 2 * JWT_EXPIRE_MIN minutes from the moment Phase 5 was deployed in the target environment (default JWT_EXPIRE_MIN = 60, so 120 minutes minimum). If the refresh-token TTL is longer than 2 * JWT_EXPIRE_MIN, extend the soak window to 2 * REFRESH_TOKEN_TTL.

Audit gate (T16-D — added as hard precondition)

Before any other pre-flip check, the brand_id audit pair must pass on the deployed commit. Both classifiers must exit 0 (no FIX, no FIX-DYNAMIC, no FIX-TEST):

make audit-brand-id-inserts && make audit-brand-id-rw

Expected output (post-T16):

  • INSERT: RAW: FIX=0 FIX-TEST=0, SA: FIX=0 FIX-TEST=0 (only the 2 documented assertion-string false positives in test_agent_integration.py:1063,1068 may surface — see 2026-04-28-brand-id-insert-audit.md § FIX-TEST callout).
  • RW: SELECT FIX=0 FIX-DYNAMIC=0, UPDATE FIX=0 FIX-DYNAMIC=0, DELETE FIX=0 FIX-DYNAMIC=0.

If either classifier reports any FIX/FIX-DYNAMIC entry, STOP — the deployed commit has a brand-leak the audit can statically prove. Triage and revert before proceeding.

Soak gate

  • Soak window has been observed for the documented length.
  • brand_resolution_failed_total{reason="jwt_domain_mismatch"} counter is at zero across the soak window.
  • brand_resolution_failed_total{reason="missing_header"} counter is at zero across the soak window (proves every internal caller forwards X-Brand-Id on brand-scoped routes).
  • wallet_cross_brand_rejected_total{mode="observe"} counter is at zero across the soak window.
  • wallet_brand_signature_failed_total{mode="observe"} counter is at zero across the soak window.
  • event_legacy_schema_total{stream,consumer} counter is at zero across the soak window.
  • agent_brand row count covers every active agent. Run: SELECT (SELECT count(*) FROM agent) AS agents, (SELECT count(DISTINCT agent_id) FROM agent_brand) AS bound; and confirm both numbers match. If they diverge, run the Phase 12 one-shot re-seed (see plan Phase 12) and re-check before proceeding.
  • All callers send X-Caller-Service header (verified by test_per_caller_header_audit.py passing). Run: cd servers_v2 && uv run pytest tests/test_per_caller_header_audit.py -v and confirm 0 failures. This regression net catches any new internal HTTP client that forgets the per-caller identifier.
  • internal_caller_token_legacy_total is zero across the soak window for every consumer service. Per-consumer query: sum by (consumer) (rate(internal_caller_token_legacy_total[15m])) == 0. A non-zero value means at least one caller is still authenticating with the shared INTERNAL_SERVICE_TOKEN; resolve before proceeding.
  • Stage C (revoke shared INTERNAL_SERVICE_TOKEN) has completed in this environment. Verify by printenv | grep INTERNAL_SERVICE_TOKEN on each consumer container; only the per-caller variants (INTERNAL_SERVICE_TOKEN_<CALLER>) should be present, the bare INTERNAL_SERVICE_TOKEN must be absent. See the per-caller-service token rollout section in docs/runbooks/multi-brand/multi-brand-isolation-rollout.md.
  • PER_CALLER_TOKEN_REQUIRED=on is set on every internal-token- accepting service (wallet_service, agent_service, player_service, promotion_service, rolling_service, recon_service). Verify by printenv PER_CALLER_TOKEN_REQUIRED on each container. With this flag on, the legacy single-token fallback hard-rejects (403 + internal_caller_token_legacy_rejected_total) instead of accepting -- closing the spoof window where an attacker holding the legacy secret could claim ANY X-Caller-Service value that hadn't been migrated yet. T4-D-I2 added this gate; it must advance together with MULTI_BRAND_ENFORCEMENT=enforce because the boot-time advisory flags an out-of-sync state but does not prevent boot.
  • WALLET_BRAND_SIGNATURE_REQUIRE=on in the target environment AND wallet_brand_signature_missing_total{caller_service=*} == 0 across the soak window. A non-zero value identifies a caller that is not signing brand-scoped writes; resolve before proceeding. See "Brand-signature enforcement flip" in docs/runbooks/multi-brand/multi-brand-isolation-rollout.md.
  • BRAND_SIGNING_KEY is set (non-empty) on every wallet pod. Verify by querying sum(rate(wallet_brand_signature_misconfigured_total[15m])) == 0 across the soak window AND confirm the boot-time guard (assert_brand_signing_key_configured in wallet_service/app/main.py) did not surface in any pod's boot logs. The combination MULTI_BRAND_ENFORCEMENT=enforce AND WALLET_BRAND_SIGNATURE_REQUIRE=on AND empty BRAND_SIGNING_KEY is a fail-open: the boot guard refuses to start the pod, the runtime path fails closed if the guard is bypassed, but the safest gate is to prove the value is provisioned BEFORE the flip. Cross-check kubectl exec deploy/wallet-service -- printenv BRAND_SIGNING_KEY | wc -c returns >1 byte on every wallet pod.
  • VERIFY_CALLBACKS=true on every game_service pod. Run kubectl exec deploy/game-service -- printenv | grep VERIFY_CALLBACKS and confirm VERIFY_CALLBACKS=true is present on every replica. Cross-check sum(rate(game_callback_verification_bypassed_total[15m])) == 0 across the soak window. The combination runtime=production AND VERIFY_CALLBACKS=false is a fail-open that exposes /HO/, /mg/, /wc/ (gateway public path prefixes) to the public internet without signature verification; the boot-time guard (assert_verify_callbacks_safe_for_runtime in game_service/app/core/boot_guards.py) refuses to start the pod in that combination, but the safest gate is to prove the value is provisioned BEFORE the flip. Any non-zero rate on the bypass counter means at least one pod started in a non-prod runtime label and is silently skipping verification — root-cause before flipping.
  • game_callback_brand_unresolved_total{reason="raw_account_in_enforce"} AND {reason="raw_guid_in_enforce"} are zero across the soak window. Non-zero means a game-provider callback path still sends the legacy raw-account / raw-guid form; that path is brand-unaware and is rejected outright in enforce mode (see Bug #3 / game_service/app/services/callback_brand.py). Migrate the upstream caller to send the namespaced {brand_code}_{account} form and re-soak before proceeding.
  • wallet_topology_default_brand_fallback_total == 0 across the soak window. A non-zero value means the per-brand topology lookup silently fell back to the default brand for at least one request -- root-cause before the flip (likely a missing per-brand topology row).
  • On-call has paged backup, opened the incident channel placeholder, and started the 30-minute wall-clock budget timer.
  • On-call has the multi-brand counter Diagnosis Playbook open in a tab: diagnosis-playbook.md. Every pre-flip counter check above (signature, cross-brand, topology fallback) has a per-counter "what does it mean / what to do" section there for the case where the soak window is non-zero and root-cause is needed before proceeding.

Flip Sequence

Run the operator script first to print a per-environment plan into the incident channel for record-keeping. As of the P1-5 hardening the script requires --operator-id and --reason and writes exactly one admin_audit row per invocation (see "Audit row tamper-evidence" and "Operator audit trail for MULTI_BRAND_ENFORCEMENT flips" in the rollout runbook):

DATABASE_URL=<env DSN> \
python -m servers_v2.tools.multi_brand_backfill.flip_enforcement \
--env <staging|production> \
--operator-id <your-email-or-ldap-id> \
--reason "<ticket-id>: <short justification>"

If the script exits non-zero, stop. Exit 3 means the plan was printed but the admin_audit row write failed -- the security trail is missing and the flip MUST NOT proceed until the row lands.

Note: admin_audit is DB-level append-only as of migration 0032. Any "missing audit row" alarm or admin_audit is append-only log line is itself a security event -- page the platform on-call rather than retrying.

Then execute the steps below in order. Downstream first; edge last. Per-step budget: 5 minutes wall-clock. Total budget: 30 minutes.

For each step:

  1. Set MULTI_BRAND_ENFORCEMENT=enforce for the listed service in the environment's deploy config.
  2. Restart the API container and any worker container listed.
  3. Curl /health on the just-restarted service and confirm the response shows multi_brand_enforcement_mode=enforce.
  4. Query Prometheus for multi_brand_enforcement_mode{service="<svc>"} == 2.
  5. Query Prometheus for sum(rate(wallet_cross_brand_rejected_total{mode="enforce"}[1m])) at baseline (zero or known background).
  6. If any check fails twice or the step exceeds 5 minutes, abort and revert.

Steps:

  • Step 1 -- wallet_service + wallet_worker.
  • Step 2 -- rolling_service + rolling_worker.
  • Step 3 -- promotion_service + promotion_worker.
  • Step 4 -- player_service.
  • Step 5 -- recon_service + recon_worker.
  • Step 6 -- agent_service + agent_worker.
  • Step 7 -- gateway.

Mid-Flip Abort Criteria

Abort and revert if any of the following fires:

  • A single service step exceeds 5 minutes wall-clock.
  • The readiness check fails twice on the same service.
  • The wallet_cross_brand_rejected_total{mode="enforce"} alert fires during the flip.
  • The total budget exceeds 30 minutes wall-clock.

Mid-Flip Revert Procedure

Run the operator script with --rollback to print the reverse plan (same --operator-id + --reason requirements; the audit row records the rollback intent and the abort reason):

DATABASE_URL=<env DSN> \
python -m servers_v2.tools.multi_brand_backfill.flip_enforcement \
--env <staging|production> \
--operator-id <your-email-or-ldap-id> \
--reason "abort: <which abort criterion fired>" \
--rollback

Then execute the steps in reverse order. Edge first; wallet last. Per-step budget unchanged.

  • Revert step 1 -- gateway -> observe.
  • Revert step 2 -- agent_service + agent_worker -> observe.
  • Revert step 3 -- recon_service + recon_worker -> observe.
  • Revert step 4 -- player_service -> observe.
  • Revert step 5 -- promotion_service + promotion_worker -> observe.
  • Revert step 6 -- rolling_service + rolling_worker -> observe.
  • Revert step 7 -- wallet_service + wallet_worker -> observe.

After every reverted service shows observe again, document the incident in #incidents and root-cause the failure before any retry. A retry MUST start from a fresh soak window.

Post-Flip Verification

After all 7 forward steps land cleanly:

  • Every service reports enforce via Prometheus multi_brand_enforcement_mode == 2.
  • security_downgrade_total counter remains at zero in the hour after the flip (would fire if any service later re-bootstrapped with the wrong mode).
  • Synthetic test: a cross-brand JWT replay against gateway returns the documented reject envelope.
  • Synthetic test: a JWT without brand_id against gateway returns the documented reject envelope.
  • No spike in legitimate rejections in normal traffic.
  • docs/runbooks/multi-brand/multi-brand-isolation-rollout.md release-gate table is updated to mark this environment as hard-enforce.
  • Incident channel placeholder closed; on-call hands over.

Phase 17 — Post-flip hardening (not blocking Phase 16)

After Phase 16 stabilises (typically 2 weeks of soak with the critical-zero counter gate holding), execute these in order. Each is self-contained and reversible.

T18 closed the code paths for #1, #2, #3 + the supporting infrastructure for #4–#8 — what remains is the operator decision + env-var flip in each target environment.

  • audit_registration refactor onto shared helper — closed in commit 0939a91f (T18-B). The route now calls assert_player_brand_matches_operator; behaviour identical, counter cardinality matches the other 13 admin routes.

  • Composite indexes on 6 high-volume tables — closed in commit f5b15966 (T18-A) via migration 0039_brand_id_composite_indexes.py. Apply the migration in each target environment via the standard alembic flow:

    cd servers_v2/shared/rgb_db && uv run alembic upgrade head

    CONCURRENTLY-built so no exclusive locks against writers. Verify via \di+ rgb.ix_player_deposit_brand_status_create_ts etc.

  • AES_STRICT_DECRYPT=on flip — once pii_aes_legacy_plaintext_observed_total{service=*,op=decrypt} drains to zero across player_service / agent_service / admin_service for one full week, set the env var on each service:

    # On each player/agent/admin pod (or via the secrets manager):
    AES_STRICT_DECRYPT=on

    T18-E1 added a boot-time WARN that fires every restart when the flag is OFF in production (warn_aes_strict_decrypt_recommendation) so on-call sees the deferred flip in the boot logs. After the flip the WARN goes silent. The strict-raise behaviour is wired since T12-C2 — the flag toggle is the only thing left to do.

  • Legacy INTERNAL_SERVICE_TOKEN removal — execute in two steps:

    1. Verify internal_caller_token_legacy_total has held zero for two soak windows.

    2. Set INTERNAL_SERVICE_TOKEN_DEPRECATED=on on every internal-token-accepting service (wallet_service, agent_service, player_service, promotion_service, rolling_service, recon_service, admin_service) AND remove the legacy INTERNAL_SERVICE_TOKEN env var in the same deploy:

      # On each consumer pod:
      INTERNAL_SERVICE_TOKEN_DEPRECATED=on
      unset INTERNAL_SERVICE_TOKEN
      # The per-caller variants INTERNAL_SERVICE_TOKEN_<CALLER>
      # remain set — they are now the only auth path.

      T18-E2 added the boot guard assert_internal_service_token_deprecated_state — if the deprecation flag is on AND the legacy var is still present the service refuses to boot, forcing the operator to remove the legacy var as a single atomic step.

  • Gateway legacy session fallback removal — delivered 2026-05-06 as part of the Phase 4E dual-read sunset cleanup. Gateway now reads only user-session:brand:{brand_id}:{account}; GATEWAY_SESSION_LEGACY_KEY_DISABLED and gateway_session_legacy_key_used_total are retired. The retained signal is gateway_jwt_session_missing_total.

    Historical incident notes that mention GATEWAY_SESSION_LEGACY_KEY_DISABLED describe the pre-sunset transition branch; that flag no longer exists in gateway.

See Also

  • docs/plans/multi-brand/2026-04-27-multi-brand-isolation-plan.md -- canonical Phase 16 step definition.
  • docs/runbooks/multi-brand/multi-brand-isolation-rollout.md -- rationale, release gate, soak-window reasoning.
  • docs/specs/multi-brand/2026-04-27-multi-brand-isolation-spec.md -- acceptance criteria and observability counter definitions.
  • docs/adr/ADR-009-multi-brand-domain-routed-isolation.md -- the decision the flip enforces.
  • servers_v2/tools/multi_brand_backfill/flip_enforcement.py -- the reproducible operator script that prints this plan.
  • docs/runbooks/multi-brand/diagnosis-playbook.md -- per-counter operator-facing playbook for every multi-brand counter referenced in the pre-flip gate above.