Incident Response Runbook
Status
Active
Date
2026-05-06
Owners
- Platform Backend (primary)
- On-call rotation (acknowledge within 15 min for SEV1/2)
Severity Definitions
| SEV | Examples | Time to ack | Time to fix |
|---|---|---|---|
| SEV1 | Money loss / wallet writer wedged / wallet API down / admin lockout | 5 min | 1 hour |
| SEV2 | Player-facing outage on a major flow (login, deposit, bet, withdraw) | 15 min | 4 hours |
| SEV3 | Single non-critical surface degraded (records page slow, recon delay > 1h) | 1 hour | 24 hours |
| SEV4 | Cosmetic, internal-only, or fully mitigated | next business day | as scheduled |
Standard Response Loop
-
Acknowledge in the on-call channel within the SEV-defined window.
-
Snapshot context — current commit (
git log -1), running env (RGB_ENV, set inservers_v2/docker-compose.yml), service health (curl <gw>/health), error rate (Prometheus), recent deploys. -
Isolate blast radius — there is no gateway "drain wallet" switch. Pick the narrowest lever that contains the incident:
3a. Platform-wide freeze (broadest, super_admin only). Activates the global
maintenancerecord every brand observes:# admin_service is reachable directly (not via gateway). $ADMIN = admin base URL, $TOK = super_admin bearer.curl -sX PUT "$ADMIN/api/v1/maintenance" -H "Authorization: Bearer $TOK" \-H 'Content-Type: application/json' \-d '{"is_active": true, "message": "Emergency maintenance — INC-<id>"}'# Verify:curl -sX GET "$ADMIN/api/v1/maintenance" -H "Authorization: Bearer $TOK"3b. Halt deposits only (platform-global global_var).
DEPOSIT_MAINTENANCE_STATUSlives in theglobal_vartable; wallet_service rejects deposits withFinance maintenancewhen it is1:curl -sX PUT "$ADMIN/api/v1/global-vars/DEPOSIT_MAINTENANCE_STATUS" \-H "Authorization: Bearer $TOK" -H 'Content-Type: application/json' -d '{"value": "1"}'# SQL fallback if admin_service is down (run on the platform Postgres):# UPDATE global_var SET value='1', updated_at=NOW() WHERE code='DEPOSIT_MAINTENANCE_STATUS';# Verify a deposit attempt now returns {"status":1,"msg":"Finance maintenance"}.There is no
WITHDRAW_MAINTENANCE_STATUSglobal. To stop withdrawals (or to scope a halt to one brand), use the per-brand policy flags below.3c. Per-brand deposit/withdraw disable.
feature_policy/payment_policylive in thebrand_configtable keyed by(brand_id, key); wallet readsdeposit_enabled/withdraw_enabled(wallet_service/app/api/routes/wallet.py,queries.py). Flip the relevant flag tofalsefor the affectedbrand_idvia the brand-config admin surface (or SQL onbrand_configfor that brand). Verify a deposit/withdraw for that brand returnsFinance maintenance; other brands stay live.3d. Provider/bet traffic. Provider callbacks proxy through the gateway and wallet_service is the single writer, so blocking provider ingress at the gateway halts new authorizations without a wallet-side switch.
Recovery (reverse, in order, only after root cause contained): set
is_active:false/value:"0"/ flags back totrue, then re-verify a real deposit + withdraw + one provider settle round-trip succeeds and reconciliation (ledger vs. bucket for affected players) is clean before declaring all-clear. -
Mitigate first, fix second — rollback over forward-fix when the bad commit is < 24h old. This repo builds from source rather than pulling tagged images, so rollback = revert + rebuild:
# Find the bad commit and revertgit log --oneline -10git revert <bad_sha>git push origin main# Rebuild and redeploy on the hostcd servers_v2docker compose -f docker-compose.yml --env-file .env.compose.local up -d --build -
Communicate — status update every 30 min for SEV1/2 even when nothing changes.
-
Resolve + write postmortem — postmortem deadline: 5 business days for SEV1, 10 for SEV2.
Diagnostic Playbooks (existing)
- Multi-brand:
docs/runbooks/multi-brand/diagnosis-playbook.md - Worker health:
docs/runbooks/worker-health/worker-health-and-settlement-validation.md - Local Docker stack:
docs/runbooks/local-docker/local-docker-full-stack.md - Wallet topology rollout:
docs/runbooks/wallet/ruby-wallet-topology-rollout.md
Postmortem Template
# Postmortem: <SEV> <one-line title>
## Date
YYYY-MM-DD
## Authors
- <on-call lead>
- <fix author>
## Summary
<1-3 sentences: what broke, blast radius, how we recovered.>
## Timeline (UTC)
- HH:MM — first symptom
- HH:MM — alert fired / first human noticed
- HH:MM — incident declared (SEV)
- HH:MM — mitigation applied
- HH:MM — fully resolved
## Impact
- users affected:
- revenue / money at risk:
- duration of degraded service:
## Root Cause
<the actual mechanism, not "a bug". Reference commit / config / external dependency.>
## Detection
<what alerted us; if a human noticed first, why did monitoring miss it?>
## Resolution
<what action ended the incident.>
## Action Items (must have owners + due dates)
- [ ] @owner — prevent: <change so this class of failure can't recur>
- [ ] @owner — detect: <monitor / alert that would have caught this earlier>
- [ ] @owner — recover: <runbook / tooling that would have shortened MTTR>
Decision Log Conventions
Every SEV1/2 mitigation that bypassed normal change control needs a
follow-up entry in docs/adr/ within 5 business days. Reference the
postmortem.