Incident Response Runbook

Status

Active

Date

2026-05-06

Owners

Platform Backend (primary)
On-call rotation (acknowledge within 15 min for SEV1/2)

Severity Definitions

SEV	Examples	Time to ack	Time to fix
SEV1	Money loss / wallet writer wedged / wallet API down / admin lockout	5 min	1 hour
SEV2	Player-facing outage on a major flow (login, deposit, bet, withdraw)	15 min	4 hours
SEV3	Single non-critical surface degraded (records page slow, recon delay > 1h)	1 hour	24 hours
SEV4	Cosmetic, internal-only, or fully mitigated	next business day	as scheduled

Standard Response Loop

Acknowledge in the on-call channel within the SEV-defined window.
Snapshot context — current commit (git log -1), running env (RGB_ENV, set in servers_v2/docker-compose.yml), service health (curl <gw>/health), error rate (Prometheus), recent deploys.
Isolate blast radius — there is no gateway "drain wallet" switch. Pick the narrowest lever that contains the incident:

3a. Platform-wide freeze (broadest, super_admin only). Activates the global maintenance record every brand observes:
```
# admin_service is reachable directly (not via gateway). $ADMIN = admin base URL, $TOK = super_admin bearer.
curl -sX PUT "$ADMIN/api/v1/maintenance" -H "Authorization: Bearer $TOK" \
  -H 'Content-Type: application/json' \
  -d '{"is_active": true, "message": "Emergency maintenance — INC-<id>"}'
# Verify:
curl -sX GET "$ADMIN/api/v1/maintenance" -H "Authorization: Bearer $TOK"
```
3b. Halt deposits only (platform-global global_var). DEPOSIT_MAINTENANCE_STATUS lives in the global_var table; wallet_service rejects deposits with Finance maintenance when it is 1:
```
curl -sX PUT "$ADMIN/api/v1/global-vars/DEPOSIT_MAINTENANCE_STATUS" \
  -H "Authorization: Bearer $TOK" -H 'Content-Type: application/json' -d '{"value": "1"}'
# SQL fallback if admin_service is down (run on the platform Postgres):
#   UPDATE global_var SET value='1', updated_at=NOW() WHERE code='DEPOSIT_MAINTENANCE_STATUS';
# Verify a deposit attempt now returns {"status":1,"msg":"Finance maintenance"}.
```
There is no WITHDRAW_MAINTENANCE_STATUS global. To stop withdrawals (or to scope a halt to one brand), use the per-brand policy flags below.

3c. Per-brand deposit/withdraw disable. feature_policy / payment_policy live in the brand_config table keyed by (brand_id, key); wallet reads deposit_enabled / withdraw_enabled (wallet_service/app/api/routes/wallet.py, queries.py). Flip the relevant flag to false for the affected brand_id via the brand-config admin surface (or SQL on brand_config for that brand). Verify a deposit/withdraw for that brand returns Finance maintenance; other brands stay live.

3d. Provider/bet traffic. Provider callbacks proxy through the gateway and wallet_service is the single writer, so blocking provider ingress at the gateway halts new authorizations without a wallet-side switch.

Recovery (reverse, in order, only after root cause contained): set is_active:false / value:"0" / flags back to true, then re-verify a real deposit + withdraw + one provider settle round-trip succeeds and reconciliation (ledger vs. bucket for affected players) is clean before declaring all-clear.

Mitigate first, fix second — rollback over forward-fix when the bad commit is < 24h old. This repo builds from source rather than pulling tagged images, so rollback = revert + rebuild:

# Find the bad commit and revert
git log --oneline -10
git revert <bad_sha>
git push origin main

# Rebuild and redeploy on the host
cd servers_v2
docker compose -f docker-compose.yml --env-file .env.compose.local up -d --build

Communicate — status update every 30 min for SEV1/2 even when nothing changes.
Resolve + write postmortem — postmortem deadline: 5 business days for SEV1, 10 for SEV2.

Diagnostic Playbooks (existing)

Multi-brand: docs/runbooks/multi-brand/diagnosis-playbook.md
Worker health: docs/runbooks/worker-health/worker-health-and-settlement-validation.md
Local Docker stack: docs/runbooks/local-docker/local-docker-full-stack.md
Wallet topology rollout: docs/runbooks/wallet/ruby-wallet-topology-rollout.md

Postmortem Template

# Postmortem: <SEV> <one-line title>

## Date
YYYY-MM-DD

## Authors
- <on-call lead>
- <fix author>

## Summary
<1-3 sentences: what broke, blast radius, how we recovered.>

## Timeline (UTC)
- HH:MM — first symptom
- HH:MM — alert fired / first human noticed
- HH:MM — incident declared (SEV)
- HH:MM — mitigation applied
- HH:MM — fully resolved

## Impact
- users affected:
- revenue / money at risk:
- duration of degraded service:

## Root Cause
<the actual mechanism, not "a bug". Reference commit / config / external dependency.>

## Detection
<what alerted us; if a human noticed first, why did monitoring miss it?>

## Resolution
<what action ended the incident.>

## Action Items (must have owners + due dates)
- [ ] @owner — prevent: <change so this class of failure can't recur>
- [ ] @owner — detect: <monitor / alert that would have caught this earlier>
- [ ] @owner — recover: <runbook / tooling that would have shortened MTTR>

Decision Log Conventions

Every SEV1/2 mitigation that bypassed normal change control needs a follow-up entry in docs/adr/ within 5 business days. Reference the postmortem.

Status​

Date​

Owners​

Severity Definitions​

Standard Response Loop​

Diagnostic Playbooks (existing)​

Postmortem Template​

Decision Log Conventions​