Skip to main content

Incident Response Runbook

Status

Active

Date

2026-05-06

Owners

  • Platform Backend (primary)
  • On-call rotation (acknowledge within 15 min for SEV1/2)

Severity Definitions

SEVExamplesTime to ackTime to fix
SEV1Money loss / wallet writer wedged / wallet API down / admin lockout5 min1 hour
SEV2Player-facing outage on a major flow (login, deposit, bet, withdraw)15 min4 hours
SEV3Single non-critical surface degraded (records page slow, recon delay > 1h)1 hour24 hours
SEV4Cosmetic, internal-only, or fully mitigatednext business dayas scheduled

Standard Response Loop

  1. Acknowledge in the on-call channel within the SEV-defined window.

  2. Snapshot context — current commit (git log -1), running env (RGB_ENV, set in servers_v2/docker-compose.yml), service health (curl <gw>/health), error rate (Prometheus), recent deploys.

  3. Isolate blast radius — there is no gateway "drain wallet" switch. Pick the narrowest lever that contains the incident:

    3a. Platform-wide freeze (broadest, super_admin only). Activates the global maintenance record every brand observes:

    # admin_service is reachable directly (not via gateway). $ADMIN = admin base URL, $TOK = super_admin bearer.
    curl -sX PUT "$ADMIN/api/v1/maintenance" -H "Authorization: Bearer $TOK" \
    -H 'Content-Type: application/json' \
    -d '{"is_active": true, "message": "Emergency maintenance — INC-<id>"}'
    # Verify:
    curl -sX GET "$ADMIN/api/v1/maintenance" -H "Authorization: Bearer $TOK"

    3b. Halt deposits only (platform-global global_var). DEPOSIT_MAINTENANCE_STATUS lives in the global_var table; wallet_service rejects deposits with Finance maintenance when it is 1:

    curl -sX PUT "$ADMIN/api/v1/global-vars/DEPOSIT_MAINTENANCE_STATUS" \
    -H "Authorization: Bearer $TOK" -H 'Content-Type: application/json' -d '{"value": "1"}'
    # SQL fallback if admin_service is down (run on the platform Postgres):
    # UPDATE global_var SET value='1', updated_at=NOW() WHERE code='DEPOSIT_MAINTENANCE_STATUS';
    # Verify a deposit attempt now returns {"status":1,"msg":"Finance maintenance"}.

    There is no WITHDRAW_MAINTENANCE_STATUS global. To stop withdrawals (or to scope a halt to one brand), use the per-brand policy flags below.

    3c. Per-brand deposit/withdraw disable. feature_policy / payment_policy live in the brand_config table keyed by (brand_id, key); wallet reads deposit_enabled / withdraw_enabled (wallet_service/app/api/routes/wallet.py, queries.py). Flip the relevant flag to false for the affected brand_id via the brand-config admin surface (or SQL on brand_config for that brand). Verify a deposit/withdraw for that brand returns Finance maintenance; other brands stay live.

    3d. Provider/bet traffic. Provider callbacks proxy through the gateway and wallet_service is the single writer, so blocking provider ingress at the gateway halts new authorizations without a wallet-side switch.

    Recovery (reverse, in order, only after root cause contained): set is_active:false / value:"0" / flags back to true, then re-verify a real deposit + withdraw + one provider settle round-trip succeeds and reconciliation (ledger vs. bucket for affected players) is clean before declaring all-clear.

  4. Mitigate first, fix second — rollback over forward-fix when the bad commit is < 24h old. This repo builds from source rather than pulling tagged images, so rollback = revert + rebuild:

    # Find the bad commit and revert
    git log --oneline -10
    git revert <bad_sha>
    git push origin main

    # Rebuild and redeploy on the host
    cd servers_v2
    docker compose -f docker-compose.yml --env-file .env.compose.local up -d --build
  5. Communicate — status update every 30 min for SEV1/2 even when nothing changes.

  6. Resolve + write postmortem — postmortem deadline: 5 business days for SEV1, 10 for SEV2.

Diagnostic Playbooks (existing)

  • Multi-brand: docs/runbooks/multi-brand/diagnosis-playbook.md
  • Worker health: docs/runbooks/worker-health/worker-health-and-settlement-validation.md
  • Local Docker stack: docs/runbooks/local-docker/local-docker-full-stack.md
  • Wallet topology rollout: docs/runbooks/wallet/ruby-wallet-topology-rollout.md

Postmortem Template

# Postmortem: <SEV> <one-line title>

## Date
YYYY-MM-DD

## Authors
- <on-call lead>
- <fix author>

## Summary
<1-3 sentences: what broke, blast radius, how we recovered.>

## Timeline (UTC)
- HH:MM — first symptom
- HH:MM — alert fired / first human noticed
- HH:MM — incident declared (SEV)
- HH:MM — mitigation applied
- HH:MM — fully resolved

## Impact
- users affected:
- revenue / money at risk:
- duration of degraded service:

## Root Cause
<the actual mechanism, not "a bug". Reference commit / config / external dependency.>

## Detection
<what alerted us; if a human noticed first, why did monitoring miss it?>

## Resolution
<what action ended the incident.>

## Action Items (must have owners + due dates)
- [ ] @owner — prevent: <change so this class of failure can't recur>
- [ ] @owner — detect: <monitor / alert that would have caught this earlier>
- [ ] @owner — recover: <runbook / tooling that would have shortened MTTR>

Decision Log Conventions

Every SEV1/2 mitigation that bypassed normal change control needs a follow-up entry in docs/adr/ within 5 business days. Reference the postmortem.