跳到主要内容

Ruby Wallet Topology Rollout Runbook

Status

Approved

Last Approved: 2026-05-15 (HEAD=3f36e7a8)

Date

2026-04-23

Owners

  • Platform Backend
  • Wallet Domain

Affected Services

  • wallet_service
  • wallet_worker
  • rolling_service
  • rolling_worker
  • promotion_service
  • promotion_worker
  • game_service
  • gateway
  • admin_service
  • docs/adr/ADR-005-wallet-topology-bucket-ledger-model.md
  • docs/specs/wallet/2026-04-23-ruby-wallet-split-structure.md
  • docs/plans/wallet/2026-04-23-ruby-wallet-split-implementation.md
  • docs/runbooks/local-docker/local-docker-full-stack.md

Goal

Validate and roll out the topology-driven wallet model safely after the implementation is complete.

This runbook is not approved for production use until the related spec, implementation plan, tests, and migrations are complete.

Preconditions

  • ADR-005 remains accepted.
  • Wallet spec status is Approved.
  • Implementation plan done definition is satisfied.
  • Database migrations are reviewed and reversible for non-production environments.
  • Fresh local or staging data exists, or a signed data migration plan exists.
  • No service writes legacy flat wallet fields as canonical money state.
  • wallet_service owns all wallet bucket, coupon grant, policy, authorization, transfer, and ledger writes.
  • Worker health checks reflect forward progress according to ADR-003.
  • Promotion settlement failures surface according to ADR-004.
  • Alerting exists for wallet ledger failures, idempotency mismatches, missing bet authorizations, negative-balance prevention, and rolling creation failures after wallet credit.

Local Docker Validation

  1. Start the full local stack using the local Docker runbook.
  2. Apply wallet topology migrations.
  3. Load exactly one topology and policy seed for the target product mode: RUBY_SPLIT_V1 for sports/casino separated normal and bonus wallets, or RUBY_UNIFIED_V1 for one common normal and bonus wallet across all provider types.
  4. Confirm wallet_service, rolling_service, promotion_service, game_service, gateway, and admin_service health is green.
  5. Confirm wallet workers report forward-progress readiness.
  6. Run unit and contract tests for:
    • wallet topology validation
    • wallet policy engine
    • wallet API commands
    • rolling attribution
    • promotion points credit
    • coupon grant saga
    • gateway split-balance routes
    • admin topology and policy validation
  7. Run local end-to-end flows:
    • sports normal deposit, sports bet, settlement, rolling, withdrawable
    • casino normal deposit, live bet, settlement, rolling, withdrawable
    • slots bet using casino policy
    • sports bonus and casino bonus active at the same time
    • sports-only coupon accepted for sports and rejected for live/slots
    • casino-only coupon accepted for live/slots and rejected for sports
    • provider-only coupon accepted only for configured provider IDs
    • points credit, points transfer to every configured normal-wallet target, generated rolling
    • split topology only: sports normal to casino normal transfer with rolling inheritance
    • split topology only: casino normal to sports normal transfer with rolling inheritance
    • unified topology only: legacy normal-wallet transfer aliases return a no-op response because there is no cross-normal transfer to perform
    • withdrawable-funded bet under every configured withdrawable policy
    • bet rollback restores exact authorization funding breakdown
  8. Confirm every money movement has immutable wallet_ledger rows.
  9. Confirm every settled or rolled-back bet has a stored wallet_bet_authorization.
  10. Confirm policy and topology snapshots exist on transactional records.

Staging Validation

  1. Deploy services and migrations to staging.
  2. Activate the selected topology only after seed validation succeeds.
  3. Run the same Docker validation flows against staging.
  4. Compare wallet bucket totals with ledger-derived totals.
  5. Confirm admin policy CRUD creates immutable policy versions.
  6. Confirm topology activation is blocked when required policy or bucket definitions are missing.
  7. Confirm game provider callbacks cannot override server-side funding policy.
  8. Confirm operational dashboards show wallet command success/failure rates, ledger latency, coupon rejection reasons, transfer rejection reasons, and rolling creation failures.

Release Gate

Do not enable the new wallet model in any shared or production-like environment if any of the following are true:

  • wallet topology or policy tests are missing or failing
  • local Docker end-to-end flows are failing
  • staging ledger totals do not match bucket totals
  • any money movement bypasses wallet_service
  • any settlement or rollback path recomputes funding from current balances
  • coupon grants are collapsed into a generic coupon balance
  • points can be bet or withdrawn directly
  • topology activation can leave active balances or records unreachable
  • alerting for critical wallet failures is missing

Rollback Trigger

Pause rollout or roll back immediately if:

  • ledger writes fail or are delayed beyond the release threshold
  • bucket totals diverge from ledger-derived totals
  • accepted bets cannot be settled because authorization records are missing
  • rollback restores a different source than the original authorization
  • topology or policy activation changes current-balance rows unexpectedly
  • provider callbacks start failing because wallet policy cannot be resolved
  • workers remain healthy while wallet outbox or rolling creation is stalled

Rollback Steps

For non-production environments:

  1. Stop provider-facing traffic to the new wallet command paths.
  2. Stop workers that consume wallet events.
  3. Export wallet topology, policy, bucket, authorization, transfer, coupon grant, and ledger tables for investigation.
  4. Deactivate the candidate topology if no accepted bets or unresolved wallet records depend on it.
  5. Restore from the pre-rollout database snapshot if destructive schema or seed changes were applied.
  6. Restart services and workers.
  7. Re-run health checks and local/staging smoke checks.

Production rollback procedure

Production rollback is incident-commander gated and tiered: pick the lowest tier that resolves the incident. Tiers 0–2 are non-destructive and preferred; Tier 3 (snapshot restore) is money-lossy and is the last resort. Money is frozen before any state change in every tier.

Step 0 — Freeze money (all tiers). Apply the narrowest containment lever from incident-response.md §3 (platform PUT /api/v1/maintenance, or DEPOSIT_MAINTENANCE_STATUS via PUT /api/v1/global-vars/{key}, or per-brand feature_policy/payment_policy). Pause provider ingress at the gateway so no new /v2/bets/authorize lands. Record the freeze timestamp.

Step 1 — Drain, don't drop, in-flight money events. Do not kill wallet_worker immediately. Wait until wallet_outbox has no rows with published_at IS NULL (query it) so no settled-bet / deposit event is lost, then stop the wallet-event consumers (rolling_service, promotion_service workers). The outbox + inbox idempotency design makes a later resume replay-safe; a dropped unpublished row does not.

Tier 1 — Code rollback (bad commit < 24h, no schema/seed change). This is the common case. No data rollback needed — the ledger is authoritative and idempotency keys make reprocessing safe.

  1. git revert <bad_sha> && git push origin main (repo builds from source — see incident-response.md Step 4).
  2. Redeploy affected services + workers from the reverted commit.
  3. Go to Step 4 (Verify).

Tier 2 — Topology/policy activation rollback (no destructive schema). Only the active wallet_topology / wallet_policy pointer changed.

  1. Export wallet_topology, wallet_policy, wallet_bucket, wallet_bet_authorization, wallet_transfer, wallet_coupon_grant, wallet_ledger for forensics.
  2. Re-activate the previous topology only if no accepted bet or unresolved authorization references the candidate topology (check wallet_bet_authorization for status='AUTHORIZED' rows on the candidate topology — if any exist, resolve/settle/rollback them first; activating away from a topology with live authorizations strands those stakes).
  3. Confirm reconciliation is clean after re-activation: force a cycle (or wait one hourly tick) and verify wallet_ledger_reconciliation_drift_buckets == 0 — the wallet_service/app/tasks/ledger_reconciliation.py check joins buckets and coupon grants to their ledger tip on the full (brand_id, player_id, topology_code, bucket_type_code) key, so a topology re-point that strands or mismatches any balance shows up here.
  4. Go to Step 4 (Verify).

Tier 3 — Snapshot restore (LAST RESORT — money-lossy). Only if destructive schema/seed changes were applied and Tiers 1–2 cannot recover. Requires incident-commander + finance sign-off because every money mutation after the snapshot timestamp is lost and must be reconciled manually against provider records.

  1. Confirm a verified pre-rollout snapshot exists and its timestamp.
  2. Capture a current full dump first (post-incident forensics + manual replay source).
  3. Restore the snapshot.
  4. Manually reconcile provider settlements and deposits that occurred between the snapshot and the freeze against provider statements before unfreezing.
  5. Go to Step 4 (Verify).

Step 4 — Verify before all-clear (all tiers).

  1. Restart services + workers; GET /health green for every service and wallet_worker/player_worker heartbeats fresh.
  2. Run the Smoke Checks below.
  3. Force a reconciliation cycle (or wait one hourly tick) and confirm wallet_ledger_reconciliation_drift_buckets == 0 (this single gauge covers both the bucket and coupon-grant segments).
  4. Execute one real deposit → bet authorize → settle → withdraw round-trip on a canary account; confirm balances and ledger tip agree.
  5. Only then reverse Step 0 (lift maintenance/flags, restore provider ingress) and declare all-clear.

Never auto-correct a reconciliation drift as part of rollback — investigate root cause first. Auto-correcting an unexplained drift converts a detectable accounting bug into silent money loss.

Smoke Checks

  • GET /health succeeds for every affected service.
  • wallet topology read endpoint returns the active topology and version.
  • wallet policy read endpoint returns the active policy versions.
  • player wallet snapshot includes topology code and version.
  • a small sports test bet authorizes, settles, and produces ledger rows.
  • a small casino test bet authorizes, settles, and produces ledger rows.
  • duplicate request IDs are idempotent with matching payloads.
  • duplicate request IDs reject payload mismatches.