Recon Service Cutover Runbook
Status
Ready
Goal
Roll out recon_service and recon_worker into servers_v2 without breaking existing back-office shooter and pushbullet workflows.
Preconditions
- compatibility contract tests are green
- worker freshness health tests are green
- compose contract tests include recon service and worker
- admin top-info aggregation has been updated and tested
/shooter/device/*has been explicitly classified- the implemented v2 surface has been frozen in
docs/reference/recon/recon-service-v2-surface.md
Environment Requirements
Required configuration classes:
- database access for shooter-related tables
- Redis access
- Pushbullet credentials
- Telegram bot token if Telegram ingestion is enabled
- OpenRouter configuration if AI regex generation is enabled
- per-caller internal tokens configured for the recon/admin path:
RGB_INTERNAL_SERVICE_TOKEN_ADMINon consumers that receive calls fromadmin_serviceRGB_INTERNAL_SERVICE_TOKEN_RECONon consumers that receive calls fromrecon_workerRGB_PER_CALLER_TOKEN_REQUIRED=ononce the Stage C hard flip is active
- DingTalk or equivalent alert channel
RECON_SERVICE_URLconfigured inadmin_service
Local Verification
- Start local dependencies and services.
cd servers_v2docker compose -f docker-compose.yml -f docker-compose.dev.yml --env-file .env.compose.local up -d --build
- Confirm
admin_servicehealth is green. - Confirm
recon_servicehealth is green. - Confirm
recon_workerhealth is green. - Exercise these back-office routes through
admin_service:- template list/add/delete
- phone whitelist list/add/delete
- SMS list/get/reset/check
- Pushbullet device list/add/delete
- shooter device list/trust/delete
- Confirm
Authorizationheader refresh still appears on successful legacy responses. - Confirm admin top-info shows
sms_need_check_cnt. - Run verification test suites:
cd servers_v2/admin_service && uv run pytestcd servers_v2/recon_service && uv run pytestcd servers_v2/rolling_service && uv run pytestcd servers_v2/wallet_service && uv run pytestcd servers_v2 && uv run pytest tests/test_compose_contract.py
Staging Verification
Before production cutover:
- compare old and new template counts
- compare pending SMS counts
- compare parse-failure and unmatched counts
- compare top-info values over the same time window
- verify that forced manual review updates the same logical fields
- verify that final deposit approval still flows through wallet-owned paths only
Production Rollout Order
Current implementation note:
admin_servicecompatibility routes already proxy torecon_service- there is no runtime feature flag today that can enable or disable the proxy path independently
Deploy and verify the recon path as one release unit:
- Deploy the target
admin_service,recon_service, andrecon_workerbuilds together. - Verify internal auth:
admin_servicecallsrecon_servicewithX-Caller-Service: adminandRGB_INTERNAL_SERVICE_TOKEN_ADMINrecon_workercallsadmin_service/internal/meta/ws/syncwithX-Caller-Service: reconandRGB_INTERNAL_SERVICE_TOKEN_RECON- in Stage C, the bare
RGB_INTERNAL_SERVICE_TOKENis absent or empty andRGB_PER_CALLER_TOKEN_REQUIRED=onis set on every internal-token consumer
- Verify both health endpoints and worker freshness metrics.
- Monitor:
- compatibility route errors
- worker freshness
- pending SMS backlog
- parse failure count
- match failure count
- top-info update lag
- Keep the previous known-good deployment set ready for full service rollback until parity is confirmed.
Rollback
Rollback triggers:
- compatibility route failure spikes
- worker health stale or unhealthy
- pending SMS backlog grows unexpectedly
- top-info mismatch or absence of
sms_need_check_cnt - approval flow bypasses expected wallet path
Rollback steps:
- Roll back
admin_service,recon_service, andrecon_workerto the previous known-good release set. - Do not assume a runtime repoint or cutover flag exists unless it has been implemented and documented separately.
- If immediate full rollback is not possible, isolate the faulty recon release from traffic by reverting the deployment or service image, not by editing undocumented runtime routing.
- Preserve logs, queue state, and DB snapshots needed for analysis.
- Open a follow-up incident doc before retrying cutover.
Post-Cutover Checks
- no unresolved front-end calls to old
middle_server - no unserved compatibility dependency such as
/shooter/device/* - recon worker loops are making fresh forward progress
- admin websocket top-info remains stable over multiple refresh cycles