跳到主要内容

Recon Service Cutover Runbook

Status

Ready

Goal

Roll out recon_service and recon_worker into servers_v2 without breaking existing back-office shooter and pushbullet workflows.

Preconditions

  • compatibility contract tests are green
  • worker freshness health tests are green
  • compose contract tests include recon service and worker
  • admin top-info aggregation has been updated and tested
  • /shooter/device/* has been explicitly classified
  • the implemented v2 surface has been frozen in docs/reference/recon/recon-service-v2-surface.md

Environment Requirements

Required configuration classes:

  • database access for shooter-related tables
  • Redis access
  • Pushbullet credentials
  • Telegram bot token if Telegram ingestion is enabled
  • OpenRouter configuration if AI regex generation is enabled
  • per-caller internal tokens configured for the recon/admin path:
    • RGB_INTERNAL_SERVICE_TOKEN_ADMIN on consumers that receive calls from admin_service
    • RGB_INTERNAL_SERVICE_TOKEN_RECON on consumers that receive calls from recon_worker
    • RGB_PER_CALLER_TOKEN_REQUIRED=on once the Stage C hard flip is active
  • DingTalk or equivalent alert channel
  • RECON_SERVICE_URL configured in admin_service

Local Verification

  1. Start local dependencies and services.
    • cd servers_v2
    • docker compose -f docker-compose.yml -f docker-compose.dev.yml --env-file .env.compose.local up -d --build
  2. Confirm admin_service health is green.
  3. Confirm recon_service health is green.
  4. Confirm recon_worker health is green.
  5. Exercise these back-office routes through admin_service:
    • template list/add/delete
    • phone whitelist list/add/delete
    • SMS list/get/reset/check
    • Pushbullet device list/add/delete
    • shooter device list/trust/delete
  6. Confirm Authorization header refresh still appears on successful legacy responses.
  7. Confirm admin top-info shows sms_need_check_cnt.
  8. Run verification test suites:
    • cd servers_v2/admin_service && uv run pytest
    • cd servers_v2/recon_service && uv run pytest
    • cd servers_v2/rolling_service && uv run pytest
    • cd servers_v2/wallet_service && uv run pytest
    • cd servers_v2 && uv run pytest tests/test_compose_contract.py

Staging Verification

Before production cutover:

  • compare old and new template counts
  • compare pending SMS counts
  • compare parse-failure and unmatched counts
  • compare top-info values over the same time window
  • verify that forced manual review updates the same logical fields
  • verify that final deposit approval still flows through wallet-owned paths only

Production Rollout Order

Current implementation note:

  • admin_service compatibility routes already proxy to recon_service
  • there is no runtime feature flag today that can enable or disable the proxy path independently

Deploy and verify the recon path as one release unit:

  1. Deploy the target admin_service, recon_service, and recon_worker builds together.
  2. Verify internal auth:
    • admin_service calls recon_service with X-Caller-Service: admin and RGB_INTERNAL_SERVICE_TOKEN_ADMIN
    • recon_worker calls admin_service /internal/meta/ws/sync with X-Caller-Service: recon and RGB_INTERNAL_SERVICE_TOKEN_RECON
    • in Stage C, the bare RGB_INTERNAL_SERVICE_TOKEN is absent or empty and RGB_PER_CALLER_TOKEN_REQUIRED=on is set on every internal-token consumer
  3. Verify both health endpoints and worker freshness metrics.
  4. Monitor:
    • compatibility route errors
    • worker freshness
    • pending SMS backlog
    • parse failure count
    • match failure count
    • top-info update lag
  5. Keep the previous known-good deployment set ready for full service rollback until parity is confirmed.

Rollback

Rollback triggers:

  • compatibility route failure spikes
  • worker health stale or unhealthy
  • pending SMS backlog grows unexpectedly
  • top-info mismatch or absence of sms_need_check_cnt
  • approval flow bypasses expected wallet path

Rollback steps:

  1. Roll back admin_service, recon_service, and recon_worker to the previous known-good release set.
  2. Do not assume a runtime repoint or cutover flag exists unless it has been implemented and documented separately.
  3. If immediate full rollback is not possible, isolate the faulty recon release from traffic by reverting the deployment or service image, not by editing undocumented runtime routing.
  4. Preserve logs, queue state, and DB snapshots needed for analysis.
  5. Open a follow-up incident doc before retrying cutover.

Post-Cutover Checks

  • no unresolved front-end calls to old middle_server
  • no unserved compatibility dependency such as /shooter/device/*
  • recon worker loops are making fresh forward progress
  • admin websocket top-info remains stable over multiple refresh cycles