跳到主要内容

Plan: Make Multi-Brand Event Scoping Explicit & Enforced

Status

done — delivered 2026-05-06

Date

2026-05-05 (proposed) / 2026-05-06 (delivered)

Delivered

  1. Schema tightening (Phase 1). DomainEvent.brand_id in servers_v2/shared/contracts/src/rgb_contracts/events/base.py is now a required int (no | None, no default). Pydantic refuses to construct an envelope without one, which trips at producer-write time rather than at consumer-deserialize time.
  2. Producer cleanup. Two stale producer call sites that were not passing brand_id were fixed:
    • servers_v2/player_service/app/api/routes/player.py (level_change VipLevelChanged event) sources brand_id from the locked player row.
    • servers_v2/rolling_service/app/tasks/expiry_scheduler.py (RollingCanceled for expired rolling) sources brand_id from the rolling row already validated above the wallet cancel call.
  3. Consumer fail-closed regardless of mode (Phase 2). _resolve_envelope_brand_id in both servers_v2/rolling_service/app/services/event_consumer.py and servers_v2/promotion_service/app/services/event_consumer.py no longer falls back to DEFAULT_BRAND_ID for legacy schema_version=1 events under MULTI_BRAND_ENFORCEMENT=observe. Any envelope without a numeric brand_id is refused (message stays pending for supervision), independent of enforcement mode.
  4. Regression tests (Phase 3). Existing test_rolling_event_consumer_brand.py / test_event_consumer_brand.py updated to assert the new always-fail behavior and to drop the observe-mode-fallback case.

The event_brand_missing_total counter remains in place for post-deploy verification (Phase 4 dashboards).

Owners

  • Platform Backend
  • Wallet Domain
  • Promotion Domain
  • Rolling Domain

Problem

The header comment on servers_v2/wallet_service/app/api/routes/wallet.py:12-16 calls out that some outbox writes are not yet brand-scoped at the event layer:

Multi-brand isolation is incomplete in the event payload itself — consumers rely on implicit filtering inside settlement / rolling logic.

Today the producer-side check (_request_brand_id) does require a brand context on the wallet command path. The brand_id then lands on the outbox row. But:

  1. Consumers do not assert brand_id is present before deciding what to do with the event. The event consumers in servers_v2/rolling_service/app/services/event_consumer.py and servers_v2/promotion_service/app/services/event_consumer.py read the wallet stream and then dispatch to settlement / FIFO logic that filters implicitly through SQL WHERE brand_id = :brand_id.
  2. The contract does not require brand_id on every event payload. servers_v2/shared/contracts/src/rgb_contracts/streams.py defines the stream and message shapes. If a future producer omits brand_id — or sets it to None for an internal-only event — the consumer silently treats the event as cross-brand by default.
  3. No regression test pins the invariant. A schema drift that makes brand_id optional, or a consumer that forgets to forward it into a downstream SQL filter, would not fail any existing test.

Result: the multi-brand isolation guarantee from ADR-009 is enforced only by current implementation discipline, not by code that fails closed when the discipline lapses.

Goal

Make brand_id a structural invariant of every multi-brand-relevant event:

  • Required on the producer side (already done for wallet).
  • Required by the event payload schema in rgb_contracts.
  • Asserted by every consumer immediately on dequeue, before the payload reaches business logic.
  • Pinned by regression tests for each producer/consumer pair.

Scope

In scope (Phase 17 post-cleanup):

  • wallet:events (producer: wallet_service; consumers: rolling_service_group, promotion_service_group).
  • rolling:events (producer: rolling_service; no active consumer — producer-only after Phase 17 cleanup; the previous notification_service_group declaration was removed because it had no consumer implementation and was silently losing events past STREAM_MAX_LEN).
  • player:events (producer: player_service; no active consumer — same Phase 17 cleanup rationale; wallet_service_group / risk_service_group / notification_service_group were never wired).

Phase 17 removed:

  • promotion:events — there was no producer at all (declared in streams.py but no code emitted to it). The stream + the lone reserved notification_service_group were removed together. If promotion grows a notification surface later, re-declare the stream with at least one active consumer in the same PR.

Out of scope:

  • Internal recon events that do not cross brand boundaries (e.g. a worker heartbeat). These can stay brand-less.

Phased Plan

Phase 1 — Schema Tightening

  1. In servers_v2/shared/contracts/src/rgb_contracts/, change the relevant event payload models so brand_id: int is required (Optional → required, no default). Each producer must set it.
  2. Wire the schema into the outbox row writers so a payload without brand_id fails at construction, not at deserialize.

Phase 2 — Consumer-Side Fail-Closed Assertion

For every active consumer:

  1. Immediately after parsing the event payload, assert payload.brand_id is not None and payload.brand_id > 0.
  2. On violation: log event_brand_missing with the event id and do not ack the message. The retry loop will re-attempt; the schema fix from Phase 1 should already have prevented the case.
  3. Forward brand_id into every SQL filter used by the downstream handler. Stop relying on the handler's own SQL to "happen to" include brand_id via the player join.

Consumers to update:

  • servers_v2/rolling_service/app/services/event_consumer.py
  • servers_v2/promotion_service/app/services/event_consumer.py
  • Any future wallet_service-side consumers of player:events.

Phase 3 — Regression Tests

Add a contract-level test per producer that asserts every emitted event carries a non-null brand_id. Add a consumer-level test per consumer that asserts the consumer rejects (does not ack) an event whose payload arrives with brand_id=None.

Phase 4 — Dashboards & Cleanup

  1. Add an event_brand_missing counter to the existing observability stack so the rate is visible at zero.
  2. Remove the "implicit filtering" caveat from the wallet_service/app/api/routes/wallet.py header comment once the counter has been at zero in production for 30+ days.

Risks

  • Migration ordering. Phase 1 makes the field required in the contract; if any producer is rolled forward before its consumer is ready, consumers without the assertion will accept the field but fail later at SQL filter time. Mitigation: roll producers and consumers together, gate Phase 1 behind a deploy-coupled release.
  • Inactive consumers. Historically risk_service_group and notification_service_group were declared in streams.py as placeholders with no live workers. Phase 17 removed those declarations because an unconsumed group accumulates pending messages that get silently trimmed at STREAM_MAX_LEN — a future worker turning on would inherit a gap rather than catch up. When a real worker is built, the matching consumer-group name must be added back via a PR that ships the worker in the same change. Until then, the producer-side schema invariant is the only enforcement.

Acceptance

  • rg "brand_id.*Optional" servers_v2/shared/contracts/src returns no matches in event payload models.
  • Every active consumer's first parsed-event branch is a brand_id assertion.
  • The event_brand_missing counter has been at zero in production for 30+ consecutive days.
  • The "Multi-brand isolation is incomplete" comment in wallet_service/app/api/routes/wallet.py is removed.

References

  • docs/adr/ADR-009-multi-brand-domain-routed-isolation.md
  • docs/architecture/event-catalog.md
  • servers_v2/shared/contracts/src/rgb_contracts/streams.py
  • servers_v2/wallet_service/app/api/routes/wallet.py — header comment
  • servers_v2/rolling_service/app/services/event_consumer.py
  • servers_v2/promotion_service/app/services/event_consumer.py