Plan: Make Multi-Brand Event Scoping Explicit & Enforced
Status
done — delivered 2026-05-06
Date
2026-05-05 (proposed) / 2026-05-06 (delivered)
Delivered
- Schema tightening (Phase 1).
DomainEvent.brand_idinservers_v2/shared/contracts/src/rgb_contracts/events/base.pyis now a requiredint(no| None, no default). Pydantic refuses to construct an envelope without one, which trips at producer-write time rather than at consumer-deserialize time. - Producer cleanup. Two stale producer call sites that were not
passing
brand_idwere fixed:servers_v2/player_service/app/api/routes/player.py(level_changeVipLevelChangedevent) sourcesbrand_idfrom the locked player row.servers_v2/rolling_service/app/tasks/expiry_scheduler.py(RollingCanceledfor expired rolling) sourcesbrand_idfrom the rolling row already validated above the wallet cancel call.
- Consumer fail-closed regardless of mode (Phase 2).
_resolve_envelope_brand_idin bothservers_v2/rolling_service/app/services/event_consumer.pyandservers_v2/promotion_service/app/services/event_consumer.pyno longer falls back toDEFAULT_BRAND_IDfor legacyschema_version=1events underMULTI_BRAND_ENFORCEMENT=observe. Any envelope without a numericbrand_idis refused (message stays pending for supervision), independent of enforcement mode. - Regression tests (Phase 3). Existing
test_rolling_event_consumer_brand.py/test_event_consumer_brand.pyupdated to assert the new always-fail behavior and to drop the observe-mode-fallback case.
The event_brand_missing_total counter remains in place for
post-deploy verification (Phase 4 dashboards).
Owners
- Platform Backend
- Wallet Domain
- Promotion Domain
- Rolling Domain
Problem
The header comment on
servers_v2/wallet_service/app/api/routes/wallet.py:12-16 calls out
that some outbox writes are not yet brand-scoped at the event layer:
Multi-brand isolation is incomplete in the event payload itself — consumers rely on implicit filtering inside settlement / rolling logic.
Today the producer-side check (_request_brand_id) does require a
brand context on the wallet command path. The brand_id then lands on
the outbox row. But:
- Consumers do not assert brand_id is present before deciding
what to do with the event. The event consumers in
servers_v2/rolling_service/app/services/event_consumer.pyandservers_v2/promotion_service/app/services/event_consumer.pyread the wallet stream and then dispatch to settlement / FIFO logic that filters implicitly through SQLWHERE brand_id = :brand_id. - The contract does not require brand_id on every event payload.
servers_v2/shared/contracts/src/rgb_contracts/streams.pydefines the stream and message shapes. If a future producer omits brand_id — or sets it toNonefor an internal-only event — the consumer silently treats the event as cross-brand by default. - No regression test pins the invariant. A schema drift that makes brand_id optional, or a consumer that forgets to forward it into a downstream SQL filter, would not fail any existing test.
Result: the multi-brand isolation guarantee from ADR-009 is enforced only by current implementation discipline, not by code that fails closed when the discipline lapses.
Goal
Make brand_id a structural invariant of every multi-brand-relevant event:
- Required on the producer side (already done for wallet).
- Required by the event payload schema in
rgb_contracts. - Asserted by every consumer immediately on dequeue, before the payload reaches business logic.
- Pinned by regression tests for each producer/consumer pair.
Scope
In scope (Phase 17 post-cleanup):
wallet:events(producer: wallet_service; consumers:rolling_service_group,promotion_service_group).rolling:events(producer: rolling_service; no active consumer — producer-only after Phase 17 cleanup; the previousnotification_service_groupdeclaration was removed because it had no consumer implementation and was silently losing events pastSTREAM_MAX_LEN).player:events(producer: player_service; no active consumer — same Phase 17 cleanup rationale;wallet_service_group/risk_service_group/notification_service_groupwere never wired).
Phase 17 removed:
promotion:events— there was no producer at all (declared instreams.pybut no code emitted to it). The stream + the lone reservednotification_service_groupwere removed together. If promotion grows a notification surface later, re-declare the stream with at least one active consumer in the same PR.
Out of scope:
- Internal recon events that do not cross brand boundaries (e.g. a worker heartbeat). These can stay brand-less.
Phased Plan
Phase 1 — Schema Tightening
- In
servers_v2/shared/contracts/src/rgb_contracts/, change the relevant event payload models sobrand_id: intis required (Optional→ required, no default). Each producer must set it. - Wire the schema into the outbox row writers so a payload without brand_id fails at construction, not at deserialize.
Phase 2 — Consumer-Side Fail-Closed Assertion
For every active consumer:
- Immediately after parsing the event payload, assert
payload.brand_id is not None and payload.brand_id > 0. - On violation: log
event_brand_missingwith the event id and do not ack the message. The retry loop will re-attempt; the schema fix from Phase 1 should already have prevented the case. - Forward
brand_idinto every SQL filter used by the downstream handler. Stop relying on the handler's own SQL to "happen to" include brand_id via the player join.
Consumers to update:
servers_v2/rolling_service/app/services/event_consumer.pyservers_v2/promotion_service/app/services/event_consumer.py- Any future
wallet_service-side consumers ofplayer:events.
Phase 3 — Regression Tests
Add a contract-level test per producer that asserts every emitted
event carries a non-null brand_id. Add a consumer-level test per
consumer that asserts the consumer rejects (does not ack) an event
whose payload arrives with brand_id=None.
Phase 4 — Dashboards & Cleanup
- Add an
event_brand_missingcounter to the existing observability stack so the rate is visible at zero. - Remove the "implicit filtering" caveat from the
wallet_service/app/api/routes/wallet.pyheader comment once the counter has been at zero in production for 30+ days.
Risks
- Migration ordering. Phase 1 makes the field required in the contract; if any producer is rolled forward before its consumer is ready, consumers without the assertion will accept the field but fail later at SQL filter time. Mitigation: roll producers and consumers together, gate Phase 1 behind a deploy-coupled release.
- Inactive consumers. Historically
risk_service_groupandnotification_service_groupwere declared instreams.pyas placeholders with no live workers. Phase 17 removed those declarations because an unconsumed group accumulates pending messages that get silently trimmed atSTREAM_MAX_LEN— a future worker turning on would inherit a gap rather than catch up. When a real worker is built, the matching consumer-group name must be added back via a PR that ships the worker in the same change. Until then, the producer-side schema invariant is the only enforcement.
Acceptance
rg "brand_id.*Optional" servers_v2/shared/contracts/srcreturns no matches in event payload models.- Every active consumer's first parsed-event branch is a brand_id assertion.
- The
event_brand_missingcounter has been at zero in production for 30+ consecutive days. - The "Multi-brand isolation is incomplete" comment in
wallet_service/app/api/routes/wallet.pyis removed.
References
docs/adr/ADR-009-multi-brand-domain-routed-isolation.mddocs/architecture/event-catalog.mdservers_v2/shared/contracts/src/rgb_contracts/streams.pyservers_v2/wallet_service/app/api/routes/wallet.py— header commentservers_v2/rolling_service/app/services/event_consumer.pyservers_v2/promotion_service/app/services/event_consumer.py