ADR

ADR-021: Messaging & Event Architecture

Last updated: 2026-02-01 | Decisions

ADR-021: Messaging & Event Architecture

Status

Under Review — Kafka vs RabbitMQ strategic analysis in progress

Version	Status	Date	Notes
v1	Proposed	2026-02-01	Enhance RabbitMQ with reliability patterns
v2	Under Review	2026-02-01	Strategic evaluation: Kafka vs RabbitMQ

Context

RabbitMQ is the platform’s nervous system — 75+ message types, 20+ publishing services, 15+ consuming services, 92 Java classes implementing messaging. The Knowledge Graph initiative (ADR-019) introduces new requirements for event streaming, replay, and analytics that warrant re-evaluation of the messaging strategy.

Current State

Component	Current	Assessment
Message broker	RabbitMQ 3.12 (3-node cluster per tenant)	Working, but per-tenant clusters add ops overhead
Message types	75+ across 7 domains	Significant investment in contracts
Publishers	20+ services, 40+ sender classes	Consistent core-lib patterns
Consumers	15+ services, 20+ receiver classes	@RabbitListener + core-lib Transport
Idempotency	None	Duplicate processing risk
Dead letter queues	Partial (HMS only)	Most services lack DLQ
Message versioning	None	Schema changes can break consumers
Event replay	Not possible	Messages consumed and deleted
Audit trail	None	No event history retention

New Requirements (Knowledge Graph Era)

Requirement	Source	RabbitMQ	Kafka
Real-time graph ingestion	Epic #53	✓ Supported	✓ Supported
Event replay for reprocessing	Migration, recovery	✗ Not supported	✓ Native (log retention)
Event sourcing for audit	Compliance, CDD	✗ Not designed for this	✓ Native pattern
Consumer lag monitoring	Operations	✗ Limited	✓ Built-in
Multi-tenant isolation	ADR-002	✓ vhosts	✓ Topic prefixes/ACLs
Exactly-once semantics	Reliability	✗ At-most-once default	✓ With transactions
High throughput streaming	Future scale	~50K msg/sec	~1M+ msg/sec
Schema evolution	Contract management	Manual	Schema Registry

Strategic Analysis: Kafka vs RabbitMQ

Architectural Comparison

┌─────────────────────────────────────────────────────────────────────────────┐
│                           RabbitMQ Architecture                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   Producer ──► Exchange ──► Queue ──► Consumer                              │
│                  │                                                          │
│                  │ (routing)                                                │
│                  ▼                                                          │
│              [Queue 1] ──► Consumer A                                       │
│              [Queue 2] ──► Consumer B (competing consumers)                 │
│                                                                             │
│   • Message deleted after acknowledgment                                    │
│   • No replay capability                                                    │
│   • Push-based delivery                                                     │
│   • Per-message routing flexibility                                         │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│                            Kafka Architecture                                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   Producer ──► Topic (Partition 0) ──► Consumer Group A                     │
│                       │                     │                               │
│               (append-only log)       (offset tracking)                     │
│                       │                     │                               │
│               [Log: 0,1,2,3,4...]    Consumer A reads at offset 3           │
│                       │              Consumer B reads at offset 5           │
│                       ▼                                                     │
│               Retention: 7 days (configurable)                              │
│                                                                             │
│   • Messages retained after consumption                                     │
│   • Full replay from any offset                                             │
│   • Pull-based delivery                                                     │
│   • Partition-based ordering                                                │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Feature Comparison Matrix

Capability	RabbitMQ	Kafka	Winner	Notes
Message Routing	Excellent (exchanges, bindings)	Basic (topics, partitions)	RabbitMQ	RabbitMQ has flexible routing patterns
Throughput	~50K msg/sec	~1M+ msg/sec	Kafka	Kafka designed for high throughput
Latency	Low (~1ms)	Low-Medium (~5ms)	RabbitMQ	RabbitMQ slightly faster for small messages
Message Retention	Until consumed	Configurable (days/size)	Kafka	Kafka retains messages in log
Replay Capability	None	Full replay from any offset	Kafka	Critical for event sourcing, recovery
Ordering Guarantee	Per-queue	Per-partition	Tie	Both can guarantee ordering
Exactly-Once	Not native	With transactions	Kafka	Kafka has idempotent producers
Consumer Groups	Competing consumers	Consumer groups with offsets	Kafka	Kafka tracks progress automatically
Dead Letter Handling	Native DLQ	Manual (separate topic)	RabbitMQ	RabbitMQ has built-in DLX
Schema Registry	None (manual)	Confluent Schema Registry	Kafka	Avro/Protobuf with evolution
Multi-Tenancy	vhosts	Topic naming + ACLs	Tie	Both support isolation
Operational Complexity	Medium	High	RabbitMQ	Kafka requires more tuning
Cloud Managed Options	CloudAMQP, Amazon MQ	Confluent, Amazon MSK, Aiven	Tie	Both have managed services

Use Case Fit Analysis

Use Case	RabbitMQ Fit	Kafka Fit	Platform Need
Task queues (shoutout processing)	✓✓✓ Excellent	✓ Good	High
Request/reply (sync over async)	✓✓✓ Native	✗ Anti-pattern	Medium
Pub/sub fanout (SSE broadcasts)	✓✓ Good	✓✓ Good	High
Event sourcing (audit trail)	✗ Not designed	✓✓✓ Native	Growing
Stream processing (analytics)	✗ Not designed	✓✓✓ Kafka Streams	Growing
Log aggregation	✗ Not designed	✓✓✓ Native	Low (using Loki)
CQRS projections	✓ Supported	✓✓✓ Ideal	Growing
Microservice events	✓✓ Good	✓✓ Good	High

Cost Analysis

Factor	RabbitMQ (Current)	RabbitMQ (Enhanced)	Kafka (Migration)
Infrastructure	4 clusters (per-tenant)	1 cluster (vhosts)	1 cluster
Monthly cost (est)	$800/mo	$400/mo	$600-1000/mo
Operational overhead	Medium	Medium	High (initially)
Migration effort	N/A	2-4 weeks	3-6 months
Code changes	N/A	core-lib only	92 classes + contracts
Team training	N/A	Minimal	Significant
Risk	Low	Low	Medium-High

Migration Effort Breakdown (If Kafka)

Component	Effort	Files Affected
Replace core-lib Transport	2 weeks	5 files
Rewrite senders	2 weeks	40+ classes
Rewrite receivers	2 weeks	20+ classes
Update configs	1 week	50+ application.yml
Schema Registry setup	1 week	New infrastructure
Testing & validation	4 weeks	All services
Total	12-16 weeks	100+ files

Decision Options

Option A: Enhance RabbitMQ (Original v1 Recommendation)

Add reliability patterns to existing RabbitMQ infrastructure: - Idempotency via Redis-backed store - DLQ configuration with 3-retry policy - Circuit breakers for synchronous calls - Lightweight message versioning (headers) - Consolidate to single cluster with vhosts

Pros: - Low risk, incremental improvement - Minimal code changes (core-lib only) - Team already knows RabbitMQ - 2-4 weeks effort

Cons: - No event replay capability - No event sourcing for audit - Limited stream processing options - Per-tenant cluster consolidation still needed

Option B: Hybrid Architecture (RabbitMQ + Kafka)

Keep RabbitMQ for transactional messaging, add Kafka for event streaming:

┌─────────────────────────────────────────────────────────────────────────────┐
│                        Hybrid Architecture                                   │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   ┌─────────────┐                          ┌─────────────┐                  │
│   │  RabbitMQ   │  Task Queues             │   Kafka     │  Event Streams   │
│   │             │  Request/Reply           │             │  Event Sourcing  │
│   │             │  Shoutout Processing     │             │  Audit Trail     │
│   │             │  Payment Workflows       │             │  Analytics       │
│   └──────┬──────┘                          └──────┬──────┘                  │
│          │                                        │                         │
│          ▼                                        ▼                         │
│   [Transactional Services]              [Knowledge Graph Service]           │
│   [Payment Service]                     [Analytics Service]                 │
│   [Notification Service]                [Audit Service]                     │
│                                                                             │
│   Bridge: RabbitMQ events → Kafka (for retention/replay)                   │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Pros: - Best of both worlds - RabbitMQ for existing patterns (low migration) - Kafka for new streaming use cases - Event replay for Knowledge Graph ingestion - Incremental adoption

Cons: - Two messaging systems to operate - Increased infrastructure complexity - Bridge logic between systems - Team needs to learn Kafka

Option C: Full Kafka Migration

Replace RabbitMQ entirely with Kafka:

Pros: - Single messaging platform - Full event sourcing capability - Schema Registry for contract management - Future-proof for streaming analytics - Better consumer lag monitoring

Cons: - High migration risk (92 classes) - 3-6 months effort - Team training required - Some patterns less natural (request/reply) - Higher operational complexity initially

Hypothesis-Driven Evaluation

Primary Hypothesis (Option B - Hybrid)

A hybrid RabbitMQ + Kafka architecture provides the optimal balance of reliability for existing patterns and capability for new streaming requirements, while minimizing migration risk.

Evidence Chain

Claim	Evidence	Assurance
RabbitMQ working for current patterns	75+ message types operational	L2
Knowledge Graph needs event replay	Migration recovery, reprocessing	L1
Full Kafka migration high risk	92 classes, 3-6 month estimate	L1
Hybrid reduces migration scope	New services use Kafka, existing stay	L1
Team can learn Kafka incrementally	Knowledge Graph team pilots	L0
Event sourcing needed for audit	CDD compliance requirements	L1

Overall Confidence: L1 (WLNK capped by unverified team learning curve)

Alternative Hypothesis (Option A - Enhance RabbitMQ)

Enhancing RabbitMQ with reliability patterns is sufficient for all current and near-term requirements.

Evidence: Current workload is command/event, not event sourcing (L2)
Counter-evidence: Knowledge Graph needs replay for bulk migration (L1)
Counter-evidence: CDD audit trail requirements emerging (L1)

Assessment: Option A is viable short-term but may require revisiting within 12 months.

Falsifiability Criteria

Option B (Hybrid) is FALSE if: - Kafka operational overhead exceeds benefit (>40 hours/month ops) - Event replay is never actually used in production - Team struggles to adopt Kafka after 6 months - Two-system complexity causes more incidents than it prevents

Option A (Enhanced RabbitMQ) is FALSE if: - Event replay becomes critical requirement within 6 months - Compliance audit requires full event history - Stream processing becomes core capability need

Recommendation

Short-Term (0-3 months): Enhanced RabbitMQ

Implement v1 reliability enhancements: 1. Add idempotency to core-lib MessageHandler 2. Configure DLQ with retry policy 3. Add circuit breakers for sync calls 4. Consolidate to single cluster with vhosts

Medium-Term (3-12 months): Hybrid Architecture Pilot

Deploy Kafka alongside RabbitMQ
Knowledge Graph service uses Kafka for:
- Event ingestion (replay capability)
- Analytics event streaming
- Audit trail retention
Bridge: Forward critical RabbitMQ events to Kafka for retention
Evaluate operational experience

Long-Term (12+ months): Strategic Decision

Based on hybrid pilot: - If Kafka proves valuable → gradual migration of more services - If Kafka overhead outweighs benefit → remain on enhanced RabbitMQ - Decision informed by actual operational data

Implementation Plan (Hybrid Approach)

Phase 1: RabbitMQ Enhancements (Weeks 1-4)

Task	Owner	Effort
Idempotency store (Redis)	Platform	1 week
DLQ configuration	Platform	3 days
Circuit breakers	Platform	1 week
Cluster consolidation	DevOps	1 week

Phase 2: Kafka Infrastructure (Weeks 5-8)

Task	Owner	Effort
Kafka cluster setup (Confluent Cloud or self-hosted)	DevOps	1 week
Schema Registry setup	DevOps	3 days
Monitoring & alerting	DevOps	1 week
Team training	Platform	1 week

Phase 3: Knowledge Graph Kafka Integration (Weeks 9-12)

Task	Owner	Effort
Kafka ingestion listeners	KG Team	2 weeks
Event replay for migration	KG Team	1 week
Analytics streaming	KG Team	1 week

Phase 4: Bridge & Evaluation (Weeks 13-16)

Task	Owner	Effort
RabbitMQ → Kafka bridge	Platform	2 weeks
Operational runbook	DevOps	1 week
30-day evaluation	All	Ongoing

Bounded Validity

Scope

Applies to: All inter-service messaging, event streaming, audit requirements
Does not apply to: In-process events, database triggers, webhooks

Expiry Conditions

Time-based: Re-evaluate after 6-month hybrid pilot
Scale-based: If message volume exceeds 1M/day
Complexity-based: If operational overhead of hybrid exceeds 40 hours/month

Review Triggers

Kafka adoption rate among teams
Incident rate comparison (RabbitMQ vs Kafka)
Event replay actual usage frequency
Cost variance from estimates

Monitoring

Message throughput per system
Consumer lag (Kafka)
DLQ depth (RabbitMQ)
Cross-system latency (bridge)

Consequences

Positive (Hybrid Approach)

Reliability improvements on existing RabbitMQ
Event replay capability for Knowledge Graph
Incremental Kafka adoption reduces risk
Future-proof for streaming analytics
CDD audit trail capability

Negative (Hybrid Approach)

Two messaging systems to operate
Bridge adds complexity
Team learning curve for Kafka
Higher initial infrastructure cost

Neutral

Core-lib abstraction can support both backends
Existing services unchanged initially

ADR-002: Multi-tenant compute (cluster consolidation)
ADR-019: Knowledge Graph (event ingestion requirements)
ADR-014: Observability (message tracing)

Original decision date: 2026-02-01 Revision date: 2026-02-01 Review by: 2026-08-01

ADR-021: Messaging & Event Architecture

ADR-021: Messaging & Event Architecture

Status

Context

Current State

New Requirements (Knowledge Graph Era)

Strategic Analysis: Kafka vs RabbitMQ

Architectural Comparison

Feature Comparison Matrix

Use Case Fit Analysis

Cost Analysis

Migration Effort Breakdown (If Kafka)

Decision Options

Option A: Enhance RabbitMQ (Original v1 Recommendation)

Option B: Hybrid Architecture (RabbitMQ + Kafka)

Option C: Full Kafka Migration

Hypothesis-Driven Evaluation

Primary Hypothesis (Option B - Hybrid)

Evidence Chain

Alternative Hypothesis (Option A - Enhance RabbitMQ)

Falsifiability Criteria

Recommendation

Short-Term (0-3 months): Enhanced RabbitMQ

Medium-Term (3-12 months): Hybrid Architecture Pilot

Long-Term (12+ months): Strategic Decision

Implementation Plan (Hybrid Approach)

Phase 1: RabbitMQ Enhancements (Weeks 1-4)

Phase 2: Kafka Infrastructure (Weeks 5-8)

Phase 3: Knowledge Graph Kafka Integration (Weeks 9-12)

Phase 4: Bridge & Evaluation (Weeks 13-16)

Bounded Validity

Scope

Expiry Conditions

Review Triggers

Monitoring

Consequences

Positive (Hybrid Approach)

Negative (Hybrid Approach)

Neutral

Related Decisions