ADR-021: Messaging & Event Architecture
ADR-021: Messaging & Event Architecture
Status
Under Review — Kafka vs RabbitMQ strategic analysis in progress
| Version | Status | Date | Notes |
|---|---|---|---|
| v1 | Proposed | 2026-02-01 | Enhance RabbitMQ with reliability patterns |
| v2 | Under Review | 2026-02-01 | Strategic evaluation: Kafka vs RabbitMQ |
Context
RabbitMQ is the platform’s nervous system — 75+ message types, 20+ publishing services, 15+ consuming services, 92 Java classes implementing messaging. The Knowledge Graph initiative (ADR-019) introduces new requirements for event streaming, replay, and analytics that warrant re-evaluation of the messaging strategy.
Current State
| Component | Current | Assessment |
|---|---|---|
| Message broker | RabbitMQ 3.12 (3-node cluster per tenant) | Working, but per-tenant clusters add ops overhead |
| Message types | 75+ across 7 domains | Significant investment in contracts |
| Publishers | 20+ services, 40+ sender classes | Consistent core-lib patterns |
| Consumers | 15+ services, 20+ receiver classes | @RabbitListener + core-lib Transport |
| Idempotency | None | Duplicate processing risk |
| Dead letter queues | Partial (HMS only) | Most services lack DLQ |
| Message versioning | None | Schema changes can break consumers |
| Event replay | Not possible | Messages consumed and deleted |
| Audit trail | None | No event history retention |
New Requirements (Knowledge Graph Era)
| Requirement | Source | RabbitMQ | Kafka |
|---|---|---|---|
| Real-time graph ingestion | Epic #53 | ✓ Supported | ✓ Supported |
| Event replay for reprocessing | Migration, recovery | ✗ Not supported | ✓ Native (log retention) |
| Event sourcing for audit | Compliance, CDD | ✗ Not designed for this | ✓ Native pattern |
| Consumer lag monitoring | Operations | ✗ Limited | ✓ Built-in |
| Multi-tenant isolation | ADR-002 | ✓ vhosts | ✓ Topic prefixes/ACLs |
| Exactly-once semantics | Reliability | ✗ At-most-once default | ✓ With transactions |
| High throughput streaming | Future scale | ~50K msg/sec | ~1M+ msg/sec |
| Schema evolution | Contract management | Manual | Schema Registry |
Strategic Analysis: Kafka vs RabbitMQ
Architectural Comparison
┌─────────────────────────────────────────────────────────────────────────────┐
│ RabbitMQ Architecture │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Producer ──► Exchange ──► Queue ──► Consumer │
│ │ │
│ │ (routing) │
│ ▼ │
│ [Queue 1] ──► Consumer A │
│ [Queue 2] ──► Consumer B (competing consumers) │
│ │
│ • Message deleted after acknowledgment │
│ • No replay capability │
│ • Push-based delivery │
│ • Per-message routing flexibility │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ Kafka Architecture │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Producer ──► Topic (Partition 0) ──► Consumer Group A │
│ │ │ │
│ (append-only log) (offset tracking) │
│ │ │ │
│ [Log: 0,1,2,3,4...] Consumer A reads at offset 3 │
│ │ Consumer B reads at offset 5 │
│ ▼ │
│ Retention: 7 days (configurable) │
│ │
│ • Messages retained after consumption │
│ • Full replay from any offset │
│ • Pull-based delivery │
│ • Partition-based ordering │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Feature Comparison Matrix
| Capability | RabbitMQ | Kafka | Winner | Notes |
|---|---|---|---|---|
| Message Routing | Excellent (exchanges, bindings) | Basic (topics, partitions) | RabbitMQ | RabbitMQ has flexible routing patterns |
| Throughput | ~50K msg/sec | ~1M+ msg/sec | Kafka | Kafka designed for high throughput |
| Latency | Low (~1ms) | Low-Medium (~5ms) | RabbitMQ | RabbitMQ slightly faster for small messages |
| Message Retention | Until consumed | Configurable (days/size) | Kafka | Kafka retains messages in log |
| Replay Capability | None | Full replay from any offset | Kafka | Critical for event sourcing, recovery |
| Ordering Guarantee | Per-queue | Per-partition | Tie | Both can guarantee ordering |
| Exactly-Once | Not native | With transactions | Kafka | Kafka has idempotent producers |
| Consumer Groups | Competing consumers | Consumer groups with offsets | Kafka | Kafka tracks progress automatically |
| Dead Letter Handling | Native DLQ | Manual (separate topic) | RabbitMQ | RabbitMQ has built-in DLX |
| Schema Registry | None (manual) | Confluent Schema Registry | Kafka | Avro/Protobuf with evolution |
| Multi-Tenancy | vhosts | Topic naming + ACLs | Tie | Both support isolation |
| Operational Complexity | Medium | High | RabbitMQ | Kafka requires more tuning |
| Cloud Managed Options | CloudAMQP, Amazon MQ | Confluent, Amazon MSK, Aiven | Tie | Both have managed services |
Use Case Fit Analysis
| Use Case | RabbitMQ Fit | Kafka Fit | Platform Need |
|---|---|---|---|
| Task queues (shoutout processing) | ✓✓✓ Excellent | ✓ Good | High |
| Request/reply (sync over async) | ✓✓✓ Native | ✗ Anti-pattern | Medium |
| Pub/sub fanout (SSE broadcasts) | ✓✓ Good | ✓✓ Good | High |
| Event sourcing (audit trail) | ✗ Not designed | ✓✓✓ Native | Growing |
| Stream processing (analytics) | ✗ Not designed | ✓✓✓ Kafka Streams | Growing |
| Log aggregation | ✗ Not designed | ✓✓✓ Native | Low (using Loki) |
| CQRS projections | ✓ Supported | ✓✓✓ Ideal | Growing |
| Microservice events | ✓✓ Good | ✓✓ Good | High |
Cost Analysis
| Factor | RabbitMQ (Current) | RabbitMQ (Enhanced) | Kafka (Migration) |
|---|---|---|---|
| Infrastructure | 4 clusters (per-tenant) | 1 cluster (vhosts) | 1 cluster |
| Monthly cost (est) | $800/mo | $400/mo | $600-1000/mo |
| Operational overhead | Medium | Medium | High (initially) |
| Migration effort | N/A | 2-4 weeks | 3-6 months |
| Code changes | N/A | core-lib only | 92 classes + contracts |
| Team training | N/A | Minimal | Significant |
| Risk | Low | Low | Medium-High |
Migration Effort Breakdown (If Kafka)
| Component | Effort | Files Affected |
|---|---|---|
| Replace core-lib Transport | 2 weeks | 5 files |
| Rewrite senders | 2 weeks | 40+ classes |
| Rewrite receivers | 2 weeks | 20+ classes |
| Update configs | 1 week | 50+ application.yml |
| Schema Registry setup | 1 week | New infrastructure |
| Testing & validation | 4 weeks | All services |
| Total | 12-16 weeks | 100+ files |
Decision Options
Option A: Enhance RabbitMQ (Original v1 Recommendation)
Add reliability patterns to existing RabbitMQ infrastructure: - Idempotency via Redis-backed store - DLQ configuration with 3-retry policy - Circuit breakers for synchronous calls - Lightweight message versioning (headers) - Consolidate to single cluster with vhosts
Pros: - Low risk, incremental improvement - Minimal code changes (core-lib only) - Team already knows RabbitMQ - 2-4 weeks effort
Cons: - No event replay capability - No event sourcing for audit - Limited stream processing options - Per-tenant cluster consolidation still needed
Option B: Hybrid Architecture (RabbitMQ + Kafka)
Keep RabbitMQ for transactional messaging, add Kafka for event streaming:
┌─────────────────────────────────────────────────────────────────────────────┐
│ Hybrid Architecture │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ RabbitMQ │ Task Queues │ Kafka │ Event Streams │
│ │ │ Request/Reply │ │ Event Sourcing │
│ │ │ Shoutout Processing │ │ Audit Trail │
│ │ │ Payment Workflows │ │ Analytics │
│ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │
│ ▼ ▼ │
│ [Transactional Services] [Knowledge Graph Service] │
│ [Payment Service] [Analytics Service] │
│ [Notification Service] [Audit Service] │
│ │
│ Bridge: RabbitMQ events → Kafka (for retention/replay) │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Pros: - Best of both worlds - RabbitMQ for existing patterns (low migration) - Kafka for new streaming use cases - Event replay for Knowledge Graph ingestion - Incremental adoption
Cons: - Two messaging systems to operate - Increased infrastructure complexity - Bridge logic between systems - Team needs to learn Kafka
Option C: Full Kafka Migration
Replace RabbitMQ entirely with Kafka:
Pros: - Single messaging platform - Full event sourcing capability - Schema Registry for contract management - Future-proof for streaming analytics - Better consumer lag monitoring
Cons: - High migration risk (92 classes) - 3-6 months effort - Team training required - Some patterns less natural (request/reply) - Higher operational complexity initially
Hypothesis-Driven Evaluation
Primary Hypothesis (Option B - Hybrid)
A hybrid RabbitMQ + Kafka architecture provides the optimal balance of reliability for existing patterns and capability for new streaming requirements, while minimizing migration risk.
Evidence Chain
| Claim | Evidence | Assurance |
|---|---|---|
| RabbitMQ working for current patterns | 75+ message types operational | L2 |
| Knowledge Graph needs event replay | Migration recovery, reprocessing | L1 |
| Full Kafka migration high risk | 92 classes, 3-6 month estimate | L1 |
| Hybrid reduces migration scope | New services use Kafka, existing stay | L1 |
| Team can learn Kafka incrementally | Knowledge Graph team pilots | L0 |
| Event sourcing needed for audit | CDD compliance requirements | L1 |
Overall Confidence: L1 (WLNK capped by unverified team learning curve)
Alternative Hypothesis (Option A - Enhance RabbitMQ)
Enhancing RabbitMQ with reliability patterns is sufficient for all current and near-term requirements.
- Evidence: Current workload is command/event, not event sourcing (L2)
- Counter-evidence: Knowledge Graph needs replay for bulk migration (L1)
- Counter-evidence: CDD audit trail requirements emerging (L1)
Assessment: Option A is viable short-term but may require revisiting within 12 months.
Falsifiability Criteria
Option B (Hybrid) is FALSE if: - Kafka operational overhead exceeds benefit (>40 hours/month ops) - Event replay is never actually used in production - Team struggles to adopt Kafka after 6 months - Two-system complexity causes more incidents than it prevents
Option A (Enhanced RabbitMQ) is FALSE if: - Event replay becomes critical requirement within 6 months - Compliance audit requires full event history - Stream processing becomes core capability need
Recommendation
Short-Term (0-3 months): Enhanced RabbitMQ
Implement v1 reliability enhancements: 1. Add idempotency to core-lib MessageHandler 2. Configure DLQ with retry policy 3. Add circuit breakers for sync calls 4. Consolidate to single cluster with vhosts
Medium-Term (3-12 months): Hybrid Architecture Pilot
- Deploy Kafka alongside RabbitMQ
- Knowledge Graph service uses Kafka for:
- Event ingestion (replay capability)
- Analytics event streaming
- Audit trail retention
- Bridge: Forward critical RabbitMQ events to Kafka for retention
- Evaluate operational experience
Long-Term (12+ months): Strategic Decision
Based on hybrid pilot: - If Kafka proves valuable → gradual migration of more services - If Kafka overhead outweighs benefit → remain on enhanced RabbitMQ - Decision informed by actual operational data
Implementation Plan (Hybrid Approach)
Phase 1: RabbitMQ Enhancements (Weeks 1-4)
| Task | Owner | Effort |
|---|---|---|
| Idempotency store (Redis) | Platform | 1 week |
| DLQ configuration | Platform | 3 days |
| Circuit breakers | Platform | 1 week |
| Cluster consolidation | DevOps | 1 week |
Phase 2: Kafka Infrastructure (Weeks 5-8)
| Task | Owner | Effort |
|---|---|---|
| Kafka cluster setup (Confluent Cloud or self-hosted) | DevOps | 1 week |
| Schema Registry setup | DevOps | 3 days |
| Monitoring & alerting | DevOps | 1 week |
| Team training | Platform | 1 week |
Phase 3: Knowledge Graph Kafka Integration (Weeks 9-12)
| Task | Owner | Effort |
|---|---|---|
| Kafka ingestion listeners | KG Team | 2 weeks |
| Event replay for migration | KG Team | 1 week |
| Analytics streaming | KG Team | 1 week |
Phase 4: Bridge & Evaluation (Weeks 13-16)
| Task | Owner | Effort |
|---|---|---|
| RabbitMQ → Kafka bridge | Platform | 2 weeks |
| Operational runbook | DevOps | 1 week |
| 30-day evaluation | All | Ongoing |
Bounded Validity
Scope
- Applies to: All inter-service messaging, event streaming, audit requirements
- Does not apply to: In-process events, database triggers, webhooks
Expiry Conditions
- Time-based: Re-evaluate after 6-month hybrid pilot
- Scale-based: If message volume exceeds 1M/day
- Complexity-based: If operational overhead of hybrid exceeds 40 hours/month
Review Triggers
- Kafka adoption rate among teams
- Incident rate comparison (RabbitMQ vs Kafka)
- Event replay actual usage frequency
- Cost variance from estimates
Monitoring
- Message throughput per system
- Consumer lag (Kafka)
- DLQ depth (RabbitMQ)
- Cross-system latency (bridge)
Consequences
Positive (Hybrid Approach)
- Reliability improvements on existing RabbitMQ
- Event replay capability for Knowledge Graph
- Incremental Kafka adoption reduces risk
- Future-proof for streaming analytics
- CDD audit trail capability
Negative (Hybrid Approach)
- Two messaging systems to operate
- Bridge adds complexity
- Team learning curve for Kafka
- Higher initial infrastructure cost
Neutral
- Core-lib abstraction can support both backends
- Existing services unchanged initially
Related Decisions
- ADR-002: Multi-tenant compute (cluster consolidation)
- ADR-019: Knowledge Graph (event ingestion requirements)
- ADR-014: Observability (message tracing)
Original decision date: 2026-02-01 Revision date: 2026-02-01 Review by: 2026-08-01