ADR

ADR-021: Messaging & Event Architecture

Last updated: 2026-02-01 | Decisions

ADR-021: Messaging & Event Architecture

Status

Under Review — Kafka vs RabbitMQ strategic analysis in progress

Version Status Date Notes
v1 Proposed 2026-02-01 Enhance RabbitMQ with reliability patterns
v2 Under Review 2026-02-01 Strategic evaluation: Kafka vs RabbitMQ

Context

RabbitMQ is the platform’s nervous system — 75+ message types, 20+ publishing services, 15+ consuming services, 92 Java classes implementing messaging. The Knowledge Graph initiative (ADR-019) introduces new requirements for event streaming, replay, and analytics that warrant re-evaluation of the messaging strategy.

Current State

Component Current Assessment
Message broker RabbitMQ 3.12 (3-node cluster per tenant) Working, but per-tenant clusters add ops overhead
Message types 75+ across 7 domains Significant investment in contracts
Publishers 20+ services, 40+ sender classes Consistent core-lib patterns
Consumers 15+ services, 20+ receiver classes @RabbitListener + core-lib Transport
Idempotency None Duplicate processing risk
Dead letter queues Partial (HMS only) Most services lack DLQ
Message versioning None Schema changes can break consumers
Event replay Not possible Messages consumed and deleted
Audit trail None No event history retention

New Requirements (Knowledge Graph Era)

Requirement Source RabbitMQ Kafka
Real-time graph ingestion Epic #53 ✓ Supported ✓ Supported
Event replay for reprocessing Migration, recovery ✗ Not supported ✓ Native (log retention)
Event sourcing for audit Compliance, CDD ✗ Not designed for this ✓ Native pattern
Consumer lag monitoring Operations ✗ Limited ✓ Built-in
Multi-tenant isolation ADR-002 ✓ vhosts ✓ Topic prefixes/ACLs
Exactly-once semantics Reliability ✗ At-most-once default ✓ With transactions
High throughput streaming Future scale ~50K msg/sec ~1M+ msg/sec
Schema evolution Contract management Manual Schema Registry

Strategic Analysis: Kafka vs RabbitMQ

Architectural Comparison

┌─────────────────────────────────────────────────────────────────────────────┐
│                           RabbitMQ Architecture                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   Producer ──► Exchange ──► Queue ──► Consumer                              │
│                  │                                                          │
│                  │ (routing)                                                │
│                  ▼                                                          │
│              [Queue 1] ──► Consumer A                                       │
│              [Queue 2] ──► Consumer B (competing consumers)                 │
│                                                                             │
│   • Message deleted after acknowledgment                                    │
│   • No replay capability                                                    │
│   • Push-based delivery                                                     │
│   • Per-message routing flexibility                                         │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│                            Kafka Architecture                                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   Producer ──► Topic (Partition 0) ──► Consumer Group A                     │
│                       │                     │                               │
│               (append-only log)       (offset tracking)                     │
│                       │                     │                               │
│               [Log: 0,1,2,3,4...]    Consumer A reads at offset 3           │
│                       │              Consumer B reads at offset 5           │
│                       ▼                                                     │
│               Retention: 7 days (configurable)                              │
│                                                                             │
│   • Messages retained after consumption                                     │
│   • Full replay from any offset                                             │
│   • Pull-based delivery                                                     │
│   • Partition-based ordering                                                │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Feature Comparison Matrix

Capability RabbitMQ Kafka Winner Notes
Message Routing Excellent (exchanges, bindings) Basic (topics, partitions) RabbitMQ RabbitMQ has flexible routing patterns
Throughput ~50K msg/sec ~1M+ msg/sec Kafka Kafka designed for high throughput
Latency Low (~1ms) Low-Medium (~5ms) RabbitMQ RabbitMQ slightly faster for small messages
Message Retention Until consumed Configurable (days/size) Kafka Kafka retains messages in log
Replay Capability None Full replay from any offset Kafka Critical for event sourcing, recovery
Ordering Guarantee Per-queue Per-partition Tie Both can guarantee ordering
Exactly-Once Not native With transactions Kafka Kafka has idempotent producers
Consumer Groups Competing consumers Consumer groups with offsets Kafka Kafka tracks progress automatically
Dead Letter Handling Native DLQ Manual (separate topic) RabbitMQ RabbitMQ has built-in DLX
Schema Registry None (manual) Confluent Schema Registry Kafka Avro/Protobuf with evolution
Multi-Tenancy vhosts Topic naming + ACLs Tie Both support isolation
Operational Complexity Medium High RabbitMQ Kafka requires more tuning
Cloud Managed Options CloudAMQP, Amazon MQ Confluent, Amazon MSK, Aiven Tie Both have managed services

Use Case Fit Analysis

Use Case RabbitMQ Fit Kafka Fit Platform Need
Task queues (shoutout processing) ✓✓✓ Excellent ✓ Good High
Request/reply (sync over async) ✓✓✓ Native ✗ Anti-pattern Medium
Pub/sub fanout (SSE broadcasts) ✓✓ Good ✓✓ Good High
Event sourcing (audit trail) ✗ Not designed ✓✓✓ Native Growing
Stream processing (analytics) ✗ Not designed ✓✓✓ Kafka Streams Growing
Log aggregation ✗ Not designed ✓✓✓ Native Low (using Loki)
CQRS projections ✓ Supported ✓✓✓ Ideal Growing
Microservice events ✓✓ Good ✓✓ Good High

Cost Analysis

Factor RabbitMQ (Current) RabbitMQ (Enhanced) Kafka (Migration)
Infrastructure 4 clusters (per-tenant) 1 cluster (vhosts) 1 cluster
Monthly cost (est) $800/mo $400/mo $600-1000/mo
Operational overhead Medium Medium High (initially)
Migration effort N/A 2-4 weeks 3-6 months
Code changes N/A core-lib only 92 classes + contracts
Team training N/A Minimal Significant
Risk Low Low Medium-High

Migration Effort Breakdown (If Kafka)

Component Effort Files Affected
Replace core-lib Transport 2 weeks 5 files
Rewrite senders 2 weeks 40+ classes
Rewrite receivers 2 weeks 20+ classes
Update configs 1 week 50+ application.yml
Schema Registry setup 1 week New infrastructure
Testing & validation 4 weeks All services
Total 12-16 weeks 100+ files

Decision Options

Option A: Enhance RabbitMQ (Original v1 Recommendation)

Add reliability patterns to existing RabbitMQ infrastructure: - Idempotency via Redis-backed store - DLQ configuration with 3-retry policy - Circuit breakers for synchronous calls - Lightweight message versioning (headers) - Consolidate to single cluster with vhosts

Pros: - Low risk, incremental improvement - Minimal code changes (core-lib only) - Team already knows RabbitMQ - 2-4 weeks effort

Cons: - No event replay capability - No event sourcing for audit - Limited stream processing options - Per-tenant cluster consolidation still needed

Option B: Hybrid Architecture (RabbitMQ + Kafka)

Keep RabbitMQ for transactional messaging, add Kafka for event streaming:

┌─────────────────────────────────────────────────────────────────────────────┐
│                        Hybrid Architecture                                   │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   ┌─────────────┐                          ┌─────────────┐                  │
│   │  RabbitMQ   │  Task Queues             │   Kafka     │  Event Streams   │
│   │             │  Request/Reply           │             │  Event Sourcing  │
│   │             │  Shoutout Processing     │             │  Audit Trail     │
│   │             │  Payment Workflows       │             │  Analytics       │
│   └──────┬──────┘                          └──────┬──────┘                  │
│          │                                        │                         │
│          ▼                                        ▼                         │
│   [Transactional Services]              [Knowledge Graph Service]           │
│   [Payment Service]                     [Analytics Service]                 │
│   [Notification Service]                [Audit Service]                     │
│                                                                             │
│   Bridge: RabbitMQ events → Kafka (for retention/replay)                   │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Pros: - Best of both worlds - RabbitMQ for existing patterns (low migration) - Kafka for new streaming use cases - Event replay for Knowledge Graph ingestion - Incremental adoption

Cons: - Two messaging systems to operate - Increased infrastructure complexity - Bridge logic between systems - Team needs to learn Kafka

Option C: Full Kafka Migration

Replace RabbitMQ entirely with Kafka:

Pros: - Single messaging platform - Full event sourcing capability - Schema Registry for contract management - Future-proof for streaming analytics - Better consumer lag monitoring

Cons: - High migration risk (92 classes) - 3-6 months effort - Team training required - Some patterns less natural (request/reply) - Higher operational complexity initially


Hypothesis-Driven Evaluation

Primary Hypothesis (Option B - Hybrid)

A hybrid RabbitMQ + Kafka architecture provides the optimal balance of reliability for existing patterns and capability for new streaming requirements, while minimizing migration risk.

Evidence Chain

Claim Evidence Assurance
RabbitMQ working for current patterns 75+ message types operational L2
Knowledge Graph needs event replay Migration recovery, reprocessing L1
Full Kafka migration high risk 92 classes, 3-6 month estimate L1
Hybrid reduces migration scope New services use Kafka, existing stay L1
Team can learn Kafka incrementally Knowledge Graph team pilots L0
Event sourcing needed for audit CDD compliance requirements L1

Overall Confidence: L1 (WLNK capped by unverified team learning curve)

Alternative Hypothesis (Option A - Enhance RabbitMQ)

Enhancing RabbitMQ with reliability patterns is sufficient for all current and near-term requirements.

Assessment: Option A is viable short-term but may require revisiting within 12 months.

Falsifiability Criteria

Option B (Hybrid) is FALSE if: - Kafka operational overhead exceeds benefit (>40 hours/month ops) - Event replay is never actually used in production - Team struggles to adopt Kafka after 6 months - Two-system complexity causes more incidents than it prevents

Option A (Enhanced RabbitMQ) is FALSE if: - Event replay becomes critical requirement within 6 months - Compliance audit requires full event history - Stream processing becomes core capability need


Recommendation

Short-Term (0-3 months): Enhanced RabbitMQ

Implement v1 reliability enhancements: 1. Add idempotency to core-lib MessageHandler 2. Configure DLQ with retry policy 3. Add circuit breakers for sync calls 4. Consolidate to single cluster with vhosts

Medium-Term (3-12 months): Hybrid Architecture Pilot

  1. Deploy Kafka alongside RabbitMQ
  2. Knowledge Graph service uses Kafka for:
    • Event ingestion (replay capability)
    • Analytics event streaming
    • Audit trail retention
  3. Bridge: Forward critical RabbitMQ events to Kafka for retention
  4. Evaluate operational experience

Long-Term (12+ months): Strategic Decision

Based on hybrid pilot: - If Kafka proves valuable → gradual migration of more services - If Kafka overhead outweighs benefit → remain on enhanced RabbitMQ - Decision informed by actual operational data


Implementation Plan (Hybrid Approach)

Phase 1: RabbitMQ Enhancements (Weeks 1-4)

Task Owner Effort
Idempotency store (Redis) Platform 1 week
DLQ configuration Platform 3 days
Circuit breakers Platform 1 week
Cluster consolidation DevOps 1 week

Phase 2: Kafka Infrastructure (Weeks 5-8)

Task Owner Effort
Kafka cluster setup (Confluent Cloud or self-hosted) DevOps 1 week
Schema Registry setup DevOps 3 days
Monitoring & alerting DevOps 1 week
Team training Platform 1 week

Phase 3: Knowledge Graph Kafka Integration (Weeks 9-12)

Task Owner Effort
Kafka ingestion listeners KG Team 2 weeks
Event replay for migration KG Team 1 week
Analytics streaming KG Team 1 week

Phase 4: Bridge & Evaluation (Weeks 13-16)

Task Owner Effort
RabbitMQ → Kafka bridge Platform 2 weeks
Operational runbook DevOps 1 week
30-day evaluation All Ongoing

Bounded Validity

Scope

Expiry Conditions

Review Triggers

Monitoring


Consequences

Positive (Hybrid Approach)

Negative (Hybrid Approach)

Neutral



Original decision date: 2026-02-01 Revision date: 2026-02-01 Review by: 2026-08-01