Modernization

Target Architecture — Gen 3 Proposal

Last updated: 2026-02-01 | Modernization

Target Architecture — Gen 3 Proposal

Key Takeaways

  1. H14 evaluation: Incremental upgrade, not full rewrite — Evidence from 10 sessions supports upgrading Gen 2 in place rather than building Gen 3 from scratch. Gen 2 services are already Java 21/Spring Boot 3.5.4, use consistent patterns (core-lib, GraphQL, RabbitMQ), and have clean service boundaries. The debt is in infrastructure, testing, and BPM — not application architecture.
  2. Consolidate from ~35 application services to ~18 — Group by domain-driven boundaries. Merge over-decomposed services (wallet+transaction, email+sms+notifications, shoutout+shoutout-bpm). Keep services that serve as infrastructure (SSE, Keycloak).
  3. Unified frontend monorepo — Merge peeq-mono and frontends into a single Angular 18/Nx monorepo with shared component library. Standardize on Tailwind CSS. Component-by-component migration, not big-bang.
  4. Shared cluster with namespace isolation — Replace cluster-per-tenant with namespace-per-tenant in a single regional GKE cluster. NetworkPolicies + Istio AuthorizationPolicy for isolation. H11 L2 confirms no code-level branching.
  5. Replace Camunda 7.17.0 with Operaton — Adopt Operaton (community-owned Camunda 7 fork) as strategic BPM platform. Near-zero migration preserves existing BPMN files and DB schema. BPM is a platform investment for future process-based capabilities, not just a legacy dependency (see ADR-013).

Migration Decision Question

What does Gen 3 look like?


H14 Evaluation: Gen 3 Rewrite vs Gen 2 Upgrade

Hypothesis

H14: A Gen 3 rewrite is justified over upgrading Gen 2 in place.

Evidence Assessment

Factor Evidence Favors
Tech stack currency Java 21, Spring Boot 3.5.4, Angular 18.2 — all current Upgrade
Pattern consistency All services use core-lib, GraphQL, RabbitMQ, Keycloak OAuth2 (H13 L1) Upgrade
Service boundaries Database-per-service, no shared DB backdoors (H6 L1) Upgrade
Multi-brand architecture Config-only differentiation, no code branching (H11 L2) Upgrade
BPM engine EOL CIB Seven must be replaced regardless Neutral
Test coverage Near-zero (H7 falsified) — but rewrite wouldn’t have tests either Neutral
Infrastructure debt Zonal clusters, no alerting — infrastructure changes, not app rewrites Upgrade
Frontend split CSS framework mismatch — restyling, not logic rewrite (H4 L1) Upgrade
Service count Over-decomposed — consolidation needed, not rewrite Upgrade
External integrations 11 active APIs — all must be preserved regardless of approach Neutral
Schema complexity ~280 Flyway migrations — data migration cost same either way Neutral

Verdict

H14: FALSIFIED — The evidence does not support a full Gen 3 rewrite.

Recommendation: Incremental upgrade with targeted consolidation.

The platform’s application layer is architecturally sound: modern tech stack, consistent patterns, clean boundaries. The debt is concentrated in infrastructure (zonal HA, observability, security enforcement), BPM engine replacement, test coverage, and frontend unification — none of which require rewriting business logic.

A full rewrite would: - Duplicate ~280 Flyway migrations worth of schema knowledge - Re-implement ~75 RabbitMQ message contracts - Re-integrate 11 external APIs - Re-create 24+ GraphQL gateway endpoints - Introduce regression risk with zero test coverage as starting point

An incremental upgrade preserves all existing integration work and focuses effort on actual problems.

WLNK Analysis

Overall confidence: L1 (Validated) Weakest link: H8 (data volumes L0) — if production data volumes are unexpectedly large, some services might need more fundamental data model changes. However, this affects migration strategy, not the rewrite-vs-upgrade decision.

Bounded Validity


Target Service Architecture

Service Consolidation Map

graph TD
    subgraph "Gen 2 Current (~35 services)"
        direction TB
        C1[celebrity]
        C2[fan]
        C3[users]
        C4[content]
        C5[media]
        C6[webinar]
        C7[stripe]
        C8[subscriptions]
        C9[purchase-request-bpm]
        C10[wallet]
        C11[transaction]
        C12[shoutout]
        C13[shoutout-bpm]
        C14[inventory]
        C15[class-catalog]
        C16[onsite-event]
        C17[email]
        C18[sms]
        C19[notifications]
        C20[chat]
        C21[message-board]
        C22[sse]
        C23[search]
        C24[tags]
        C25[tracking]
        C26[group-profile]
        C27[org-manager]
        C28[journey]
    end

    subgraph "Target (~18 services)"
        direction TB
        T1[identity-service<br/>celebrity + fan + users]
        T2[content-service<br/>content + media]
        T3[webinar-service<br/>webinar]
        T4[payment-service<br/>stripe + subscriptions + wallet + transaction]
        T5[purchase-workflow<br/>purchase-request state machine]
        T6[shoutout-service<br/>shoutout + shoutout-bpm merged]
        T7[inventory-service<br/>inventory]
        T8[class-catalog-service<br/>class-catalog + journey]
        T9[event-service<br/>onsite-event]
        T10[notification-service<br/>email + sms + notifications]
        T11[chat-service<br/>chat]
        T12[message-board-service<br/>message-board]
        T13[sse-service<br/>sse]
        T14[search-service<br/>search]
        T15[platform-services<br/>tags + tracking + group-profile + org-manager]
        T16[keycloak<br/>identity provider]
        T17[admin-frontend<br/>unified Angular monorepo]
        T18[fan-frontend<br/>unified Angular monorepo]
    end

    C1 --> T1
    C2 --> T1
    C3 --> T1
    C4 --> T2
    C5 --> T2
    C6 --> T3
    C7 --> T4
    C8 --> T4
    C10 --> T4
    C11 --> T4
    C9 --> T5
    C12 --> T6
    C13 --> T6
    C14 --> T7
    C15 --> T8
    C28 --> T8
    C16 --> T9
    C17 --> T10
    C18 --> T10
    C19 --> T10
    C20 --> T11
    C21 --> T12
    C22 --> T13
    C23 --> T14
    C24 --> T15
    C25 --> T15
    C26 --> T15
    C27 --> T15

Consolidation Rationale

Target Service Source Services Rationale
identity-service celebrity, fan, users Same domain (user profiles), shared Keycloak dependency, small API surfaces
content-service content, media Shared Mux integration, overlapping video handling, same storage patterns
payment-service stripe, subscriptions, wallet, transaction Same financial domain; wallet (3 tables) and transaction (1 table) are too small for standalone
purchase-workflow purchase-request-bpm Migrate from Camunda 7 to Operaton; keep separate due to different runtime characteristics
shoutout-service shoutout, shoutout-bpm Merge shoutout + shoutout-bpm into single service on Operaton; single deployment unit
class-catalog-service class-catalog, journey Journey is learning path management — same domain as class catalog
notification-service email, sms, notifications Already share a database; natural delivery pipeline (notifications → email/sms)
platform-services tags, tracking, group-profile, org-manager Small supporting services with minimal traffic; consolidate to reduce operational overhead

Services Kept Separate

Service Reason
webinar-service Zoom integration with distinct lifecycle (registrations, recordings, calendar)
inventory-service Cross-cutting hub (5 domain dependencies); too risky to merge
chat-service Stream Chat SaaS wrapper; distinct real-time protocol
message-board-service Redis SSE fanout; distinct from notification pipeline
sse-service Platform infrastructure (8 handlers, 7+ publishers); cross-cutting
search-service Elasticsearch integration; distinct query patterns
event-service Simple but distinct domain (in-person events)
keycloak Infrastructure service; managed separately

Target Tech Stack

Application Layer

Component Current (Gen 2) Target Change
Language Java 21 (content: Java 24) Java 21 LTS Standardize on LTS; content downgrades from 24 to 21
Framework Spring Boot 3.5.4 Spring Boot 3.x (latest stable) Keep current; upgrade as patch releases
Build Maven + Jib Maven + Jib No change
API Spring GraphQL Spring GraphQL No change
Messaging RabbitMQ + core-lib MessageSender/Handler RabbitMQ + core-lib No change; add idempotency to MessageHandler
Auth Keycloak 26.x + Spring Security OAuth2 Keycloak 26.x + Spring Security OAuth2 No change
Database PostgreSQL 16 + Flyway PostgreSQL 16 + Flyway No change
BPM Camunda 7.17.0 CE (EOL) Operaton (community-owned fork) Replace — Camunda 7 EOL; BPM as strategic platform capability (ADR-013)
Shared libs core-lib 0.0.67-69, messages 0.0.48-73 core-lib + messages (aligned versions) Align versions, add idempotency, circuit breakers

Frontend Layer

Component Current Target Change
Framework Angular 18.2 (both repos) Angular 18.x / Nx No change
Monorepo 2 repos (peeq-mono, frontends) 1 unified monorepo Consolidate
CSS Tailwind/DaisyUI (peeq-mono) + Bootstrap/Angular Material (frontends) Tailwind/DaisyUI Standardize
Component lib @vzlabs/ui + @vzlabs/peeq-ui (duplicate) Single shared library Consolidate
Mobile Ionic 6 (in peeq-mono) Ionic 6 or Capacitor Evaluate; keep if working
State Angular services + Apollo cache Same No change
API client Apollo Client (inline gql) Apollo Client No change; add schema validation

Infrastructure Layer

Component Current Target Change
Compute GKE 1.30 (zonal, cluster-per-tenant) GKE (regional, shared cluster) Consolidate + HA
Isolation Cluster isolation Namespace + NetworkPolicy + Istio AuthPolicy Change model
Database Cloud SQL (zonal, 35 DBs per tenant) Cloud SQL (regional HA) HA upgrade
Routing Istio IngressGateway (path-based) Istio IngressGateway No change
DNS Cloud DNS + external-dns + cert-manager Same No change
GitOps ArgoCD + Helm common chart Same No change
CI/CD GitHub Actions (28 workflows) Same + enforce security gates Add gates
IaC Terraform 1.9.5 + Atlantis Same + regional modules Upgrade modules
Secrets GCP Secret Manager → AVP → K8s Secrets Same No change
Cost CastAI (spot instances) CastAI + PDB review Optimize overrides

Observability Layer

Component Current Target Change
Metrics Prometheus + Grafana (no alerts) Prometheus + Grafana + AlertManager Add alerting
Tracing Istio Stackdriver (not adopted) OpenTelemetry auto-instrumentation Add tracing
Logging peeq-logging → Elasticsearch 7.x GCP Cloud Logging + Elastic Cloud 8.x Replace pipeline
APM Elastic APM (disabled) OpenTelemetry + Grafana Tempo Replace with OTel
SLOs None Define per-service SLIs/SLOs New
Error tracking None Sentry or equivalent New
Session replay LogRocket Keep LogRocket No change

Database Strategy

Consolidation with Service Merges

When services merge, their databases merge:

Target Service Source Databases Strategy
identity-service celebrity-db, fan-db, users-db Merge into single identity DB
content-service content-db, media-db Merge into single content DB
payment-service stripe-db, subscriptions-db, wallet-db, transaction-db Merge into single payment DB
notification-service shared notification-db (already shared) Keep as-is
shoutout-service shoutout-db, shoutout-bpm-db Merge into single shoutout DB
class-catalog-service class-catalog-db, journey-db Merge into single learning DB
platform-services tags-db, tracking-db, group-profile-db, org-manager-db Merge into single platform DB

Database Count Reduction

Metric Current Target
Databases per tenant 35 ~18
Total production databases 140 (35 × 4 tenants) ~18 (shared cluster)
PgBouncer complexity 41 routing entries ~18 routing entries
Cloud SQL instances 4 (one per tenant) 1 regional HA instance

Migration Approach


Frontend Unification Strategy

Phase 1: Shared Component Library

  1. Create unified @vzlabs/components library in a new Nx workspace
  2. Port high-usage components from both repos (start with auth, navigation, layout)
  3. Both repos import from shared library (dual-publish during transition)

Phase 2: CSS Standardization

  1. Choose Tailwind as target (more flexible, better DX, already in peeq-mono)
  2. Create Tailwind equivalents for Bootstrap/Angular Material components in frontends
  3. Component-by-component migration (not big-bang)

Phase 3: Repo Merge

  1. Move frontends apps (admin-fe, celeb-fe, org-dashboard-fe) into peeq-mono Nx workspace
  2. Remove duplicate services and gateways
  3. Single CI/CD pipeline for all frontend apps

Phase 4: Dead Code Removal

  1. Remove 5 dead API gateways (broadcast, conference, stream, dwolla, logging)
  2. Remove unused components and services
  3. Target: reduce frontend codebase by ~17%

Multi-Tenant Architecture (Shared Cluster)

Current: Cluster-Per-Tenant

Tenant A: GKE Cluster → 28+ pods → Cloud SQL (35 DBs) → RabbitMQ → Redis → Keycloak
Tenant B: GKE Cluster → 28+ pods → Cloud SQL (35 DBs) → RabbitMQ → Redis → Keycloak
Tenant C: GKE Cluster → 28+ pods → Cloud SQL (35 DBs) → RabbitMQ → Redis → Keycloak
Tenant D: GKE Cluster → 28+ pods → Cloud SQL (35 DBs) → RabbitMQ → Redis → Keycloak

Target: Shared Cluster with Namespace Isolation

Shared Regional GKE Cluster
├── namespace: tenant-a → 18 pods → schemas in shared Cloud SQL → vhost in shared RabbitMQ
├── namespace: tenant-b → 18 pods → schemas in shared Cloud SQL → vhost in shared RabbitMQ
├── namespace: tenant-c → 18 pods → schemas in shared Cloud SQL → vhost in shared RabbitMQ
├── namespace: tenant-d → 18 pods → schemas in shared Cloud SQL → vhost in shared RabbitMQ
├── namespace: platform → Keycloak, Istio, ArgoCD, monitoring
└── NetworkPolicies + Istio AuthorizationPolicy for isolation

Isolation Mechanisms

Layer Mechanism
Network NetworkPolicies: default-deny, explicit allow per service pair
Service mesh Istio AuthorizationPolicy: namespace-scoped access control
Database Separate schemas or databases within shared Cloud SQL instance
Messaging RabbitMQ vhosts per tenant
Cache Redis key prefixing or separate Redis instances
Secrets Namespace-scoped Kubernetes secrets (AVP per namespace)
Compute ResourceQuotas per namespace

Prerequisites (Before Consolidation)


BPM Replacement Strategy

Current: Camunda 7.17.0 CE (EOL October 2025)

Two active workflows: 1. Purchase-Request BPM: Payment → entitlements → wallet debit/credit → notifications → refunds (~10 states) 2. Shoutout BPM: Offer → purchase → celebrity recording → FFmpeg/Mux → admin review → delivery (~12 states)

Target: Operaton (Community-Owned Camunda 7 Fork)

Updated per ADR-013: The original recommendation was Spring State Machine. This was reversed based on two inputs: (1) Meet & Greet SM is completely retired — no in-house SM production experience, and (2) strategic intent to expand BPM as a platform capability for future process-based features (expert onboarding, content approval, event lifecycle, dispute resolution, etc.).

Operaton is the same engine the platform already runs — near-zero migration. Preserves existing BPMN files, database schema, and operational knowledge. Community-owned governance prevents single-vendor EOL risk.

Migration approach: 1. Update Maven dependencies — replace org.camunda.bpm with Operaton equivalents 2. Validate BPMN files work without modification 3. Test database compatibility (7.17 → 7.24 schema may need migration scripts) 4. Upgrade BPM services to Java 21 / Spring Boot 3.x (ADR-003) 5. Run in parallel with Camunda 7 during validation (dual-write) 6. Drain Camunda 7 instances 7. Switch traffic to Operaton 8. Remove Camunda 7 dependencies and Keycloak identity sync plugin


Observability Stack

SLI/SLO Framework

Service Category SLI SLO Target
API services Request success rate (non-5xx) 99.9%
API services p99 latency <500ms
Payment services Transaction success rate 99.95%
Real-time (SSE) Connection success rate 99.5%
Background (RabbitMQ) Message processing success 99.9%
Background (RabbitMQ) Message processing latency <30s p99

Alerting Rules (Critical)

Alert Condition Severity
Service down 0 ready pods for >2min P1
High error rate >1% 5xx rate for >5min P1
Payment failure >0.1% payment errors for >2min P0
Database connection exhaustion >80% connection pool for >5min P2
RabbitMQ queue backlog >1000 unacked messages for >10min P2
Certificate expiry <7 days until expiry P2
Pod crash loop >3 restarts in 5min P1

Security Improvements

Area Current Target
CI scanning Trivy + Qwiet (non-blocking) Block on high/critical + Binary Authorization
Network No NetworkPolicies Default-deny + explicit allow
CORS Allow all origins Restrict to tenant domains
WAF None Cloud Armor on GCP LB
Supply chain No signing Container image signing + SBOM
Secrets AVP (working well) Keep AVP + add secret rotation
Access No namespace isolation RBAC + Istio AuthorizationPolicy

Target Architecture Diagram

graph TB
    subgraph "Clients"
        FAN[Fan App<br/>Angular/Ionic]
        ADMIN[Admin/Expert<br/>Angular]
    end

    subgraph "Edge"
        DNS[Cloud DNS]
        LB[GCP Load Balancer<br/>+ Cloud Armor WAF]
        CERT[cert-manager<br/>Let's Encrypt]
    end

    subgraph "Service Mesh"
        ISTIO[Istio IngressGateway<br/>TLS + Path Routing + mTLS]
    end

    subgraph "Identity"
        KC[Keycloak 26.x<br/>Magic Link + SSO]
        ID[identity-service<br/>profiles + follows]
    end

    subgraph "Content"
        CONT[content-service<br/>articles + video + media]
        WEB[webinar-service<br/>Zoom integration]
    end

    subgraph "Commerce"
        PAY[payment-service<br/>Stripe + wallet + txn]
        PW[purchase-workflow<br/>state machine]
        INV[inventory-service<br/>product catalog hub]
        SUB_S[shoutout-service<br/>offers + fulfillment]
    end

    subgraph "Learning"
        CLS[class-catalog-service<br/>courses + journeys]
        EVT[event-service<br/>onsite check-in]
    end

    subgraph "Communication"
        NOT[notification-service<br/>email + SMS + push]
        CHAT[chat-service<br/>Stream Chat]
        MB[message-board-service]
        SSE_S[sse-service<br/>real-time events]
    end

    subgraph "Platform"
        PLT[platform-services<br/>tags + tracking + org]
        SRCH[search-service<br/>Elasticsearch]
    end

    subgraph "Infrastructure"
        PG[(Cloud SQL<br/>Regional HA)]
        RMQ[RabbitMQ<br/>Shared + vhosts]
        REDIS[Redis]
        GCS[Cloud Storage]
    end

    subgraph "External APIs"
        STRIPE_API[Stripe]
        MUX_API[Mux]
        ZOOM_API[Zoom]
        STREAM_API[Stream Chat]
        TWILIO_API[Twilio]
        MANDRILL_API[Mandrill]
    end

    subgraph "Observability"
        PROM[Prometheus + Grafana]
        OTEL[OpenTelemetry]
        SENTRY[Error Tracking]
        LOG[Cloud Logging]
    end

    FAN --> DNS --> LB --> ISTIO
    ADMIN --> DNS
    ISTIO --> KC
    ISTIO --> ID
    ISTIO --> CONT
    ISTIO --> WEB
    ISTIO --> PAY
    ISTIO --> INV
    ISTIO --> CLS
    ISTIO --> NOT
    ISTIO --> CHAT
    ISTIO --> SSE_S

    PAY --> STRIPE_API
    CONT --> MUX_API
    WEB --> ZOOM_API
    CHAT --> STREAM_API
    NOT --> TWILIO_API
    NOT --> MANDRILL_API

    ID --> PG
    CONT --> PG
    CONT --> GCS
    PAY --> PG
    INV --> PG
    CLS --> PG
    NOT --> PG
    SSE_S --> REDIS
    PW --> RMQ
    NOT --> RMQ
end

Hypotheses Final Status (Session 11)

# Hypothesis Final Assurance Session 11 Update
H1 Broadcast not in production L2 Confirmed — no target architecture impact
H2 Dwolla inactive L2 Confirmed — archive repos
H3 Gen 1 fully replaced L1 Confirmed — retire remaining Gen 1 infra
H4 Frontend unification feasible L1 Validated — CSS restyling, not logic rewrite
H5 >50% repos archivable L1 ~110 of 191 (58%) archivable
H6 No shared DB backdoors L1 Clean boundaries enable consolidation
H7 >60% test coverage L0 Falsified Major gap — test investment required
H8 Data volumes manageable L0 Partial Still need actual data — shared cluster design assumes moderate volume
H9 No compliance blockers L0 PCI scope unconfirmed but likely SAQ-A
H10 APIs backward-compatible L0 GraphQL additive evolution supports strangler fig
H11 Multi-brand is config-only L2 Enables shared cluster consolidation
H12 RabbitMQ contracts discoverable L2 Complete inventory enables safe consolidation
H13 core-lib stable for Gen 3 L1 Foundation preserved; add idempotency + circuit breakers
H14 Gen 3 rewrite justified L1 Falsified Incremental upgrade recommended

Last updated: 2026-01-30 — Session 11 Review by: 2026-04-30 Staleness risk: Medium — target architecture evolves with implementation decisions