Target Architecture — Gen 3 Proposal
Target Architecture — Gen 3 Proposal
Key Takeaways
- H14 evaluation: Incremental upgrade, not full rewrite — Evidence from 10 sessions supports upgrading Gen 2 in place rather than building Gen 3 from scratch. Gen 2 services are already Java 21/Spring Boot 3.5.4, use consistent patterns (core-lib, GraphQL, RabbitMQ), and have clean service boundaries. The debt is in infrastructure, testing, and BPM — not application architecture.
- Consolidate from ~35 application services to ~18 — Group by domain-driven boundaries. Merge over-decomposed services (wallet+transaction, email+sms+notifications, shoutout+shoutout-bpm). Keep services that serve as infrastructure (SSE, Keycloak).
- Unified frontend monorepo — Merge peeq-mono and frontends into a single Angular 18/Nx monorepo with shared component library. Standardize on Tailwind CSS. Component-by-component migration, not big-bang.
- Shared cluster with namespace isolation — Replace cluster-per-tenant with namespace-per-tenant in a single regional GKE cluster. NetworkPolicies + Istio AuthorizationPolicy for isolation. H11 L2 confirms no code-level branching.
- Replace Camunda 7.17.0 with Operaton — Adopt Operaton (community-owned Camunda 7 fork) as strategic BPM platform. Near-zero migration preserves existing BPMN files and DB schema. BPM is a platform investment for future process-based capabilities, not just a legacy dependency (see ADR-013).
Migration Decision Question
What does Gen 3 look like?
H14 Evaluation: Gen 3 Rewrite vs Gen 2 Upgrade
Hypothesis
H14: A Gen 3 rewrite is justified over upgrading Gen 2 in place.
Evidence Assessment
| Factor | Evidence | Favors |
|---|---|---|
| Tech stack currency | Java 21, Spring Boot 3.5.4, Angular 18.2 — all current | Upgrade |
| Pattern consistency | All services use core-lib, GraphQL, RabbitMQ, Keycloak OAuth2 (H13 L1) | Upgrade |
| Service boundaries | Database-per-service, no shared DB backdoors (H6 L1) | Upgrade |
| Multi-brand architecture | Config-only differentiation, no code branching (H11 L2) | Upgrade |
| BPM engine EOL | CIB Seven must be replaced regardless | Neutral |
| Test coverage | Near-zero (H7 falsified) — but rewrite wouldn’t have tests either | Neutral |
| Infrastructure debt | Zonal clusters, no alerting — infrastructure changes, not app rewrites | Upgrade |
| Frontend split | CSS framework mismatch — restyling, not logic rewrite (H4 L1) | Upgrade |
| Service count | Over-decomposed — consolidation needed, not rewrite | Upgrade |
| External integrations | 11 active APIs — all must be preserved regardless of approach | Neutral |
| Schema complexity | ~280 Flyway migrations — data migration cost same either way | Neutral |
Verdict
H14: FALSIFIED — The evidence does not support a full Gen 3 rewrite.
Recommendation: Incremental upgrade with targeted consolidation.
The platform’s application layer is architecturally sound: modern tech stack, consistent patterns, clean boundaries. The debt is concentrated in infrastructure (zonal HA, observability, security enforcement), BPM engine replacement, test coverage, and frontend unification — none of which require rewriting business logic.
A full rewrite would: - Duplicate ~280 Flyway migrations worth of schema knowledge - Re-implement ~75 RabbitMQ message contracts - Re-integrate 11 external APIs - Re-create 24+ GraphQL gateway endpoints - Introduce regression risk with zero test coverage as starting point
An incremental upgrade preserves all existing integration work and focuses effort on actual problems.
WLNK Analysis
Overall confidence: L1 (Validated) Weakest link: H8 (data volumes L0) — if production data volumes are unexpectedly large, some services might need more fundamental data model changes. However, this affects migration strategy, not the rewrite-vs-upgrade decision.
Bounded Validity
- Scope: Applies to all application services and infrastructure
- Expiry: Re-evaluate if Gen 2 tech stack falls behind (e.g., Spring Boot 3.x EOL, Java 21 EOL)
- Review trigger: If >50% of services require schema redesign during upgrade, reconsider targeted rewrite for those domains
Target Service Architecture
Service Consolidation Map
graph TD
subgraph "Gen 2 Current (~35 services)"
direction TB
C1[celebrity]
C2[fan]
C3[users]
C4[content]
C5[media]
C6[webinar]
C7[stripe]
C8[subscriptions]
C9[purchase-request-bpm]
C10[wallet]
C11[transaction]
C12[shoutout]
C13[shoutout-bpm]
C14[inventory]
C15[class-catalog]
C16[onsite-event]
C17[email]
C18[sms]
C19[notifications]
C20[chat]
C21[message-board]
C22[sse]
C23[search]
C24[tags]
C25[tracking]
C26[group-profile]
C27[org-manager]
C28[journey]
end
subgraph "Target (~18 services)"
direction TB
T1[identity-service<br/>celebrity + fan + users]
T2[content-service<br/>content + media]
T3[webinar-service<br/>webinar]
T4[payment-service<br/>stripe + subscriptions + wallet + transaction]
T5[purchase-workflow<br/>purchase-request state machine]
T6[shoutout-service<br/>shoutout + shoutout-bpm merged]
T7[inventory-service<br/>inventory]
T8[class-catalog-service<br/>class-catalog + journey]
T9[event-service<br/>onsite-event]
T10[notification-service<br/>email + sms + notifications]
T11[chat-service<br/>chat]
T12[message-board-service<br/>message-board]
T13[sse-service<br/>sse]
T14[search-service<br/>search]
T15[platform-services<br/>tags + tracking + group-profile + org-manager]
T16[keycloak<br/>identity provider]
T17[admin-frontend<br/>unified Angular monorepo]
T18[fan-frontend<br/>unified Angular monorepo]
end
C1 --> T1
C2 --> T1
C3 --> T1
C4 --> T2
C5 --> T2
C6 --> T3
C7 --> T4
C8 --> T4
C10 --> T4
C11 --> T4
C9 --> T5
C12 --> T6
C13 --> T6
C14 --> T7
C15 --> T8
C28 --> T8
C16 --> T9
C17 --> T10
C18 --> T10
C19 --> T10
C20 --> T11
C21 --> T12
C22 --> T13
C23 --> T14
C24 --> T15
C25 --> T15
C26 --> T15
C27 --> T15
Consolidation Rationale
| Target Service | Source Services | Rationale |
|---|---|---|
| identity-service | celebrity, fan, users | Same domain (user profiles), shared Keycloak dependency, small API surfaces |
| content-service | content, media | Shared Mux integration, overlapping video handling, same storage patterns |
| payment-service | stripe, subscriptions, wallet, transaction | Same financial domain; wallet (3 tables) and transaction (1 table) are too small for standalone |
| purchase-workflow | purchase-request-bpm | Migrate from Camunda 7 to Operaton; keep separate due to different runtime characteristics |
| shoutout-service | shoutout, shoutout-bpm | Merge shoutout + shoutout-bpm into single service on Operaton; single deployment unit |
| class-catalog-service | class-catalog, journey | Journey is learning path management — same domain as class catalog |
| notification-service | email, sms, notifications | Already share a database; natural delivery pipeline (notifications → email/sms) |
| platform-services | tags, tracking, group-profile, org-manager | Small supporting services with minimal traffic; consolidate to reduce operational overhead |
Services Kept Separate
| Service | Reason |
|---|---|
| webinar-service | Zoom integration with distinct lifecycle (registrations, recordings, calendar) |
| inventory-service | Cross-cutting hub (5 domain dependencies); too risky to merge |
| chat-service | Stream Chat SaaS wrapper; distinct real-time protocol |
| message-board-service | Redis SSE fanout; distinct from notification pipeline |
| sse-service | Platform infrastructure (8 handlers, 7+ publishers); cross-cutting |
| search-service | Elasticsearch integration; distinct query patterns |
| event-service | Simple but distinct domain (in-person events) |
| keycloak | Infrastructure service; managed separately |
Target Tech Stack
Application Layer
| Component | Current (Gen 2) | Target | Change |
|---|---|---|---|
| Language | Java 21 (content: Java 24) | Java 21 LTS | Standardize on LTS; content downgrades from 24 to 21 |
| Framework | Spring Boot 3.5.4 | Spring Boot 3.x (latest stable) | Keep current; upgrade as patch releases |
| Build | Maven + Jib | Maven + Jib | No change |
| API | Spring GraphQL | Spring GraphQL | No change |
| Messaging | RabbitMQ + core-lib MessageSender/Handler | RabbitMQ + core-lib | No change; add idempotency to MessageHandler |
| Auth | Keycloak 26.x + Spring Security OAuth2 | Keycloak 26.x + Spring Security OAuth2 | No change |
| Database | PostgreSQL 16 + Flyway | PostgreSQL 16 + Flyway | No change |
| BPM | Camunda 7.17.0 CE (EOL) | Operaton (community-owned fork) | Replace — Camunda 7 EOL; BPM as strategic platform capability (ADR-013) |
| Shared libs | core-lib 0.0.67-69, messages 0.0.48-73 | core-lib + messages (aligned versions) | Align versions, add idempotency, circuit breakers |
Frontend Layer
| Component | Current | Target | Change |
|---|---|---|---|
| Framework | Angular 18.2 (both repos) | Angular 18.x / Nx | No change |
| Monorepo | 2 repos (peeq-mono, frontends) | 1 unified monorepo | Consolidate |
| CSS | Tailwind/DaisyUI (peeq-mono) + Bootstrap/Angular Material (frontends) | Tailwind/DaisyUI | Standardize |
| Component lib | @vzlabs/ui + @vzlabs/peeq-ui (duplicate) | Single shared library | Consolidate |
| Mobile | Ionic 6 (in peeq-mono) | Ionic 6 or Capacitor | Evaluate; keep if working |
| State | Angular services + Apollo cache | Same | No change |
| API client | Apollo Client (inline gql) | Apollo Client | No change; add schema validation |
Infrastructure Layer
| Component | Current | Target | Change |
|---|---|---|---|
| Compute | GKE 1.30 (zonal, cluster-per-tenant) | GKE (regional, shared cluster) | Consolidate + HA |
| Isolation | Cluster isolation | Namespace + NetworkPolicy + Istio AuthPolicy | Change model |
| Database | Cloud SQL (zonal, 35 DBs per tenant) | Cloud SQL (regional HA) | HA upgrade |
| Routing | Istio IngressGateway (path-based) | Istio IngressGateway | No change |
| DNS | Cloud DNS + external-dns + cert-manager | Same | No change |
| GitOps | ArgoCD + Helm common chart | Same | No change |
| CI/CD | GitHub Actions (28 workflows) | Same + enforce security gates | Add gates |
| IaC | Terraform 1.9.5 + Atlantis | Same + regional modules | Upgrade modules |
| Secrets | GCP Secret Manager → AVP → K8s Secrets | Same | No change |
| Cost | CastAI (spot instances) | CastAI + PDB review | Optimize overrides |
Observability Layer
| Component | Current | Target | Change |
|---|---|---|---|
| Metrics | Prometheus + Grafana (no alerts) | Prometheus + Grafana + AlertManager | Add alerting |
| Tracing | Istio Stackdriver (not adopted) | OpenTelemetry auto-instrumentation | Add tracing |
| Logging | peeq-logging → Elasticsearch 7.x | GCP Cloud Logging + Elastic Cloud 8.x | Replace pipeline |
| APM | Elastic APM (disabled) | OpenTelemetry + Grafana Tempo | Replace with OTel |
| SLOs | None | Define per-service SLIs/SLOs | New |
| Error tracking | None | Sentry or equivalent | New |
| Session replay | LogRocket | Keep LogRocket | No change |
Database Strategy
Consolidation with Service Merges
When services merge, their databases merge:
| Target Service | Source Databases | Strategy |
|---|---|---|
| identity-service | celebrity-db, fan-db, users-db | Merge into single identity DB |
| content-service | content-db, media-db | Merge into single content DB |
| payment-service | stripe-db, subscriptions-db, wallet-db, transaction-db | Merge into single payment DB |
| notification-service | shared notification-db (already shared) | Keep as-is |
| shoutout-service | shoutout-db, shoutout-bpm-db | Merge into single shoutout DB |
| class-catalog-service | class-catalog-db, journey-db | Merge into single learning DB |
| platform-services | tags-db, tracking-db, group-profile-db, org-manager-db | Merge into single platform DB |
Database Count Reduction
| Metric | Current | Target |
|---|---|---|
| Databases per tenant | 35 | ~18 |
| Total production databases | 140 (35 × 4 tenants) | ~18 (shared cluster) |
| PgBouncer complexity | 41 routing entries | ~18 routing entries |
| Cloud SQL instances | 4 (one per tenant) | 1 regional HA instance |
Migration Approach
- No schema rewrite — merged databases use schema prefixes or separate schemas within the same instance
- Flyway continues — each service module maintains its own migration folder
- Foreign keys within merged service only — no cross-service FK even within the same database instance
Frontend Unification Strategy
Phase 1: Shared Component Library
- Create unified
@vzlabs/componentslibrary in a new Nx workspace - Port high-usage components from both repos (start with auth, navigation, layout)
- Both repos import from shared library (dual-publish during transition)
Phase 2: CSS Standardization
- Choose Tailwind as target (more flexible, better DX, already in peeq-mono)
- Create Tailwind equivalents for Bootstrap/Angular Material components in frontends
- Component-by-component migration (not big-bang)
Phase 3: Repo Merge
- Move frontends apps (admin-fe, celeb-fe, org-dashboard-fe) into peeq-mono Nx workspace
- Remove duplicate services and gateways
- Single CI/CD pipeline for all frontend apps
Phase 4: Dead Code Removal
- Remove 5 dead API gateways (broadcast, conference, stream, dwolla, logging)
- Remove unused components and services
- Target: reduce frontend codebase by ~17%
Multi-Tenant Architecture (Shared Cluster)
Current: Cluster-Per-Tenant
Tenant A: GKE Cluster → 28+ pods → Cloud SQL (35 DBs) → RabbitMQ → Redis → Keycloak
Tenant B: GKE Cluster → 28+ pods → Cloud SQL (35 DBs) → RabbitMQ → Redis → Keycloak
Tenant C: GKE Cluster → 28+ pods → Cloud SQL (35 DBs) → RabbitMQ → Redis → Keycloak
Tenant D: GKE Cluster → 28+ pods → Cloud SQL (35 DBs) → RabbitMQ → Redis → Keycloak
Target: Shared Cluster with Namespace Isolation
Shared Regional GKE Cluster
├── namespace: tenant-a → 18 pods → schemas in shared Cloud SQL → vhost in shared RabbitMQ
├── namespace: tenant-b → 18 pods → schemas in shared Cloud SQL → vhost in shared RabbitMQ
├── namespace: tenant-c → 18 pods → schemas in shared Cloud SQL → vhost in shared RabbitMQ
├── namespace: tenant-d → 18 pods → schemas in shared Cloud SQL → vhost in shared RabbitMQ
├── namespace: platform → Keycloak, Istio, ArgoCD, monitoring
└── NetworkPolicies + Istio AuthorizationPolicy for isolation
Isolation Mechanisms
| Layer | Mechanism |
|---|---|
| Network | NetworkPolicies: default-deny, explicit allow per service pair |
| Service mesh | Istio AuthorizationPolicy: namespace-scoped access control |
| Database | Separate schemas or databases within shared Cloud SQL instance |
| Messaging | RabbitMQ vhosts per tenant |
| Cache | Redis key prefixing or separate Redis instances |
| Secrets | Namespace-scoped Kubernetes secrets (AVP per namespace) |
| Compute | ResourceQuotas per namespace |
Prerequisites (Before Consolidation)
BPM Replacement Strategy
Current: Camunda 7.17.0 CE (EOL October 2025)
Two active workflows: 1. Purchase-Request BPM: Payment → entitlements → wallet debit/credit → notifications → refunds (~10 states) 2. Shoutout BPM: Offer → purchase → celebrity recording → FFmpeg/Mux → admin review → delivery (~12 states)
Target: Operaton (Community-Owned Camunda 7 Fork)
Updated per ADR-013: The original recommendation was Spring State Machine. This was reversed based on two inputs: (1) Meet & Greet SM is completely retired — no in-house SM production experience, and (2) strategic intent to expand BPM as a platform capability for future process-based features (expert onboarding, content approval, event lifecycle, dispute resolution, etc.).
Operaton is the same engine the platform already runs — near-zero migration. Preserves existing BPMN files, database schema, and operational knowledge. Community-owned governance prevents single-vendor EOL risk.
Migration approach: 1. Update Maven dependencies —
replace org.camunda.bpm with Operaton equivalents 2.
Validate BPMN files work without modification 3. Test database
compatibility (7.17 → 7.24 schema may need migration scripts) 4. Upgrade
BPM services to Java 21 / Spring Boot 3.x (ADR-003) 5. Run in parallel
with Camunda 7 during validation (dual-write) 6. Drain Camunda 7
instances 7. Switch traffic to Operaton 8. Remove Camunda 7 dependencies
and Keycloak identity sync plugin
Observability Stack
SLI/SLO Framework
| Service Category | SLI | SLO Target |
|---|---|---|
| API services | Request success rate (non-5xx) | 99.9% |
| API services | p99 latency | <500ms |
| Payment services | Transaction success rate | 99.95% |
| Real-time (SSE) | Connection success rate | 99.5% |
| Background (RabbitMQ) | Message processing success | 99.9% |
| Background (RabbitMQ) | Message processing latency | <30s p99 |
Alerting Rules (Critical)
| Alert | Condition | Severity |
|---|---|---|
| Service down | 0 ready pods for >2min | P1 |
| High error rate | >1% 5xx rate for >5min | P1 |
| Payment failure | >0.1% payment errors for >2min | P0 |
| Database connection exhaustion | >80% connection pool for >5min | P2 |
| RabbitMQ queue backlog | >1000 unacked messages for >10min | P2 |
| Certificate expiry | <7 days until expiry | P2 |
| Pod crash loop | >3 restarts in 5min | P1 |
Security Improvements
| Area | Current | Target |
|---|---|---|
| CI scanning | Trivy + Qwiet (non-blocking) | Block on high/critical + Binary Authorization |
| Network | No NetworkPolicies | Default-deny + explicit allow |
| CORS | Allow all origins | Restrict to tenant domains |
| WAF | None | Cloud Armor on GCP LB |
| Supply chain | No signing | Container image signing + SBOM |
| Secrets | AVP (working well) | Keep AVP + add secret rotation |
| Access | No namespace isolation | RBAC + Istio AuthorizationPolicy |
Target Architecture Diagram
graph TB
subgraph "Clients"
FAN[Fan App<br/>Angular/Ionic]
ADMIN[Admin/Expert<br/>Angular]
end
subgraph "Edge"
DNS[Cloud DNS]
LB[GCP Load Balancer<br/>+ Cloud Armor WAF]
CERT[cert-manager<br/>Let's Encrypt]
end
subgraph "Service Mesh"
ISTIO[Istio IngressGateway<br/>TLS + Path Routing + mTLS]
end
subgraph "Identity"
KC[Keycloak 26.x<br/>Magic Link + SSO]
ID[identity-service<br/>profiles + follows]
end
subgraph "Content"
CONT[content-service<br/>articles + video + media]
WEB[webinar-service<br/>Zoom integration]
end
subgraph "Commerce"
PAY[payment-service<br/>Stripe + wallet + txn]
PW[purchase-workflow<br/>state machine]
INV[inventory-service<br/>product catalog hub]
SUB_S[shoutout-service<br/>offers + fulfillment]
end
subgraph "Learning"
CLS[class-catalog-service<br/>courses + journeys]
EVT[event-service<br/>onsite check-in]
end
subgraph "Communication"
NOT[notification-service<br/>email + SMS + push]
CHAT[chat-service<br/>Stream Chat]
MB[message-board-service]
SSE_S[sse-service<br/>real-time events]
end
subgraph "Platform"
PLT[platform-services<br/>tags + tracking + org]
SRCH[search-service<br/>Elasticsearch]
end
subgraph "Infrastructure"
PG[(Cloud SQL<br/>Regional HA)]
RMQ[RabbitMQ<br/>Shared + vhosts]
REDIS[Redis]
GCS[Cloud Storage]
end
subgraph "External APIs"
STRIPE_API[Stripe]
MUX_API[Mux]
ZOOM_API[Zoom]
STREAM_API[Stream Chat]
TWILIO_API[Twilio]
MANDRILL_API[Mandrill]
end
subgraph "Observability"
PROM[Prometheus + Grafana]
OTEL[OpenTelemetry]
SENTRY[Error Tracking]
LOG[Cloud Logging]
end
FAN --> DNS --> LB --> ISTIO
ADMIN --> DNS
ISTIO --> KC
ISTIO --> ID
ISTIO --> CONT
ISTIO --> WEB
ISTIO --> PAY
ISTIO --> INV
ISTIO --> CLS
ISTIO --> NOT
ISTIO --> CHAT
ISTIO --> SSE_S
PAY --> STRIPE_API
CONT --> MUX_API
WEB --> ZOOM_API
CHAT --> STREAM_API
NOT --> TWILIO_API
NOT --> MANDRILL_API
ID --> PG
CONT --> PG
CONT --> GCS
PAY --> PG
INV --> PG
CLS --> PG
NOT --> PG
SSE_S --> REDIS
PW --> RMQ
NOT --> RMQ
end
Hypotheses Final Status (Session 11)
| # | Hypothesis | Final Assurance | Session 11 Update |
|---|---|---|---|
| H1 | Broadcast not in production | L2 | Confirmed — no target architecture impact |
| H2 | Dwolla inactive | L2 | Confirmed — archive repos |
| H3 | Gen 1 fully replaced | L1 | Confirmed — retire remaining Gen 1 infra |
| H4 | Frontend unification feasible | L1 | Validated — CSS restyling, not logic rewrite |
| H5 | >50% repos archivable | L1 | ~110 of 191 (58%) archivable |
| H6 | No shared DB backdoors | L1 | Clean boundaries enable consolidation |
| H7 | >60% test coverage | L0 Falsified | Major gap — test investment required |
| H8 | Data volumes manageable | L0 Partial | Still need actual data — shared cluster design assumes moderate volume |
| H9 | No compliance blockers | L0 | PCI scope unconfirmed but likely SAQ-A |
| H10 | APIs backward-compatible | L0 | GraphQL additive evolution supports strangler fig |
| H11 | Multi-brand is config-only | L2 | Enables shared cluster consolidation |
| H12 | RabbitMQ contracts discoverable | L2 | Complete inventory enables safe consolidation |
| H13 | core-lib stable for Gen 3 | L1 | Foundation preserved; add idempotency + circuit breakers |
| H14 | Gen 3 rewrite justified | L1 Falsified | Incremental upgrade recommended |
Last updated: 2026-01-30 — Session 11 Review by: 2026-04-30 Staleness risk: Medium — target architecture evolves with implementation decisions