ADR-007: Architecture Simplification & Cloud Resource Optimization
ADR-007: Architecture Simplification & Cloud Resource Optimization
Status
Proposed — Pending engineering team review
Context
The platform accumulated significant infrastructure complexity across 191 repos, 35 services, 4 dedicated clusters, 140 databases, 48 Helm charts, 121 Terraform files, and 28 CI/CD workflows. ADRs 001 and 004 address service and cluster consolidation. ADR-005 addresses database consolidation. This ADR takes a holistic view of architecture simplification and cloud resource optimization to improve supportability, maintainability, and cost efficiency.
Current Complexity Metrics
| Dimension | Current | Problem |
|---|---|---|
| Active repos | ~50 | Difficult to navigate, maintain, and onboard |
| Services per tenant | ~28 pods | Over-provisioned for workload complexity |
| Helm charts | 48 | Each requires version management |
| Terraform files | 121 across environments | Duplicated patterns, drift risk |
| CI/CD workflows | 28 reusable | Complex matrix builds |
| PgBouncer entries | 41 per tenant | Connection management overhead |
| Monitoring gaps | No alerting, no SLOs, APM disabled | Blind to production issues |
| Dead code | ~17% frontend, 12+ dead services | Confusion, wasted CI time |
| Libraries | 3 date libraries, 2 CSS frameworks, 2 component libs | Unnecessary diversity |
Decision
Implement a five-pillar simplification strategy that reduces operational surface area, eliminates redundancy, and right-sizes cloud resources.
Pillar 1: Repository Consolidation
| Action | Current | Target | Reduction |
|---|---|---|---|
| Archive dead/Gen 1 repos | 191 repos | ~50 active | 74% fewer repos |
| Merge service + DB repos | Service repo + DB repo separate | DB migrations inside service repo | 25 fewer repos |
| Mono-repo for shared libs | core-lib, messages, common-helm separate | Single platform-libs repo | 3 fewer repos |
Result: From 191 repos → ~40 actively maintained repos.
Pillar 2: Infrastructure Right-Sizing
| Resource | Current | Optimized | Savings |
|---|---|---|---|
| GKE clusters | 4 prod + 2 dev (6 total) | 1 prod regional + 1 dev (2 total) | ~67% cluster overhead |
| Cloud SQL instances | 4 prod (1 per tenant) | 1 regional HA instance | ~75% DB instance cost |
| Databases per instance | 35 per tenant | 6 per tenant (ADR-005) | 83% fewer databases |
| RabbitMQ clusters | 4 × 3-node (12 nodes) | 1 × 3-node shared with vhosts | ~75% messaging infra |
| Redis instances | 4 (one per tenant) | 1 shared with key prefixing | ~75% cache infra |
| NFS PVCs | 16 (4 × 50Gi × 4 tenants) | Migrate to GCS | 100% NFS elimination |
| PgBouncer replicas | 12 (3 per tenant) | 3 (shared cluster) | 75% fewer replicas |
Pillar 3: Observability Consolidation
Replace the fragmented observability stack with a unified approach:
| Layer | Current | Target | Change |
|---|---|---|---|
| Metrics | Prometheus + Grafana (no alerts) | Prometheus + Grafana + AlertManager | Add alerting |
| Tracing | Istio Stackdriver (unused) | OpenTelemetry auto-instrumentation | Replace |
| Logging | peeq-logging (Gen 1 Node.js) → ES 7.x | GCP Cloud Logging (native) | Replace |
| APM | Elastic APM (disabled) | OpenTelemetry → Grafana Tempo | Replace |
| Search | Self-hosted ES 7.x (EOL) | Elastic Cloud 8.x or built-in Cloud Logging search | Upgrade or replace |
| Session replay | LogRocket | Keep LogRocket | No change |
| SLOs | None | Define per-service SLIs/SLOs in Grafana | New |
| Error tracking | None | Sentry or GCP Error Reporting | New |
Consolidation benefit: Eliminate peeq-logging service, eliminate self-managed Elasticsearch/Kibana, reduce monitoring services from 5+ to 2 (Prometheus stack + Cloud Logging).
Pillar 4: CI/CD Simplification
| Area | Current | Target |
|---|---|---|
| Build workflows | Per-service GitHub Actions | Consolidated with service-type detection |
| Preview environments | Full stack per PR | Selective — only changed service + dependencies |
| Security scanning | Non-enforcing (Trivy, Qwiet) | Enforcing with build failure on high/critical |
| Helm chart management | 48 charts, each versioned | Fewer charts after service consolidation (~18) |
| Image variants | Per-service Dockerfile | Standardized base images (Java 21, Node 20) |
Pillar 5: Dead Code & Redundancy Elimination
| Category | Items | Action |
|---|---|---|
| Dead API gateways (frontend) | 5 gateways (broadcast, conference, stream, dwolla, logging) | Remove |
| Dead backend services | 12+ confirmed dead | Archive repos |
| Deprecated GraphQL queries | 3 in celebrity, email verification APIs | Remove after frontend audit |
| Duplicate libraries | 3 date libs (luxon, moment, date-fns), 2 CSS frameworks | Standardize on 1 each |
| Deprecated Arlo LMS | class-catalog | Remove |
| Dual Keycloak instances | identityx-25 + identityx-26 on agilenetwork | Complete migration, retire 25 |
| Gen 1 DB repos | ~25 peeq-*-db repos | Archive with README pointing to Gen 2 |
| CastAI on-demand overrides | 25+ services forced on-demand | Audit — most stateless services can use spot |
Combined Cost Impact Estimate
| Optimization | Estimated Savings |
|---|---|
| Cluster consolidation (ADR-004) | 60-75% cluster costs |
| Database consolidation (ADR-005) | 50-70% database instance costs |
| Scale-to-zero / Knative (ADR-006) | 40-60% compute for Tier 2/3 services |
| NFS → GCS migration | 100% NFS costs (Filestore) |
| RabbitMQ consolidation | 75% messaging costs |
| Redis consolidation | 75% cache costs |
| CastAI on-demand audit | 15-25% additional spot savings |
| Eliminate self-hosted ES/Kibana | 100% ES hosting costs |
| Overall infrastructure | Estimated 50-70% total cloud cost reduction |
Hypothesis Background
Primary: Systematic simplification across infrastructure, code, and operations reduces total cost of ownership by 50-70% while improving system reliability and developer experience.
- Evidence: The platform was built for scale that hasn’t materialized — 4 separate clusters for 4 brands with identical code (H11 L2). The infrastructure was designed for a future that assumed many more tenants needing hard isolation.
- Evidence: Over-decomposition is documented — 6 services with <5 endpoints (tech-debt-inventory). Consolidation is straightforward because all services follow identical patterns (H13 L1).
- Evidence: Observability gaps (no alerting, APM disabled, no SLOs) mean the team is paying for infrastructure they can’t effectively monitor.
Alternative 1: Optimize incrementally without a unified strategy. - Rejected: Individual optimizations (just consolidate clusters, or just fix observability) miss compounding benefits. Database consolidation enables simpler PgBouncer which enables simpler Cloud SQL which enables simpler Terraform. The optimizations reinforce each other.
Alternative 2: Rewrite on a simpler platform (e.g., monolith on Cloud Run). - Rejected: H14 falsified — the application code is sound. The complexity is in infrastructure and operational overhead, not in business logic. A rewrite would duplicate proven integrations.
Falsifiability Criteria
- If combined optimizations don’t achieve at least 30% cloud cost reduction → reassess individual pillar assumptions
- If simplification causes more than 2 production incidents in the first quarter → pause and stabilize before continuing
- If developer experience surveys show increased cognitive load despite fewer components → investigate and adjust
- If GCP billing shows unexpected cost increases from any pillar → immediately investigate and revert if needed
Evidence Quality
| Evidence | Assurance |
|---|---|
| Config-only multi-tenancy (H11) | L2 (verified) |
| Identical service patterns (H13) | L1 (validated) |
| Over-decomposed services identified | L1 (tech debt inventory) |
| Dead code and repos quantified | L1 (gap analysis) |
| Observability gaps documented | L1 (infrastructure analysis) |
| CastAI override impact | L1 (25+ services on-demand) |
| Actual cloud billing data | L0 (not available) |
| Production traffic patterns | L0 (not available) |
Overall: L1 (WLNK capped by billing data and traffic patterns L0)
Bounded Validity
- Scope: All application services, infrastructure, CI/CD, and observability. Excludes external SaaS integrations (Stripe, Mux, Zoom, etc.) which are already optimized.
- Expiry: Re-evaluate cost targets annually. Re-evaluate architecture if tenant count exceeds 10 or if a fundamentally different workload type is added (e.g., ML training, large-scale data processing).
- Review trigger: If any single optimization causes a regression in availability (SLOs breached) or developer productivity.
- Monitoring: Monthly cloud cost reports by category. Quarterly developer experience survey. SLO dashboards for all services.
Consequences
Positive: - Estimated 50-70% cloud cost reduction - Dramatically simpler operational model (~40 repos, 1 cluster, 6 databases, unified observability) - Faster onboarding for new team members (less to learn) - Better incident response (alerting, tracing, SLOs exist) - Easier to add new brands (namespace + values file, not entire cluster stack)
Negative: - Large coordinated effort across infrastructure, application, and CI/CD - Shared infrastructure increases blast radius (mitigated by regional HA, namespace isolation) - Migration period has dual-state complexity (old + new running in parallel) - Some optimizations depend on others (database consolidation depends on service consolidation)
Mitigated by: Phased rollout following dependency order (ADR-001 services → ADR-005 databases → ADR-004 cluster → ADR-006 scale-to-zero). Each phase is independently valuable and reversible.
Execution Order
graph TD
A[Phase 1: Dead Code Cleanup<br/>Archive repos, remove dead FE code] --> B[Phase 2: Service Consolidation<br/>ADR-001: 35 → 18 services]
B --> C[Phase 3: Database Consolidation<br/>ADR-005: 35 → 6 databases]
C --> D[Phase 4: Cluster Consolidation<br/>ADR-004: 4 → 1 cluster]
D --> E[Phase 5: Compute Optimization<br/>ADR-006: Knative + scale-to-zero]
F[Parallel: Observability<br/>Alerting, tracing, SLOs] --> D
G[Parallel: CI/CD Simplification<br/>Enforce security gates] --> B
H[Parallel: CastAI Audit<br/>Reduce on-demand overrides] --> D
Decision date: 2026-01-31 Review by: 2026-07-31