ADR

ADR-007: Architecture Simplification & Cloud Resource Optimization

Last updated: 2026-02-01 | Decisions

ADR-007: Architecture Simplification & Cloud Resource Optimization

Status

Proposed — Pending engineering team review

Context

The platform accumulated significant infrastructure complexity across 191 repos, 35 services, 4 dedicated clusters, 140 databases, 48 Helm charts, 121 Terraform files, and 28 CI/CD workflows. ADRs 001 and 004 address service and cluster consolidation. ADR-005 addresses database consolidation. This ADR takes a holistic view of architecture simplification and cloud resource optimization to improve supportability, maintainability, and cost efficiency.

Current Complexity Metrics

Dimension Current Problem
Active repos ~50 Difficult to navigate, maintain, and onboard
Services per tenant ~28 pods Over-provisioned for workload complexity
Helm charts 48 Each requires version management
Terraform files 121 across environments Duplicated patterns, drift risk
CI/CD workflows 28 reusable Complex matrix builds
PgBouncer entries 41 per tenant Connection management overhead
Monitoring gaps No alerting, no SLOs, APM disabled Blind to production issues
Dead code ~17% frontend, 12+ dead services Confusion, wasted CI time
Libraries 3 date libraries, 2 CSS frameworks, 2 component libs Unnecessary diversity

Decision

Implement a five-pillar simplification strategy that reduces operational surface area, eliminates redundancy, and right-sizes cloud resources.

Pillar 1: Repository Consolidation

Action Current Target Reduction
Archive dead/Gen 1 repos 191 repos ~50 active 74% fewer repos
Merge service + DB repos Service repo + DB repo separate DB migrations inside service repo 25 fewer repos
Mono-repo for shared libs core-lib, messages, common-helm separate Single platform-libs repo 3 fewer repos

Result: From 191 repos → ~40 actively maintained repos.

Pillar 2: Infrastructure Right-Sizing

Resource Current Optimized Savings
GKE clusters 4 prod + 2 dev (6 total) 1 prod regional + 1 dev (2 total) ~67% cluster overhead
Cloud SQL instances 4 prod (1 per tenant) 1 regional HA instance ~75% DB instance cost
Databases per instance 35 per tenant 6 per tenant (ADR-005) 83% fewer databases
RabbitMQ clusters 4 × 3-node (12 nodes) 1 × 3-node shared with vhosts ~75% messaging infra
Redis instances 4 (one per tenant) 1 shared with key prefixing ~75% cache infra
NFS PVCs 16 (4 × 50Gi × 4 tenants) Migrate to GCS 100% NFS elimination
PgBouncer replicas 12 (3 per tenant) 3 (shared cluster) 75% fewer replicas

Pillar 3: Observability Consolidation

Replace the fragmented observability stack with a unified approach:

Layer Current Target Change
Metrics Prometheus + Grafana (no alerts) Prometheus + Grafana + AlertManager Add alerting
Tracing Istio Stackdriver (unused) OpenTelemetry auto-instrumentation Replace
Logging peeq-logging (Gen 1 Node.js) → ES 7.x GCP Cloud Logging (native) Replace
APM Elastic APM (disabled) OpenTelemetry → Grafana Tempo Replace
Search Self-hosted ES 7.x (EOL) Elastic Cloud 8.x or built-in Cloud Logging search Upgrade or replace
Session replay LogRocket Keep LogRocket No change
SLOs None Define per-service SLIs/SLOs in Grafana New
Error tracking None Sentry or GCP Error Reporting New

Consolidation benefit: Eliminate peeq-logging service, eliminate self-managed Elasticsearch/Kibana, reduce monitoring services from 5+ to 2 (Prometheus stack + Cloud Logging).

Pillar 4: CI/CD Simplification

Area Current Target
Build workflows Per-service GitHub Actions Consolidated with service-type detection
Preview environments Full stack per PR Selective — only changed service + dependencies
Security scanning Non-enforcing (Trivy, Qwiet) Enforcing with build failure on high/critical
Helm chart management 48 charts, each versioned Fewer charts after service consolidation (~18)
Image variants Per-service Dockerfile Standardized base images (Java 21, Node 20)

Pillar 5: Dead Code & Redundancy Elimination

Category Items Action
Dead API gateways (frontend) 5 gateways (broadcast, conference, stream, dwolla, logging) Remove
Dead backend services 12+ confirmed dead Archive repos
Deprecated GraphQL queries 3 in celebrity, email verification APIs Remove after frontend audit
Duplicate libraries 3 date libs (luxon, moment, date-fns), 2 CSS frameworks Standardize on 1 each
Deprecated Arlo LMS class-catalog Remove
Dual Keycloak instances identityx-25 + identityx-26 on agilenetwork Complete migration, retire 25
Gen 1 DB repos ~25 peeq-*-db repos Archive with README pointing to Gen 2
CastAI on-demand overrides 25+ services forced on-demand Audit — most stateless services can use spot

Combined Cost Impact Estimate

Optimization Estimated Savings
Cluster consolidation (ADR-004) 60-75% cluster costs
Database consolidation (ADR-005) 50-70% database instance costs
Scale-to-zero / Knative (ADR-006) 40-60% compute for Tier 2/3 services
NFS → GCS migration 100% NFS costs (Filestore)
RabbitMQ consolidation 75% messaging costs
Redis consolidation 75% cache costs
CastAI on-demand audit 15-25% additional spot savings
Eliminate self-hosted ES/Kibana 100% ES hosting costs
Overall infrastructure Estimated 50-70% total cloud cost reduction

Hypothesis Background

Primary: Systematic simplification across infrastructure, code, and operations reduces total cost of ownership by 50-70% while improving system reliability and developer experience.

Alternative 1: Optimize incrementally without a unified strategy. - Rejected: Individual optimizations (just consolidate clusters, or just fix observability) miss compounding benefits. Database consolidation enables simpler PgBouncer which enables simpler Cloud SQL which enables simpler Terraform. The optimizations reinforce each other.

Alternative 2: Rewrite on a simpler platform (e.g., monolith on Cloud Run). - Rejected: H14 falsified — the application code is sound. The complexity is in infrastructure and operational overhead, not in business logic. A rewrite would duplicate proven integrations.

Falsifiability Criteria

Evidence Quality

Evidence Assurance
Config-only multi-tenancy (H11) L2 (verified)
Identical service patterns (H13) L1 (validated)
Over-decomposed services identified L1 (tech debt inventory)
Dead code and repos quantified L1 (gap analysis)
Observability gaps documented L1 (infrastructure analysis)
CastAI override impact L1 (25+ services on-demand)
Actual cloud billing data L0 (not available)
Production traffic patterns L0 (not available)

Overall: L1 (WLNK capped by billing data and traffic patterns L0)

Bounded Validity

Consequences

Positive: - Estimated 50-70% cloud cost reduction - Dramatically simpler operational model (~40 repos, 1 cluster, 6 databases, unified observability) - Faster onboarding for new team members (less to learn) - Better incident response (alerting, tracing, SLOs exist) - Easier to add new brands (namespace + values file, not entire cluster stack)

Negative: - Large coordinated effort across infrastructure, application, and CI/CD - Shared infrastructure increases blast radius (mitigated by regional HA, namespace isolation) - Migration period has dual-state complexity (old + new running in parallel) - Some optimizations depend on others (database consolidation depends on service consolidation)

Mitigated by: Phased rollout following dependency order (ADR-001 services → ADR-005 databases → ADR-004 cluster → ADR-006 scale-to-zero). Each phase is independently valuable and reversible.

Execution Order

graph TD
    A[Phase 1: Dead Code Cleanup<br/>Archive repos, remove dead FE code] --> B[Phase 2: Service Consolidation<br/>ADR-001: 35 → 18 services]
    B --> C[Phase 3: Database Consolidation<br/>ADR-005: 35 → 6 databases]
    C --> D[Phase 4: Cluster Consolidation<br/>ADR-004: 4 → 1 cluster]
    D --> E[Phase 5: Compute Optimization<br/>ADR-006: Knative + scale-to-zero]

    F[Parallel: Observability<br/>Alerting, tracing, SLOs] --> D
    G[Parallel: CI/CD Simplification<br/>Enforce security gates] --> B
    H[Parallel: CastAI Audit<br/>Reduce on-demand overrides] --> D

Decision date: 2026-01-31 Review by: 2026-07-31