ADR

ADR-007: Architecture Simplification & Cloud Resource Optimization

Last updated: 2026-02-01 | Decisions

ADR-007: Architecture Simplification & Cloud Resource Optimization

Status

Proposed — Pending engineering team review

Context

The platform accumulated significant infrastructure complexity across 191 repos, 35 services, 4 dedicated clusters, 140 databases, 48 Helm charts, 121 Terraform files, and 28 CI/CD workflows. ADRs 001 and 004 address service and cluster consolidation. ADR-005 addresses database consolidation. This ADR takes a holistic view of architecture simplification and cloud resource optimization to improve supportability, maintainability, and cost efficiency.

Current Complexity Metrics

Dimension	Current	Problem
Active repos	~50	Difficult to navigate, maintain, and onboard
Services per tenant	~28 pods	Over-provisioned for workload complexity
Helm charts	48	Each requires version management
Terraform files	121 across environments	Duplicated patterns, drift risk
CI/CD workflows	28 reusable	Complex matrix builds
PgBouncer entries	41 per tenant	Connection management overhead
Monitoring gaps	No alerting, no SLOs, APM disabled	Blind to production issues
Dead code	~17% frontend, 12+ dead services	Confusion, wasted CI time
Libraries	3 date libraries, 2 CSS frameworks, 2 component libs	Unnecessary diversity

Decision

Implement a five-pillar simplification strategy that reduces operational surface area, eliminates redundancy, and right-sizes cloud resources.

Pillar 1: Repository Consolidation

Action	Current	Target	Reduction
Archive dead/Gen 1 repos	191 repos	~50 active	74% fewer repos
Merge service + DB repos	Service repo + DB repo separate	DB migrations inside service repo	25 fewer repos
Mono-repo for shared libs	core-lib, messages, common-helm separate	Single platform-libs repo	3 fewer repos

Result: From 191 repos → ~40 actively maintained repos.

Pillar 2: Infrastructure Right-Sizing

Resource	Current	Optimized	Savings
GKE clusters	4 prod + 2 dev (6 total)	1 prod regional + 1 dev (2 total)	~67% cluster overhead
Cloud SQL instances	4 prod (1 per tenant)	1 regional HA instance	~75% DB instance cost
Databases per instance	35 per tenant	6 per tenant (ADR-005)	83% fewer databases
RabbitMQ clusters	4 × 3-node (12 nodes)	1 × 3-node shared with vhosts	~75% messaging infra
Redis instances	4 (one per tenant)	1 shared with key prefixing	~75% cache infra
NFS PVCs	16 (4 × 50Gi × 4 tenants)	Migrate to GCS	100% NFS elimination
PgBouncer replicas	12 (3 per tenant)	3 (shared cluster)	75% fewer replicas

Pillar 3: Observability Consolidation

Replace the fragmented observability stack with a unified approach:

Layer	Current	Target	Change
Metrics	Prometheus + Grafana (no alerts)	Prometheus + Grafana + AlertManager	Add alerting
Tracing	Istio Stackdriver (unused)	OpenTelemetry auto-instrumentation	Replace
Logging	peeq-logging (Gen 1 Node.js) → ES 7.x	GCP Cloud Logging (native)	Replace
APM	Elastic APM (disabled)	OpenTelemetry → Grafana Tempo	Replace
Search	Self-hosted ES 7.x (EOL)	Elastic Cloud 8.x or built-in Cloud Logging search	Upgrade or replace
Session replay	LogRocket	Keep LogRocket	No change
SLOs	None	Define per-service SLIs/SLOs in Grafana	New
Error tracking	None	Sentry or GCP Error Reporting	New

Consolidation benefit: Eliminate peeq-logging service, eliminate self-managed Elasticsearch/Kibana, reduce monitoring services from 5+ to 2 (Prometheus stack + Cloud Logging).

Pillar 4: CI/CD Simplification

Area	Current	Target
Build workflows	Per-service GitHub Actions	Consolidated with service-type detection
Preview environments	Full stack per PR	Selective — only changed service + dependencies
Security scanning	Non-enforcing (Trivy, Qwiet)	Enforcing with build failure on high/critical
Helm chart management	48 charts, each versioned	Fewer charts after service consolidation (~18)
Image variants	Per-service Dockerfile	Standardized base images (Java 21, Node 20)

Pillar 5: Dead Code & Redundancy Elimination

Category	Items	Action
Dead API gateways (frontend)	5 gateways (broadcast, conference, stream, dwolla, logging)	Remove
Dead backend services	12+ confirmed dead	Archive repos
Deprecated GraphQL queries	3 in celebrity, email verification APIs	Remove after frontend audit
Duplicate libraries	3 date libs (luxon, moment, date-fns), 2 CSS frameworks	Standardize on 1 each
Deprecated Arlo LMS	class-catalog	Remove
Dual Keycloak instances	identityx-25 + identityx-26 on agilenetwork	Complete migration, retire 25
Gen 1 DB repos	~25 peeq-*-db repos	Archive with README pointing to Gen 2
CastAI on-demand overrides	25+ services forced on-demand	Audit — most stateless services can use spot

Combined Cost Impact Estimate

Optimization	Estimated Savings
Cluster consolidation (ADR-004)	60-75% cluster costs
Database consolidation (ADR-005)	50-70% database instance costs
Scale-to-zero / Knative (ADR-006)	40-60% compute for Tier 2/3 services
NFS → GCS migration	100% NFS costs (Filestore)
RabbitMQ consolidation	75% messaging costs
Redis consolidation	75% cache costs
CastAI on-demand audit	15-25% additional spot savings
Eliminate self-hosted ES/Kibana	100% ES hosting costs
Overall infrastructure	Estimated 50-70% total cloud cost reduction

Hypothesis Background

Primary: Systematic simplification across infrastructure, code, and operations reduces total cost of ownership by 50-70% while improving system reliability and developer experience.

Evidence: The platform was built for scale that hasn’t materialized — 4 separate clusters for 4 brands with identical code (H11 L2). The infrastructure was designed for a future that assumed many more tenants needing hard isolation.
Evidence: Over-decomposition is documented — 6 services with <5 endpoints (tech-debt-inventory). Consolidation is straightforward because all services follow identical patterns (H13 L1).
Evidence: Observability gaps (no alerting, APM disabled, no SLOs) mean the team is paying for infrastructure they can’t effectively monitor.

Alternative 1: Optimize incrementally without a unified strategy. - Rejected: Individual optimizations (just consolidate clusters, or just fix observability) miss compounding benefits. Database consolidation enables simpler PgBouncer which enables simpler Cloud SQL which enables simpler Terraform. The optimizations reinforce each other.

Alternative 2: Rewrite on a simpler platform (e.g., monolith on Cloud Run). - Rejected: H14 falsified — the application code is sound. The complexity is in infrastructure and operational overhead, not in business logic. A rewrite would duplicate proven integrations.

Falsifiability Criteria

If combined optimizations don’t achieve at least 30% cloud cost reduction → reassess individual pillar assumptions
If simplification causes more than 2 production incidents in the first quarter → pause and stabilize before continuing
If developer experience surveys show increased cognitive load despite fewer components → investigate and adjust
If GCP billing shows unexpected cost increases from any pillar → immediately investigate and revert if needed

Evidence Quality

Evidence	Assurance
Config-only multi-tenancy (H11)	L2 (verified)
Identical service patterns (H13)	L1 (validated)
Over-decomposed services identified	L1 (tech debt inventory)
Dead code and repos quantified	L1 (gap analysis)
Observability gaps documented	L1 (infrastructure analysis)
CastAI override impact	L1 (25+ services on-demand)
Actual cloud billing data	L0 (not available)
Production traffic patterns	L0 (not available)

Overall: L1 (WLNK capped by billing data and traffic patterns L0)

Bounded Validity

Scope: All application services, infrastructure, CI/CD, and observability. Excludes external SaaS integrations (Stripe, Mux, Zoom, etc.) which are already optimized.
Expiry: Re-evaluate cost targets annually. Re-evaluate architecture if tenant count exceeds 10 or if a fundamentally different workload type is added (e.g., ML training, large-scale data processing).
Review trigger: If any single optimization causes a regression in availability (SLOs breached) or developer productivity.
Monitoring: Monthly cloud cost reports by category. Quarterly developer experience survey. SLO dashboards for all services.

Consequences

Positive: - Estimated 50-70% cloud cost reduction - Dramatically simpler operational model (~40 repos, 1 cluster, 6 databases, unified observability) - Faster onboarding for new team members (less to learn) - Better incident response (alerting, tracing, SLOs exist) - Easier to add new brands (namespace + values file, not entire cluster stack)

Negative: - Large coordinated effort across infrastructure, application, and CI/CD - Shared infrastructure increases blast radius (mitigated by regional HA, namespace isolation) - Migration period has dual-state complexity (old + new running in parallel) - Some optimizations depend on others (database consolidation depends on service consolidation)

Mitigated by: Phased rollout following dependency order (ADR-001 services → ADR-005 databases → ADR-004 cluster → ADR-006 scale-to-zero). Each phase is independently valuable and reversible.

Execution Order

graph TD
    A[Phase 1: Dead Code Cleanup<br/>Archive repos, remove dead FE code] --> B[Phase 2: Service Consolidation<br/>ADR-001: 35 → 18 services]
    B --> C[Phase 3: Database Consolidation<br/>ADR-005: 35 → 6 databases]
    C --> D[Phase 4: Cluster Consolidation<br/>ADR-004: 4 → 1 cluster]
    D --> E[Phase 5: Compute Optimization<br/>ADR-006: Knative + scale-to-zero]

    F[Parallel: Observability<br/>Alerting, tracing, SLOs] --> D
    G[Parallel: CI/CD Simplification<br/>Enforce security gates] --> B
    H[Parallel: CastAI Audit<br/>Reduce on-demand overrides] --> D

Decision date: 2026-01-31 Review by: 2026-07-31