Architecture

Infrastructure & DevOps Architecture

Last updated: 2026-02-01 | Architecture

Infrastructure & DevOps Architecture

Key Takeaways

Fully GCP-native infrastructure — GKE clusters, Cloud SQL (PostgreSQL 16), Cloud DNS, GCS, Secret Manager, Artifact Registry. All provisioned via Terraform with Atlantis PR-based workflow.
Dedicated cluster per tenant — 3 production brands (The Agile Network, NIL Game Plan, VT NIL) each get a separate GKE cluster, PostgreSQL instance (35 databases), RabbitMQ cluster (3 nodes), Redis, and Keycloak realm. A 4th brand (Speed of AI) also has a production cluster.
Mature GitOps pipeline — ArgoCD manages all deployments. 28 reusable GitHub Actions workflows. Common Helm library chart (v0.0.179, 33 templates) standardizes all 48 service charts. Preview environments auto-created per PR with Istio subdomain routing.
Multi-brand routing via Istio — Istio IngressGateway handles TLS termination, path-based routing (/api/{service}), mTLS between services, and per-tenant Gateway hosts. External-DNS and cert-manager automate DNS records and Let’s Encrypt certificates.
No backend brand-specific logic in infrastructure — All tenant differentiation is via environment variables and values-globals.yaml. Same Helm charts, same Docker images, same service code deployed to all tenants. Confirms H11 across all 7 domains + infrastructure.

Migration Decision Question

What infrastructure changes are needed, and what’s the multi-brand routing mechanism?

Migration Verdict

Verdict: Upgrade Complexity: L Key Constraint: Cluster-per-tenant model is expensive but provides isolation; consolidation requires namespace-level isolation strategy (NetworkPolicies, ResourceQuotas, Istio authorization). Dependencies: All application services depend on this infrastructure layer. Changes here affect all domains.

Infrastructure Inventory

Cloud Platform

Component	Technology	Version/Config
Cloud Provider	Google Cloud Platform	4 GCP projects
Container Orchestration	GKE	Kubernetes 1.30.4
Service Mesh	Istio	IstioOperator CRD, Stackdriver tracing
Database	Cloud SQL PostgreSQL	PostgreSQL 16, db-custom-2-6656
Message Queue	RabbitMQ	3.13.7 (Bitnami), 3-node HA cluster
Cache	Redis	Master + replicas per tenant
Connection Pooling	PgBouncer	3 replicas, 41 databases routed
DNS	Cloud DNS + external-dns	Automated record management
TLS	cert-manager + Let’s Encrypt	Automatic certificate provisioning
Secrets	GCP Secret Manager + AVP	ArgoCD Vault Plugin injection
Image Registry	Google Artifact Registry	Docker, Maven, Helm (OCI)
Cost Optimization	CastAI	Spot instances default, on-demand for critical services
IaC	Terraform 1.9.5 + Atlantis	PR-based infrastructure changes
GitOps	ArgoCD	Declarative Helm-based deployments
CI/CD	GitHub Actions	28 reusable workflows, self-hosted runners
Analytics ETL	Airbyte	CDC replication for 20 databases
Data Warehouse	Snowflake	Per-tenant databases
Monitoring	kube-prometheus-stack	Prometheus + Grafana
Logging	Elasticsearch + Kibana 7.15.2	peeq-logging (Node.js) aggregator
APM	Elastic APM	Disabled by default, available per-service
Session Replay	LogRocket	Frontend-only (admin, celeb, fan)
Security Scanning	Trivy + Qwiet (ShiftLeft)	Container + SAST scanning

GCP Projects

Project	Purpose
`core-services-370815`	Centralized services: ArgoCD, Terraform state, shared secrets
`production-370815`	Production clusters and databases
`vz-development-381618`	Development environment
`vz-staging-381618`	Staging environment
`favedom-dev`	Artifact Registry (Docker, Maven, Helm)

Production Tenants

Brand	Cluster	Domain	Namespace	Apps
The Agile Network	agilenetwork	theagilenetwork.com	agilenetwork	52
NIL Game Plan	nilgameplan	nilgameplan.com	nilgameplan	52
VT NIL	vtnil	vt.triumphnil.com	vt	49
Speed of AI	speedofai	(AI training vertical)	speedofai	48

Development Tenants

Brand	Cluster	Domain	Apps
FanFuze NIL	fanfuzenil	dev.fanfuzenil.com	70
Temp FanFuze	tmp-fanfuze	(temporary)	47

Multi-Brand Routing Architecture

DNS-to-Backend Flow

graph TD
    subgraph "Internet"
        U1[User: theagilenetwork.com]
        U2[User: nilgameplan.com]
        U3[User: vt.triumphnil.com]
    end

    subgraph "GCP Cloud DNS"
        DNS1[theagilenetwork.com → LB IP]
        DNS2[nilgameplan.com → LB IP]
        DNS3[vt.triumphnil.com → LB IP]
    end

    subgraph "GCP Load Balancer"
        LB1[agilenetwork cluster LB]
        LB2[nilgameplan cluster LB]
        LB3[vtnil cluster LB]
    end

    subgraph "Istio per Cluster"
        IG1[Istio IngressGateway]
        IG2[Istio IngressGateway]
        IG3[Istio IngressGateway]
    end

    subgraph "Kubernetes Namespaces"
        NS1[agilenetwork namespace<br/>28 services + infra]
        NS2[nilgameplan namespace<br/>28 services + infra]
        NS3[vt namespace<br/>28 services + infra]
    end

    U1 --> DNS1 --> LB1 --> IG1 --> NS1
    U2 --> DNS2 --> LB2 --> IG2 --> NS2
    U3 --> DNS3 --> LB3 --> IG3 --> NS3

Istio Gateway Configuration

Each tenant has an Istio Gateway accepting traffic for its domains:

# Simplified from prod/agilenetwork/istio-gateway
servers:
  - hosts: ['*.theagilenetwork.com', 'theagilenetwork.com']
    port: { number: 443, protocol: HTTPS }
    tls: { mode: SIMPLE, minProtocolVersion: TLSV1_2 }
  - hosts: ['*.theagilenetwork.com', 'theagilenetwork.com']
    port: { number: 80, protocol: HTTP }
    tls: { httpsRedirect: true }

Per-Service Routing (VirtualService)

Every service gets path-based routing through the Istio gateway:

# Pattern for all services
spec:
  hosts: ['theagilenetwork.com']
  gateways: ['istio-system/istio-gateway']
  http:
    - match: [{ uri: { prefix: /api/celebrity } }]
      route: [{ destination: { host: celebrity, port: { number: 8080 } } }]

Frontend Domain Mapping

Subdomain	Purpose	Application
`theagilenetwork.com`	Fan-facing app	mono-web
`instructor.theagilenetwork.com`	Expert portal	celeb-fe
`admin.theagilenetwork.com`	Admin portal	admin-fe
`identity.theagilenetwork.com`	Keycloak login	identityx-26
`app.theagilenetwork.com`	API gateway	All backend services

Tenant Configuration (values-globals.yaml)

Each tenant has a values-globals.yaml with all tenant-specific config:

global:
  cluster: agilenetwork
  domain: theagilenetwork.com
  env: prod
  gcpProject: production-370815
  secretManagerId: "564934788583"
  tenant:
    alias: "The Agile Network"
    name: "agilenetwork"
    namespace: "agilenetwork"
  database:
    hostname: pgbouncer
    port: 5432
  keycloak:
    authServerUrl: "https://identity.theagilenetwork.com"
    realm: agilenetwork
  rabbitmq:
    host: rabbitmq
  redis:
    master: { host: redis-master, port: 6379 }

All services read from global.* — no brand-specific business logic anywhere.

GKE Cluster Configuration

Node Pools (Production)

Setting	Value
Machine Type	n1-standard-8 (8 vCPU, 30 GB RAM)
Disk	100 GB standard persistent
Image	COS_CONTAINERD
Autoscaling	0-3 nodes per pool
Preemptible	Yes (CastAI manages spot)
cgroup	CGROUP_MODE_V2
Max Pods/Node	110
IP Aliasing	Pod range /16, Service range /22
Workload Identity	Enabled
Maintenance	Weekends, 3-9 AM UTC
Location	us-central1-a (zonal)

Cluster Features

Workload Identity: All GCP API calls use Kubernetes SA → GCP SA binding (no JSON keys)
Network Policy: Disabled (relying on Istio for service-to-service authorization)
Release Channel: UNSPECIFIED (manual version control)
Monitoring: Cloud Monitoring + Cloud Logging enabled
Filestore CSI: Available for NFS volumes

CastAI Cost Optimization

Default: Spot instances for all workloads
On-demand exceptions: admin-fe, celeb-fe, celebrity, class-catalog, content, email, fan, group-profile, identityx, inventory, media, message-board, mono-web, notifications, rabbitmq, shoutout, shoutout-bpm, sms, sse, stripe, subscriptions, tags, transaction, users, wallet
Node constraints: 2-8 CPU cores, min 30GB disk
Evictor: Aggressive mode in dev, conservative in prod
Impact: Most critical services forced to on-demand nodes, reducing cost savings

Database Infrastructure

Cloud SQL Configuration

Setting	Value
Engine	PostgreSQL 16
Tier	db-custom-2-6656 (2 vCPU, 6.5 GB)
Availability	ZONAL (single zone)
Backup	Enabled, point-in-time recovery
Private IP	Via VPC peering
Public IP	Only when Airbyte enabled
Max Connections	750
Deletion Protection	Enabled
IAM Auth	Enabled

Databases per Tenant (35 PostgreSQL)

celebrity, chat, class-catalog, content, email, fan, group-profile, identityx_26, inventory, journey, media, message_board, notification_service, org_manager, purchase_request_bpm, reporting, search, shoutout, shoutout_bpm, sms, sse, stream, stripe, subscriptions, tags, tracking, transaction, wallet, webinar, superset

Plus additional MySQL instance for VT NIL (tracking database).

PgBouncer (Connection Pooling)

Replicas: 3 per tenant
Databases routed: 41 (all services connect via PgBouncer)
Resources: 200m-500m CPU, 192Mi memory
Pattern: All services connect to pgbouncer:5432 not Cloud SQL directly
Init containers: Every service waits for PgBouncer before starting

Airbyte CDC Replication

20 databases have Change Data Capture replication to Snowflake: - Sources: PostgreSQL Cloud SQL (public IP required) - Destination: Snowflake data warehouse (org: TSGCLBT, account: PM66380) - Per-tenant Snowflake databases (e.g., AGILENETWORK_DB)

CI/CD Pipeline

GitHub Actions (28 Reusable Workflows)

graph TD
    subgraph "Developer Workflow"
        PR[Open PR] --> BUILD
    end

    subgraph "Build Phase"
        BUILD[GitHub Actions] --> MVN[Maven Build + Tests]
        BUILD --> DOCKER[Docker Build<br/>Multi-arch: amd64 + arm64]
    end

    subgraph "Security Phase"
        DOCKER --> TRIVY[Trivy Container Scan<br/>CRITICAL + HIGH]
        DOCKER --> QWIET[Qwiet SAST<br/>Code analysis]
    end

    subgraph "Artifact Phase"
        MVN --> GAR_MVN[Push to GAR Maven]
        DOCKER --> GAR_DOCKER[Push to GAR Docker]
        DOCKER --> HELM[Generate Helm Chart]
        HELM --> GAR_HELM[Push to GAR Helm OCI]
    end

    subgraph "Deploy Phase"
        GAR_DOCKER --> |PR| PREVIEW[Preview Environment<br/>ArgoCD Previews]
        GAR_DOCKER --> |Master| PROD_UPDATE[Update argocd-deployments<br/>image.tag in values.yaml]
        PROD_UPDATE --> ARGOCD[ArgoCD Sync<br/>Helm Release Update]
    end

    subgraph "Preview Lifecycle"
        PREVIEW --> PR_COMMENT[Bot: Preview URL<br/>on PR comment]
        PR_COMMENT --> CLEANUP[PR Closed → Delete<br/>Preview Namespace]
    end

Key Workflow Files

Workflow	Purpose
`build-maven-docker.yaml`	Spring Boot service build + Docker
`build-node-docker.yaml`	Node.js/Angular build + Docker
`build-pnpm-docker.yaml`	pnpm frontend builds
`deploy-argocd-env.yaml`	Update ArgoCD env with new version
`deploy-helm-preview.yaml`	Create PR preview environment
`cleanup-preview-env.yaml`	Delete preview namespace on PR close
`security-trivy.yaml`	Container vulnerability scanning
`security-qwiet.yaml`	Static application security testing
`cloud-run-job-flyway.yaml`	Database migration via Cloud Run
`lint-sql.yaml`	SQL linting for Flyway migrations

Self-Hosted Runner

Base: ghcr.io/actions/actions-runner:2.322.0
Mode: Docker-in-Docker (DinD) on GKE
Pre-installed: Java 21 (Corretto), Maven 3.9.9, Node 20, Helm 3.17.1, GitHub CLI
Scaling: ARC (Actions Runner Controller), 0-5 runners

Preview Environment Flow

PR opened → GitHub Actions builds Docker image with PR-specific tag
Helm chart generated with preview values (namespace: {service}-pr-{number})
Application YAML pushed to argocd-previews repo
ArgoCD syncs preview application
Secrets copied from fanfuze namespace to preview namespace
Bot comments preview URL: {service}-pr-{number}.dev.fanfuzenil.com
PR closed → Cleanup workflow deletes namespace

Version Management

Semantic versioning: {major}.{minor}.{patch} from git tags
PR versions: {version}-pr.{pr_number}.{run_number}.{attempt}
Chart versions: Independent of application versions (common chart at 0.0.179)

Helm Chart Architecture

Common Library Chart (v0.0.179)

33 reusable templates standardizing all service deployments:

Template	Purpose
`_deployment-java.tpl`	Java/Spring Boot deployments
`_deployment-node.tpl`	Node.js deployments
`_deployment-fe.tpl`	Frontend (Angular/Ionic) deployments
`_istio.tpl`	VirtualService + DestinationRule
`_hpa.tpl`	Horizontal Pod Autoscaler
`_postgres.tpl`	Database connection env vars + init container
`_keycloak.tpl`	Keycloak integration env vars
`_env.tpl`	Common environment variables
`_secret-postgres.yaml`	PostgreSQL secret template
`_pdb.yaml`	PodDisruptionBudget
`_keda.yaml`	KEDA event-driven autoscaling
`_servicemonitor.tpl`	Prometheus ServiceMonitor
`_canary.tpl`	Flagger canary deployments
`_apm.tpl`	APM sidecar injection
`_kubefledged.tpl`	Image pre-loading

Service Chart Pattern

Each service chart is minimal — delegates to common chart:

# Chart.yaml
dependencies:
  - name: common
    version: 0.0.179
    repository: "oci://us-central1-docker.pkg.dev/favedom-dev/helm"

# templates/deployment.yaml
{{ include "common.deployment.java" . }}

# templates/istio.yaml
{{ include "common.istio" . }}

Feature flags control what each service gets:

# values.yaml (per service)
keycloak: { enabled: true }
postgres: { enabled: true }
rabbitmq: { enabled: true }
prometheus: { enabled: true }
hpa: { enabled: false }  # overridden per tenant

All Helm Charts (48)

Application Services (30): celebrity, fan, users, content, media, shoutout, shoutout-bpm, webinar, chat, message-board, notifications, email, sms, sse, inventory, journey, class-catalog, purchase-request-bpm, transaction, wallet, subscriptions, stripe, search, tags, tracking, reporting, org-manager, group-profile, onsite-event, athlete-manager

Frontends (5): mono-web, admin-fe, celeb-fe, org-dashboard-fe, nilgp-partnerportal-fe

Infrastructure (13): common (library), pgbouncer, rabbitmq-queue-monitor, flyway, shared-secrets, stackhawk, argocd-reports, site-maintenance, node-tracking, nilgp-partnerportal-be, test-spring-boot-app, plus preview configs

Secret Management

Architecture (3-Tier)

graph LR
    subgraph "Source of Truth"
        GSM[GCP Secret Manager<br/>100+ secrets per tenant]
    end

    subgraph "Injection Layer"
        AVP[ArgoCD Vault Plugin<br/>Fetches at sync time]
    end

    subgraph "Runtime"
        K8S[Kubernetes Secrets<br/>Mounted in pods]
    end

    subgraph "Provisioning"
        TF[Terraform<br/>Creates secret shells]
    end

    TF --> GSM
    GSM --> AVP
    AVP --> K8S

Secret Naming Convention

{tenant}_{VENDOR}_{APP}_{SECRET_NAME}

Examples: - agilenetwork_STRIPE_PAYMENT_KEY - agilenetwork_KEYCLOAK_FAN_CLIENTID - agilenetwork_RABBITMQ_PASSWORD

Integrated Services (20+ secret modules)

Service	Secrets
Stripe	Payment key, webhook signing secret
Keycloak	Multiple realm configs, DB credentials
RabbitMQ	User, password, Erlang cookie
Redis	Password
Twilio	Account SID, auth token
Mandrill	API key
Mux	Token ID, secret
Stream Chat	API key, secret
Mailchimp	API key
Dwolla	API key, secret, funding source, webhook secret
Zoom	API key, secret
Intercom	API token
Snowflake	Account, user, password, warehouse
Auth0	Client ID, secret, domain
Plaid	Client ID, secret
Grafana	Admin credentials
Elasticsearch/APM	Connection credentials

CI/CD Secrets

Workload Identity Federation for GCP authentication (no service account keys)
CI/CD secrets stored with cicd_ prefix in core-services-370815 project
Fetched at runtime by GitHub Actions, never stored in GitHub

Terraform Infrastructure-as-Code

Module Inventory (19 modules)

Module	Purpose
`gke-cluster`	GKE cluster provisioning
`postgres-instance`	Cloud SQL PostgreSQL instance
`postgres-db`	Individual database creation
`mysql-instance`	Cloud SQL MySQL instance
`mysql-db`	MySQL database creation
`sql-user`	Database user creation
`dns`	Cloud DNS + external-dns + cert-manager
`secrets`	20+ integration secret shells
`secrets-powervz`	PowerVZ-specific secrets
`castai`	CastAI cost optimization
`argocd`	Cross-project ArgoCD IAM
`airbyte`	Airbyte IP allowlisting
`airbyte-connection`	ETL connection configs
`airbyte-source`	PostgreSQL CDC sources
`airbyte-destination`	Snowflake destinations
`snowflake`	Snowflake user/database
`filestore`	GCP Filestore (NFS)
`logging-exclusions`	Cost-saving log filters
`keycloak-google-idp`	Google IDP for Keycloak

State Management

Backend: GCS bucket terraform-state-370815
State files: Per-environment (core-services/, development/, production/)
Locking: GCS-native state locking
CI/CD: Atlantis (PR-based plan/apply, auto-merge on success)

Environment Setup

Secrets injected via tf_setup.sh:

export TF_VAR_castai_api_token=$(gcloud secrets versions access latest \
  --secret="core_CASTAI_API_TOKEN")

Monitoring & Observability

Current Stack

Layer	Tool	Status
Metrics	kube-prometheus-stack	Deployed all tenants
Service Monitors	Prometheus ServiceMonitor	All Java services expose `/actuator/prometheus`
Dashboards	Grafana (via prometheus-stack)	Deployed
Logging (collection)	peeq-logging (Node.js)	Gen 1 (Node.js/Express → Elasticsearch)
Logging (storage)	Elasticsearch	Cloud-hosted (port 9243, HTTPS)
Logging (visualization)	Kibana 7.15.2	Deployed
Tracing	Istio → Stackdriver	Zipkin integration configured
APM	Elastic APM 1.25.2	Available but disabled by default
Session Replay	LogRocket	Frontend only (admin, celeb, fan)
Analytics	Superset	Connected to PostgreSQL reporting DB
Cloud Logging	GCP Cloud Logging	GKE-native integration
Cloud Monitoring	GCP Cloud Monitoring	GKE-native integration

Observability Gaps

No centralized alerting visible in infrastructure repos (no PagerDuty, Opsgenie, or Slack alert configs in Terraform/Helm)
APM disabled by default — transaction sampling at 50% when enabled, but not turned on
Logging pipeline is Gen 1 — peeq-logging is Node.js/Express (not peeq-* naming, but Gen 1 patterns: Node 9 Dockerfile, Express, no TypeScript build in image)
No SLO/SLI definitions in Helm charts or monitoring config
Kibana dashboards not defined in code — likely manual creation
No distributed tracing adoption — Zipkin address configured in Istio but no evidence of service-level trace instrumentation

Storage Infrastructure

NFS (ReadWriteMany)

4 Persistent Volume Claims per tenant via NFS provisioner:

PVC	Size	Purpose
pvc-content	50Gi	Content service file storage
pvc-media	50Gi	Media service file storage
pvc-shoutout	50Gi	Shoutout video storage
pvc-stream	50Gi	Streaming assets

All use nfs-client StorageClass via per-tenant NFS provisioner.

GCS Buckets

2 shared buckets with tenant-specific directories:

public-assets-{hex}: Public-facing assets (images, videos). Public read access (allUsers).
backend-assets-{hex}: Backend-generated assets. Public read access.

Both have versioning enabled and deletion protection.

Operational Tooling

ArgoCD Scripts (24 utilities)

Script	Purpose
`check-app-version.sh`	Compare deployed versions across tenants
`check-secrets.sh`	Validate secret availability
`set-replicas.sh`	Scale services up/down
`rabbitmq-reports.sh`	RabbitMQ health and metrics
`pgbouncer/reload.sh`	Hot-reload PgBouncer configuration
`restart-pods/`	Rolling restart utilities
`diff_promote/`	Diff configs between environments
`update-tags.sh`	Batch update image tags
`create-changelog.sh`	Generate deployment changelogs

DevOps Utilities (devops-utlities repo)

Subsystem	Purpose
`db/`	Database dumps, restores, BPM exports, vacuum
`github/`	Mass PR merging, repo migration (multi-gitter)
`jdk-update/`	Platform-wide JDK upgrades via multi-gitter
`update-mvn-deps/`	Maven dependency propagation
`graphql-migration/`	Spring Boot 3.x + GraphQL upgrade scripts
`google-artifact-repo/`	Container registry management (gcrane)
`vpa/`	Vertical Pod Autoscaler analysis for cost optimization
`env-start-stop/`	Multi-tenant cluster startup/shutdown
`keycloak/`	Realm export utilities

RabbitMQ Queue Monitor

Stack: Bash + curl + jq + kubectl on Debian slim
Monitors: 27 services mapped to Kubernetes deployments
Purpose: Message-driven autoscaling (precursor to KEDA)
Maps: Queue names → deployment names for scaling decisions

Security Architecture

Authentication & Authorization

graph TD
    subgraph "External"
        USER[User Browser]
    end

    subgraph "GCP Edge"
        LB[GCP Load Balancer<br/>TLS Termination]
    end

    subgraph "Service Mesh"
        IG[Istio IngressGateway<br/>Host-based routing]
        MTLS[Istio mTLS<br/>Pod-to-pod encryption]
    end

    subgraph "Identity"
        KC[Keycloak 26.3<br/>OAuth2/OIDC]
    end

    subgraph "Services"
        SVC[Spring Boot Service<br/>JWT validation via issuer URI]
    end

    USER -->|HTTPS| LB --> IG
    IG -->|/api/*| SVC
    IG -->|/auth/*| KC
    USER -->|OAuth2 flow| KC
    KC -->|JWT| USER
    USER -->|Bearer token| IG
    SVC -.->|mTLS| MTLS -.-> SVC

Security Controls

Control	Implementation
TLS in transit	Let’s Encrypt certs via cert-manager, Istio mTLS
Workload Identity	Kubernetes SA → GCP SA (no JSON keys)
Secret storage	GCP Secret Manager (encrypted at rest, IAM-controlled)
Container scanning	Trivy (CRITICAL + HIGH severity)
SAST	Qwiet/ShiftLeft (Java + JavaScript)
Database access	Private IP via VPC peering, Cloud SQL IAM auth
Deletion protection	GKE clusters + Cloud SQL instances
CORS	Istio VirtualService (allows all origins — broad)

Security Gaps

Trivy scans don’t fail builds — exit code 0, vulnerabilities tracked but not enforced
No Binary Authorization — unsigned container images can deploy
No GKE Shielded Nodes — secure boot not enabled
No NetworkPolicies — relying on Istio alone for pod-level isolation
Public GCS buckets — allUsers read access on asset buckets
CORS allows all origins — allowOrigins: regex: .* in Istio VirtualService
Developer IPs in authorized networks — direct DB access via individual IPs (dev/staging)

Gen 1 Infrastructure Services

Service	Stack	Status	Replacement
peeq-logging	Node.js/Express → Elasticsearch	Active (Gen 1 patterns)	Upgrade or replace with GCP Cloud Logging
peeq-kibana-deploy	Kibana 7.15.2 wrapper	Active	Upgrade Kibana or migrate to Elastic Cloud
peeq-shared-secret	Java 11/SB 2.4.3 → GCP Secret Manager	Unclear if active	ArgoCD Vault Plugin may have replaced this
devops-utlities	Bash scripts + multi-gitter	Active (operational)	Keep as tooling

Infrastructure Risk Assessment

High Risk

Zonal GKE clusters — Single zone (us-central1-a). Zone failure = complete tenant outage. No regional cluster configuration. RPO depends on Cloud SQL backup frequency.
Zonal Cloud SQL — ZONAL availability type. No automatic failover. Manual recovery required from point-in-time backup.
Single region — All infrastructure in us-central1. No cross-region DR.

Medium Risk

Cluster-per-tenant cost — 4 production clusters each with 0-3 n1-standard-8 nodes + Cloud SQL + RabbitMQ + Redis. Infrastructure cost scales linearly with tenants.
CastAI on-demand overrides — 25+ services forced to on-demand nodes, reducing spot savings significantly.
PgBouncer as SPOF — If PgBouncer pods fail, all database connections fail. 3 replicas provide some redundancy but no circuit breaking.

Low Risk

Common chart coupling — All 48 charts depend on common v0.0.179. Breaking change affects everything. Mitigated by chart versioning.
ArgoCD centralization — Single ArgoCD instance manages all clusters from core-services project. ArgoCD failure blocks all deployments.
Helm OCI registry — All charts stored in single Artifact Registry. Registry outage blocks deployments but not running services.

Inter-Service Communication (Infrastructure View)

Synchronous

All service-to-service calls route through Istio mesh: - Path-based routing: http://{service}:8080/api/{service}/... - mTLS encryption between all pods - CORS configured at Istio level (broad: all origins)

Asynchronous

RabbitMQ 3.13.7 cluster per tenant: - 3 nodes, persistent storage (20Gi) - Memory high watermark: 3276Mi - On-demand nodes (no spot eviction) - Safe-to-evict: false annotation - Definitions loaded from Kubernetes secrets

Database Connections

All services → PgBouncer (3 replicas) → Cloud SQL (private IP): - JDBC URL: jdbc:postgresql://pgbouncer:5432/{service-name} - Credentials: Kubernetes Secret (from GCP Secret Manager via AVP) - Init container waits for PgBouncer availability

Data Model (Infrastructure)

Resource Allocation per Tenant

Resource	Per Tenant	Total (4 prod)
GKE Cluster	1	4
GKE Nodes (max)	3 × n1-standard-8	12 nodes, 96 vCPU, 360 GB RAM
Cloud SQL Instance	1 (2 vCPU, 6.5 GB)	4 instances
PostgreSQL Databases	35	140
Max DB Connections	750	3,000
RabbitMQ Nodes	3	12
Redis Instances	1 (master + replicas)	4
PgBouncer Replicas	3	12
NFS PVCs	4 × 50Gi = 200Gi	800Gi
GCS Buckets	2 (shared)	2 (shared)

Scaling Boundaries

Component	Current Limit	Concern
GKE nodes	3 per cluster	Low headroom for traffic spikes
Cloud SQL	750 connections	Shared across 35 databases
RabbitMQ	3 nodes, 3276Mi memory	Adequate for current load
PgBouncer	3 replicas	Bottleneck if overwhelmed
NFS	200Gi per tenant	Fixed allocation, may waste or exhaust

Modernization Implications

Consolidation Opportunity: Shared Multi-Tenant Cluster

Current: 4 production GKE clusters (1 per tenant) Option: 1 regional GKE cluster with namespace-per-tenant isolation

Benefits: - ~60-70% infrastructure cost reduction (shared control plane, better node utilization) - Simpler management (1 cluster to upgrade, 1 Istio mesh) - Faster new-tenant onboarding

Requirements: - Kubernetes ResourceQuota per namespace - Istio AuthorizationPolicy for service isolation - NetworkPolicy for pod-level segmentation - Separate Cloud SQL instances per tenant (keep for data isolation)

Risk: Noisy neighbor issues, shared blast radius. Mitigated by resource quotas and Istio rate limiting.

Infrastructure Upgrades Needed

Regional GKE — Move from zonal to regional clusters for HA
Regional Cloud SQL — Enable automatic failover
Security hardening — Binary Authorization, Shielded Nodes, NetworkPolicies, tighter CORS
Trivy enforcement — Fail builds on CRITICAL vulnerabilities
Observability gaps — Alerting, SLOs, distributed tracing, APM enablement
Logging modernization — Replace peeq-logging (Node.js Gen 1) with GCP Cloud Logging or upgrade to Elastic Cloud
KEDA adoption — Replace custom rabbitmq-queue-monitor with KEDA ScaledObjects

What Can Stay As-Is

GitOps pipeline (ArgoCD + GitHub Actions) — mature and well-structured
Terraform modules — well-factored, 19 reusable modules
Common Helm chart — 33 templates, battle-tested at v0.0.179
Secret management (GCP Secret Manager + AVP) — secure pattern
CastAI — effective cost optimization
Istio service mesh — provides routing, mTLS, observability

Last updated: 2026-01-30 — Session 8 Review by: 2026-04-30 Staleness risk: Medium — infrastructure configs and versions evolve frequently