Infrastructure & DevOps Architecture
Infrastructure & DevOps Architecture
Key Takeaways
- Fully GCP-native infrastructure — GKE clusters, Cloud SQL (PostgreSQL 16), Cloud DNS, GCS, Secret Manager, Artifact Registry. All provisioned via Terraform with Atlantis PR-based workflow.
- Dedicated cluster per tenant — 3 production brands (The Agile Network, NIL Game Plan, VT NIL) each get a separate GKE cluster, PostgreSQL instance (35 databases), RabbitMQ cluster (3 nodes), Redis, and Keycloak realm. A 4th brand (Speed of AI) also has a production cluster.
- Mature GitOps pipeline — ArgoCD manages all deployments. 28 reusable GitHub Actions workflows. Common Helm library chart (v0.0.179, 33 templates) standardizes all 48 service charts. Preview environments auto-created per PR with Istio subdomain routing.
- Multi-brand routing via Istio — Istio
IngressGateway handles TLS termination, path-based routing
(
/api/{service}), mTLS between services, and per-tenant Gateway hosts. External-DNS and cert-manager automate DNS records and Let’s Encrypt certificates. - No backend brand-specific logic in infrastructure —
All tenant differentiation is via environment variables and
values-globals.yaml. Same Helm charts, same Docker images, same service code deployed to all tenants. Confirms H11 across all 7 domains + infrastructure.
Migration Decision Question
What infrastructure changes are needed, and what’s the multi-brand routing mechanism?
Migration Verdict
Verdict: Upgrade Complexity: L Key Constraint: Cluster-per-tenant model is expensive but provides isolation; consolidation requires namespace-level isolation strategy (NetworkPolicies, ResourceQuotas, Istio authorization). Dependencies: All application services depend on this infrastructure layer. Changes here affect all domains.
Infrastructure Inventory
Cloud Platform
| Component | Technology | Version/Config |
|---|---|---|
| Cloud Provider | Google Cloud Platform | 4 GCP projects |
| Container Orchestration | GKE | Kubernetes 1.30.4 |
| Service Mesh | Istio | IstioOperator CRD, Stackdriver tracing |
| Database | Cloud SQL PostgreSQL | PostgreSQL 16, db-custom-2-6656 |
| Message Queue | RabbitMQ | 3.13.7 (Bitnami), 3-node HA cluster |
| Cache | Redis | Master + replicas per tenant |
| Connection Pooling | PgBouncer | 3 replicas, 41 databases routed |
| DNS | Cloud DNS + external-dns | Automated record management |
| TLS | cert-manager + Let’s Encrypt | Automatic certificate provisioning |
| Secrets | GCP Secret Manager + AVP | ArgoCD Vault Plugin injection |
| Image Registry | Google Artifact Registry | Docker, Maven, Helm (OCI) |
| Cost Optimization | CastAI | Spot instances default, on-demand for critical services |
| IaC | Terraform 1.9.5 + Atlantis | PR-based infrastructure changes |
| GitOps | ArgoCD | Declarative Helm-based deployments |
| CI/CD | GitHub Actions | 28 reusable workflows, self-hosted runners |
| Analytics ETL | Airbyte | CDC replication for 20 databases |
| Data Warehouse | Snowflake | Per-tenant databases |
| Monitoring | kube-prometheus-stack | Prometheus + Grafana |
| Logging | Elasticsearch + Kibana 7.15.2 | peeq-logging (Node.js) aggregator |
| APM | Elastic APM | Disabled by default, available per-service |
| Session Replay | LogRocket | Frontend-only (admin, celeb, fan) |
| Security Scanning | Trivy + Qwiet (ShiftLeft) | Container + SAST scanning |
GCP Projects
| Project | Purpose |
|---|---|
core-services-370815 |
Centralized services: ArgoCD, Terraform state, shared secrets |
production-370815 |
Production clusters and databases |
vz-development-381618 |
Development environment |
vz-staging-381618 |
Staging environment |
favedom-dev |
Artifact Registry (Docker, Maven, Helm) |
Production Tenants
| Brand | Cluster | Domain | Namespace | Apps |
|---|---|---|---|---|
| The Agile Network | agilenetwork | theagilenetwork.com | agilenetwork | 52 |
| NIL Game Plan | nilgameplan | nilgameplan.com | nilgameplan | 52 |
| VT NIL | vtnil | vt.triumphnil.com | vt | 49 |
| Speed of AI | speedofai | (AI training vertical) | speedofai | 48 |
Development Tenants
| Brand | Cluster | Domain | Apps |
|---|---|---|---|
| FanFuze NIL | fanfuzenil | dev.fanfuzenil.com | 70 |
| Temp FanFuze | tmp-fanfuze | (temporary) | 47 |
Multi-Brand Routing Architecture
DNS-to-Backend Flow
graph TD
subgraph "Internet"
U1[User: theagilenetwork.com]
U2[User: nilgameplan.com]
U3[User: vt.triumphnil.com]
end
subgraph "GCP Cloud DNS"
DNS1[theagilenetwork.com → LB IP]
DNS2[nilgameplan.com → LB IP]
DNS3[vt.triumphnil.com → LB IP]
end
subgraph "GCP Load Balancer"
LB1[agilenetwork cluster LB]
LB2[nilgameplan cluster LB]
LB3[vtnil cluster LB]
end
subgraph "Istio per Cluster"
IG1[Istio IngressGateway]
IG2[Istio IngressGateway]
IG3[Istio IngressGateway]
end
subgraph "Kubernetes Namespaces"
NS1[agilenetwork namespace<br/>28 services + infra]
NS2[nilgameplan namespace<br/>28 services + infra]
NS3[vt namespace<br/>28 services + infra]
end
U1 --> DNS1 --> LB1 --> IG1 --> NS1
U2 --> DNS2 --> LB2 --> IG2 --> NS2
U3 --> DNS3 --> LB3 --> IG3 --> NS3
Istio Gateway Configuration
Each tenant has an Istio Gateway accepting traffic for its domains:
# Simplified from prod/agilenetwork/istio-gateway
servers:
- hosts: ['*.theagilenetwork.com', 'theagilenetwork.com']
port: { number: 443, protocol: HTTPS }
tls: { mode: SIMPLE, minProtocolVersion: TLSV1_2 }
- hosts: ['*.theagilenetwork.com', 'theagilenetwork.com']
port: { number: 80, protocol: HTTP }
tls: { httpsRedirect: true }
Per-Service Routing (VirtualService)
Every service gets path-based routing through the Istio gateway:
# Pattern for all services
spec:
hosts: ['theagilenetwork.com']
gateways: ['istio-system/istio-gateway']
http:
- match: [{ uri: { prefix: /api/celebrity } }]
route: [{ destination: { host: celebrity, port: { number: 8080 } } }]
Frontend Domain Mapping
| Subdomain | Purpose | Application |
|---|---|---|
theagilenetwork.com |
Fan-facing app | mono-web |
instructor.theagilenetwork.com |
Expert portal | celeb-fe |
admin.theagilenetwork.com |
Admin portal | admin-fe |
identity.theagilenetwork.com |
Keycloak login | identityx-26 |
app.theagilenetwork.com |
API gateway | All backend services |
Tenant Configuration (values-globals.yaml)
Each tenant has a values-globals.yaml with all
tenant-specific config:
global:
cluster: agilenetwork
domain: theagilenetwork.com
env: prod
gcpProject: production-370815
secretManagerId: "564934788583"
tenant:
alias: "The Agile Network"
name: "agilenetwork"
namespace: "agilenetwork"
database:
hostname: pgbouncer
port: 5432
keycloak:
authServerUrl: "https://identity.theagilenetwork.com"
realm: agilenetwork
rabbitmq:
host: rabbitmq
redis:
master: { host: redis-master, port: 6379 }
All services read from global.* — no brand-specific
business logic anywhere.
GKE Cluster Configuration
Node Pools (Production)
| Setting | Value |
|---|---|
| Machine Type | n1-standard-8 (8 vCPU, 30 GB RAM) |
| Disk | 100 GB standard persistent |
| Image | COS_CONTAINERD |
| Autoscaling | 0-3 nodes per pool |
| Preemptible | Yes (CastAI manages spot) |
| cgroup | CGROUP_MODE_V2 |
| Max Pods/Node | 110 |
| IP Aliasing | Pod range /16, Service range /22 |
| Workload Identity | Enabled |
| Maintenance | Weekends, 3-9 AM UTC |
| Location | us-central1-a (zonal) |
Cluster Features
- Workload Identity: All GCP API calls use Kubernetes SA → GCP SA binding (no JSON keys)
- Network Policy: Disabled (relying on Istio for service-to-service authorization)
- Release Channel: UNSPECIFIED (manual version control)
- Monitoring: Cloud Monitoring + Cloud Logging enabled
- Filestore CSI: Available for NFS volumes
CastAI Cost Optimization
- Default: Spot instances for all workloads
- On-demand exceptions: admin-fe, celeb-fe, celebrity, class-catalog, content, email, fan, group-profile, identityx, inventory, media, message-board, mono-web, notifications, rabbitmq, shoutout, shoutout-bpm, sms, sse, stripe, subscriptions, tags, transaction, users, wallet
- Node constraints: 2-8 CPU cores, min 30GB disk
- Evictor: Aggressive mode in dev, conservative in prod
- Impact: Most critical services forced to on-demand nodes, reducing cost savings
Database Infrastructure
Cloud SQL Configuration
| Setting | Value |
|---|---|
| Engine | PostgreSQL 16 |
| Tier | db-custom-2-6656 (2 vCPU, 6.5 GB) |
| Availability | ZONAL (single zone) |
| Backup | Enabled, point-in-time recovery |
| Private IP | Via VPC peering |
| Public IP | Only when Airbyte enabled |
| Max Connections | 750 |
| Deletion Protection | Enabled |
| IAM Auth | Enabled |
Databases per Tenant (35 PostgreSQL)
celebrity, chat, class-catalog, content, email, fan, group-profile, identityx_26, inventory, journey, media, message_board, notification_service, org_manager, purchase_request_bpm, reporting, search, shoutout, shoutout_bpm, sms, sse, stream, stripe, subscriptions, tags, tracking, transaction, wallet, webinar, superset
Plus additional MySQL instance for VT NIL (tracking
database).
PgBouncer (Connection Pooling)
- Replicas: 3 per tenant
- Databases routed: 41 (all services connect via PgBouncer)
- Resources: 200m-500m CPU, 192Mi memory
- Pattern: All services connect to
pgbouncer:5432not Cloud SQL directly - Init containers: Every service waits for PgBouncer before starting
Airbyte CDC Replication
20 databases have Change Data Capture replication to Snowflake: -
Sources: PostgreSQL Cloud SQL (public IP required) - Destination:
Snowflake data warehouse (org: TSGCLBT, account: PM66380) - Per-tenant
Snowflake databases (e.g., AGILENETWORK_DB)
CI/CD Pipeline
GitHub Actions (28 Reusable Workflows)
graph TD
subgraph "Developer Workflow"
PR[Open PR] --> BUILD
end
subgraph "Build Phase"
BUILD[GitHub Actions] --> MVN[Maven Build + Tests]
BUILD --> DOCKER[Docker Build<br/>Multi-arch: amd64 + arm64]
end
subgraph "Security Phase"
DOCKER --> TRIVY[Trivy Container Scan<br/>CRITICAL + HIGH]
DOCKER --> QWIET[Qwiet SAST<br/>Code analysis]
end
subgraph "Artifact Phase"
MVN --> GAR_MVN[Push to GAR Maven]
DOCKER --> GAR_DOCKER[Push to GAR Docker]
DOCKER --> HELM[Generate Helm Chart]
HELM --> GAR_HELM[Push to GAR Helm OCI]
end
subgraph "Deploy Phase"
GAR_DOCKER --> |PR| PREVIEW[Preview Environment<br/>ArgoCD Previews]
GAR_DOCKER --> |Master| PROD_UPDATE[Update argocd-deployments<br/>image.tag in values.yaml]
PROD_UPDATE --> ARGOCD[ArgoCD Sync<br/>Helm Release Update]
end
subgraph "Preview Lifecycle"
PREVIEW --> PR_COMMENT[Bot: Preview URL<br/>on PR comment]
PR_COMMENT --> CLEANUP[PR Closed → Delete<br/>Preview Namespace]
end
Key Workflow Files
| Workflow | Purpose |
|---|---|
build-maven-docker.yaml |
Spring Boot service build + Docker |
build-node-docker.yaml |
Node.js/Angular build + Docker |
build-pnpm-docker.yaml |
pnpm frontend builds |
deploy-argocd-env.yaml |
Update ArgoCD env with new version |
deploy-helm-preview.yaml |
Create PR preview environment |
cleanup-preview-env.yaml |
Delete preview namespace on PR close |
security-trivy.yaml |
Container vulnerability scanning |
security-qwiet.yaml |
Static application security testing |
cloud-run-job-flyway.yaml |
Database migration via Cloud Run |
lint-sql.yaml |
SQL linting for Flyway migrations |
Self-Hosted Runner
- Base:
ghcr.io/actions/actions-runner:2.322.0 - Mode: Docker-in-Docker (DinD) on GKE
- Pre-installed: Java 21 (Corretto), Maven 3.9.9, Node 20, Helm 3.17.1, GitHub CLI
- Scaling: ARC (Actions Runner Controller), 0-5 runners
Preview Environment Flow
- PR opened → GitHub Actions builds Docker image with PR-specific tag
- Helm chart generated with preview values (namespace:
{service}-pr-{number}) - Application YAML pushed to
argocd-previewsrepo - ArgoCD syncs preview application
- Secrets copied from
fanfuzenamespace to preview namespace - Bot comments preview URL:
{service}-pr-{number}.dev.fanfuzenil.com - PR closed → Cleanup workflow deletes namespace
Version Management
- Semantic versioning:
{major}.{minor}.{patch}from git tags - PR versions:
{version}-pr.{pr_number}.{run_number}.{attempt} - Chart versions: Independent of application versions (common chart at 0.0.179)
Helm Chart Architecture
Common Library Chart (v0.0.179)
33 reusable templates standardizing all service deployments:
| Template | Purpose |
|---|---|
_deployment-java.tpl |
Java/Spring Boot deployments |
_deployment-node.tpl |
Node.js deployments |
_deployment-fe.tpl |
Frontend (Angular/Ionic) deployments |
_istio.tpl |
VirtualService + DestinationRule |
_hpa.tpl |
Horizontal Pod Autoscaler |
_postgres.tpl |
Database connection env vars + init container |
_keycloak.tpl |
Keycloak integration env vars |
_env.tpl |
Common environment variables |
_secret-postgres.yaml |
PostgreSQL secret template |
_pdb.yaml |
PodDisruptionBudget |
_keda.yaml |
KEDA event-driven autoscaling |
_servicemonitor.tpl |
Prometheus ServiceMonitor |
_canary.tpl |
Flagger canary deployments |
_apm.tpl |
APM sidecar injection |
_kubefledged.tpl |
Image pre-loading |
Service Chart Pattern
Each service chart is minimal — delegates to common chart:
# Chart.yaml
dependencies:
- name: common
version: 0.0.179
repository: "oci://us-central1-docker.pkg.dev/favedom-dev/helm"
# templates/deployment.yaml
{{ include "common.deployment.java" . }}
# templates/istio.yaml
{{ include "common.istio" . }}
Feature flags control what each service gets:
# values.yaml (per service)
keycloak: { enabled: true }
postgres: { enabled: true }
rabbitmq: { enabled: true }
prometheus: { enabled: true }
hpa: { enabled: false } # overridden per tenant
All Helm Charts (48)
Application Services (30): celebrity, fan, users, content, media, shoutout, shoutout-bpm, webinar, chat, message-board, notifications, email, sms, sse, inventory, journey, class-catalog, purchase-request-bpm, transaction, wallet, subscriptions, stripe, search, tags, tracking, reporting, org-manager, group-profile, onsite-event, athlete-manager
Frontends (5): mono-web, admin-fe, celeb-fe, org-dashboard-fe, nilgp-partnerportal-fe
Infrastructure (13): common (library), pgbouncer, rabbitmq-queue-monitor, flyway, shared-secrets, stackhawk, argocd-reports, site-maintenance, node-tracking, nilgp-partnerportal-be, test-spring-boot-app, plus preview configs
Secret Management
Architecture (3-Tier)
graph LR
subgraph "Source of Truth"
GSM[GCP Secret Manager<br/>100+ secrets per tenant]
end
subgraph "Injection Layer"
AVP[ArgoCD Vault Plugin<br/>Fetches at sync time]
end
subgraph "Runtime"
K8S[Kubernetes Secrets<br/>Mounted in pods]
end
subgraph "Provisioning"
TF[Terraform<br/>Creates secret shells]
end
TF --> GSM
GSM --> AVP
AVP --> K8S
Secret Naming Convention
{tenant}_{VENDOR}_{APP}_{SECRET_NAME}
Examples: - agilenetwork_STRIPE_PAYMENT_KEY -
agilenetwork_KEYCLOAK_FAN_CLIENTID -
agilenetwork_RABBITMQ_PASSWORD
Integrated Services (20+ secret modules)
| Service | Secrets |
|---|---|
| Stripe | Payment key, webhook signing secret |
| Keycloak | Multiple realm configs, DB credentials |
| RabbitMQ | User, password, Erlang cookie |
| Redis | Password |
| Twilio | Account SID, auth token |
| Mandrill | API key |
| Mux | Token ID, secret |
| Stream Chat | API key, secret |
| Mailchimp | API key |
| Dwolla | API key, secret, funding source, webhook secret |
| Zoom | API key, secret |
| Intercom | API token |
| Snowflake | Account, user, password, warehouse |
| Auth0 | Client ID, secret, domain |
| Plaid | Client ID, secret |
| Grafana | Admin credentials |
| Elasticsearch/APM | Connection credentials |
CI/CD Secrets
- Workload Identity Federation for GCP authentication (no service account keys)
- CI/CD secrets stored with
cicd_prefix incore-services-370815project - Fetched at runtime by GitHub Actions, never stored in GitHub
Terraform Infrastructure-as-Code
Module Inventory (19 modules)
| Module | Purpose |
|---|---|
gke-cluster |
GKE cluster provisioning |
postgres-instance |
Cloud SQL PostgreSQL instance |
postgres-db |
Individual database creation |
mysql-instance |
Cloud SQL MySQL instance |
mysql-db |
MySQL database creation |
sql-user |
Database user creation |
dns |
Cloud DNS + external-dns + cert-manager |
secrets |
20+ integration secret shells |
secrets-powervz |
PowerVZ-specific secrets |
castai |
CastAI cost optimization |
argocd |
Cross-project ArgoCD IAM |
airbyte |
Airbyte IP allowlisting |
airbyte-connection |
ETL connection configs |
airbyte-source |
PostgreSQL CDC sources |
airbyte-destination |
Snowflake destinations |
snowflake |
Snowflake user/database |
filestore |
GCP Filestore (NFS) |
logging-exclusions |
Cost-saving log filters |
keycloak-google-idp |
Google IDP for Keycloak |
State Management
- Backend: GCS bucket
terraform-state-370815 - State files: Per-environment
(
core-services/,development/,production/) - Locking: GCS-native state locking
- CI/CD: Atlantis (PR-based plan/apply, auto-merge on success)
Environment Setup
Secrets injected via tf_setup.sh:
export TF_VAR_castai_api_token=$(gcloud secrets versions access latest \
--secret="core_CASTAI_API_TOKEN")
Monitoring & Observability
Current Stack
| Layer | Tool | Status |
|---|---|---|
| Metrics | kube-prometheus-stack | Deployed all tenants |
| Service Monitors | Prometheus ServiceMonitor | All Java services expose /actuator/prometheus |
| Dashboards | Grafana (via prometheus-stack) | Deployed |
| Logging (collection) | peeq-logging (Node.js) | Gen 1 (Node.js/Express → Elasticsearch) |
| Logging (storage) | Elasticsearch | Cloud-hosted (port 9243, HTTPS) |
| Logging (visualization) | Kibana 7.15.2 | Deployed |
| Tracing | Istio → Stackdriver | Zipkin integration configured |
| APM | Elastic APM 1.25.2 | Available but disabled by default |
| Session Replay | LogRocket | Frontend only (admin, celeb, fan) |
| Analytics | Superset | Connected to PostgreSQL reporting DB |
| Cloud Logging | GCP Cloud Logging | GKE-native integration |
| Cloud Monitoring | GCP Cloud Monitoring | GKE-native integration |
Observability Gaps
- No centralized alerting visible in infrastructure repos (no PagerDuty, Opsgenie, or Slack alert configs in Terraform/Helm)
- APM disabled by default — transaction sampling at 50% when enabled, but not turned on
- Logging pipeline is Gen 1 — peeq-logging is Node.js/Express (not peeq-* naming, but Gen 1 patterns: Node 9 Dockerfile, Express, no TypeScript build in image)
- No SLO/SLI definitions in Helm charts or monitoring config
- Kibana dashboards not defined in code — likely manual creation
- No distributed tracing adoption — Zipkin address configured in Istio but no evidence of service-level trace instrumentation
Storage Infrastructure
NFS (ReadWriteMany)
4 Persistent Volume Claims per tenant via NFS provisioner:
| PVC | Size | Purpose |
|---|---|---|
| pvc-content | 50Gi | Content service file storage |
| pvc-media | 50Gi | Media service file storage |
| pvc-shoutout | 50Gi | Shoutout video storage |
| pvc-stream | 50Gi | Streaming assets |
All use nfs-client StorageClass via per-tenant NFS
provisioner.
GCS Buckets
2 shared buckets with tenant-specific directories:
- public-assets-{hex}: Public-facing assets (images,
videos). Public read access (
allUsers). - backend-assets-{hex}: Backend-generated assets. Public read access.
Both have versioning enabled and deletion protection.
Operational Tooling
ArgoCD Scripts (24 utilities)
| Script | Purpose |
|---|---|
check-app-version.sh |
Compare deployed versions across tenants |
check-secrets.sh |
Validate secret availability |
set-replicas.sh |
Scale services up/down |
rabbitmq-reports.sh |
RabbitMQ health and metrics |
pgbouncer/reload.sh |
Hot-reload PgBouncer configuration |
restart-pods/ |
Rolling restart utilities |
diff_promote/ |
Diff configs between environments |
update-tags.sh |
Batch update image tags |
create-changelog.sh |
Generate deployment changelogs |
DevOps Utilities (devops-utlities repo)
| Subsystem | Purpose |
|---|---|
db/ |
Database dumps, restores, BPM exports, vacuum |
github/ |
Mass PR merging, repo migration (multi-gitter) |
jdk-update/ |
Platform-wide JDK upgrades via multi-gitter |
update-mvn-deps/ |
Maven dependency propagation |
graphql-migration/ |
Spring Boot 3.x + GraphQL upgrade scripts |
google-artifact-repo/ |
Container registry management (gcrane) |
vpa/ |
Vertical Pod Autoscaler analysis for cost optimization |
env-start-stop/ |
Multi-tenant cluster startup/shutdown |
keycloak/ |
Realm export utilities |
RabbitMQ Queue Monitor
- Stack: Bash + curl + jq + kubectl on Debian slim
- Monitors: 27 services mapped to Kubernetes deployments
- Purpose: Message-driven autoscaling (precursor to KEDA)
- Maps: Queue names → deployment names for scaling decisions
Security Architecture
Authentication & Authorization
graph TD
subgraph "External"
USER[User Browser]
end
subgraph "GCP Edge"
LB[GCP Load Balancer<br/>TLS Termination]
end
subgraph "Service Mesh"
IG[Istio IngressGateway<br/>Host-based routing]
MTLS[Istio mTLS<br/>Pod-to-pod encryption]
end
subgraph "Identity"
KC[Keycloak 26.3<br/>OAuth2/OIDC]
end
subgraph "Services"
SVC[Spring Boot Service<br/>JWT validation via issuer URI]
end
USER -->|HTTPS| LB --> IG
IG -->|/api/*| SVC
IG -->|/auth/*| KC
USER -->|OAuth2 flow| KC
KC -->|JWT| USER
USER -->|Bearer token| IG
SVC -.->|mTLS| MTLS -.-> SVC
Security Controls
| Control | Implementation |
|---|---|
| TLS in transit | Let’s Encrypt certs via cert-manager, Istio mTLS |
| Workload Identity | Kubernetes SA → GCP SA (no JSON keys) |
| Secret storage | GCP Secret Manager (encrypted at rest, IAM-controlled) |
| Container scanning | Trivy (CRITICAL + HIGH severity) |
| SAST | Qwiet/ShiftLeft (Java + JavaScript) |
| Database access | Private IP via VPC peering, Cloud SQL IAM auth |
| Deletion protection | GKE clusters + Cloud SQL instances |
| CORS | Istio VirtualService (allows all origins — broad) |
Security Gaps
- Trivy scans don’t fail builds — exit code 0, vulnerabilities tracked but not enforced
- No Binary Authorization — unsigned container images can deploy
- No GKE Shielded Nodes — secure boot not enabled
- No NetworkPolicies — relying on Istio alone for pod-level isolation
- Public GCS buckets —
allUsersread access on asset buckets - CORS allows all origins —
allowOrigins: regex: .*in Istio VirtualService - Developer IPs in authorized networks — direct DB access via individual IPs (dev/staging)
Gen 1 Infrastructure Services
| Service | Stack | Status | Replacement |
|---|---|---|---|
| peeq-logging | Node.js/Express → Elasticsearch | Active (Gen 1 patterns) | Upgrade or replace with GCP Cloud Logging |
| peeq-kibana-deploy | Kibana 7.15.2 wrapper | Active | Upgrade Kibana or migrate to Elastic Cloud |
| peeq-shared-secret | Java 11/SB 2.4.3 → GCP Secret Manager | Unclear if active | ArgoCD Vault Plugin may have replaced this |
| devops-utlities | Bash scripts + multi-gitter | Active (operational) | Keep as tooling |
Infrastructure Risk Assessment
High Risk
- Zonal GKE clusters — Single zone (us-central1-a). Zone failure = complete tenant outage. No regional cluster configuration. RPO depends on Cloud SQL backup frequency.
- Zonal Cloud SQL —
ZONALavailability type. No automatic failover. Manual recovery required from point-in-time backup. - Single region — All infrastructure in us-central1. No cross-region DR.
Medium Risk
- Cluster-per-tenant cost — 4 production clusters each with 0-3 n1-standard-8 nodes + Cloud SQL + RabbitMQ + Redis. Infrastructure cost scales linearly with tenants.
- CastAI on-demand overrides — 25+ services forced to on-demand nodes, reducing spot savings significantly.
- PgBouncer as SPOF — If PgBouncer pods fail, all database connections fail. 3 replicas provide some redundancy but no circuit breaking.
Low Risk
- Common chart coupling — All 48 charts depend on common v0.0.179. Breaking change affects everything. Mitigated by chart versioning.
- ArgoCD centralization — Single ArgoCD instance manages all clusters from core-services project. ArgoCD failure blocks all deployments.
- Helm OCI registry — All charts stored in single Artifact Registry. Registry outage blocks deployments but not running services.
Inter-Service Communication (Infrastructure View)
Synchronous
All service-to-service calls route through Istio mesh: - Path-based
routing: http://{service}:8080/api/{service}/... - mTLS
encryption between all pods - CORS configured at Istio level (broad: all
origins)
Asynchronous
RabbitMQ 3.13.7 cluster per tenant: - 3 nodes, persistent storage (20Gi) - Memory high watermark: 3276Mi - On-demand nodes (no spot eviction) - Safe-to-evict: false annotation - Definitions loaded from Kubernetes secrets
Database Connections
All services → PgBouncer (3 replicas) → Cloud SQL (private IP): -
JDBC URL: jdbc:postgresql://pgbouncer:5432/{service-name} -
Credentials: Kubernetes Secret (from GCP Secret Manager via AVP) - Init
container waits for PgBouncer availability
Data Model (Infrastructure)
Resource Allocation per Tenant
| Resource | Per Tenant | Total (4 prod) |
|---|---|---|
| GKE Cluster | 1 | 4 |
| GKE Nodes (max) | 3 × n1-standard-8 | 12 nodes, 96 vCPU, 360 GB RAM |
| Cloud SQL Instance | 1 (2 vCPU, 6.5 GB) | 4 instances |
| PostgreSQL Databases | 35 | 140 |
| Max DB Connections | 750 | 3,000 |
| RabbitMQ Nodes | 3 | 12 |
| Redis Instances | 1 (master + replicas) | 4 |
| PgBouncer Replicas | 3 | 12 |
| NFS PVCs | 4 × 50Gi = 200Gi | 800Gi |
| GCS Buckets | 2 (shared) | 2 (shared) |
Scaling Boundaries
| Component | Current Limit | Concern |
|---|---|---|
| GKE nodes | 3 per cluster | Low headroom for traffic spikes |
| Cloud SQL | 750 connections | Shared across 35 databases |
| RabbitMQ | 3 nodes, 3276Mi memory | Adequate for current load |
| PgBouncer | 3 replicas | Bottleneck if overwhelmed |
| NFS | 200Gi per tenant | Fixed allocation, may waste or exhaust |
Modernization Implications
Consolidation Opportunity: Shared Multi-Tenant Cluster
Current: 4 production GKE clusters (1 per tenant) Option: 1 regional GKE cluster with namespace-per-tenant isolation
Benefits: - ~60-70% infrastructure cost reduction (shared control plane, better node utilization) - Simpler management (1 cluster to upgrade, 1 Istio mesh) - Faster new-tenant onboarding
Requirements: - Kubernetes ResourceQuota per namespace - Istio AuthorizationPolicy for service isolation - NetworkPolicy for pod-level segmentation - Separate Cloud SQL instances per tenant (keep for data isolation)
Risk: Noisy neighbor issues, shared blast radius. Mitigated by resource quotas and Istio rate limiting.
Infrastructure Upgrades Needed
- Regional GKE — Move from zonal to regional clusters for HA
- Regional Cloud SQL — Enable automatic failover
- Security hardening — Binary Authorization, Shielded Nodes, NetworkPolicies, tighter CORS
- Trivy enforcement — Fail builds on CRITICAL vulnerabilities
- Observability gaps — Alerting, SLOs, distributed tracing, APM enablement
- Logging modernization — Replace peeq-logging (Node.js Gen 1) with GCP Cloud Logging or upgrade to Elastic Cloud
- KEDA adoption — Replace custom rabbitmq-queue-monitor with KEDA ScaledObjects
What Can Stay As-Is
- GitOps pipeline (ArgoCD + GitHub Actions) — mature and well-structured
- Terraform modules — well-factored, 19 reusable modules
- Common Helm chart — 33 templates, battle-tested at v0.0.179
- Secret management (GCP Secret Manager + AVP) — secure pattern
- CastAI — effective cost optimization
- Istio service mesh — provides routing, mTLS, observability
Last updated: 2026-01-30 — Session 8 Review by: 2026-04-30 Staleness risk: Medium — infrastructure configs and versions evolve frequently