Architecture

Infrastructure & DevOps Architecture

Last updated: 2026-02-01 | Architecture

Infrastructure & DevOps Architecture

Key Takeaways

  1. Fully GCP-native infrastructure — GKE clusters, Cloud SQL (PostgreSQL 16), Cloud DNS, GCS, Secret Manager, Artifact Registry. All provisioned via Terraform with Atlantis PR-based workflow.
  2. Dedicated cluster per tenant — 3 production brands (The Agile Network, NIL Game Plan, VT NIL) each get a separate GKE cluster, PostgreSQL instance (35 databases), RabbitMQ cluster (3 nodes), Redis, and Keycloak realm. A 4th brand (Speed of AI) also has a production cluster.
  3. Mature GitOps pipeline — ArgoCD manages all deployments. 28 reusable GitHub Actions workflows. Common Helm library chart (v0.0.179, 33 templates) standardizes all 48 service charts. Preview environments auto-created per PR with Istio subdomain routing.
  4. Multi-brand routing via Istio — Istio IngressGateway handles TLS termination, path-based routing (/api/{service}), mTLS between services, and per-tenant Gateway hosts. External-DNS and cert-manager automate DNS records and Let’s Encrypt certificates.
  5. No backend brand-specific logic in infrastructure — All tenant differentiation is via environment variables and values-globals.yaml. Same Helm charts, same Docker images, same service code deployed to all tenants. Confirms H11 across all 7 domains + infrastructure.

Migration Decision Question

What infrastructure changes are needed, and what’s the multi-brand routing mechanism?

Migration Verdict

Verdict: Upgrade Complexity: L Key Constraint: Cluster-per-tenant model is expensive but provides isolation; consolidation requires namespace-level isolation strategy (NetworkPolicies, ResourceQuotas, Istio authorization). Dependencies: All application services depend on this infrastructure layer. Changes here affect all domains.

Infrastructure Inventory

Cloud Platform

Component Technology Version/Config
Cloud Provider Google Cloud Platform 4 GCP projects
Container Orchestration GKE Kubernetes 1.30.4
Service Mesh Istio IstioOperator CRD, Stackdriver tracing
Database Cloud SQL PostgreSQL PostgreSQL 16, db-custom-2-6656
Message Queue RabbitMQ 3.13.7 (Bitnami), 3-node HA cluster
Cache Redis Master + replicas per tenant
Connection Pooling PgBouncer 3 replicas, 41 databases routed
DNS Cloud DNS + external-dns Automated record management
TLS cert-manager + Let’s Encrypt Automatic certificate provisioning
Secrets GCP Secret Manager + AVP ArgoCD Vault Plugin injection
Image Registry Google Artifact Registry Docker, Maven, Helm (OCI)
Cost Optimization CastAI Spot instances default, on-demand for critical services
IaC Terraform 1.9.5 + Atlantis PR-based infrastructure changes
GitOps ArgoCD Declarative Helm-based deployments
CI/CD GitHub Actions 28 reusable workflows, self-hosted runners
Analytics ETL Airbyte CDC replication for 20 databases
Data Warehouse Snowflake Per-tenant databases
Monitoring kube-prometheus-stack Prometheus + Grafana
Logging Elasticsearch + Kibana 7.15.2 peeq-logging (Node.js) aggregator
APM Elastic APM Disabled by default, available per-service
Session Replay LogRocket Frontend-only (admin, celeb, fan)
Security Scanning Trivy + Qwiet (ShiftLeft) Container + SAST scanning

GCP Projects

Project Purpose
core-services-370815 Centralized services: ArgoCD, Terraform state, shared secrets
production-370815 Production clusters and databases
vz-development-381618 Development environment
vz-staging-381618 Staging environment
favedom-dev Artifact Registry (Docker, Maven, Helm)

Production Tenants

Brand Cluster Domain Namespace Apps
The Agile Network agilenetwork theagilenetwork.com agilenetwork 52
NIL Game Plan nilgameplan nilgameplan.com nilgameplan 52
VT NIL vtnil vt.triumphnil.com vt 49
Speed of AI speedofai (AI training vertical) speedofai 48

Development Tenants

Brand Cluster Domain Apps
FanFuze NIL fanfuzenil dev.fanfuzenil.com 70
Temp FanFuze tmp-fanfuze (temporary) 47

Multi-Brand Routing Architecture

DNS-to-Backend Flow

graph TD
    subgraph "Internet"
        U1[User: theagilenetwork.com]
        U2[User: nilgameplan.com]
        U3[User: vt.triumphnil.com]
    end

    subgraph "GCP Cloud DNS"
        DNS1[theagilenetwork.com → LB IP]
        DNS2[nilgameplan.com → LB IP]
        DNS3[vt.triumphnil.com → LB IP]
    end

    subgraph "GCP Load Balancer"
        LB1[agilenetwork cluster LB]
        LB2[nilgameplan cluster LB]
        LB3[vtnil cluster LB]
    end

    subgraph "Istio per Cluster"
        IG1[Istio IngressGateway]
        IG2[Istio IngressGateway]
        IG3[Istio IngressGateway]
    end

    subgraph "Kubernetes Namespaces"
        NS1[agilenetwork namespace<br/>28 services + infra]
        NS2[nilgameplan namespace<br/>28 services + infra]
        NS3[vt namespace<br/>28 services + infra]
    end

    U1 --> DNS1 --> LB1 --> IG1 --> NS1
    U2 --> DNS2 --> LB2 --> IG2 --> NS2
    U3 --> DNS3 --> LB3 --> IG3 --> NS3

Istio Gateway Configuration

Each tenant has an Istio Gateway accepting traffic for its domains:

# Simplified from prod/agilenetwork/istio-gateway
servers:
  - hosts: ['*.theagilenetwork.com', 'theagilenetwork.com']
    port: { number: 443, protocol: HTTPS }
    tls: { mode: SIMPLE, minProtocolVersion: TLSV1_2 }
  - hosts: ['*.theagilenetwork.com', 'theagilenetwork.com']
    port: { number: 80, protocol: HTTP }
    tls: { httpsRedirect: true }

Per-Service Routing (VirtualService)

Every service gets path-based routing through the Istio gateway:

# Pattern for all services
spec:
  hosts: ['theagilenetwork.com']
  gateways: ['istio-system/istio-gateway']
  http:
    - match: [{ uri: { prefix: /api/celebrity } }]
      route: [{ destination: { host: celebrity, port: { number: 8080 } } }]

Frontend Domain Mapping

Subdomain Purpose Application
theagilenetwork.com Fan-facing app mono-web
instructor.theagilenetwork.com Expert portal celeb-fe
admin.theagilenetwork.com Admin portal admin-fe
identity.theagilenetwork.com Keycloak login identityx-26
app.theagilenetwork.com API gateway All backend services

Tenant Configuration (values-globals.yaml)

Each tenant has a values-globals.yaml with all tenant-specific config:

global:
  cluster: agilenetwork
  domain: theagilenetwork.com
  env: prod
  gcpProject: production-370815
  secretManagerId: "564934788583"
  tenant:
    alias: "The Agile Network"
    name: "agilenetwork"
    namespace: "agilenetwork"
  database:
    hostname: pgbouncer
    port: 5432
  keycloak:
    authServerUrl: "https://identity.theagilenetwork.com"
    realm: agilenetwork
  rabbitmq:
    host: rabbitmq
  redis:
    master: { host: redis-master, port: 6379 }

All services read from global.* — no brand-specific business logic anywhere.

GKE Cluster Configuration

Node Pools (Production)

Setting Value
Machine Type n1-standard-8 (8 vCPU, 30 GB RAM)
Disk 100 GB standard persistent
Image COS_CONTAINERD
Autoscaling 0-3 nodes per pool
Preemptible Yes (CastAI manages spot)
cgroup CGROUP_MODE_V2
Max Pods/Node 110
IP Aliasing Pod range /16, Service range /22
Workload Identity Enabled
Maintenance Weekends, 3-9 AM UTC
Location us-central1-a (zonal)

Cluster Features

CastAI Cost Optimization

Database Infrastructure

Cloud SQL Configuration

Setting Value
Engine PostgreSQL 16
Tier db-custom-2-6656 (2 vCPU, 6.5 GB)
Availability ZONAL (single zone)
Backup Enabled, point-in-time recovery
Private IP Via VPC peering
Public IP Only when Airbyte enabled
Max Connections 750
Deletion Protection Enabled
IAM Auth Enabled

Databases per Tenant (35 PostgreSQL)

celebrity, chat, class-catalog, content, email, fan, group-profile, identityx_26, inventory, journey, media, message_board, notification_service, org_manager, purchase_request_bpm, reporting, search, shoutout, shoutout_bpm, sms, sse, stream, stripe, subscriptions, tags, tracking, transaction, wallet, webinar, superset

Plus additional MySQL instance for VT NIL (tracking database).

PgBouncer (Connection Pooling)

Airbyte CDC Replication

20 databases have Change Data Capture replication to Snowflake: - Sources: PostgreSQL Cloud SQL (public IP required) - Destination: Snowflake data warehouse (org: TSGCLBT, account: PM66380) - Per-tenant Snowflake databases (e.g., AGILENETWORK_DB)

CI/CD Pipeline

GitHub Actions (28 Reusable Workflows)

graph TD
    subgraph "Developer Workflow"
        PR[Open PR] --> BUILD
    end

    subgraph "Build Phase"
        BUILD[GitHub Actions] --> MVN[Maven Build + Tests]
        BUILD --> DOCKER[Docker Build<br/>Multi-arch: amd64 + arm64]
    end

    subgraph "Security Phase"
        DOCKER --> TRIVY[Trivy Container Scan<br/>CRITICAL + HIGH]
        DOCKER --> QWIET[Qwiet SAST<br/>Code analysis]
    end

    subgraph "Artifact Phase"
        MVN --> GAR_MVN[Push to GAR Maven]
        DOCKER --> GAR_DOCKER[Push to GAR Docker]
        DOCKER --> HELM[Generate Helm Chart]
        HELM --> GAR_HELM[Push to GAR Helm OCI]
    end

    subgraph "Deploy Phase"
        GAR_DOCKER --> |PR| PREVIEW[Preview Environment<br/>ArgoCD Previews]
        GAR_DOCKER --> |Master| PROD_UPDATE[Update argocd-deployments<br/>image.tag in values.yaml]
        PROD_UPDATE --> ARGOCD[ArgoCD Sync<br/>Helm Release Update]
    end

    subgraph "Preview Lifecycle"
        PREVIEW --> PR_COMMENT[Bot: Preview URL<br/>on PR comment]
        PR_COMMENT --> CLEANUP[PR Closed → Delete<br/>Preview Namespace]
    end

Key Workflow Files

Workflow Purpose
build-maven-docker.yaml Spring Boot service build + Docker
build-node-docker.yaml Node.js/Angular build + Docker
build-pnpm-docker.yaml pnpm frontend builds
deploy-argocd-env.yaml Update ArgoCD env with new version
deploy-helm-preview.yaml Create PR preview environment
cleanup-preview-env.yaml Delete preview namespace on PR close
security-trivy.yaml Container vulnerability scanning
security-qwiet.yaml Static application security testing
cloud-run-job-flyway.yaml Database migration via Cloud Run
lint-sql.yaml SQL linting for Flyway migrations

Self-Hosted Runner

Preview Environment Flow

  1. PR opened → GitHub Actions builds Docker image with PR-specific tag
  2. Helm chart generated with preview values (namespace: {service}-pr-{number})
  3. Application YAML pushed to argocd-previews repo
  4. ArgoCD syncs preview application
  5. Secrets copied from fanfuze namespace to preview namespace
  6. Bot comments preview URL: {service}-pr-{number}.dev.fanfuzenil.com
  7. PR closed → Cleanup workflow deletes namespace

Version Management

Helm Chart Architecture

Common Library Chart (v0.0.179)

33 reusable templates standardizing all service deployments:

Template Purpose
_deployment-java.tpl Java/Spring Boot deployments
_deployment-node.tpl Node.js deployments
_deployment-fe.tpl Frontend (Angular/Ionic) deployments
_istio.tpl VirtualService + DestinationRule
_hpa.tpl Horizontal Pod Autoscaler
_postgres.tpl Database connection env vars + init container
_keycloak.tpl Keycloak integration env vars
_env.tpl Common environment variables
_secret-postgres.yaml PostgreSQL secret template
_pdb.yaml PodDisruptionBudget
_keda.yaml KEDA event-driven autoscaling
_servicemonitor.tpl Prometheus ServiceMonitor
_canary.tpl Flagger canary deployments
_apm.tpl APM sidecar injection
_kubefledged.tpl Image pre-loading

Service Chart Pattern

Each service chart is minimal — delegates to common chart:

# Chart.yaml
dependencies:
  - name: common
    version: 0.0.179
    repository: "oci://us-central1-docker.pkg.dev/favedom-dev/helm"

# templates/deployment.yaml
{{ include "common.deployment.java" . }}

# templates/istio.yaml
{{ include "common.istio" . }}

Feature flags control what each service gets:

# values.yaml (per service)
keycloak: { enabled: true }
postgres: { enabled: true }
rabbitmq: { enabled: true }
prometheus: { enabled: true }
hpa: { enabled: false }  # overridden per tenant

All Helm Charts (48)

Application Services (30): celebrity, fan, users, content, media, shoutout, shoutout-bpm, webinar, chat, message-board, notifications, email, sms, sse, inventory, journey, class-catalog, purchase-request-bpm, transaction, wallet, subscriptions, stripe, search, tags, tracking, reporting, org-manager, group-profile, onsite-event, athlete-manager

Frontends (5): mono-web, admin-fe, celeb-fe, org-dashboard-fe, nilgp-partnerportal-fe

Infrastructure (13): common (library), pgbouncer, rabbitmq-queue-monitor, flyway, shared-secrets, stackhawk, argocd-reports, site-maintenance, node-tracking, nilgp-partnerportal-be, test-spring-boot-app, plus preview configs

Secret Management

Architecture (3-Tier)

graph LR
    subgraph "Source of Truth"
        GSM[GCP Secret Manager<br/>100+ secrets per tenant]
    end

    subgraph "Injection Layer"
        AVP[ArgoCD Vault Plugin<br/>Fetches at sync time]
    end

    subgraph "Runtime"
        K8S[Kubernetes Secrets<br/>Mounted in pods]
    end

    subgraph "Provisioning"
        TF[Terraform<br/>Creates secret shells]
    end

    TF --> GSM
    GSM --> AVP
    AVP --> K8S

Secret Naming Convention

{tenant}_{VENDOR}_{APP}_{SECRET_NAME}

Examples: - agilenetwork_STRIPE_PAYMENT_KEY - agilenetwork_KEYCLOAK_FAN_CLIENTID - agilenetwork_RABBITMQ_PASSWORD

Integrated Services (20+ secret modules)

Service Secrets
Stripe Payment key, webhook signing secret
Keycloak Multiple realm configs, DB credentials
RabbitMQ User, password, Erlang cookie
Redis Password
Twilio Account SID, auth token
Mandrill API key
Mux Token ID, secret
Stream Chat API key, secret
Mailchimp API key
Dwolla API key, secret, funding source, webhook secret
Zoom API key, secret
Intercom API token
Snowflake Account, user, password, warehouse
Auth0 Client ID, secret, domain
Plaid Client ID, secret
Grafana Admin credentials
Elasticsearch/APM Connection credentials

CI/CD Secrets

Terraform Infrastructure-as-Code

Module Inventory (19 modules)

Module Purpose
gke-cluster GKE cluster provisioning
postgres-instance Cloud SQL PostgreSQL instance
postgres-db Individual database creation
mysql-instance Cloud SQL MySQL instance
mysql-db MySQL database creation
sql-user Database user creation
dns Cloud DNS + external-dns + cert-manager
secrets 20+ integration secret shells
secrets-powervz PowerVZ-specific secrets
castai CastAI cost optimization
argocd Cross-project ArgoCD IAM
airbyte Airbyte IP allowlisting
airbyte-connection ETL connection configs
airbyte-source PostgreSQL CDC sources
airbyte-destination Snowflake destinations
snowflake Snowflake user/database
filestore GCP Filestore (NFS)
logging-exclusions Cost-saving log filters
keycloak-google-idp Google IDP for Keycloak

State Management

Environment Setup

Secrets injected via tf_setup.sh:

export TF_VAR_castai_api_token=$(gcloud secrets versions access latest \
  --secret="core_CASTAI_API_TOKEN")

Monitoring & Observability

Current Stack

Layer Tool Status
Metrics kube-prometheus-stack Deployed all tenants
Service Monitors Prometheus ServiceMonitor All Java services expose /actuator/prometheus
Dashboards Grafana (via prometheus-stack) Deployed
Logging (collection) peeq-logging (Node.js) Gen 1 (Node.js/Express → Elasticsearch)
Logging (storage) Elasticsearch Cloud-hosted (port 9243, HTTPS)
Logging (visualization) Kibana 7.15.2 Deployed
Tracing Istio → Stackdriver Zipkin integration configured
APM Elastic APM 1.25.2 Available but disabled by default
Session Replay LogRocket Frontend only (admin, celeb, fan)
Analytics Superset Connected to PostgreSQL reporting DB
Cloud Logging GCP Cloud Logging GKE-native integration
Cloud Monitoring GCP Cloud Monitoring GKE-native integration

Observability Gaps

Storage Infrastructure

NFS (ReadWriteMany)

4 Persistent Volume Claims per tenant via NFS provisioner:

PVC Size Purpose
pvc-content 50Gi Content service file storage
pvc-media 50Gi Media service file storage
pvc-shoutout 50Gi Shoutout video storage
pvc-stream 50Gi Streaming assets

All use nfs-client StorageClass via per-tenant NFS provisioner.

GCS Buckets

2 shared buckets with tenant-specific directories:

Both have versioning enabled and deletion protection.

Operational Tooling

ArgoCD Scripts (24 utilities)

Script Purpose
check-app-version.sh Compare deployed versions across tenants
check-secrets.sh Validate secret availability
set-replicas.sh Scale services up/down
rabbitmq-reports.sh RabbitMQ health and metrics
pgbouncer/reload.sh Hot-reload PgBouncer configuration
restart-pods/ Rolling restart utilities
diff_promote/ Diff configs between environments
update-tags.sh Batch update image tags
create-changelog.sh Generate deployment changelogs

DevOps Utilities (devops-utlities repo)

Subsystem Purpose
db/ Database dumps, restores, BPM exports, vacuum
github/ Mass PR merging, repo migration (multi-gitter)
jdk-update/ Platform-wide JDK upgrades via multi-gitter
update-mvn-deps/ Maven dependency propagation
graphql-migration/ Spring Boot 3.x + GraphQL upgrade scripts
google-artifact-repo/ Container registry management (gcrane)
vpa/ Vertical Pod Autoscaler analysis for cost optimization
env-start-stop/ Multi-tenant cluster startup/shutdown
keycloak/ Realm export utilities

RabbitMQ Queue Monitor

Security Architecture

Authentication & Authorization

graph TD
    subgraph "External"
        USER[User Browser]
    end

    subgraph "GCP Edge"
        LB[GCP Load Balancer<br/>TLS Termination]
    end

    subgraph "Service Mesh"
        IG[Istio IngressGateway<br/>Host-based routing]
        MTLS[Istio mTLS<br/>Pod-to-pod encryption]
    end

    subgraph "Identity"
        KC[Keycloak 26.3<br/>OAuth2/OIDC]
    end

    subgraph "Services"
        SVC[Spring Boot Service<br/>JWT validation via issuer URI]
    end

    USER -->|HTTPS| LB --> IG
    IG -->|/api/*| SVC
    IG -->|/auth/*| KC
    USER -->|OAuth2 flow| KC
    KC -->|JWT| USER
    USER -->|Bearer token| IG
    SVC -.->|mTLS| MTLS -.-> SVC

Security Controls

Control Implementation
TLS in transit Let’s Encrypt certs via cert-manager, Istio mTLS
Workload Identity Kubernetes SA → GCP SA (no JSON keys)
Secret storage GCP Secret Manager (encrypted at rest, IAM-controlled)
Container scanning Trivy (CRITICAL + HIGH severity)
SAST Qwiet/ShiftLeft (Java + JavaScript)
Database access Private IP via VPC peering, Cloud SQL IAM auth
Deletion protection GKE clusters + Cloud SQL instances
CORS Istio VirtualService (allows all origins — broad)

Security Gaps

Gen 1 Infrastructure Services

Service Stack Status Replacement
peeq-logging Node.js/Express → Elasticsearch Active (Gen 1 patterns) Upgrade or replace with GCP Cloud Logging
peeq-kibana-deploy Kibana 7.15.2 wrapper Active Upgrade Kibana or migrate to Elastic Cloud
peeq-shared-secret Java 11/SB 2.4.3 → GCP Secret Manager Unclear if active ArgoCD Vault Plugin may have replaced this
devops-utlities Bash scripts + multi-gitter Active (operational) Keep as tooling

Infrastructure Risk Assessment

High Risk

  1. Zonal GKE clusters — Single zone (us-central1-a). Zone failure = complete tenant outage. No regional cluster configuration. RPO depends on Cloud SQL backup frequency.
  2. Zonal Cloud SQLZONAL availability type. No automatic failover. Manual recovery required from point-in-time backup.
  3. Single region — All infrastructure in us-central1. No cross-region DR.

Medium Risk

  1. Cluster-per-tenant cost — 4 production clusters each with 0-3 n1-standard-8 nodes + Cloud SQL + RabbitMQ + Redis. Infrastructure cost scales linearly with tenants.
  2. CastAI on-demand overrides — 25+ services forced to on-demand nodes, reducing spot savings significantly.
  3. PgBouncer as SPOF — If PgBouncer pods fail, all database connections fail. 3 replicas provide some redundancy but no circuit breaking.

Low Risk

  1. Common chart coupling — All 48 charts depend on common v0.0.179. Breaking change affects everything. Mitigated by chart versioning.
  2. ArgoCD centralization — Single ArgoCD instance manages all clusters from core-services project. ArgoCD failure blocks all deployments.
  3. Helm OCI registry — All charts stored in single Artifact Registry. Registry outage blocks deployments but not running services.

Inter-Service Communication (Infrastructure View)

Synchronous

All service-to-service calls route through Istio mesh: - Path-based routing: http://{service}:8080/api/{service}/... - mTLS encryption between all pods - CORS configured at Istio level (broad: all origins)

Asynchronous

RabbitMQ 3.13.7 cluster per tenant: - 3 nodes, persistent storage (20Gi) - Memory high watermark: 3276Mi - On-demand nodes (no spot eviction) - Safe-to-evict: false annotation - Definitions loaded from Kubernetes secrets

Database Connections

All services → PgBouncer (3 replicas) → Cloud SQL (private IP): - JDBC URL: jdbc:postgresql://pgbouncer:5432/{service-name} - Credentials: Kubernetes Secret (from GCP Secret Manager via AVP) - Init container waits for PgBouncer availability

Data Model (Infrastructure)

Resource Allocation per Tenant

Resource Per Tenant Total (4 prod)
GKE Cluster 1 4
GKE Nodes (max) 3 × n1-standard-8 12 nodes, 96 vCPU, 360 GB RAM
Cloud SQL Instance 1 (2 vCPU, 6.5 GB) 4 instances
PostgreSQL Databases 35 140
Max DB Connections 750 3,000
RabbitMQ Nodes 3 12
Redis Instances 1 (master + replicas) 4
PgBouncer Replicas 3 12
NFS PVCs 4 × 50Gi = 200Gi 800Gi
GCS Buckets 2 (shared) 2 (shared)

Scaling Boundaries

Component Current Limit Concern
GKE nodes 3 per cluster Low headroom for traffic spikes
Cloud SQL 750 connections Shared across 35 databases
RabbitMQ 3 nodes, 3276Mi memory Adequate for current load
PgBouncer 3 replicas Bottleneck if overwhelmed
NFS 200Gi per tenant Fixed allocation, may waste or exhaust

Modernization Implications

Consolidation Opportunity: Shared Multi-Tenant Cluster

Current: 4 production GKE clusters (1 per tenant) Option: 1 regional GKE cluster with namespace-per-tenant isolation

Benefits: - ~60-70% infrastructure cost reduction (shared control plane, better node utilization) - Simpler management (1 cluster to upgrade, 1 Istio mesh) - Faster new-tenant onboarding

Requirements: - Kubernetes ResourceQuota per namespace - Istio AuthorizationPolicy for service isolation - NetworkPolicy for pod-level segmentation - Separate Cloud SQL instances per tenant (keep for data isolation)

Risk: Noisy neighbor issues, shared blast radius. Mitigated by resource quotas and Istio rate limiting.

Infrastructure Upgrades Needed

  1. Regional GKE — Move from zonal to regional clusters for HA
  2. Regional Cloud SQL — Enable automatic failover
  3. Security hardening — Binary Authorization, Shielded Nodes, NetworkPolicies, tighter CORS
  4. Trivy enforcement — Fail builds on CRITICAL vulnerabilities
  5. Observability gaps — Alerting, SLOs, distributed tracing, APM enablement
  6. Logging modernization — Replace peeq-logging (Node.js Gen 1) with GCP Cloud Logging or upgrade to Elastic Cloud
  7. KEDA adoption — Replace custom rabbitmq-queue-monitor with KEDA ScaledObjects

What Can Stay As-Is


Last updated: 2026-01-30 — Session 8 Review by: 2026-04-30 Staleness risk: Medium — infrastructure configs and versions evolve frequently