ADR

ADR-004: Multi-Brand Architecture

Last updated: 2026-02-01 | Decisions

ADR-004: Multi-Brand Architecture

Status

Proposed — Pending engineering team review

Context

The platform serves 3 production brands (The Agile Network, NIL Game Plan, VT NIL) on a cluster-per-tenant model. Speed of AI is planned but has no ArgoCD or Terraform production configuration as of 2026-01-31. H11 (L2 Verified) confirmed all tenant differentiation is config-only — same Docker images, same Helm charts, same code. Infrastructure cost scales linearly with tenant count.

Decision

Consolidate to a shared GKE cluster with namespace-per-tenant isolation, preserving config-only tenant differentiation.

Architecture

Current State (3 production tenants)

Shared Regional GKE Cluster
├── namespace: tenant-agile-network → ~18 pods
├── namespace: tenant-nil-game-plan → ~18 pods
├── namespace: tenant-vt-nil → ~18 pods
├── namespace: platform → Keycloak, Istio, ArgoCD, monitoring
└── Isolation: NetworkPolicies + Istio AuthPolicy + RabbitMQ vhosts

Future State (when Speed of AI reaches production)

Shared Regional GKE Cluster
├── namespace: tenant-agile-network → ~18 pods
├── namespace: tenant-nil-game-plan → ~18 pods
├── namespace: tenant-vt-nil → ~18 pods
├── namespace: tenant-speed-of-ai → ~18 pods  ← Planned
├── namespace: platform → Keycloak, Istio, ArgoCD, monitoring
└── Isolation: NetworkPolicies + Istio AuthPolicy + RabbitMQ vhosts

Isolation Mechanisms

Layer	Current	Target
Compute	Separate clusters	Namespace + ResourceQuota
Network	Physical isolation	NetworkPolicies (default-deny)
Service mesh	Separate Istio	Shared Istio + AuthorizationPolicy
Database	Separate Cloud SQL	Shared Cloud SQL + separate schemas
Messaging	Separate RabbitMQ	Shared RabbitMQ + vhosts
Cache	Separate Redis	Key prefixing or separate Redis
Secrets	Cluster-scoped	Namespace-scoped (AVP)
Config	values-globals.yaml per cluster	values-globals.yaml per namespace

Hypothesis Background

Primary: Config-only multi-brand (H11 L2) enables safe cluster consolidation. - Same Docker images across all tenants — no code-level branching. - values-globals.yaml provides all tenant-specific config. - Namespace isolation provides equivalent security to cluster isolation for this workload.

Alternative 1: Keep cluster-per-tenant model. - Rejected: Cost scales linearly. Adding a 4th brand (Speed of AI) or beyond requires an entire new cluster + Cloud SQL + RabbitMQ + Redis. Current model is not economically scalable.

Alternative 2: Full multi-tenancy (single set of services, tenant ID in requests). - Rejected: Requires code changes to add tenant context to every service, database queries, and message handlers. Risk is disproportionate to benefit. Namespace isolation achieves the cost savings with minimal code changes.

Falsifiability Criteria

If NetworkPolicies cannot prevent cross-namespace traffic in testing → revert to separate clusters
If shared Cloud SQL hits connection limits with 3+ tenants → use separate Cloud SQL instances
If noisy-neighbor effects cause SLO breaches → implement ResourceQuotas or revert

Evidence Quality

Evidence	Assurance
Config-only differentiation	L2 (H11, verified across all domains + infra)
Same Docker images	L2 (verified from ArgoCD + Helm charts)
NetworkPolicies supported by GKE	L2 (GCP documentation)
Cost scaling is linear	L1 (inferred from 3 production clusters)
Data volumes manageable per shared instance	L0 (H8 — need actual data)

Overall: L1 (WLNK capped by H8)

Bounded Validity

Scope: All 3 production tenants (agilenetwork, nilgameplan, vtnil) + dev/staging environments. Speed of AI will be included when it reaches production.
Expiry: Re-evaluate if tenant count exceeds 10 (shared cluster resource limits) or if tenants require regulatory isolation (separate jurisdictions)
Review trigger: If any tenant’s workload causes noisy-neighbor effects measurable in SLOs

Consequences

Positive: ~67% infrastructure cost reduction (3 clusters → 1), simplified operations (single cluster to manage), easier to add new brands like Speed of AI (new namespace, not new cluster). Negative: Blast radius increases (cluster failure affects all tenants), noisy-neighbor risk, more complex RBAC/NetworkPolicy configuration. Mitigated by: Regional GKE (multi-zone HA), ResourceQuotas per namespace, monitoring per tenant.

Decision date: 2026-01-30 Review by: 2026-07-30