ADR-004: Multi-Brand Architecture
ADR-004: Multi-Brand Architecture
Status
Proposed — Pending engineering team review
Context
The platform serves 3 production brands (The Agile Network, NIL Game Plan, VT NIL) on a cluster-per-tenant model. Speed of AI is planned but has no ArgoCD or Terraform production configuration as of 2026-01-31. H11 (L2 Verified) confirmed all tenant differentiation is config-only — same Docker images, same Helm charts, same code. Infrastructure cost scales linearly with tenant count.
Decision
Consolidate to a shared GKE cluster with namespace-per-tenant isolation, preserving config-only tenant differentiation.
Architecture
Current State (3 production tenants)
Shared Regional GKE Cluster
├── namespace: tenant-agile-network → ~18 pods
├── namespace: tenant-nil-game-plan → ~18 pods
├── namespace: tenant-vt-nil → ~18 pods
├── namespace: platform → Keycloak, Istio, ArgoCD, monitoring
└── Isolation: NetworkPolicies + Istio AuthPolicy + RabbitMQ vhosts
Future State (when Speed of AI reaches production)
Shared Regional GKE Cluster
├── namespace: tenant-agile-network → ~18 pods
├── namespace: tenant-nil-game-plan → ~18 pods
├── namespace: tenant-vt-nil → ~18 pods
├── namespace: tenant-speed-of-ai → ~18 pods ← Planned
├── namespace: platform → Keycloak, Istio, ArgoCD, monitoring
└── Isolation: NetworkPolicies + Istio AuthPolicy + RabbitMQ vhosts
Isolation Mechanisms
| Layer | Current | Target |
|---|---|---|
| Compute | Separate clusters | Namespace + ResourceQuota |
| Network | Physical isolation | NetworkPolicies (default-deny) |
| Service mesh | Separate Istio | Shared Istio + AuthorizationPolicy |
| Database | Separate Cloud SQL | Shared Cloud SQL + separate schemas |
| Messaging | Separate RabbitMQ | Shared RabbitMQ + vhosts |
| Cache | Separate Redis | Key prefixing or separate Redis |
| Secrets | Cluster-scoped | Namespace-scoped (AVP) |
| Config | values-globals.yaml per cluster | values-globals.yaml per namespace |
Hypothesis Background
Primary: Config-only multi-brand (H11 L2) enables
safe cluster consolidation. - Same Docker images across all tenants — no
code-level branching. - values-globals.yaml provides all
tenant-specific config. - Namespace isolation provides equivalent
security to cluster isolation for this workload.
Alternative 1: Keep cluster-per-tenant model. - Rejected: Cost scales linearly. Adding a 4th brand (Speed of AI) or beyond requires an entire new cluster + Cloud SQL + RabbitMQ + Redis. Current model is not economically scalable.
Alternative 2: Full multi-tenancy (single set of services, tenant ID in requests). - Rejected: Requires code changes to add tenant context to every service, database queries, and message handlers. Risk is disproportionate to benefit. Namespace isolation achieves the cost savings with minimal code changes.
Falsifiability Criteria
- If NetworkPolicies cannot prevent cross-namespace traffic in testing → revert to separate clusters
- If shared Cloud SQL hits connection limits with 3+ tenants → use separate Cloud SQL instances
- If noisy-neighbor effects cause SLO breaches → implement ResourceQuotas or revert
Evidence Quality
| Evidence | Assurance |
|---|---|
| Config-only differentiation | L2 (H11, verified across all domains + infra) |
| Same Docker images | L2 (verified from ArgoCD + Helm charts) |
| NetworkPolicies supported by GKE | L2 (GCP documentation) |
| Cost scaling is linear | L1 (inferred from 3 production clusters) |
| Data volumes manageable per shared instance | L0 (H8 — need actual data) |
Overall: L1 (WLNK capped by H8)
Bounded Validity
- Scope: All 3 production tenants (agilenetwork, nilgameplan, vtnil) + dev/staging environments. Speed of AI will be included when it reaches production.
- Expiry: Re-evaluate if tenant count exceeds 10 (shared cluster resource limits) or if tenants require regulatory isolation (separate jurisdictions)
- Review trigger: If any tenant’s workload causes noisy-neighbor effects measurable in SLOs
Consequences
Positive: ~67% infrastructure cost reduction (3 clusters → 1), simplified operations (single cluster to manage), easier to add new brands like Speed of AI (new namespace, not new cluster). Negative: Blast radius increases (cluster failure affects all tenants), noisy-neighbor risk, more complex RBAC/NetworkPolicy configuration. Mitigated by: Regional GKE (multi-zone HA), ResourceQuotas per namespace, monitoring per tenant.
Decision date: 2026-01-30 Review by: 2026-07-30