ADR

ADR-020: Network Security Model

Last updated: 2026-02-01 | Decisions

ADR-020: Network Security Model

Status

Proposed — Pending engineering team review

Context

The platform has no NetworkPolicies, permissive CORS configuration, no WAF, and public GCS buckets. All pods can communicate freely within each GKE cluster. Moving to a shared cluster (ADR-002) makes this gap critical — without network isolation, tenant A’s pods could reach tenant B’s services.

Current State

Component	Current	Gap
NetworkPolicies	None	All pods communicate freely — lateral movement possible
CORS	Allow all origins	Cross-origin attacks possible on all 24+ GraphQL endpoints
WAF	None	No application-layer attack filtering (SQL injection, XSS)
GCS buckets	Some public access	Unauthorized content access; data exposure risk
Istio mTLS	PERMISSIVE mode	Not all traffic is encrypted in transit
Namespace isolation	Single namespace per cluster	No tenant isolation within cluster
Secret rotation	None automated	Secrets persist indefinitely once created

Impact

Shared cluster (ADR-002) is blocked without NetworkPolicies — cannot co-locate tenants safely
Public GCS buckets expose user-uploaded content to unauthorized access
Permissive CORS allows any website to make API requests to backend services
No WAF means OWASP Top 10 attacks (SQL injection, XSS, etc.) are not filtered at the edge

Decision

Implement defense-in-depth network security across four layers: edge (WAF), mesh (mTLS + AuthorizationPolicy), cluster (NetworkPolicies), and storage (signed URLs).

Layer 1: Edge Security (Cloud Armor WAF)

Rule	Purpose
OWASP Top 10 managed rule set	Block SQL injection, XSS, RCE
Rate limiting	1000 req/min per IP (adjustable per path)
Geo-restriction	Optional — restrict to operating regions
Bot management	Block known malicious user agents
Custom rules	Block requests >10MB (except file upload paths)

Applied at GCP Load Balancer, in front of Istio IngressGateway.

Layer 2: Service Mesh (Istio)

Policy	Scope	Effect
PeerAuthentication	Mesh-wide	`STRICT` mTLS — all service-to-service traffic encrypted
AuthorizationPolicy	Per namespace	Only named services can reach each other
RequestAuthentication	IngressGateway	Validate JWT from Keycloak before routing

# Example: Strict mTLS mesh-wide
apiVersion: security.istio.io/v1
kind: PeerAuthentication
metadata:
  name: default
  namespace: istio-system
spec:
  mtls:
    mode: STRICT

Layer 3: Cluster Network (NetworkPolicies)

Default-deny ingress for all namespaces, with explicit allow rules:

Namespace	Allowed Ingress From	Rationale
`tenant-{name}`	Istio IngressGateway only	All external traffic enters via mesh
`platform` (Keycloak)	All tenant namespaces	All services validate JWT
`monitoring`	All namespaces (metrics scrape)	Prometheus needs pod access
Within tenant namespace	Same namespace only	Services within one tenant can communicate

# Default deny all ingress per tenant namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-ingress
  namespace: tenant-agile-network
spec:
  podSelector: {}
  policyTypes:
    - Ingress

Layer 4: Storage Security (GCS Signed URLs)

Current	Target
Public GCS buckets	Private buckets + signed URLs
Direct GCS URLs in database	Signed URL generation at read time
No expiry	URLs expire after configurable TTL (default 1 hour)

Content delivery flow: 1. Frontend requests content via GraphQL 2. Service generates signed URL (GCS IAM) 3. Frontend receives time-limited URL 4. CDN caches signed URL response (not the signature)

CORS Hardening

Current	Target
`Access-Control-Allow-Origin: *`	Explicit tenant domains only
No preflight caching	`Access-Control-Max-Age: 3600`
All methods allowed	Only `GET, POST, OPTIONS` for GraphQL

Allowed origins derived from values-globals.yaml per tenant: - https://theagilenetwork.com - https://nilgameplan.com - https://vtnil.com - https://speedofai.com - Plus *.preview.app for staging/preview environments

Implementation Priority

NetworkPolicies — prerequisite for shared cluster (ADR-002)
CORS hardening — low effort, high impact
mTLS STRICT — service mesh security
Cloud Armor WAF — edge protection
GCS signed URLs — storage security
Secret rotation — operational hygiene

Hypothesis Background

Primary: Defense-in-depth network security with four layers provides adequate isolation for multi-tenant shared cluster operation.

Evidence: No NetworkPolicies today — all pods communicate freely (L2)
Evidence: CORS allows all origins (L2 — observed in service configurations)
Evidence: Some GCS buckets have public access (L1 — observed in Terraform)
Evidence: Shared cluster requires tenant isolation (L2 — architectural requirement from ADR-002)

Alternative: Keep cluster-per-tenant (no NetworkPolicy needed). - Rejected for cost reasons (ADR-002). Cluster-per-tenant costs scale linearly with tenants.

Falsifiability Criteria

If NetworkPolicies break >5% of service-to-service communication on first deployment → integration pattern documentation is incomplete
If signed URLs add >200ms latency to content delivery → evaluate CDN-based signing (Cloud CDN signed URLs)
If Cloud Armor blocks >1% of legitimate traffic → tune rules with 2-week learning mode
If mTLS STRICT mode causes service connectivity failures → roll back to PERMISSIVE per affected service

Evidence Quality

Evidence	Assurance
No NetworkPolicies in any cluster	L2 (verified from Terraform/Helm)
CORS allows all origins	L2 (verified from service code)
Public GCS buckets	L1 (observed, not exhaustively audited)
Integration patterns (allow list)	L1 (documented in integration-patterns.md)
Cloud Armor effectiveness	L0 (not tested)

Overall: L1 (WLNK capped by untested Cloud Armor configuration)

Bounded Validity

Scope: All GKE clusters, all backend services, GCS storage.
Expiry: Re-evaluate NetworkPolicy rules after shared cluster migration is complete.
Review trigger: If multi-tenant isolation is deemed insufficient for compliance. If Cloud Armor causes significant false positives.
Monitoring: Track blocked requests (Cloud Armor), NetworkPolicy drops (Cilium/Calico metrics), mTLS handshake failures.

Consequences

Positive: - Enables shared cluster multi-tenancy (ADR-002 prerequisite) - Defense-in-depth: WAF + mesh + network + storage - Eliminates public GCS bucket exposure - CORS hardening prevents cross-origin attacks - mTLS encrypts all service-to-service traffic

Negative: - NetworkPolicy maintenance as services change - Signed URL generation adds compute overhead - Cloud Armor has per-request cost - CORS changes may break preview/development environments initially

Decision date: 2026-02-01 Review by: 2026-08-01