ADR-020: Network Security Model
ADR-020: Network Security Model
Status
Proposed — Pending engineering team review
Context
The platform has no NetworkPolicies, permissive CORS configuration, no WAF, and public GCS buckets. All pods can communicate freely within each GKE cluster. Moving to a shared cluster (ADR-002) makes this gap critical — without network isolation, tenant A’s pods could reach tenant B’s services.
Current State
| Component | Current | Gap |
|---|---|---|
| NetworkPolicies | None | All pods communicate freely — lateral movement possible |
| CORS | Allow all origins | Cross-origin attacks possible on all 24+ GraphQL endpoints |
| WAF | None | No application-layer attack filtering (SQL injection, XSS) |
| GCS buckets | Some public access | Unauthorized content access; data exposure risk |
| Istio mTLS | PERMISSIVE mode | Not all traffic is encrypted in transit |
| Namespace isolation | Single namespace per cluster | No tenant isolation within cluster |
| Secret rotation | None automated | Secrets persist indefinitely once created |
Impact
- Shared cluster (ADR-002) is blocked without NetworkPolicies — cannot co-locate tenants safely
- Public GCS buckets expose user-uploaded content to unauthorized access
- Permissive CORS allows any website to make API requests to backend services
- No WAF means OWASP Top 10 attacks (SQL injection, XSS, etc.) are not filtered at the edge
Decision
Implement defense-in-depth network security across four layers: edge (WAF), mesh (mTLS + AuthorizationPolicy), cluster (NetworkPolicies), and storage (signed URLs).
Layer 1: Edge Security (Cloud Armor WAF)
| Rule | Purpose |
|---|---|
| OWASP Top 10 managed rule set | Block SQL injection, XSS, RCE |
| Rate limiting | 1000 req/min per IP (adjustable per path) |
| Geo-restriction | Optional — restrict to operating regions |
| Bot management | Block known malicious user agents |
| Custom rules | Block requests >10MB (except file upload paths) |
Applied at GCP Load Balancer, in front of Istio IngressGateway.
Layer 2: Service Mesh (Istio)
| Policy | Scope | Effect |
|---|---|---|
| PeerAuthentication | Mesh-wide | STRICT mTLS — all service-to-service traffic
encrypted |
| AuthorizationPolicy | Per namespace | Only named services can reach each other |
| RequestAuthentication | IngressGateway | Validate JWT from Keycloak before routing |
# Example: Strict mTLS mesh-wide
apiVersion: security.istio.io/v1
kind: PeerAuthentication
metadata:
name: default
namespace: istio-system
spec:
mtls:
mode: STRICT
Layer 3: Cluster Network (NetworkPolicies)
Default-deny ingress for all namespaces, with explicit allow rules:
| Namespace | Allowed Ingress From | Rationale |
|---|---|---|
tenant-{name} |
Istio IngressGateway only | All external traffic enters via mesh |
platform (Keycloak) |
All tenant namespaces | All services validate JWT |
monitoring |
All namespaces (metrics scrape) | Prometheus needs pod access |
| Within tenant namespace | Same namespace only | Services within one tenant can communicate |
# Default deny all ingress per tenant namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-ingress
namespace: tenant-agile-network
spec:
podSelector: {}
policyTypes:
- Ingress
Layer 4: Storage Security (GCS Signed URLs)
| Current | Target |
|---|---|
| Public GCS buckets | Private buckets + signed URLs |
| Direct GCS URLs in database | Signed URL generation at read time |
| No expiry | URLs expire after configurable TTL (default 1 hour) |
Content delivery flow: 1. Frontend requests content via GraphQL 2. Service generates signed URL (GCS IAM) 3. Frontend receives time-limited URL 4. CDN caches signed URL response (not the signature)
CORS Hardening
| Current | Target |
|---|---|
Access-Control-Allow-Origin: * |
Explicit tenant domains only |
| No preflight caching | Access-Control-Max-Age: 3600 |
| All methods allowed | Only GET, POST, OPTIONS for GraphQL |
Allowed origins derived from values-globals.yaml per
tenant: - https://theagilenetwork.com -
https://nilgameplan.com - https://vtnil.com -
https://speedofai.com - Plus *.preview.app for
staging/preview environments
Implementation Priority
- NetworkPolicies — prerequisite for shared cluster (ADR-002)
- CORS hardening — low effort, high impact
- mTLS STRICT — service mesh security
- Cloud Armor WAF — edge protection
- GCS signed URLs — storage security
- Secret rotation — operational hygiene
Hypothesis Background
Primary: Defense-in-depth network security with four layers provides adequate isolation for multi-tenant shared cluster operation.
- Evidence: No NetworkPolicies today — all pods communicate freely (L2)
- Evidence: CORS allows all origins (L2 — observed in service configurations)
- Evidence: Some GCS buckets have public access (L1 — observed in Terraform)
- Evidence: Shared cluster requires tenant isolation (L2 — architectural requirement from ADR-002)
Alternative: Keep cluster-per-tenant (no NetworkPolicy needed). - Rejected for cost reasons (ADR-002). Cluster-per-tenant costs scale linearly with tenants.
Falsifiability Criteria
- If NetworkPolicies break >5% of service-to-service communication on first deployment → integration pattern documentation is incomplete
- If signed URLs add >200ms latency to content delivery → evaluate CDN-based signing (Cloud CDN signed URLs)
- If Cloud Armor blocks >1% of legitimate traffic → tune rules with 2-week learning mode
- If mTLS STRICT mode causes service connectivity failures → roll back to PERMISSIVE per affected service
Evidence Quality
| Evidence | Assurance |
|---|---|
| No NetworkPolicies in any cluster | L2 (verified from Terraform/Helm) |
| CORS allows all origins | L2 (verified from service code) |
| Public GCS buckets | L1 (observed, not exhaustively audited) |
| Integration patterns (allow list) | L1 (documented in integration-patterns.md) |
| Cloud Armor effectiveness | L0 (not tested) |
Overall: L1 (WLNK capped by untested Cloud Armor configuration)
Bounded Validity
- Scope: All GKE clusters, all backend services, GCS storage.
- Expiry: Re-evaluate NetworkPolicy rules after shared cluster migration is complete.
- Review trigger: If multi-tenant isolation is deemed insufficient for compliance. If Cloud Armor causes significant false positives.
- Monitoring: Track blocked requests (Cloud Armor), NetworkPolicy drops (Cilium/Calico metrics), mTLS handshake failures.
Consequences
Positive: - Enables shared cluster multi-tenancy (ADR-002 prerequisite) - Defense-in-depth: WAF + mesh + network + storage - Eliminates public GCS bucket exposure - CORS hardening prevents cross-origin attacks - mTLS encrypts all service-to-service traffic
Negative: - NetworkPolicy maintenance as services change - Signed URL generation adds compute overhead - Cloud Armor has per-request cost - CORS changes may break preview/development environments initially
Decision date: 2026-02-01 Review by: 2026-08-01