ADR-018: API Strategy — GraphQL Federation & Schema Evolution
ADR-018: API Strategy — GraphQL Federation & Schema Evolution
Status
Proposed — Pending engineering team review
Context
The platform exposes 24+ separate GraphQL gateways, one per service,
behind Istio path-based routing (/api/{service}). Each
service owns its own schema. There is no API versioning, no deprecation
policy, no rate limiting, and no centralized schema registry. Frontend
clients must know which service owns which data.
Service consolidation (ADR-001) reduces services from ~35 to ~18, which naturally reduces gateway count. This is the right time to decide whether to federate.
Current State
| Metric | Value |
|---|---|
| GraphQL gateways | 24+ (one per service) |
| API versioning | None |
| Schema registry | None |
| Rate limiting | None (Istio routing only) |
| Deprecation policy | Ad-hoc (@deprecated annotation, no enforcement) |
| Frontend API knowledge | Must know which service owns each query/mutation |
| Dead API calls | ~17% of frontend code calls non-existent services |
Decision
Adopt GraphQL schema stitching via a lightweight API gateway, consolidating the 24+ separate endpoints into a single unified GraphQL endpoint. Use schema evolution best practices (additive-only changes, deprecation annotations with sunset dates) rather than versioning.
Why Schema Stitching (Not Full Federation)
| Factor | Schema Stitching | Apollo Federation |
|---|---|---|
| Complexity | Moderate — gateway merges schemas | High — requires Apollo Router, subgraph protocol changes |
| Code changes | Minimal — services keep existing GraphQL endpoints | Significant — each service needs @key,
@external directives |
| Operational overhead | Single gateway service | Apollo Router + registry + composition pipeline |
| Cross-service references | Handled via client-side joins or gateway resolvers | Native entity references across subgraphs |
| Fit for our scale | 18 services, moderate query complexity | Enterprise scale, complex entity graphs |
| Existing patterns | Services already expose standalone GraphQL | Would require schema restructuring |
At 18 services with moderate query complexity, full Apollo Federation is over-engineering. Schema stitching via a gateway (graphql-mesh, graphql-tools, or custom Spring GraphQL gateway) provides a unified endpoint without restructuring every service.
Target Architecture
Frontend (Next.js / React Native)
│
└── Single GraphQL endpoint: /api/graphql
│
├── API Gateway (Spring GraphQL or graphql-mesh)
│ ├── Rate limiting
│ ├── Auth validation (Keycloak JWT)
│ ├── Schema stitching (merges all service schemas)
│ ├── Query complexity limiting
│ └── Deprecation enforcement
│
├── identity-service /api/identity/graphql
├── content-service /api/content/graphql
├── payment-service /api/payment/graphql
├── shoutout-service /api/shoutout/graphql
└── ... (18 total)
Frontend talks to ONE endpoint. Gateway routes to the correct service. Services keep their existing GraphQL schemas unchanged.
Schema Evolution Policy
| Rule | Enforcement |
|---|---|
| Additive only | New fields, types, queries are always safe. Never remove or rename without deprecation. |
| Deprecation annotations | @deprecated(reason: "Use X instead. Sunset: YYYY-MM-DD") |
| Sunset period | Minimum 90 days between deprecation and removal |
| Breaking change gate | CI checks schema diff — blocks removal of non-deprecated fields |
| Schema changelog | Auto-generated from schema diffs per PR |
Rate Limiting
| Tier | Limit | Applies To |
|---|---|---|
| Authenticated | 1000 req/min | Normal users |
| Admin | 5000 req/min | Admin dashboard |
| Webhook | No limit | Stripe/Mux webhook callbacks |
| Query complexity | Max depth 10, max complexity 500 | All queries |
Hypothesis Background
Primary: A unified GraphQL gateway with schema stitching provides a better developer experience and operational model than 24+ separate endpoints, without the complexity of full federation.
- Evidence: 24+ separate GraphQL gateways today — frontend must know service topology (L2)
- Evidence: ~17% of frontend API calls target non-existent services — no centralized contract validation (L2)
- Evidence: Service consolidation reduces endpoints from 24+ to ~18 — natural simplification point (L1)
- Evidence: Schema stitching is a well-documented pattern in graphql-tools (L1)
Alternative 1: Apollo Federation. - Not rejected permanently — evaluate if cross-service entity references become common (e.g., “load user’s purchases and their content in a single query”). Currently queries are mostly scoped to a single service.
Alternative 2: Keep separate endpoints (status quo with fewer services). - Viable but misses the opportunity to simplify the frontend. After consolidation, 18 endpoints is manageable but a single endpoint is better DX.
Falsifiability Criteria
- If schema stitching introduces >100ms latency overhead per query → optimize or evaluate Apollo Router
- If cross-service entity references are needed in >30% of queries → migrate to Apollo Federation
- If gateway becomes a single point of failure → add redundancy and circuit breakers
- If schema merge conflicts between services are frequent → enforce namespace prefixing
Evidence Quality
| Evidence | Assurance |
|---|---|
| 24+ separate GraphQL endpoints | L2 (verified from service catalog) |
| ~17% dead frontend API calls | L2 (verified from code analysis) |
| No API versioning | L2 (verified — no version annotations found) |
| Schema stitching works at our scale | L1 (documented pattern, not tested) |
| Gateway latency overhead | L0 (needs benchmarking) |
Overall: L1 (WLNK capped by untested gateway performance)
Bounded Validity
- Scope: All backend GraphQL services. Does not cover REST endpoints (Stripe webhooks, file uploads).
- Expiry: Re-evaluate if query patterns shift toward heavy cross-service joins.
- Review trigger: If gateway latency is unacceptable. If federation would significantly reduce frontend complexity.
Consequences
Positive: - Single GraphQL endpoint for frontend — simpler client code - Centralized rate limiting, auth validation, and schema enforcement - Schema evolution policy prevents breaking changes - Dead API calls eliminated (gateway validates schema) - Foundation for API analytics and monitoring
Negative: - Gateway is a new service to operate and scale - Schema stitching can have edge cases with conflicting type names - Additional network hop for every query - Gateway must be updated when services add/change schemas
Decision date: 2026-02-01 Review by: 2026-08-01