ADR-010: Passwordless-First Authentication & Authorization Simplification
ADR-010: Passwordless-First Authentication & Authorization Simplification
Status
Proposed — Pending engineering team review
Context
Authentication and authorization is one of the most complex and tightly coupled parts of the platform. Passwordless authentication is already the primary login method — Magic Link (email/SMS) is how fans and experts access the platform. However, this critical capability is buried inside a custom Keycloak SPI, making it fragile, hard to extend, and version-coupled. The #1 priority is to make passwordless authentication a first-class, standalone capability that can evolve independently and expand to modern standards (passkeys, WebAuthn, biometrics).
Keycloak 26.3.2 serves as the identity hub for all 28+ production services across 4 tenants. The current setup includes:
Current Auth Architecture
| Component | Detail | Complexity |
|---|---|---|
| Keycloak instances | identityx-26 (4 tenants) + identityx-25 (1 tenant, legacy) | 5 instances |
| Custom SPIs | Magic Link (passwordless email/SMS), Session Restrictor | Custom Java code in Keycloak Docker image |
| CIB Seven plugin | Keycloak → Camunda identity sync | EOL dependency |
| Realms | 1 per tenant (agilenetwork, nilgameplan, vtnil, speedofai) | 4+ realms |
| Client configs | Multiple per realm (fan, celeb, admin) | ~12+ client configs |
| Token validation | Every service validates JWT via issuer-uri |
28+ services dependent |
| User types | Expert, Fan, Handler, Admin, Organization | 5 roles across multiple realms |
| Theme files | celeb-keycloak-theme, fan-keycloak-theme | 2 custom Freemarker themes |
| Users service | Bridge to Keycloak Admin API (role/group management) | Separate microservice |
| Magic Link delivery | Keycloak SPI → Twilio (SMS), Keycloak SPI → email | Custom delivery pipeline |
Passwordless: Current State & Limitations
The platform already relies on passwordless as its primary authentication method, but the implementation is constrained:
| Aspect | Current State | Limitation |
|---|---|---|
| Magic Link (email) | Custom Keycloak SPI | Tightly coupled to Keycloak version; must rebuild SPI for every upgrade |
| Magic Link (SMS) | Custom Keycloak SPI → Twilio | Same version coupling; SMS delivery is inside Keycloak Docker container |
| Passkeys / WebAuthn | Not supported | Cannot add without building another custom Keycloak SPI |
| Biometric auth | Not supported | No integration path from mobile apps |
| Social login | Not implemented | Keycloak supports it natively but not configured |
| Token lifetime | Keycloak defaults | No Magic Link-specific session management |
| Deep linking (mobile) | Not implemented | Magic Link opens web browser, not native app |
| Rate limiting | None | No protection against Magic Link abuse |
| Passwordless analytics | None | No visibility into delivery rates, click-through, login success |
The core problem: Passwordless is the platform’s competitive UX advantage (no passwords = lower friction for fans) but it’s trapped inside the most fragile part of the stack (custom Keycloak SPIs). Every Keycloak upgrade puts passwordless at risk.
Other Auth Problems
| Problem | Impact |
|---|---|
| CIB Seven plugin is EOL | Camunda 7 CE support ended. The Keycloak identity sync plugin has no upgrade path. |
| Duplicate Keycloak instances | identityx-25 still running on agilenetwork tenant alongside identityx-26. Operational overhead. |
| Per-tenant Keycloak deployment | Each tenant gets a separate Keycloak instance with separate realm, config, themes, and SPIs. Config drift between tenants. |
| users service is a thin wrapper | The users service exists mainly to call Keycloak Admin API. GraphQL
operations like updateKeycloakUser,
createRole, addRoleToUser are
pass-throughs. |
| 28+ services depend on JWT format | Any change to token claims, issuer URL, or signing keys requires coordinated update across all services. |
| Email verification has dual paths | Deprecated email verification API in email service + Keycloak-native verification. Confusing. |
| No centralized authorization | Each service independently checks JWT roles. No policy-as-code or centralized authorization rules. |
Decision
Simplify authentication and authorization through four targeted changes, with passwordless authentication as the #1 priority: build a standalone passwordless authentication service, consolidate to single Keycloak instance with shared realm, eliminate the users service, and adopt centralized authorization policies.
Change 1: Passwordless Authentication Service (TOP PRIORITY)
Current: Magic Link is a custom Keycloak SPI (Java code compiled into Keycloak Docker image). Tightly coupled to Keycloak internals. Must be rebuilt for every Keycloak version. Cannot be extended with modern passwordless methods.
Target: A standalone passwordless-auth-service that owns all passwordless authentication methods and works alongside Keycloak. This is not just “extracting Magic Link” — it’s building a passwordless platform that can grow.
Passwordless Method Roadmap
| Method | Priority | Phase | Description |
|---|---|---|---|
| Magic Link (email) | P0 | MVP | Existing flow, extracted from Keycloak SPI |
| Magic Link (SMS) | P0 | MVP | Existing flow, extracted from Keycloak SPI |
| Mobile deep linking | P0 | MVP | Magic Links open native app directly (ADR-009) |
| Passkeys / WebAuthn | P1 | Phase 2 | FIDO2 passwordless — browser/device biometric auth |
| Biometric unlock | P1 | Phase 2 | Face ID / Touch ID for returning mobile users |
| Social login (Google/Apple) | P2 | Phase 3 | OAuth2 via Keycloak identity brokering |
| One-time passcode (OTP) | P2 | Phase 3 | 6-digit code via SMS/email as Magic Link alternative |
| QR code login | P3 | Future | Scan from mobile to log into web session |
Architecture: Passwordless Auth Service
sequenceDiagram
participant User
participant App as Web/Mobile App
participant PAS as Passwordless Auth Service
participant KC as Keycloak
participant SMS as Twilio
participant Email as Email Service
participant Redis as Redis (tokens)
Note over PAS: MVP: Magic Link
User->>App: Request passwordless login
App->>PAS: POST /auth/passwordless {method: "magic-link", email/phone}
PAS->>Redis: Store one-time token (15min TTL)
PAS->>PAS: Rate limit check (per email/phone)
alt Email
PAS->>Email: Send Magic Link email
else SMS
PAS->>SMS: Send Magic Link SMS via Twilio
end
User->>App: Click magic link / deep link
App->>PAS: POST /auth/passwordless/verify {token}
PAS->>Redis: Validate + consume token
PAS->>KC: Admin API: authenticate user (direct grant)
KC->>PAS: Access token + refresh token
PAS->>App: Return Keycloak tokens
App->>App: Store tokens, proceed as authenticated
Note over PAS: Phase 2: Passkeys
User->>App: Register passkey (WebAuthn)
App->>PAS: POST /auth/passkey/register {attestation}
PAS->>PAS: Store public key credential
User->>App: Login with passkey
App->>PAS: POST /auth/passkey/authenticate {assertion}
PAS->>PAS: Verify signature against stored credential
PAS->>KC: Admin API: authenticate user
KC->>PAS: Tokens
PAS->>App: Return Keycloak tokens
Passwordless Auth Service Capabilities
| Capability | Detail |
|---|---|
| Multi-method support | Pluggable authentication methods behind a unified API |
| Rate limiting | Per-email/phone rate limits to prevent abuse (e.g., max 5 Magic Links per email per hour) |
| Deep linking | Mobile Magic Links use
theagilenetwork://auth/verify?token=xxx for native app
launch |
| Universal Links | iOS Universal Links + Android App Links for seamless web-to-app handoff |
| Delivery tracking | Track Magic Link send, delivery, click, and login success rates |
| Fallback chain | If SMS fails, auto-fallback to email; if passkey fails, offer Magic Link |
| Session management | Configurable session durations per auth method (passkey = longer, Magic Link = shorter) |
| Device trust | Remember trusted devices to reduce re-authentication frequency |
| Branding | Per-tenant email/SMS templates without Keycloak Freemarker themes |
Passwordless Auth Service Stack
- Java 21 / Spring Boot 3.x (same as all other services, uses core-lib patterns)
- Redis for token storage (15-minute TTL for Magic Links, longer for passkey challenges)
- Twilio for SMS delivery
- Email service for email delivery
- WebAuthn4J for passkey/FIDO2 support (Phase 2)
- Keycloak Admin API for token exchange
- PostgreSQL for passkey credential storage (Phase 2)
Mobile Passwordless Flow (ADR-009 Integration)
sequenceDiagram
participant User
participant RN as React Native App
participant PAS as Passwordless Auth Service
participant KC as Keycloak
participant SS as SecureStore
Note over User,SS: First Login (Magic Link)
User->>RN: Tap "Sign In"
RN->>PAS: POST /auth/passwordless {method: "magic-link", email}
PAS-->>User: Email with deep link
User->>RN: Tap deep link (Universal Link)
RN->>PAS: POST /auth/passwordless/verify {token}
PAS->>KC: Authenticate user
KC->>PAS: Tokens
PAS->>RN: Return tokens
RN->>SS: Store tokens + enable biometric
Note over User,SS: Returning User (Biometric)
User->>RN: Open app
RN->>RN: Check biometric enrollment
RN->>User: Face ID / Touch ID prompt
User->>RN: Biometric success
RN->>SS: Retrieve refresh token
RN->>KC: Refresh token exchange
KC->>RN: New access token
RN->>RN: Authenticated session
Why passwordless is #1 priority: Every user hits authentication on every session. A poor passwordless experience directly impacts engagement, conversion, and retention. Extracting this from Keycloak SPI unblocks: Keycloak upgrades, passkey support, mobile deep linking, and delivery analytics — all of which are blocked today.
Change 2: Single Keycloak Instance with Multi-Tenant Realm
Current: Separate Keycloak instances per tenant, each with its own realm.
Target: Single Keycloak instance (regional HA) with a shared realm and tenant-scoped groups.
graph TB
subgraph "Current: Per-Tenant Keycloak"
KC1[Keycloak<br/>agilenetwork realm]
KC2[Keycloak<br/>nilgameplan realm]
KC3[Keycloak<br/>vtnil realm]
KC4[Keycloak<br/>speedofai realm]
end
subgraph "Target: Shared Keycloak"
KC[Single Keycloak 26.x<br/>Regional HA]
R1[Realm: platform]
G1[Group: agilenetwork]
G2[Group: nilgameplan]
G3[Group: vtnil]
G4[Group: speedofai]
KC --> R1
R1 --> G1
R1 --> G2
R1 --> G3
R1 --> G4
end
Benefits: - Single Keycloak to maintain, upgrade, and monitor - Users can exist across tenants (future: single login for multiple brands) - Client configs managed once, scoped by group - Zero custom SPIs (passwordless externalized in Change 1) - Simpler Terraform (1 instance, not 4+)
Tenant isolation: Keycloak groups + custom claims in
JWT. Token includes tenant: "agilenetwork". Services
validate tenant claim matches request context. Keycloak authorization
services enforce tenant-scoped access.
Migration approach: 1. Export realm configs from all
4 Keycloak instances 2. Create unified platform realm with
tenant groups 3. Import users into unified realm (Keycloak user IDs are
UUIDs — preserve them) 4. Update Istio routing to point to single
Keycloak 5. Update all service issuer-uri configs 6. Run
parallel for validation period 7. Decommission per-tenant instances
Change 3: Eliminate Users Service
Current: The users service is a GraphQL
wrapper around the Keycloak Admin API. Operations like
createRole, addRoleToUser,
searchUsers are direct pass-throughs.
Target: Absorb users service functionality into the identity-service (ADR-001) and the magic-link-service.
| Current users Service Operation | Target |
|---|---|
magicEmailLogin, magicSmsCode,
magicSmsLogin |
→ Magic Link Service |
updateKeycloakUser, searchUsers,
whoAmI |
→ identity-service (Keycloak Admin API calls) |
createRole, deleteRole,
addRoleToUser, removeRoleFromUser |
→ identity-service (admin operations) |
createGroup, deleteGroup,
addUserToGroup, removeUserFromGroup |
→ identity-service (admin operations) |
verifyEmail, verifyPhone |
→ Keycloak native verification (complete migration from deprecated email service API) |
Benefit: Eliminates an entire microservice. Identity-service (celebrity + fan + users merged per ADR-001) handles all identity operations. Magic link gets its own lightweight service because it has a distinct lifecycle (stateless, Redis-backed, can scale independently).
Change 4: Centralized Authorization with OPA/Cedar
Current: Each service independently parses JWT claims and checks roles in application code. Authorization logic is scattered across 28+ services.
Target: Centralized authorization policy engine using OPA (Open Policy Agent) or Cedar, evaluated at the Istio sidecar level.
graph LR
subgraph "Request Flow"
CLIENT[Client<br/>JWT Bearer token]
ISTIO[Istio Sidecar<br/>+ ExtAuthz filter]
OPA[OPA / Cedar<br/>Policy Engine]
SVC[Backend Service]
end
CLIENT -->|Request + JWT| ISTIO
ISTIO -->|AuthZ check| OPA
OPA -->|Allow/Deny| ISTIO
ISTIO -->|Allowed request| SVC
Policy Example (OPA/Rego):
package authz
default allow = false
# Fans can view content
allow {
input.method == "POST"
input.path == "/api/content/graphql"
input.token.realm_access.roles[_] == "fan"
input.body.operationName == "queryContent"
}
# Experts can manage their own profile
allow {
input.path == "/api/celebrity/graphql"
input.token.realm_access.roles[_] == "celebrity"
input.token.sub == input.body.variables.celebrityUserId
}
# Admins can access admin endpoints
allow {
input.token.realm_access.roles[_] == "admin"
}
# Tenant isolation
allow {
input.token.tenant == input.headers["x-tenant-id"]
}
Benefits: - Authorization logic defined in one place (policy files), not scattered across 28+ services - Policies can be tested independently (OPA has a built-in test framework) - Tenant isolation enforced at the mesh level, not in application code - Services become simpler — they trust that requests reaching them are authorized - Policy changes don’t require service redeployment
Implementation approach: 1. Deploy OPA as an Istio ExternalAuthorization provider 2. Start with coarse-grained policies (role-based, tenant-scoped) 3. Gradually move fine-grained authorization from services to OPA 4. Services keep JWT validation as defense-in-depth but delegate authorization
Combined Architecture
graph TB
subgraph "Passwordless Layer (Priority #1)"
PAS[Passwordless Auth Service<br/>Magic Link + Passkeys + Biometric<br/>Spring Boot + Redis]
end
subgraph "Identity Provider"
KC[Single Keycloak 26.x<br/>Multi-tenant realm<br/>Regional HA]
end
subgraph "Authorization Layer"
OPA[OPA Policy Engine<br/>Istio ExtAuthz]
end
subgraph "Identity Domain (ADR-001)"
ID[identity-service<br/>celebrity + fan + users merged<br/>Profile management + Keycloak Admin API]
end
subgraph "All Other Services"
SVC1[payment-service]
SVC2[content-service]
SVC3[notification-service]
SVCN[... other services]
end
subgraph "Clients"
WEB[Next.js Web]
MOB[React Native Mobile]
end
WEB -->|Magic Link / Passkey| PAS
MOB -->|Magic Link deep link / Biometric| PAS
WEB -->|OAuth2 PKCE fallback| KC
MOB -->|OAuth2 PKCE fallback| KC
PAS -->|Token exchange| KC
WEB -->|GraphQL + JWT| OPA
MOB -->|GraphQL + JWT| OPA
OPA -->|Authorized| SVC1
OPA -->|Authorized| SVC2
OPA -->|Authorized| SVC3
OPA -->|Authorized| SVCN
OPA -->|Authorized| ID
ID -->|Admin API| KC
Simplification Summary
| Metric | Current | After Simplification |
|---|---|---|
| Passwordless methods supported | 1 (Magic Link only, locked in SPI) | 4+ (Magic Link, Passkeys, Biometric, OTP) |
| Passwordless extensibility | Requires Keycloak SPI rebuild | Standard REST API, independently deployable |
| Keycloak instances | 5 (4 prod + 1 legacy) | 1 (regional HA) |
| Custom Keycloak SPIs | 2 (Magic Link, Session Restrictor) | 0 (externalized to passwordless-auth-service) |
| CIB Seven plugin | 1 (EOL) | 0 (eliminated with BPM replacement) |
| Auth-related services | 4 (Keycloak, users, celebrity, fan) | 2 (Keycloak, passwordless-auth-service) + identity-service |
| Authorization logic locations | 28+ services (scattered) | 1 (OPA policy files) |
| Keycloak themes | 2 repos (Freemarker) | 1 unified theme (or Keycloak 26 built-in theming) |
| Email verification paths | 2 (deprecated email API + Keycloak native) | 1 (Keycloak native only) |
| Tenant Keycloak configs | 4+ separate realm configs | 1 unified realm with tenant groups |
| Mobile auth deep linking | Not supported | Universal Links + App Links for native app launch |
| Auth delivery analytics | None | Full funnel: send → deliver → click → login |
Hypothesis Background
Primary: Building a standalone passwordless authentication service that decouples passwordless flows from Keycloak SPIs will unlock passkey/WebAuthn support, mobile deep linking, and delivery analytics — while reducing auth complexity by ~60% through Keycloak consolidation, users service elimination, and centralized authorization.
- Evidence: Passwordless (Magic Link) is the primary login method for all users today, but it’s trapped in a custom Keycloak SPI that must be rebuilt for every Keycloak upgrade.
- Evidence: Modern passwordless standards (FIDO2/WebAuthn, passkeys) cannot be added without building yet another custom Keycloak SPI. Externalizing passwordless to a standard service makes these methods additive, not invasive.
- Evidence: H11 (L2) confirms all tenant differentiation is config-only. There’s no reason for separate Keycloak instances per tenant — a shared realm with tenant groups achieves the same isolation with less operational overhead.
- Evidence: The users service is a thin wrapper (15 GraphQL operations, most are Keycloak Admin API pass-throughs). Eliminating it removes a deployment, CI pipeline, Helm chart, and monitoring surface.
Alternative 1: Replace Keycloak with Auth0/Okta. - Rejected: Auth0 has native passwordless but at significant cost per MAU, and would require migrating all client configs, updating 28+ service issuer-uri configs, and reimplementing theme customization. Keycloak 26.x is mature and actively maintained. The problem isn’t Keycloak — it’s how passwordless is coupled to it via custom SPIs.
Alternative 2: Keep Magic Link as Keycloak SPI, add passkeys as another SPI. - Rejected: This doubles down on the coupling problem. Every Keycloak upgrade would now require rebuilding two custom SPIs. The SPI lifecycle (compile → build Docker image → deploy) is far heavier than a standalone service deployment.
Alternative 3: Use Keycloak’s built-in WebAuthn support. - Partially viable but insufficient: Keycloak 26.x has WebAuthn authenticator support, but it’s configured through the admin UI and authentication flows, not through APIs that the mobile app can call directly. A standalone passwordless service gives full API control for both web and React Native clients.
Alternative 4: Skip OPA, keep per-service authorization. - Not fully rejected — can be deferred. OPA adds infrastructure complexity. However, with 18 consolidated services (ADR-001) all needing tenant isolation (ADR-004), centralizing authorization prevents repeating the same tenant-check logic in every service.
Falsifiability Criteria
Passwordless-specific: - If passwordless-auth-service adds >1s latency to Magic Link login flow compared to current Keycloak SPI → optimize token exchange or implement caching - If Magic Link delivery rates drop below current levels after migration → investigate email/SMS delivery pipeline differences - If WebAuthn4J passkey registration fails on >5% of target devices (iOS 16+, Android 9+, Chrome 100+) → evaluate alternative FIDO2 libraries - If mobile deep linking (Universal Links / App Links) fails to open native app >10% of the time → implement fallback web flow with app banner - If passwordless-auth-service cannot handle peak authentication load (measure: >500 concurrent logins) → scale horizontally or add connection pooling
Infrastructure-specific: - If single Keycloak realm with tenant groups cannot enforce sufficient isolation (users seeing cross-tenant data) → revert to per-tenant realms within single instance - If OPA policy evaluation adds >50ms to request latency → simplify policies or move to application-level authorization - If migrating 4 realm configs to 1 unified realm loses user data or breaks existing sessions → extend parallel-run period - If React Native / Next.js Keycloak PKCE integration proves unreliable → implement BFF (Backend-for-Frontend) token exchange
Evidence Quality
| Evidence | Assurance |
|---|---|
| Magic Link is primary login method | L2 (verified — custom SPI in Keycloak Docker image) |
| Magic Link SPI is version-coupled | L1 (custom Java code, must rebuild per Keycloak version) |
| Passkeys/WebAuthn cannot be added via current SPI | L1 (Keycloak SPI architecture limits, no existing passkey SPI) |
| WebAuthn4J library maturity | L1 (open-source, actively maintained, used by Spring Security) |
| Multi-tenant is config-only (H11) | L2 (verified across all domains) |
| CIB Seven plugin is EOL | L2 (Camunda 7 CE support ended Oct 2025) |
| users service is thin wrapper | L1 (15 operations, mostly Keycloak Admin API) |
| Dual Keycloak instances running | L2 (identityx-25 + identityx-26 verified) |
| OPA/Istio ExtAuthz integration | L1 (Istio documentation, community adoption) |
| Keycloak multi-tenant realm feasibility | L1 (Keycloak documentation, group-based isolation) |
| User migration preserves IDs | L0 (Keycloak export/import should preserve UUIDs — needs testing) |
| Passkey device compatibility | L0 (need to verify target user device distribution) |
| Magic Link delivery rate baseline | L0 (no current metrics on send/deliver/click rates) |
Overall: L1 (WLNK capped by user migration testing L0 and passwordless delivery baseline L0)
Bounded Validity
- Scope: All authentication and authorization across all tenants and services. Affects every service that validates JWT tokens.
- Expiry: Re-evaluate if Keycloak introduces breaking changes to realm/group model, or if the platform needs federating with external identity providers (e.g., enterprise SSO for organization tenants).
- Review trigger: If auth-related incidents increase after consolidation, or if OPA policy complexity exceeds what the team can maintain.
- Monitoring: Passwordless login funnel (send → deliver → click → login success rate), auth latency (login flow p99), passkey registration/auth success rate, token validation latency, OPA decision latency, failed auth rate, deep link open rate (mobile).
Consequences
Positive: - Passwordless becomes a first-class, independently deployable capability - Path to passkeys, biometrics, and modern passwordless standards without Keycloak SPI changes - Mobile deep linking for Magic Links (direct native app launch) - Full passwordless analytics funnel (delivery, click-through, login success) - 80% fewer Keycloak instances to manage (5 → 1) - Zero custom Keycloak SPIs (removes version coupling — Keycloak becomes upgradeable) - One fewer microservice (users eliminated) - Centralized authorization logic (testable, auditable) - Simpler Terraform (1 Keycloak deployment, not 5) - Foundation for cross-brand user accounts - Aligns with ADR-004 (shared cluster), ADR-008 (web auth), and ADR-009 (mobile auth)
Negative: - Passwordless-auth-service is a new service (though simpler than the Keycloak SPI it replaces) - Passkey credential storage adds a data management responsibility - Single Keycloak instance is a bigger blast radius (mitigated by regional HA) - OPA adds a new technology to the stack (learning curve) - User migration from 4 realms to 1 requires careful planning and a maintenance window - Tenant isolation relies on claims and policies rather than physical separation
Mitigated by: Passwordless-auth-service follows existing core-lib patterns and is stateless (Redis + Keycloak). Passkey credentials can be stored in the identity database (ADR-005). Regional HA Keycloak eliminates SPOF. OPA policies are declarative and testable. User migration can be done per-tenant with parallel running. Istio + OPA provides defense-in-depth for tenant isolation.
Decision date: 2026-01-31 Review by: 2026-07-31