ADR

ADR-010: Passwordless-First Authentication & Authorization Simplification

Last updated: 2026-02-01 | Decisions

ADR-010: Passwordless-First Authentication & Authorization Simplification

Status

Proposed — Pending engineering team review

Context

Authentication and authorization is one of the most complex and tightly coupled parts of the platform. Passwordless authentication is already the primary login method — Magic Link (email/SMS) is how fans and experts access the platform. However, this critical capability is buried inside a custom Keycloak SPI, making it fragile, hard to extend, and version-coupled. The #1 priority is to make passwordless authentication a first-class, standalone capability that can evolve independently and expand to modern standards (passkeys, WebAuthn, biometrics).

Keycloak 26.3.2 serves as the identity hub for all 28+ production services across 4 tenants. The current setup includes:

Current Auth Architecture

Component Detail Complexity
Keycloak instances identityx-26 (4 tenants) + identityx-25 (1 tenant, legacy) 5 instances
Custom SPIs Magic Link (passwordless email/SMS), Session Restrictor Custom Java code in Keycloak Docker image
CIB Seven plugin Keycloak → Camunda identity sync EOL dependency
Realms 1 per tenant (agilenetwork, nilgameplan, vtnil, speedofai) 4+ realms
Client configs Multiple per realm (fan, celeb, admin) ~12+ client configs
Token validation Every service validates JWT via issuer-uri 28+ services dependent
User types Expert, Fan, Handler, Admin, Organization 5 roles across multiple realms
Theme files celeb-keycloak-theme, fan-keycloak-theme 2 custom Freemarker themes
Users service Bridge to Keycloak Admin API (role/group management) Separate microservice
Magic Link delivery Keycloak SPI → Twilio (SMS), Keycloak SPI → email Custom delivery pipeline

Passwordless: Current State & Limitations

The platform already relies on passwordless as its primary authentication method, but the implementation is constrained:

Aspect Current State Limitation
Magic Link (email) Custom Keycloak SPI Tightly coupled to Keycloak version; must rebuild SPI for every upgrade
Magic Link (SMS) Custom Keycloak SPI → Twilio Same version coupling; SMS delivery is inside Keycloak Docker container
Passkeys / WebAuthn Not supported Cannot add without building another custom Keycloak SPI
Biometric auth Not supported No integration path from mobile apps
Social login Not implemented Keycloak supports it natively but not configured
Token lifetime Keycloak defaults No Magic Link-specific session management
Deep linking (mobile) Not implemented Magic Link opens web browser, not native app
Rate limiting None No protection against Magic Link abuse
Passwordless analytics None No visibility into delivery rates, click-through, login success

The core problem: Passwordless is the platform’s competitive UX advantage (no passwords = lower friction for fans) but it’s trapped inside the most fragile part of the stack (custom Keycloak SPIs). Every Keycloak upgrade puts passwordless at risk.

Other Auth Problems

Problem Impact
CIB Seven plugin is EOL Camunda 7 CE support ended. The Keycloak identity sync plugin has no upgrade path.
Duplicate Keycloak instances identityx-25 still running on agilenetwork tenant alongside identityx-26. Operational overhead.
Per-tenant Keycloak deployment Each tenant gets a separate Keycloak instance with separate realm, config, themes, and SPIs. Config drift between tenants.
users service is a thin wrapper The users service exists mainly to call Keycloak Admin API. GraphQL operations like updateKeycloakUser, createRole, addRoleToUser are pass-throughs.
28+ services depend on JWT format Any change to token claims, issuer URL, or signing keys requires coordinated update across all services.
Email verification has dual paths Deprecated email verification API in email service + Keycloak-native verification. Confusing.
No centralized authorization Each service independently checks JWT roles. No policy-as-code or centralized authorization rules.

Decision

Simplify authentication and authorization through four targeted changes, with passwordless authentication as the #1 priority: build a standalone passwordless authentication service, consolidate to single Keycloak instance with shared realm, eliminate the users service, and adopt centralized authorization policies.

Change 1: Passwordless Authentication Service (TOP PRIORITY)

Current: Magic Link is a custom Keycloak SPI (Java code compiled into Keycloak Docker image). Tightly coupled to Keycloak internals. Must be rebuilt for every Keycloak version. Cannot be extended with modern passwordless methods.

Target: A standalone passwordless-auth-service that owns all passwordless authentication methods and works alongside Keycloak. This is not just “extracting Magic Link” — it’s building a passwordless platform that can grow.

Passwordless Method Roadmap

Method Priority Phase Description
Magic Link (email) P0 MVP Existing flow, extracted from Keycloak SPI
Magic Link (SMS) P0 MVP Existing flow, extracted from Keycloak SPI
Mobile deep linking P0 MVP Magic Links open native app directly (ADR-009)
Passkeys / WebAuthn P1 Phase 2 FIDO2 passwordless — browser/device biometric auth
Biometric unlock P1 Phase 2 Face ID / Touch ID for returning mobile users
Social login (Google/Apple) P2 Phase 3 OAuth2 via Keycloak identity brokering
One-time passcode (OTP) P2 Phase 3 6-digit code via SMS/email as Magic Link alternative
QR code login P3 Future Scan from mobile to log into web session

Architecture: Passwordless Auth Service

sequenceDiagram
    participant User
    participant App as Web/Mobile App
    participant PAS as Passwordless Auth Service
    participant KC as Keycloak
    participant SMS as Twilio
    participant Email as Email Service
    participant Redis as Redis (tokens)

    Note over PAS: MVP: Magic Link
    User->>App: Request passwordless login
    App->>PAS: POST /auth/passwordless {method: "magic-link", email/phone}
    PAS->>Redis: Store one-time token (15min TTL)
    PAS->>PAS: Rate limit check (per email/phone)
    alt Email
        PAS->>Email: Send Magic Link email
    else SMS
        PAS->>SMS: Send Magic Link SMS via Twilio
    end
    User->>App: Click magic link / deep link
    App->>PAS: POST /auth/passwordless/verify {token}
    PAS->>Redis: Validate + consume token
    PAS->>KC: Admin API: authenticate user (direct grant)
    KC->>PAS: Access token + refresh token
    PAS->>App: Return Keycloak tokens
    App->>App: Store tokens, proceed as authenticated

    Note over PAS: Phase 2: Passkeys
    User->>App: Register passkey (WebAuthn)
    App->>PAS: POST /auth/passkey/register {attestation}
    PAS->>PAS: Store public key credential
    User->>App: Login with passkey
    App->>PAS: POST /auth/passkey/authenticate {assertion}
    PAS->>PAS: Verify signature against stored credential
    PAS->>KC: Admin API: authenticate user
    KC->>PAS: Tokens
    PAS->>App: Return Keycloak tokens

Passwordless Auth Service Capabilities

Capability Detail
Multi-method support Pluggable authentication methods behind a unified API
Rate limiting Per-email/phone rate limits to prevent abuse (e.g., max 5 Magic Links per email per hour)
Deep linking Mobile Magic Links use theagilenetwork://auth/verify?token=xxx for native app launch
Universal Links iOS Universal Links + Android App Links for seamless web-to-app handoff
Delivery tracking Track Magic Link send, delivery, click, and login success rates
Fallback chain If SMS fails, auto-fallback to email; if passkey fails, offer Magic Link
Session management Configurable session durations per auth method (passkey = longer, Magic Link = shorter)
Device trust Remember trusted devices to reduce re-authentication frequency
Branding Per-tenant email/SMS templates without Keycloak Freemarker themes

Passwordless Auth Service Stack

Mobile Passwordless Flow (ADR-009 Integration)

sequenceDiagram
    participant User
    participant RN as React Native App
    participant PAS as Passwordless Auth Service
    participant KC as Keycloak
    participant SS as SecureStore

    Note over User,SS: First Login (Magic Link)
    User->>RN: Tap "Sign In"
    RN->>PAS: POST /auth/passwordless {method: "magic-link", email}
    PAS-->>User: Email with deep link
    User->>RN: Tap deep link (Universal Link)
    RN->>PAS: POST /auth/passwordless/verify {token}
    PAS->>KC: Authenticate user
    KC->>PAS: Tokens
    PAS->>RN: Return tokens
    RN->>SS: Store tokens + enable biometric

    Note over User,SS: Returning User (Biometric)
    User->>RN: Open app
    RN->>RN: Check biometric enrollment
    RN->>User: Face ID / Touch ID prompt
    User->>RN: Biometric success
    RN->>SS: Retrieve refresh token
    RN->>KC: Refresh token exchange
    KC->>RN: New access token
    RN->>RN: Authenticated session

Why passwordless is #1 priority: Every user hits authentication on every session. A poor passwordless experience directly impacts engagement, conversion, and retention. Extracting this from Keycloak SPI unblocks: Keycloak upgrades, passkey support, mobile deep linking, and delivery analytics — all of which are blocked today.

Change 2: Single Keycloak Instance with Multi-Tenant Realm

Current: Separate Keycloak instances per tenant, each with its own realm.

Target: Single Keycloak instance (regional HA) with a shared realm and tenant-scoped groups.

graph TB
    subgraph "Current: Per-Tenant Keycloak"
        KC1[Keycloak<br/>agilenetwork realm]
        KC2[Keycloak<br/>nilgameplan realm]
        KC3[Keycloak<br/>vtnil realm]
        KC4[Keycloak<br/>speedofai realm]
    end

    subgraph "Target: Shared Keycloak"
        KC[Single Keycloak 26.x<br/>Regional HA]
        R1[Realm: platform]
        G1[Group: agilenetwork]
        G2[Group: nilgameplan]
        G3[Group: vtnil]
        G4[Group: speedofai]
        KC --> R1
        R1 --> G1
        R1 --> G2
        R1 --> G3
        R1 --> G4
    end

Benefits: - Single Keycloak to maintain, upgrade, and monitor - Users can exist across tenants (future: single login for multiple brands) - Client configs managed once, scoped by group - Zero custom SPIs (passwordless externalized in Change 1) - Simpler Terraform (1 instance, not 4+)

Tenant isolation: Keycloak groups + custom claims in JWT. Token includes tenant: "agilenetwork". Services validate tenant claim matches request context. Keycloak authorization services enforce tenant-scoped access.

Migration approach: 1. Export realm configs from all 4 Keycloak instances 2. Create unified platform realm with tenant groups 3. Import users into unified realm (Keycloak user IDs are UUIDs — preserve them) 4. Update Istio routing to point to single Keycloak 5. Update all service issuer-uri configs 6. Run parallel for validation period 7. Decommission per-tenant instances

Change 3: Eliminate Users Service

Current: The users service is a GraphQL wrapper around the Keycloak Admin API. Operations like createRole, addRoleToUser, searchUsers are direct pass-throughs.

Target: Absorb users service functionality into the identity-service (ADR-001) and the magic-link-service.

Current users Service Operation Target
magicEmailLogin, magicSmsCode, magicSmsLogin → Magic Link Service
updateKeycloakUser, searchUsers, whoAmI → identity-service (Keycloak Admin API calls)
createRole, deleteRole, addRoleToUser, removeRoleFromUser → identity-service (admin operations)
createGroup, deleteGroup, addUserToGroup, removeUserFromGroup → identity-service (admin operations)
verifyEmail, verifyPhone → Keycloak native verification (complete migration from deprecated email service API)

Benefit: Eliminates an entire microservice. Identity-service (celebrity + fan + users merged per ADR-001) handles all identity operations. Magic link gets its own lightweight service because it has a distinct lifecycle (stateless, Redis-backed, can scale independently).

Change 4: Centralized Authorization with OPA/Cedar

Current: Each service independently parses JWT claims and checks roles in application code. Authorization logic is scattered across 28+ services.

Target: Centralized authorization policy engine using OPA (Open Policy Agent) or Cedar, evaluated at the Istio sidecar level.

graph LR
    subgraph "Request Flow"
        CLIENT[Client<br/>JWT Bearer token]
        ISTIO[Istio Sidecar<br/>+ ExtAuthz filter]
        OPA[OPA / Cedar<br/>Policy Engine]
        SVC[Backend Service]
    end

    CLIENT -->|Request + JWT| ISTIO
    ISTIO -->|AuthZ check| OPA
    OPA -->|Allow/Deny| ISTIO
    ISTIO -->|Allowed request| SVC

Policy Example (OPA/Rego):

package authz

default allow = false

# Fans can view content
allow {
    input.method == "POST"
    input.path == "/api/content/graphql"
    input.token.realm_access.roles[_] == "fan"
    input.body.operationName == "queryContent"
}

# Experts can manage their own profile
allow {
    input.path == "/api/celebrity/graphql"
    input.token.realm_access.roles[_] == "celebrity"
    input.token.sub == input.body.variables.celebrityUserId
}

# Admins can access admin endpoints
allow {
    input.token.realm_access.roles[_] == "admin"
}

# Tenant isolation
allow {
    input.token.tenant == input.headers["x-tenant-id"]
}

Benefits: - Authorization logic defined in one place (policy files), not scattered across 28+ services - Policies can be tested independently (OPA has a built-in test framework) - Tenant isolation enforced at the mesh level, not in application code - Services become simpler — they trust that requests reaching them are authorized - Policy changes don’t require service redeployment

Implementation approach: 1. Deploy OPA as an Istio ExternalAuthorization provider 2. Start with coarse-grained policies (role-based, tenant-scoped) 3. Gradually move fine-grained authorization from services to OPA 4. Services keep JWT validation as defense-in-depth but delegate authorization

Combined Architecture

graph TB
    subgraph "Passwordless Layer (Priority #1)"
        PAS[Passwordless Auth Service<br/>Magic Link + Passkeys + Biometric<br/>Spring Boot + Redis]
    end

    subgraph "Identity Provider"
        KC[Single Keycloak 26.x<br/>Multi-tenant realm<br/>Regional HA]
    end

    subgraph "Authorization Layer"
        OPA[OPA Policy Engine<br/>Istio ExtAuthz]
    end

    subgraph "Identity Domain (ADR-001)"
        ID[identity-service<br/>celebrity + fan + users merged<br/>Profile management + Keycloak Admin API]
    end

    subgraph "All Other Services"
        SVC1[payment-service]
        SVC2[content-service]
        SVC3[notification-service]
        SVCN[... other services]
    end

    subgraph "Clients"
        WEB[Next.js Web]
        MOB[React Native Mobile]
    end

    WEB -->|Magic Link / Passkey| PAS
    MOB -->|Magic Link deep link / Biometric| PAS
    WEB -->|OAuth2 PKCE fallback| KC
    MOB -->|OAuth2 PKCE fallback| KC
    PAS -->|Token exchange| KC

    WEB -->|GraphQL + JWT| OPA
    MOB -->|GraphQL + JWT| OPA
    OPA -->|Authorized| SVC1
    OPA -->|Authorized| SVC2
    OPA -->|Authorized| SVC3
    OPA -->|Authorized| SVCN
    OPA -->|Authorized| ID

    ID -->|Admin API| KC

Simplification Summary

Metric Current After Simplification
Passwordless methods supported 1 (Magic Link only, locked in SPI) 4+ (Magic Link, Passkeys, Biometric, OTP)
Passwordless extensibility Requires Keycloak SPI rebuild Standard REST API, independently deployable
Keycloak instances 5 (4 prod + 1 legacy) 1 (regional HA)
Custom Keycloak SPIs 2 (Magic Link, Session Restrictor) 0 (externalized to passwordless-auth-service)
CIB Seven plugin 1 (EOL) 0 (eliminated with BPM replacement)
Auth-related services 4 (Keycloak, users, celebrity, fan) 2 (Keycloak, passwordless-auth-service) + identity-service
Authorization logic locations 28+ services (scattered) 1 (OPA policy files)
Keycloak themes 2 repos (Freemarker) 1 unified theme (or Keycloak 26 built-in theming)
Email verification paths 2 (deprecated email API + Keycloak native) 1 (Keycloak native only)
Tenant Keycloak configs 4+ separate realm configs 1 unified realm with tenant groups
Mobile auth deep linking Not supported Universal Links + App Links for native app launch
Auth delivery analytics None Full funnel: send → deliver → click → login

Hypothesis Background

Primary: Building a standalone passwordless authentication service that decouples passwordless flows from Keycloak SPIs will unlock passkey/WebAuthn support, mobile deep linking, and delivery analytics — while reducing auth complexity by ~60% through Keycloak consolidation, users service elimination, and centralized authorization.

Alternative 1: Replace Keycloak with Auth0/Okta. - Rejected: Auth0 has native passwordless but at significant cost per MAU, and would require migrating all client configs, updating 28+ service issuer-uri configs, and reimplementing theme customization. Keycloak 26.x is mature and actively maintained. The problem isn’t Keycloak — it’s how passwordless is coupled to it via custom SPIs.

Alternative 2: Keep Magic Link as Keycloak SPI, add passkeys as another SPI. - Rejected: This doubles down on the coupling problem. Every Keycloak upgrade would now require rebuilding two custom SPIs. The SPI lifecycle (compile → build Docker image → deploy) is far heavier than a standalone service deployment.

Alternative 3: Use Keycloak’s built-in WebAuthn support. - Partially viable but insufficient: Keycloak 26.x has WebAuthn authenticator support, but it’s configured through the admin UI and authentication flows, not through APIs that the mobile app can call directly. A standalone passwordless service gives full API control for both web and React Native clients.

Alternative 4: Skip OPA, keep per-service authorization. - Not fully rejected — can be deferred. OPA adds infrastructure complexity. However, with 18 consolidated services (ADR-001) all needing tenant isolation (ADR-004), centralizing authorization prevents repeating the same tenant-check logic in every service.

Falsifiability Criteria

Passwordless-specific: - If passwordless-auth-service adds >1s latency to Magic Link login flow compared to current Keycloak SPI → optimize token exchange or implement caching - If Magic Link delivery rates drop below current levels after migration → investigate email/SMS delivery pipeline differences - If WebAuthn4J passkey registration fails on >5% of target devices (iOS 16+, Android 9+, Chrome 100+) → evaluate alternative FIDO2 libraries - If mobile deep linking (Universal Links / App Links) fails to open native app >10% of the time → implement fallback web flow with app banner - If passwordless-auth-service cannot handle peak authentication load (measure: >500 concurrent logins) → scale horizontally or add connection pooling

Infrastructure-specific: - If single Keycloak realm with tenant groups cannot enforce sufficient isolation (users seeing cross-tenant data) → revert to per-tenant realms within single instance - If OPA policy evaluation adds >50ms to request latency → simplify policies or move to application-level authorization - If migrating 4 realm configs to 1 unified realm loses user data or breaks existing sessions → extend parallel-run period - If React Native / Next.js Keycloak PKCE integration proves unreliable → implement BFF (Backend-for-Frontend) token exchange

Evidence Quality

Evidence Assurance
Magic Link is primary login method L2 (verified — custom SPI in Keycloak Docker image)
Magic Link SPI is version-coupled L1 (custom Java code, must rebuild per Keycloak version)
Passkeys/WebAuthn cannot be added via current SPI L1 (Keycloak SPI architecture limits, no existing passkey SPI)
WebAuthn4J library maturity L1 (open-source, actively maintained, used by Spring Security)
Multi-tenant is config-only (H11) L2 (verified across all domains)
CIB Seven plugin is EOL L2 (Camunda 7 CE support ended Oct 2025)
users service is thin wrapper L1 (15 operations, mostly Keycloak Admin API)
Dual Keycloak instances running L2 (identityx-25 + identityx-26 verified)
OPA/Istio ExtAuthz integration L1 (Istio documentation, community adoption)
Keycloak multi-tenant realm feasibility L1 (Keycloak documentation, group-based isolation)
User migration preserves IDs L0 (Keycloak export/import should preserve UUIDs — needs testing)
Passkey device compatibility L0 (need to verify target user device distribution)
Magic Link delivery rate baseline L0 (no current metrics on send/deliver/click rates)

Overall: L1 (WLNK capped by user migration testing L0 and passwordless delivery baseline L0)

Bounded Validity

Consequences

Positive: - Passwordless becomes a first-class, independently deployable capability - Path to passkeys, biometrics, and modern passwordless standards without Keycloak SPI changes - Mobile deep linking for Magic Links (direct native app launch) - Full passwordless analytics funnel (delivery, click-through, login success) - 80% fewer Keycloak instances to manage (5 → 1) - Zero custom Keycloak SPIs (removes version coupling — Keycloak becomes upgradeable) - One fewer microservice (users eliminated) - Centralized authorization logic (testable, auditable) - Simpler Terraform (1 Keycloak deployment, not 5) - Foundation for cross-brand user accounts - Aligns with ADR-004 (shared cluster), ADR-008 (web auth), and ADR-009 (mobile auth)

Negative: - Passwordless-auth-service is a new service (though simpler than the Keycloak SPI it replaces) - Passkey credential storage adds a data management responsibility - Single Keycloak instance is a bigger blast radius (mitigated by regional HA) - OPA adds a new technology to the stack (learning curve) - User migration from 4 realms to 1 requires careful planning and a maintenance window - Tenant isolation relies on claims and policies rather than physical separation

Mitigated by: Passwordless-auth-service follows existing core-lib patterns and is stateless (Redis + Keycloak). Passkey credentials can be stored in the identity database (ADR-005). Regional HA Keycloak eliminates SPOF. OPA policies are declarative and testable. User migration can be done per-tenant with parallel running. Istio + OPA provides defense-in-depth for tenant isolation.


Decision date: 2026-01-31 Review by: 2026-07-31