Modernization

Migration Strategy

Last updated: 2026-02-01 | Modernization

Migration Strategy

Key Takeaways

  1. Domain-by-domain strangler fig — Migrate one domain at a time while Gen 2 services continue running. Each phase is independently deployable and rollback-safe. No big-bang cutover.
  2. 6 migration phases over 4 waves — Wave 1: Foundation (infra + testing + BPM). Wave 2: Low-risk domains (wallet/transaction, communication). Wave 3: Medium-risk domains (identity, content, events). Wave 4: High-risk domains (inventory, Keycloak).
  3. Backward-compatible API throughout — GraphQL schemas evolve additively (new fields, deprecate old). Both Gen 2 and consolidated services serve traffic simultaneously during transition. Frontend switches per-domain.
  4. BPM migration requires drain-and-switch — Stop new workflow instances on CIB Seven, let in-flight complete (max hours), switch to Spring State Machine. No parallel execution of same workflow on both engines.
  5. Keycloak is the last migration — All services depend on Keycloak JWT. Migration approach: export/import realm, reconfigure all services to new issuer URI in a coordinated maintenance window.

Migration Decision Question

How do we get from here to Gen 3?


Migration Approach: Strangler Fig

Each domain follows this pattern:

1. Build consolidated service alongside Gen 2
2. Route traffic to consolidated service (Istio weight-based)
3. Validate with monitoring + SLOs
4. Drain Gen 2 service
5. Remove Gen 2 deployment

Key Principles


Wave 1: Foundation (Prerequisites)

Phase 0: Infrastructure & Platform Readiness

What: Establish infrastructure prerequisites before any service migration.

Task Description Dependencies
Regional GKE Upgrade clusters from zonal to regional Terraform module update
Regional Cloud SQL Enable HA for Cloud SQL instances Terraform, maintenance window
NetworkPolicies Implement default-deny + explicit allow Integration patterns doc
CI Security Gates Enforce Trivy/Qwiet failures on high/critical GitHub Actions workflows
OpenTelemetry Add auto-instrumentation to common Helm chart Helm chart update
Alerting Configure PrometheusRules for critical alerts SLI/SLO definitions
Test Framework Set up integration test infrastructure CI pipeline update
Frontend Dead Code Remove 5 dead API gateways peeq-mono + frontends

Rollback: Infrastructure changes are independent of application services. Revert Terraform if needed.

Success criteria: - [ ] Regional GKE cluster operational - [ ] Regional Cloud SQL with automated failover tested - [ ] NetworkPolicies deployed (not breaking existing traffic) - [ ] CI pipeline blocks on high/critical vulnerabilities - [ ] OpenTelemetry traces visible in Grafana - [ ] At least 5 critical alert rules firing correctly - [ ] Integration test framework running in CI

Phase 0.5: BPM Engine Replacement

What: Replace CIB Seven with Spring State Machine for both workflows.

Step Description Risk
1. Build state machines Implement purchase + shoutout state machines Low — well-defined states
2. Dual-write validation New state machine logs alongside CIB Seven (read-only) Low — no data mutation
3. Shadow traffic Route new workflow instances to state machine, compare results Medium — needs monitoring
4. Drain CIB Seven Stop new instances on CIB Seven, let in-flight complete Medium — timer events
5. Switch All new workflows use state machine exclusively Low — validated in step 3
6. Remove CIB Seven Remove CIB Seven dependency, Keycloak plugin Low — no longer used

Rollback: Revert to CIB Seven at any step. In-flight instances on CIB Seven continue regardless.

Success criteria: - [ ] Both state machines pass all integration tests - [ ] Shadow traffic shows identical state transitions for >100 workflows - [ ] Zero CIB Seven in-flight instances (fully drained) - [ ] Keycloak plugin removed without identity sync issues


Wave 2: Low-Risk Domains

Phase 1: Payment Consolidation (wallet + transaction → payment-service)

What: Merge wallet (3 tables), transaction (1 table), stripe (7 tables), and subscriptions (4 tables) into a single payment-service.

Data migration: - Merge 4 databases into single payment schema - Preserve all Stripe external IDs (customer, product, subscription) - Wallet balance + transaction history must be exact (financial data) - No schema redesign — additive merge

API migration: - Consolidated GraphQL schema includes all existing operations - Frontend switches from /api/wallet, /api/transaction, /api/stripe, /api/subscriptions to /api/payment - Istio route: old paths proxy to new service during transition

Rollback: Split service back into 4; restore individual databases from backup.

Success criteria: - [ ] All financial operations pass integration tests - [ ] Wallet balances match pre-migration values (100% reconciliation) - [ ] Stripe webhooks received and processed correctly - [ ] Zero payment failures during migration

Phase 2: Communication Consolidation (email + sms + notifications → notification-service)

What: Merge email, sms, and notifications into a single notification-service. Already share a database.

Data migration: - Database already shared (peeq-notification-service-db) — no data migration needed - Replace lutung Mandrill library with maintained HTTP client

API migration: - Consolidated service absorbs all RabbitMQ handlers (15 inbound total) - Email/SMS delivery becomes internal implementation detail - Frontend has no direct dependency (all async via RabbitMQ)

Rollback: Re-deploy individual services; they already share the same database.

Success criteria: - [ ] All notification channels (email, SMS, push) deliver correctly - [ ] Mandrill library replacement sends identical emails - [ ] RabbitMQ message processing latency unchanged - [ ] Zero missed notifications during migration


Wave 3: Medium-Risk Domains

Phase 3: Identity Consolidation (celebrity + fan + users → identity-service)

What: Merge three identity services into one, preserving all Keycloak integration.

Data migration: - Merge celebrity-db, fan-db, users-db into single identity schema - All tables already use Keycloak UUIDs as primary keys — no ID mapping needed - Preserve celebrity profile images, banners, shortlinks - Preserve fan follow relationships (MD5 composite PKs)

API migration: - Consolidated GraphQL schema: ~63 operations (27 celebrity + 21 fan + 15 users) - Remove deprecated queries (3 celebrity queries) - Frontend switches from /api/celebrity, /api/fan, /api/users to /api/identity

Rollback: Re-deploy individual services; restore individual databases.

Success criteria: - [ ] All profile operations (create, update, delete, query) pass tests - [ ] Follow/unfollow system works correctly - [ ] Magic link authentication unaffected - [ ] Keycloak admin operations (role, group management) functional

Phase 4: Content Consolidation (content + media → content-service)

What: Merge content and media services, unifying Mux integration.

Data migration: - Merge content-db and media-db into single content schema - Preserve all Mux asset IDs and playback IDs - Migrate NFS file paths to GCS references (separate sub-project) - Content service already runs Java 24 — downgrade to Java 21 LTS for consistency

API migration: - Consolidated GraphQL schema: ~40+ operations - Unified Mux webhook handler (currently in both services) - Frontend switches from /api/content, /api/media to /api/content

Rollback: Re-deploy individual services; restore databases. NFS migration is separate and independently rollbackable.

Success criteria: - [ ] All content CRUD operations pass tests - [ ] Mux video upload, transcode, and playback work end-to-end - [ ] File upload (multipart) works with new storage backend - [ ] Search indices rebuilt and returning correct results

Phase 5: Events Domain (shoutout consolidation + class-catalog upgrade)

What: Merge shoutout + shoutout-bpm (BPM already replaced in Phase 0.5). Upgrade class-catalog in place (remove Arlo dead code). Merge journey into class-catalog.

Data migration: - Merge shoutout-db and shoutout-bpm-db into single shoutout schema - Preserve Mux video asset IDs for shoutout fulfillment - Class-catalog: remove Arlo LMS tables/columns - Journey: merge into class-catalog schema

API migration: - Shoutout: single GraphQL endpoint replaces separate service + BPM - Class-catalog: unchanged GraphQL schema, cleaned internals

Rollback: Re-deploy individual services; restore databases.

Success criteria: - [ ] Shoutout offer→purchase→record→deliver workflow completes end-to-end - [ ] Class/course/session CRUD operations pass tests - [ ] Learning credits (CEU/PDU) tracking functional - [ ] No Arlo-related code paths remain


Wave 4: High-Risk & Infrastructure

Phase 6: Platform Services + Frontend + Keycloak

Platform services: - Merge tags, tracking, group-profile, org-manager into platform-service - Low data complexity, few dependencies

Frontend unification: - Execute 4-phase plan from target-architecture.md - Shared component library → CSS standardization → repo merge → dead code removal - Can run in parallel with backend migrations

Inventory service: - Upgrade in place (do NOT merge — too many dependents) - Align core-lib version, add tests - Last application service to touch

Keycloak migration (LAST): - Export realm from existing Keycloak - Import into target Keycloak instance (same or upgraded version) - Reconfigure all services to new issuer URI - Coordinate maintenance window: brief auth interruption acceptable - Magic Link SPI + Session Restrictor SPI must be tested on target

Rollback: Keycloak rollback = revert DNS to old Keycloak instance. All services re-validate against old issuer.

Success criteria: - [ ] Platform services consolidated and tested - [ ] Unified frontend monorepo deploying all 5 apps - [ ] Inventory service upgraded with tests - [ ] Keycloak realm import/export validated in staging - [ ] All 28+ services authenticating against migrated Keycloak - [ ] Magic Link and Session Restrictor SPIs functional


Shared Cluster Consolidation

This can happen in parallel with application migration:

Step Description When
1 Provision regional shared GKE cluster Phase 0
2 Deploy first tenant namespace (dev/staging) Phase 0
3 Migrate dev tenant to shared cluster After Phase 0 validated
4 Migrate staging tenant After dev validated
5 Migrate first production tenant After Wave 2
6 Migrate remaining production tenants After Wave 3
7 Decommission old per-tenant clusters After Wave 4

RabbitMQ Migration

Approach: Migrate queue-by-queue as services consolidate.

When Service Consolidates Queue Migration
Payment consolidation Merge wallet/transaction/stripe queues into payment queues
Notification consolidation Merge email/sms/notification queues into notification queues
Identity consolidation Merge celebrity/fan/users queues into identity queues

Message contract preservation: All ~75 message types preserved. Consolidated services handle the same message types as their predecessors. Queue names change but message schemas don’t.

Drain strategy per consolidation: 1. Deploy consolidated service alongside Gen 2 services 2. Consolidated service starts consuming from NEW queues 3. Route publishers to new queues (update MessageSender config) 4. Drain old queues (stop old consumers, let messages exhaust) 5. Remove old queue declarations


Elasticsearch & Search Migration

Component Strategy
Log indices Replace peeq-logging pipeline with Cloud Logging. Rebuild log aggregation if needed.
Search indices Rebuild from source databases after service consolidation.
Kibana dashboards Export manually (not in code). Import into Elastic Cloud 8.x or Grafana.
Elasticsearch version Upgrade from 7.x to 8.x or switch to managed Elastic Cloud.

Data Migration Per Domain

Domain Tables External IDs Strategy Risk
Payment 15 Stripe (customer, product, subscription) Schema merge + validate balances High
Notification ~14 Mandrill, Twilio SIDs (in logs only) Already shared DB — no data migration Low
Identity ~12 None (Keycloak UUIDs only) Schema merge Medium
Content ~20 Mux (asset, playback), NFS paths Schema merge + NFS→GCS High
Shoutout ~6 Mux (asset, playback) Schema merge Medium
Class-Catalog ~15+ None In-place upgrade + remove Arlo Medium
Platform ~8 None Schema merge Low
Inventory ~12 Stripe product IDs In-place upgrade Low

Rollback Plans

Per-Phase Rollback

Every phase has an independent rollback:

Phase Rollback Mechanism Data Recovery
0 (Infra) Revert Terraform N/A
0.5 (BPM) Re-enable CIB Seven, disable state machine CIB Seven DB untouched
1 (Payment) Re-deploy 4 services, restore 4 databases Point-in-time recovery
2 (Notification) Re-deploy 3 services Shared DB unchanged
3 (Identity) Re-deploy 3 services, restore 3 databases Point-in-time recovery
4 (Content) Re-deploy 2 services, restore 2 databases Point-in-time recovery + NFS
5 (Events) Re-deploy services, restore databases Point-in-time recovery
6 (Platform) Re-deploy services, revert Keycloak DNS Point-in-time recovery

Rollback Triggers

Automatically rollback if: - Error rate exceeds 1% for >5 minutes post-migration - Payment failure rate exceeds 0.1% for >2 minutes - P99 latency exceeds 2x baseline for >10 minutes - Any financial data inconsistency detected


Success Criteria (Overall)

Technical

Business

Operational


Last updated: 2026-01-30 — Session 12 Review by: 2026-04-30 Staleness risk: High — migration plan should be validated with engineering team before execution