Modernization

Migration Strategy

Last updated: 2026-02-01 | Modernization

Migration Strategy

Key Takeaways

Domain-by-domain strangler fig — Migrate one domain at a time while Gen 2 services continue running. Each phase is independently deployable and rollback-safe. No big-bang cutover.
6 migration phases over 4 waves — Wave 1: Foundation (infra + testing + BPM). Wave 2: Low-risk domains (wallet/transaction, communication). Wave 3: Medium-risk domains (identity, content, events). Wave 4: High-risk domains (inventory, Keycloak).
Backward-compatible API throughout — GraphQL schemas evolve additively (new fields, deprecate old). Both Gen 2 and consolidated services serve traffic simultaneously during transition. Frontend switches per-domain.
BPM migration requires drain-and-switch — Stop new workflow instances on CIB Seven, let in-flight complete (max hours), switch to Spring State Machine. No parallel execution of same workflow on both engines.
Keycloak is the last migration — All services depend on Keycloak JWT. Migration approach: export/import realm, reconfigure all services to new issuer URI in a coordinated maintenance window.

Migration Decision Question

How do we get from here to Gen 3?

Migration Approach: Strangler Fig

Each domain follows this pattern:

1. Build consolidated service alongside Gen 2
2. Route traffic to consolidated service (Istio weight-based)
3. Validate with monitoring + SLOs
4. Drain Gen 2 service
5. Remove Gen 2 deployment

Key Principles

Zero-downtime requirement — all migrations use Istio traffic shifting
Rollback always available — Gen 2 services remain deployed until consolidated service is validated
One domain at a time — reduce blast radius
Data migration is per-domain — database merges happen within domain boundaries
External IDs preserved — Stripe, Mux, Zoom, Stream Chat IDs carried forward

Wave 1: Foundation (Prerequisites)

Phase 0: Infrastructure & Platform Readiness

What: Establish infrastructure prerequisites before any service migration.

Task	Description	Dependencies
Regional GKE	Upgrade clusters from zonal to regional	Terraform module update
Regional Cloud SQL	Enable HA for Cloud SQL instances	Terraform, maintenance window
NetworkPolicies	Implement default-deny + explicit allow	Integration patterns doc
CI Security Gates	Enforce Trivy/Qwiet failures on high/critical	GitHub Actions workflows
OpenTelemetry	Add auto-instrumentation to common Helm chart	Helm chart update
Alerting	Configure PrometheusRules for critical alerts	SLI/SLO definitions
Test Framework	Set up integration test infrastructure	CI pipeline update
Frontend Dead Code	Remove 5 dead API gateways	peeq-mono + frontends

Rollback: Infrastructure changes are independent of application services. Revert Terraform if needed.

Success criteria: - [ ] Regional GKE cluster operational - [ ] Regional Cloud SQL with automated failover tested - [ ] NetworkPolicies deployed (not breaking existing traffic) - [ ] CI pipeline blocks on high/critical vulnerabilities - [ ] OpenTelemetry traces visible in Grafana - [ ] At least 5 critical alert rules firing correctly - [ ] Integration test framework running in CI

Phase 0.5: BPM Engine Replacement

What: Replace CIB Seven with Spring State Machine for both workflows.

Step	Description	Risk
1. Build state machines	Implement purchase + shoutout state machines	Low — well-defined states
2. Dual-write validation	New state machine logs alongside CIB Seven (read-only)	Low — no data mutation
3. Shadow traffic	Route new workflow instances to state machine, compare results	Medium — needs monitoring
4. Drain CIB Seven	Stop new instances on CIB Seven, let in-flight complete	Medium — timer events
5. Switch	All new workflows use state machine exclusively	Low — validated in step 3
6. Remove CIB Seven	Remove CIB Seven dependency, Keycloak plugin	Low — no longer used

Rollback: Revert to CIB Seven at any step. In-flight instances on CIB Seven continue regardless.

Success criteria: - [ ] Both state machines pass all integration tests - [ ] Shadow traffic shows identical state transitions for >100 workflows - [ ] Zero CIB Seven in-flight instances (fully drained) - [ ] Keycloak plugin removed without identity sync issues

Wave 2: Low-Risk Domains

Phase 1: Payment Consolidation (wallet + transaction → payment-service)

What: Merge wallet (3 tables), transaction (1 table), stripe (7 tables), and subscriptions (4 tables) into a single payment-service.

Data migration: - Merge 4 databases into single payment schema - Preserve all Stripe external IDs (customer, product, subscription) - Wallet balance + transaction history must be exact (financial data) - No schema redesign — additive merge

API migration: - Consolidated GraphQL schema includes all existing operations - Frontend switches from /api/wallet, /api/transaction, /api/stripe, /api/subscriptions to /api/payment - Istio route: old paths proxy to new service during transition

Rollback: Split service back into 4; restore individual databases from backup.

Success criteria: - [ ] All financial operations pass integration tests - [ ] Wallet balances match pre-migration values (100% reconciliation) - [ ] Stripe webhooks received and processed correctly - [ ] Zero payment failures during migration

Phase 2: Communication Consolidation (email + sms + notifications → notification-service)

What: Merge email, sms, and notifications into a single notification-service. Already share a database.

Data migration: - Database already shared (peeq-notification-service-db) — no data migration needed - Replace lutung Mandrill library with maintained HTTP client

API migration: - Consolidated service absorbs all RabbitMQ handlers (15 inbound total) - Email/SMS delivery becomes internal implementation detail - Frontend has no direct dependency (all async via RabbitMQ)

Rollback: Re-deploy individual services; they already share the same database.

Success criteria: - [ ] All notification channels (email, SMS, push) deliver correctly - [ ] Mandrill library replacement sends identical emails - [ ] RabbitMQ message processing latency unchanged - [ ] Zero missed notifications during migration

Wave 3: Medium-Risk Domains

Phase 3: Identity Consolidation (celebrity + fan + users → identity-service)

What: Merge three identity services into one, preserving all Keycloak integration.

Data migration: - Merge celebrity-db, fan-db, users-db into single identity schema - All tables already use Keycloak UUIDs as primary keys — no ID mapping needed - Preserve celebrity profile images, banners, shortlinks - Preserve fan follow relationships (MD5 composite PKs)

API migration: - Consolidated GraphQL schema: ~63 operations (27 celebrity + 21 fan + 15 users) - Remove deprecated queries (3 celebrity queries) - Frontend switches from /api/celebrity, /api/fan, /api/users to /api/identity

Rollback: Re-deploy individual services; restore individual databases.

Success criteria: - [ ] All profile operations (create, update, delete, query) pass tests - [ ] Follow/unfollow system works correctly - [ ] Magic link authentication unaffected - [ ] Keycloak admin operations (role, group management) functional

Phase 4: Content Consolidation (content + media → content-service)

What: Merge content and media services, unifying Mux integration.

Data migration: - Merge content-db and media-db into single content schema - Preserve all Mux asset IDs and playback IDs - Migrate NFS file paths to GCS references (separate sub-project) - Content service already runs Java 24 — downgrade to Java 21 LTS for consistency

API migration: - Consolidated GraphQL schema: ~40+ operations - Unified Mux webhook handler (currently in both services) - Frontend switches from /api/content, /api/media to /api/content

Rollback: Re-deploy individual services; restore databases. NFS migration is separate and independently rollbackable.

Success criteria: - [ ] All content CRUD operations pass tests - [ ] Mux video upload, transcode, and playback work end-to-end - [ ] File upload (multipart) works with new storage backend - [ ] Search indices rebuilt and returning correct results

Phase 5: Events Domain (shoutout consolidation + class-catalog upgrade)

What: Merge shoutout + shoutout-bpm (BPM already replaced in Phase 0.5). Upgrade class-catalog in place (remove Arlo dead code). Merge journey into class-catalog.

Data migration: - Merge shoutout-db and shoutout-bpm-db into single shoutout schema - Preserve Mux video asset IDs for shoutout fulfillment - Class-catalog: remove Arlo LMS tables/columns - Journey: merge into class-catalog schema

API migration: - Shoutout: single GraphQL endpoint replaces separate service + BPM - Class-catalog: unchanged GraphQL schema, cleaned internals

Rollback: Re-deploy individual services; restore databases.

Success criteria: - [ ] Shoutout offer→purchase→record→deliver workflow completes end-to-end - [ ] Class/course/session CRUD operations pass tests - [ ] Learning credits (CEU/PDU) tracking functional - [ ] No Arlo-related code paths remain

Wave 4: High-Risk & Infrastructure

Phase 6: Platform Services + Frontend + Keycloak

Platform services: - Merge tags, tracking, group-profile, org-manager into platform-service - Low data complexity, few dependencies

Frontend unification: - Execute 4-phase plan from target-architecture.md - Shared component library → CSS standardization → repo merge → dead code removal - Can run in parallel with backend migrations

Inventory service: - Upgrade in place (do NOT merge — too many dependents) - Align core-lib version, add tests - Last application service to touch

Keycloak migration (LAST): - Export realm from existing Keycloak - Import into target Keycloak instance (same or upgraded version) - Reconfigure all services to new issuer URI - Coordinate maintenance window: brief auth interruption acceptable - Magic Link SPI + Session Restrictor SPI must be tested on target

Rollback: Keycloak rollback = revert DNS to old Keycloak instance. All services re-validate against old issuer.

Success criteria: - [ ] Platform services consolidated and tested - [ ] Unified frontend monorepo deploying all 5 apps - [ ] Inventory service upgraded with tests - [ ] Keycloak realm import/export validated in staging - [ ] All 28+ services authenticating against migrated Keycloak - [ ] Magic Link and Session Restrictor SPIs functional

Shared Cluster Consolidation

This can happen in parallel with application migration:

Step	Description	When
1	Provision regional shared GKE cluster	Phase 0
2	Deploy first tenant namespace (dev/staging)	Phase 0
3	Migrate dev tenant to shared cluster	After Phase 0 validated
4	Migrate staging tenant	After dev validated
5	Migrate first production tenant	After Wave 2
6	Migrate remaining production tenants	After Wave 3
7	Decommission old per-tenant clusters	After Wave 4

RabbitMQ Migration

Approach: Migrate queue-by-queue as services consolidate.

When Service Consolidates	Queue Migration
Payment consolidation	Merge wallet/transaction/stripe queues into payment queues
Notification consolidation	Merge email/sms/notification queues into notification queues
Identity consolidation	Merge celebrity/fan/users queues into identity queues

Message contract preservation: All ~75 message types preserved. Consolidated services handle the same message types as their predecessors. Queue names change but message schemas don’t.

Drain strategy per consolidation: 1. Deploy consolidated service alongside Gen 2 services 2. Consolidated service starts consuming from NEW queues 3. Route publishers to new queues (update MessageSender config) 4. Drain old queues (stop old consumers, let messages exhaust) 5. Remove old queue declarations

Elasticsearch & Search Migration

Component	Strategy
Log indices	Replace peeq-logging pipeline with Cloud Logging. Rebuild log aggregation if needed.
Search indices	Rebuild from source databases after service consolidation.
Kibana dashboards	Export manually (not in code). Import into Elastic Cloud 8.x or Grafana.
Elasticsearch version	Upgrade from 7.x to 8.x or switch to managed Elastic Cloud.

Data Migration Per Domain

Domain	Tables	External IDs	Strategy	Risk
Payment	15	Stripe (customer, product, subscription)	Schema merge + validate balances	High
Notification	~14	Mandrill, Twilio SIDs (in logs only)	Already shared DB — no data migration	Low
Identity	~12	None (Keycloak UUIDs only)	Schema merge	Medium
Content	~20	Mux (asset, playback), NFS paths	Schema merge + NFS→GCS	High
Shoutout	~6	Mux (asset, playback)	Schema merge	Medium
Class-Catalog	~15+	None	In-place upgrade + remove Arlo	Medium
Platform	~8	None	Schema merge	Low
Inventory	~12	Stripe product IDs	In-place upgrade	Low

Rollback Plans

Per-Phase Rollback

Every phase has an independent rollback:

Phase	Rollback Mechanism	Data Recovery
0 (Infra)	Revert Terraform	N/A
0.5 (BPM)	Re-enable CIB Seven, disable state machine	CIB Seven DB untouched
1 (Payment)	Re-deploy 4 services, restore 4 databases	Point-in-time recovery
2 (Notification)	Re-deploy 3 services	Shared DB unchanged
3 (Identity)	Re-deploy 3 services, restore 3 databases	Point-in-time recovery
4 (Content)	Re-deploy 2 services, restore 2 databases	Point-in-time recovery + NFS
5 (Events)	Re-deploy services, restore databases	Point-in-time recovery
6 (Platform)	Re-deploy services, revert Keycloak DNS	Point-in-time recovery

Rollback Triggers

Automatically rollback if: - Error rate exceeds 1% for >5 minutes post-migration - Payment failure rate exceeds 0.1% for >2 minutes - P99 latency exceeds 2x baseline for >10 minutes - Any financial data inconsistency detected

Success Criteria (Overall)

Technical

Service count reduced from ~35 to ~18
All services on Java 21 LTS / Spring Boot 3.x
CIB Seven fully removed
Regional HA for GKE and Cloud SQL
OpenTelemetry tracing operational
Critical alerting rules deployed
CI pipeline enforces security scanning
NetworkPolicies deployed
Test coverage >60% for consolidated services

Business

Zero data loss during migration
No payment processing interruption
All external API integrations preserved (Stripe, Mux, Zoom, etc.)
All 4 production brands operational throughout
Frontend user experience unchanged

Operational

Shared cluster running all production tenants
Infrastructure cost reduced (fewer clusters)
Deployment pipeline simplified (fewer services)
Observability stack providing actionable alerts

Last updated: 2026-01-30 — Session 12 Review by: 2026-04-30 Staleness risk: High — migration plan should be validated with engineering team before execution