Migration Strategy
Migration Strategy
Key Takeaways
- Domain-by-domain strangler fig — Migrate one domain at a time while Gen 2 services continue running. Each phase is independently deployable and rollback-safe. No big-bang cutover.
- 6 migration phases over 4 waves — Wave 1: Foundation (infra + testing + BPM). Wave 2: Low-risk domains (wallet/transaction, communication). Wave 3: Medium-risk domains (identity, content, events). Wave 4: High-risk domains (inventory, Keycloak).
- Backward-compatible API throughout — GraphQL schemas evolve additively (new fields, deprecate old). Both Gen 2 and consolidated services serve traffic simultaneously during transition. Frontend switches per-domain.
- BPM migration requires drain-and-switch — Stop new workflow instances on CIB Seven, let in-flight complete (max hours), switch to Spring State Machine. No parallel execution of same workflow on both engines.
- Keycloak is the last migration — All services depend on Keycloak JWT. Migration approach: export/import realm, reconfigure all services to new issuer URI in a coordinated maintenance window.
Migration Decision Question
How do we get from here to Gen 3?
Migration Approach: Strangler Fig
Each domain follows this pattern:
1. Build consolidated service alongside Gen 2
2. Route traffic to consolidated service (Istio weight-based)
3. Validate with monitoring + SLOs
4. Drain Gen 2 service
5. Remove Gen 2 deployment
Key Principles
- Zero-downtime requirement — all migrations use Istio traffic shifting
- Rollback always available — Gen 2 services remain deployed until consolidated service is validated
- One domain at a time — reduce blast radius
- Data migration is per-domain — database merges happen within domain boundaries
- External IDs preserved — Stripe, Mux, Zoom, Stream Chat IDs carried forward
Wave 1: Foundation (Prerequisites)
Phase 0: Infrastructure & Platform Readiness
What: Establish infrastructure prerequisites before any service migration.
| Task | Description | Dependencies |
|---|---|---|
| Regional GKE | Upgrade clusters from zonal to regional | Terraform module update |
| Regional Cloud SQL | Enable HA for Cloud SQL instances | Terraform, maintenance window |
| NetworkPolicies | Implement default-deny + explicit allow | Integration patterns doc |
| CI Security Gates | Enforce Trivy/Qwiet failures on high/critical | GitHub Actions workflows |
| OpenTelemetry | Add auto-instrumentation to common Helm chart | Helm chart update |
| Alerting | Configure PrometheusRules for critical alerts | SLI/SLO definitions |
| Test Framework | Set up integration test infrastructure | CI pipeline update |
| Frontend Dead Code | Remove 5 dead API gateways | peeq-mono + frontends |
Rollback: Infrastructure changes are independent of application services. Revert Terraform if needed.
Success criteria: - [ ] Regional GKE cluster operational - [ ] Regional Cloud SQL with automated failover tested - [ ] NetworkPolicies deployed (not breaking existing traffic) - [ ] CI pipeline blocks on high/critical vulnerabilities - [ ] OpenTelemetry traces visible in Grafana - [ ] At least 5 critical alert rules firing correctly - [ ] Integration test framework running in CI
Phase 0.5: BPM Engine Replacement
What: Replace CIB Seven with Spring State Machine for both workflows.
| Step | Description | Risk |
|---|---|---|
| 1. Build state machines | Implement purchase + shoutout state machines | Low — well-defined states |
| 2. Dual-write validation | New state machine logs alongside CIB Seven (read-only) | Low — no data mutation |
| 3. Shadow traffic | Route new workflow instances to state machine, compare results | Medium — needs monitoring |
| 4. Drain CIB Seven | Stop new instances on CIB Seven, let in-flight complete | Medium — timer events |
| 5. Switch | All new workflows use state machine exclusively | Low — validated in step 3 |
| 6. Remove CIB Seven | Remove CIB Seven dependency, Keycloak plugin | Low — no longer used |
Rollback: Revert to CIB Seven at any step. In-flight instances on CIB Seven continue regardless.
Success criteria: - [ ] Both state machines pass all integration tests - [ ] Shadow traffic shows identical state transitions for >100 workflows - [ ] Zero CIB Seven in-flight instances (fully drained) - [ ] Keycloak plugin removed without identity sync issues
Wave 2: Low-Risk Domains
Phase 1: Payment Consolidation (wallet + transaction → payment-service)
What: Merge wallet (3 tables), transaction (1 table), stripe (7 tables), and subscriptions (4 tables) into a single payment-service.
Data migration: - Merge 4 databases into single payment schema - Preserve all Stripe external IDs (customer, product, subscription) - Wallet balance + transaction history must be exact (financial data) - No schema redesign — additive merge
API migration: - Consolidated GraphQL schema
includes all existing operations - Frontend switches from
/api/wallet, /api/transaction,
/api/stripe, /api/subscriptions to
/api/payment - Istio route: old paths proxy to new service
during transition
Rollback: Split service back into 4; restore individual databases from backup.
Success criteria: - [ ] All financial operations pass integration tests - [ ] Wallet balances match pre-migration values (100% reconciliation) - [ ] Stripe webhooks received and processed correctly - [ ] Zero payment failures during migration
Phase 2: Communication Consolidation (email + sms + notifications → notification-service)
What: Merge email, sms, and notifications into a single notification-service. Already share a database.
Data migration: - Database already shared
(peeq-notification-service-db) — no data migration needed -
Replace lutung Mandrill library with maintained HTTP
client
API migration: - Consolidated service absorbs all RabbitMQ handlers (15 inbound total) - Email/SMS delivery becomes internal implementation detail - Frontend has no direct dependency (all async via RabbitMQ)
Rollback: Re-deploy individual services; they already share the same database.
Success criteria: - [ ] All notification channels (email, SMS, push) deliver correctly - [ ] Mandrill library replacement sends identical emails - [ ] RabbitMQ message processing latency unchanged - [ ] Zero missed notifications during migration
Wave 3: Medium-Risk Domains
Phase 3: Identity Consolidation (celebrity + fan + users → identity-service)
What: Merge three identity services into one, preserving all Keycloak integration.
Data migration: - Merge celebrity-db, fan-db, users-db into single identity schema - All tables already use Keycloak UUIDs as primary keys — no ID mapping needed - Preserve celebrity profile images, banners, shortlinks - Preserve fan follow relationships (MD5 composite PKs)
API migration: - Consolidated GraphQL schema: ~63
operations (27 celebrity + 21 fan + 15 users) - Remove deprecated
queries (3 celebrity queries) - Frontend switches from
/api/celebrity, /api/fan,
/api/users to /api/identity
Rollback: Re-deploy individual services; restore individual databases.
Success criteria: - [ ] All profile operations (create, update, delete, query) pass tests - [ ] Follow/unfollow system works correctly - [ ] Magic link authentication unaffected - [ ] Keycloak admin operations (role, group management) functional
Phase 4: Content Consolidation (content + media → content-service)
What: Merge content and media services, unifying Mux integration.
Data migration: - Merge content-db and media-db into single content schema - Preserve all Mux asset IDs and playback IDs - Migrate NFS file paths to GCS references (separate sub-project) - Content service already runs Java 24 — downgrade to Java 21 LTS for consistency
API migration: - Consolidated GraphQL schema: ~40+
operations - Unified Mux webhook handler (currently in both services) -
Frontend switches from /api/content,
/api/media to /api/content
Rollback: Re-deploy individual services; restore databases. NFS migration is separate and independently rollbackable.
Success criteria: - [ ] All content CRUD operations pass tests - [ ] Mux video upload, transcode, and playback work end-to-end - [ ] File upload (multipart) works with new storage backend - [ ] Search indices rebuilt and returning correct results
Phase 5: Events Domain (shoutout consolidation + class-catalog upgrade)
What: Merge shoutout + shoutout-bpm (BPM already replaced in Phase 0.5). Upgrade class-catalog in place (remove Arlo dead code). Merge journey into class-catalog.
Data migration: - Merge shoutout-db and shoutout-bpm-db into single shoutout schema - Preserve Mux video asset IDs for shoutout fulfillment - Class-catalog: remove Arlo LMS tables/columns - Journey: merge into class-catalog schema
API migration: - Shoutout: single GraphQL endpoint replaces separate service + BPM - Class-catalog: unchanged GraphQL schema, cleaned internals
Rollback: Re-deploy individual services; restore databases.
Success criteria: - [ ] Shoutout offer→purchase→record→deliver workflow completes end-to-end - [ ] Class/course/session CRUD operations pass tests - [ ] Learning credits (CEU/PDU) tracking functional - [ ] No Arlo-related code paths remain
Wave 4: High-Risk & Infrastructure
Phase 6: Platform Services + Frontend + Keycloak
Platform services: - Merge tags, tracking, group-profile, org-manager into platform-service - Low data complexity, few dependencies
Frontend unification: - Execute 4-phase plan from target-architecture.md - Shared component library → CSS standardization → repo merge → dead code removal - Can run in parallel with backend migrations
Inventory service: - Upgrade in place (do NOT merge — too many dependents) - Align core-lib version, add tests - Last application service to touch
Keycloak migration (LAST): - Export realm from existing Keycloak - Import into target Keycloak instance (same or upgraded version) - Reconfigure all services to new issuer URI - Coordinate maintenance window: brief auth interruption acceptable - Magic Link SPI + Session Restrictor SPI must be tested on target
Rollback: Keycloak rollback = revert DNS to old Keycloak instance. All services re-validate against old issuer.
Success criteria: - [ ] Platform services consolidated and tested - [ ] Unified frontend monorepo deploying all 5 apps - [ ] Inventory service upgraded with tests - [ ] Keycloak realm import/export validated in staging - [ ] All 28+ services authenticating against migrated Keycloak - [ ] Magic Link and Session Restrictor SPIs functional
Shared Cluster Consolidation
This can happen in parallel with application migration:
| Step | Description | When |
|---|---|---|
| 1 | Provision regional shared GKE cluster | Phase 0 |
| 2 | Deploy first tenant namespace (dev/staging) | Phase 0 |
| 3 | Migrate dev tenant to shared cluster | After Phase 0 validated |
| 4 | Migrate staging tenant | After dev validated |
| 5 | Migrate first production tenant | After Wave 2 |
| 6 | Migrate remaining production tenants | After Wave 3 |
| 7 | Decommission old per-tenant clusters | After Wave 4 |
RabbitMQ Migration
Approach: Migrate queue-by-queue as services consolidate.
| When Service Consolidates | Queue Migration |
|---|---|
| Payment consolidation | Merge wallet/transaction/stripe queues into payment queues |
| Notification consolidation | Merge email/sms/notification queues into notification queues |
| Identity consolidation | Merge celebrity/fan/users queues into identity queues |
Message contract preservation: All ~75 message types preserved. Consolidated services handle the same message types as their predecessors. Queue names change but message schemas don’t.
Drain strategy per consolidation: 1. Deploy consolidated service alongside Gen 2 services 2. Consolidated service starts consuming from NEW queues 3. Route publishers to new queues (update MessageSender config) 4. Drain old queues (stop old consumers, let messages exhaust) 5. Remove old queue declarations
Elasticsearch & Search Migration
| Component | Strategy |
|---|---|
| Log indices | Replace peeq-logging pipeline with Cloud Logging. Rebuild log aggregation if needed. |
| Search indices | Rebuild from source databases after service consolidation. |
| Kibana dashboards | Export manually (not in code). Import into Elastic Cloud 8.x or Grafana. |
| Elasticsearch version | Upgrade from 7.x to 8.x or switch to managed Elastic Cloud. |
Data Migration Per Domain
| Domain | Tables | External IDs | Strategy | Risk |
|---|---|---|---|---|
| Payment | 15 | Stripe (customer, product, subscription) | Schema merge + validate balances | High |
| Notification | ~14 | Mandrill, Twilio SIDs (in logs only) | Already shared DB — no data migration | Low |
| Identity | ~12 | None (Keycloak UUIDs only) | Schema merge | Medium |
| Content | ~20 | Mux (asset, playback), NFS paths | Schema merge + NFS→GCS | High |
| Shoutout | ~6 | Mux (asset, playback) | Schema merge | Medium |
| Class-Catalog | ~15+ | None | In-place upgrade + remove Arlo | Medium |
| Platform | ~8 | None | Schema merge | Low |
| Inventory | ~12 | Stripe product IDs | In-place upgrade | Low |
Rollback Plans
Per-Phase Rollback
Every phase has an independent rollback:
| Phase | Rollback Mechanism | Data Recovery |
|---|---|---|
| 0 (Infra) | Revert Terraform | N/A |
| 0.5 (BPM) | Re-enable CIB Seven, disable state machine | CIB Seven DB untouched |
| 1 (Payment) | Re-deploy 4 services, restore 4 databases | Point-in-time recovery |
| 2 (Notification) | Re-deploy 3 services | Shared DB unchanged |
| 3 (Identity) | Re-deploy 3 services, restore 3 databases | Point-in-time recovery |
| 4 (Content) | Re-deploy 2 services, restore 2 databases | Point-in-time recovery + NFS |
| 5 (Events) | Re-deploy services, restore databases | Point-in-time recovery |
| 6 (Platform) | Re-deploy services, revert Keycloak DNS | Point-in-time recovery |
Rollback Triggers
Automatically rollback if: - Error rate exceeds 1% for >5 minutes post-migration - Payment failure rate exceeds 0.1% for >2 minutes - P99 latency exceeds 2x baseline for >10 minutes - Any financial data inconsistency detected
Success Criteria (Overall)
Technical
Business
Operational
Last updated: 2026-01-30 — Session 12 Review by: 2026-04-30 Staleness risk: High — migration plan should be validated with engineering team before execution