Modernization
Engineering Kickoff Package
Engineering Kickoff Package
Key Takeaways
- Go recommendation: CONDITIONAL GO — Proceed with Wave 1 (Foundation) immediately. The evidence base is strong (4 L2, 5 L1 hypotheses), the target architecture is validated, and the migration strategy is phased with rollback at every step. Two blockers remain (H8 data volumes, H9 PCI scope) but neither prevents Wave 1 from starting.
- First migration domain: Notification pipeline (email + sms + notifications) — lowest risk (shared DB already, no external ID coupling, self-contained), highest immediate value (fixes deprecated Mandrill library, reduces 3 services to 1).
- Sprint 0 scope: 6 items — regional GKE upgrade, CI security enforcement, OpenTelemetry setup, integration test framework, BPM state machine POC, notification service consolidation POC.
- 15 prioritized engineering stories spanning all 4 migration waves — ready for backlog grooming.
- 2 blockers to resolve before Wave 2: obtain production database row counts (H8), confirm Stripe PCI scope (H9).
Migration Decision Question
Is the team ready to start Sprint 0 of the modernization?
Go/No-Go Recommendation
Verdict: CONDITIONAL GO
Go for Wave 1 (Foundation + BPM replacement). Proceed immediately.
Conditional on resolving before Wave 2: 1. H8: Obtain actual production database row counts and table sizes (needed for migration window planning) 2. H9: Confirm PCI scope via Stripe dashboard (likely SAQ-A but must verify)
Evidence Supporting Go
| Factor | Evidence | Confidence |
|---|---|---|
| Architecture is sound | H14 falsified — upgrade, not rewrite | L1 |
| Service boundaries clean | H6 — no shared DB backdoors | L1 |
| Multi-brand is config-only | H11 — verified across all domains | L2 |
| Contracts discoverable | H12 — 75+ message types mapped | L2 |
| Shared libraries stable | H13 — core-lib proven foundation | L1 |
| No feature parity gaps | All Gen 1 features deprecated or replaced | L1 |
| Dead code identified | ~110 repos archivable, 12+ dead services | L1 |
| Migration strategy phased | Every phase has rollback plan | L1 |
Risks Accepted
| Risk | Mitigation |
|---|---|
| Near-zero test coverage (H7 falsified) | Add tests as part of each migration phase |
| Data volumes unknown (H8 L0) | Start with small-data domains first (Wave 2) |
| PCI scope unconfirmed (H9 L0) | Payment domain is Wave 2, not Wave 1 |
Prioritized Migration Backlog
Sprint 0: Foundation (Wave 1)
| # | Story | Priority | Effort | Acceptance Criteria |
|---|---|---|---|---|
| 1 | Upgrade GKE to regional | P0 | L | Regional cluster operational, automatic zone failover tested, all tenants migrated |
| 2 | Enforce CI security scanning | P0 | S | Trivy blocks on high/critical, Qwiet fails on critical findings, all pipelines updated |
| 3 | Deploy OpenTelemetry auto-instrumentation | P0 | M | OTel agent in common Helm chart, traces visible in Grafana for all services |
| 4 | Set up integration test framework | P0 | M | Test runner in CI, first 5 service tests passing, test database provisioning automated |
| 5 | POC: Spring State Machine for purchase workflow | P1 | M | State machine replicates all 10 purchase states, unit tests pass, shadow validation against CIB Seven |
| 6 | POC: Notification service consolidation | P1 | M | Merged email+sms+notifications with maintained Mandrill client, all 15 RabbitMQ handlers, integration tests |
Wave 2: Low-Risk Migrations
| # | Story | Priority | Effort | Acceptance Criteria |
|---|---|---|---|---|
| 7 | Consolidate notification-service | P0 | M | email+sms+notifications merged, lutung replaced, all
notification channels tested, deployed to staging |
| 8 | Consolidate payment-service | P1 | L | stripe+subscriptions+wallet+transaction merged, financial reconciliation validated, Stripe webhooks tested |
| 9 | Replace purchase-request BPM | P1 | M | Spring State Machine in production, CIB Seven drained, Keycloak plugin removed |
Wave 3: Medium-Risk Migrations
| # | Story | Priority | Effort | Acceptance Criteria |
|---|---|---|---|---|
| 10 | Consolidate identity-service | P1 | M | celebrity+fan+users merged, all profile operations tested, Keycloak integration preserved |
| 11 | Consolidate content-service | P1 | L | content+media merged, Mux integration unified, NFS→GCS migration started |
| 12 | Consolidate shoutout-service | P1 | M | shoutout+shoutout-bpm merged (BPM state machine from story 9 pattern), Mux video workflow tested |
| 13 | Upgrade class-catalog | P2 | M | Arlo dead code removed, journey merged, learning credits tested |
Wave 4: High-Risk & Frontend
| # | Story | Priority | Effort | Acceptance Criteria |
|---|---|---|---|---|
| 14 | Unify frontend monorepo | P1 | XL | Shared component library, CSS standardization, repo merged, dead code removed |
| 15 | Shared cluster consolidation | P1 | XL | All tenants in shared regional cluster, NetworkPolicies enforced, cost reduction measured |
First Migration Domain: Notification Pipeline
Why Notification First
| Criterion | Notification Score | Alternative (Payment) Score |
|---|---|---|
| Database migration needed | None (shared DB already) | Yes (4 DBs → 1) |
| External ID coupling | None | Stripe IDs (high risk) |
| Dependencies on other domains | Low (inbound only via RabbitMQ) | Medium (inventory, BPM) |
| Immediate value | High (fixes deprecated Mandrill lib) | High (reduces 4 services) |
| Rollback complexity | Low (re-deploy 3 services) | Medium (restore 4 databases) |
| Test coverage needed | Low (async message processing) | High (financial transactions) |
Notification Consolidation Plan
- Create notification-service with all 3 source services’ code as Maven modules
- Replace
lutung0.0.8 with direct Mandrill HTTP API client - Merge RabbitMQ handlers (15 inbound total)
- Add integration tests for all notification channels
- Deploy to staging, validate with synthetic traffic
- Deploy to production with Istio traffic shifting
- Drain old services, remove deployments
Sprint 0 Scope
Infrastructure Prerequisites
| Item | Owner | Definition of Done |
|---|---|---|
| Regional GKE cluster | Platform team | All tenants on regional clusters, zone failover tested |
| CI security enforcement | DevOps | All pipelines fail on high/critical vulnerabilities |
| OpenTelemetry | Platform team | Traces in Grafana for all services, 10% sampling rate |
| Integration test framework | Engineering | Test runner in CI, database provisioning, first 5 services covered |
Application POCs
| Item | Owner | Definition of Done |
|---|---|---|
| Purchase state machine | Backend team | Replicates all CIB Seven states, unit + integration tests, shadow validation |
| Notification consolidation | Backend team | Merged service with all handlers, replaced Mandrill library, staging deployment |
Resolve Before Wave 2
| Item | Owner | Definition of Done |
|---|---|---|
| H8: Production data volumes | Platform/DBA | Row counts and table sizes for all 35 databases (at least 1 tenant) |
| H9: PCI scope | Engineering lead | Stripe dashboard accessed, SAQ level confirmed, documented |
Team Skills Assessment
Skills Needed for Migration
| Skill | Current State | Gap |
|---|---|---|
| Java 21 / Spring Boot 3.5 | All Gen 2 services use this | No gap |
| Angular 18 / Nx | Both frontend repos use this | No gap |
| Spring GraphQL | All services use this | No gap |
| RabbitMQ / core-lib messaging | All services use this | No gap |
| Keycloak administration | In use across 4 tenants | No gap |
| Terraform / Atlantis | Mature IaC workflow | No gap |
| Helm / ArgoCD | Common chart v0.0.179, GitOps mature | No gap |
| Istio service mesh | Path routing, mTLS | No gap |
| Spring State Machine | Not currently used | Training needed |
| OpenTelemetry | Not currently adopted | Training needed |
| NetworkPolicies | Not currently deployed | Training needed |
| Tailwind CSS (for frontends team) | peeq-mono uses it; frontends team uses Bootstrap | Training needed |
| Integration testing | Very low coverage (H7) | Practice needed |
Training Recommendations
| Topic | Audience | Format |
|---|---|---|
| Spring State Machine | Backend developers | Workshop + POC (Sprint 0 story #5) |
| OpenTelemetry for Java | All backend developers | Documentation review + enablement |
| Kubernetes NetworkPolicies | Platform/DevOps | Hands-on lab + staging deployment |
| Tailwind CSS | Frontend (frontends repo team) | Pair programming during CSS migration |
| Integration testing patterns | All developers | Test framework setup (Sprint 0 story #4) + code reviews |
Final Hypotheses Scorecard
| # | Hypothesis | Final Assurance | Verdict |
|---|---|---|---|
| H1 | Broadcast not in production | L2 (Verified) | Confirmed — archive related repos |
| H2 | Dwolla inactive | L2 (Verified) | Confirmed — archive repos |
| H3 | Gen 1 fully replaced by Gen 2 | L1 (Validated) | Only infra Gen 1 remains (retire) |
| H4 | Frontend unification feasible | L1 (Validated) | CSS restyling, not logic rewrite |
| H5 | >50% repos archivable | L1 (Validated) | ~110 of 191 (58%) |
| H6 | No shared DB backdoors | L1 (Validated) | Clean boundaries confirmed |
| H7 | >60% test coverage | L0 (Falsified) | 2-3 test files per service — major gap |
| H8 | Data volumes manageable | L0 (Partial) | DB tier known, row counts needed |
| H9 | No compliance blockers | L0 (Partial) | Likely SAQ-A, need Stripe confirmation |
| H10 | APIs backward-compatible | L0 (Partial) | GraphQL additive pattern supports it |
| H11 | Multi-brand is config-only | L2 (Verified) | All domains + infrastructure confirmed |
| H12 | RabbitMQ contracts discoverable | L2 (Verified) | ~75 message types fully mapped |
| H13 | core-lib stable foundation | L1 (Validated) | Consistent across all services |
| H14 | Gen 3 rewrite justified | L1 (Falsified) | Incremental upgrade recommended |
Scorecard Summary
- L2 (Verified): 4 hypotheses (H1, H2, H11, H12)
- L1 (Validated): 5 hypotheses (H3, H4, H5, H6, H13)
- L1 (Falsified): 2 hypotheses (H7, H14) — both inform the approach
- L0 (Partial/Pending): 3 hypotheses (H8, H9, H10) — none block Wave 1
Knowledge Base Deliverables Summary
Phase 2/3 Documents (Sessions 0-9)
| Document | Session | Lines | Purpose |
|---|---|---|---|
domain-model/glossary.md |
0 | ~100 | Canonical domain terms |
architecture/current-state.md |
0 | ~220 | Platform overview + Mermaid |
architecture/service-catalog.md |
0 | ~360 | All 191 repos cataloged |
frontend/frontend-architecture.md |
1 | ~400 | Frontend analysis + API inventory |
architecture/user-identity.md |
2 | ~350 | Keycloak + identity domain |
architecture/content-streaming.md |
3 | ~400 | Content + media + webinar |
architecture/payment-processing.md |
4-5 | ~500 | Billing + wallet + transactions |
architecture/events-business-logic.md |
6 | ~400 | Shoutout + inventory + classes |
architecture/communication-infrastructure.md |
7 | ~400 | Email + SMS + chat + SSE |
architecture/infrastructure-devops.md |
8 | ~770 | GKE + Terraform + CI/CD + Helm |
architecture/integration-patterns.md |
9 | ~490 | Cross-domain synthesis + RabbitMQ map |
architecture/data-models.md |
9 | ~305 | Database inventory + data lineage |
Phase 4 Documents (Sessions 10-13)
| Document | Session | Lines | Purpose |
|---|---|---|---|
modernization/migration-decisions.md |
0-12 | ~80 | Progressive migration register (27 entries) |
modernization/gap-analysis.md |
10 | ~260 | Constraints + gaps + buy-vs-build |
modernization/tech-debt-inventory.md |
10 | ~470 | 32 prioritized debt items |
modernization/target-architecture.md |
11 | ~560 | H14 evaluation + consolidation map |
modernization/migration-strategy.md |
12 | ~370 | Strangler fig + 4 waves + rollback |
modernization/engineering-kickoff.md |
13 | This doc | Backlog + Sprint 0 + Go/No-Go |
decisions/ADR-001-service-consolidation.md |
12 | ~80 | ~35 → ~18 services |
decisions/ADR-002-frontend-unification.md |
12 | ~80 | Single Angular monorepo |
decisions/ADR-003-java-standardization.md |
12 | ~75 | Java 21 LTS alignment |
decisions/ADR-004-multi-brand-architecture.md |
12 | ~85 | Shared cluster + namespaces |
Total: 20 documents + 4 ADRs across 14 sessions
Recommended Next Actions
- Immediately: Archive ~110 repos (script in service-catalog.md archive candidate list)
- This week: Remove frontend dead code (5 API gateways — low effort, high clarity)
- Sprint 0: Execute 6 foundation stories (regional GKE, CI security, OTel, test framework, 2 POCs)
- Before Wave 2: Resolve H8 (data volumes) and H9 (PCI scope)
- Engineering review: Present 4 ADRs to team for approval/feedback
Last updated: 2026-01-30 — Session 13 Review by: 2026-04-30 Staleness risk: High — kickoff package should be actioned promptly