Modernization
Tech Debt Inventory
Tech Debt Inventory
Key Takeaways
- P0 debt (must fix for migration): CIB Seven EOL (2
BPM services + Keycloak plugin), deprecated Mandrill library
(
lutung0.0.8 unmaintained), zonal GKE/Cloud SQL (no HA), and near-zero test coverage across all services. - P1 debt (should fix during migration): Cluster-per-tenant cost scaling, security scanning non-enforcement, frontend CSS framework split (Tailwind vs Bootstrap), and missing observability (alerting, tracing, SLOs).
- Over-decomposed services identified: 6 services with <5 endpoints that should be consolidated — wallet (3 tables), transaction (1 table), onsite-event (2 tables), SSE (2 tables), chat (thin Stream wrapper), message-board (5 tables).
- Gen 1/Gen 2 overlap: ~35 Gen 1 repos still exist alongside Gen 2 replacements. 12+ confirmed dead services still have repos. Archive effort needed before migration to reduce confusion.
- Total tech debt items: 32 items across 7 categories, with 8 P0 (blocking), 14 P1 (important), and 10 P2 (improve when convenient).
Migration Decision Question
What technical debt blocks or complicates the modernization effort?
Priority Definitions
| Priority | Meaning | Action |
|---|---|---|
| P0 | Blocks migration — must resolve before or during migration | Fix immediately or as prerequisite |
| P1 | Complicates migration — should resolve during migration | Fix as part of migration work |
| P2 | Quality improvement — fix when touching related code | Fix opportunistically |
Effort Definitions
| Size | Scope |
|---|---|
| S | 1-2 services, <1 week equivalent effort |
| M | 2-4 services, 1-3 weeks equivalent effort |
| L | 4-6 services, 3-6 weeks equivalent effort |
| XL | 6+ services, 6+ weeks equivalent effort |
1. BPM Engine (CIB Seven / Camunda 7) — P0
| Item | Detail |
|---|---|
| Debt | CIB Seven 2.0 (fork of Camunda 7 CE) — community support ended October 2025 |
| Affected Services | purchase-request-bpm, shoutout-bpm, cibseven-keycloak plugin |
| Risk | No security patches, no bug fixes, JDK compatibility concerns |
| Effort | M |
| Remediation | Replace with lightweight state machine (workflows are ~10 states each, not complex enough to justify a full BPM engine). Alternatively, migrate to Temporal or Conductor. |
| Dependencies | Keycloak identity sync plugin must also be replaced |
| Session | 4, 6 |
2. Test Coverage — P0
| Item | Detail |
|---|---|
| Debt | Near-zero automated test coverage across all Gen 2 services (2-3 test files per service) |
| Affected Services | All 28+ Gen 2 services |
| Risk | Regression during migration — no safety net for refactoring or upgrading |
| Effort | XL (foundational — ongoing effort) |
| Remediation | Phase 1: Add integration tests for critical paths (payment, auth, inventory). Phase 2: Add unit tests as services are touched during migration. Target 60% coverage for migrated services. |
| Dependencies | CI pipeline needs test enforcement gates |
| Session | 2-7 (H7 falsified) |
3. Zonal Infrastructure — P0
| Item | Detail |
|---|---|
| Debt | GKE clusters and Cloud SQL instances are zonal (us-central1-a), not regional |
| Affected Services | All services + databases (entire platform) |
| Risk | Zone failure = complete tenant outage, no automatic failover |
| Effort | L |
| Remediation | Upgrade GKE to regional clusters, Cloud SQL to regional HA. Requires Terraform module updates + maintenance window per tenant. |
| Dependencies | Terraform modules, CastAI node policies |
| Session | 8 |
4. Mandrill Library
(lutung) — P0
| Item | Detail |
|---|---|
| Debt | Email service uses lutung 0.0.8, an unmaintained Java
Mandrill library |
| Affected Services | email service |
| Risk | No security updates, potential incompatibility with newer JDKs |
| Effort | S |
| Remediation | Replace with official Mandrill REST API calls (HTTP client) or switch to SendGrid/Postmark with maintained SDK |
| Dependencies | None — isolated to email service |
| Session | 7 |
5. Deprecated Keycloak Email Verification — P0
| Item | Detail |
|---|---|
| Debt | Email service has GraphQL API for email verification (checkStatus, confirmCode, sendCode) marked as deprecated — migrating to Keycloak native |
| Affected Services | email, users, Keycloak SPIs |
| Risk | Dual verification paths create confusion; deprecated code still callable |
| Effort | S |
| Remediation | Complete migration to Keycloak native verification, remove deprecated GraphQL operations |
| Dependencies | Frontend must stop calling deprecated email verification API |
| Session | 7 |
6. Security Scanning Non-Enforcement — P0
| Item | Detail |
|---|---|
| Debt | Trivy and Qwiet (ShiftLeft) scan containers and code but don’t fail CI builds |
| Affected Services | All services (CI pipeline) |
| Risk | Known vulnerabilities ship to production without blocking |
| Effort | S |
| Remediation | Add --exit-code 1 to Trivy scan, configure Qwiet to
fail on high/critical findings. Add Binary Authorization for container
signing. |
| Dependencies | GitHub Actions reusable workflows |
| Session | 8 |
7. Frontend Dead Code — P0
| Item | Detail |
|---|---|
| Debt | ~17% of frontend API gateway code calls non-existent production services (broadcast, conference, stream, dwolla, logging) |
| Affected Services | peeq-mono, frontends |
| Risk | Confuses developers, generates runtime errors, complicates migration analysis |
| Effort | S |
| Remediation | Remove dead gateway services and related components from both frontend repos |
| Dependencies | None — services don’t exist |
| Session | 1 |
8. No Double-Entry Bookkeeping — P0
| Item | Detail |
|---|---|
| Debt | Transaction service uses single-table JSON payment log — no debit/credit ledger |
| Affected Services | transaction, wallet |
| Risk | Financial reporting limited, refund/chargeback audit difficult, coin balance discrepancies possible |
| Effort | M |
| Remediation | If Gen 3 needs financial reporting, redesign transaction model with proper double-entry bookkeeping. If current simple log is sufficient, document the limitation and add reconciliation checks. |
| Dependencies | wallet (coin balance), stripe (payment records) |
| Session | 5, 9 |
9. Cluster-Per-Tenant Cost — P1
| Item | Detail |
|---|---|
| Debt | Each of 4 production brands gets a dedicated GKE cluster + Cloud SQL + RabbitMQ + Redis + Keycloak |
| Affected Services | All (infrastructure) |
| Risk | Infrastructure cost scales linearly with tenant count; blocks affordable scaling to more brands |
| Effort | XL |
| Remediation | Consolidate to shared cluster with namespace-per-tenant isolation (NetworkPolicies, ResourceQuotas, Istio AuthorizationPolicy). H11 L2 confirms no code-level tenant branching. |
| Dependencies | NetworkPolicies, Istio RBAC, Terraform refactor |
| Session | 8 |
10. Frontend CSS Framework Split — P1
| Item | Detail |
|---|---|
| Debt | peeq-mono uses Tailwind/DaisyUI; frontends uses Bootstrap/Angular Material |
| Affected Services | peeq-mono, frontends |
| Risk | Primary blocker for frontend unification — cannot merge repos without component restyling |
| Effort | XL |
| Remediation | Pick one framework (Tailwind recommended — modern, more flexible), migrate component-by-component. Not a big-bang rewrite. |
| Dependencies | Design system decision must come first |
| Session | 1 |
11. Missing Alerting Configuration — P1
| Item | Detail |
|---|---|
| Debt | Prometheus deployed with kube-prometheus-stack but no custom alert rules configured |
| Affected Services | All (monitoring) |
| Risk | Zero automated incident detection — outages discovered by users, not monitoring |
| Effort | M |
| Remediation | Define SLIs/SLOs per service, create PrometheusRules for key indicators (error rate, latency, availability), configure PagerDuty/Slack integration |
| Dependencies | SLO definitions (business input needed) |
| Session | 8 |
12. No Distributed Tracing — P1
| Item | Detail |
|---|---|
| Debt | Istio Stackdriver tracing configured but not adopted by services; no OpenTelemetry SDK integration |
| Affected Services | All services |
| Risk | Cannot debug cross-service request flows — critical for 28+ microservice architecture |
| Effort | M |
| Remediation | Add OpenTelemetry Java agent to common Helm chart (auto-instrumentation), configure sampling, enable trace visualization in Grafana or Cloud Trace |
| Dependencies | Common Helm chart update, trace backend selection |
| Session | 8 |
13. No NetworkPolicies — P1
| Item | Detail |
|---|---|
| Debt | All pods in GKE can communicate freely — no namespace-level network isolation |
| Affected Services | All (infrastructure) |
| Risk | Lateral movement possible if any pod is compromised; blocks multi-tenant consolidation |
| Effort | M |
| Remediation | Define NetworkPolicies per namespace: default-deny ingress, explicit allow for known service-to-service paths |
| Dependencies | Integration patterns doc provides the allow-list of communication paths |
| Session | 8 |
14. CORS Allows All Origins — P1
| Item | Detail |
|---|---|
| Debt | Backend services observed with permissive CORS configuration |
| Affected Services | All GraphQL services |
| Risk | Cross-origin attacks possible; violates defense-in-depth |
| Effort | S |
| Remediation | Restrict CORS origins to known tenant domains (4 production + dev/preview) via common configuration |
| Dependencies | Tenant domain registry in values-globals.yaml |
| Session | 8 |
15. Deprecated Arlo LMS Integration — P1
| Item | Detail |
|---|---|
| Debt | class-catalog contains deprecated Arlo LMS integration code |
| Affected Services | class-catalog |
| Risk | Dead code confusion during migration |
| Effort | S |
| Remediation | Remove Arlo-related code paths, DB columns (if safe), and configuration |
| Dependencies | None — Arlo confirmed deprecated |
| Session | 6 |
16. Deprecated Celebrity GraphQL Queries — P1
| Item | Detail |
|---|---|
| Debt | 3 GraphQL queries in celebrity service marked deprecated (celebrity, celebrities, celebritiesPaged) |
| Affected Services | celebrity, frontends |
| Risk | Frontend may still call deprecated queries; dual API surface complicates migration |
| Effort | S |
| Remediation | Verify frontend usage, migrate to non-deprecated equivalents, remove deprecated queries |
| Dependencies | Frontend API call audit |
| Session | 2 |
17. Inconsistent core-lib Versions — P1
| Item | Detail |
|---|---|
| Debt | core-lib ranges from 0.0.67-0.0.69, messages lib from 0.0.48-0.0.73 across services |
| Affected Services | All Gen 2 services |
| Risk | Version drift could cause subtle behavior differences; complicates shared library upgrades |
| Effort | S |
| Remediation | Align all services to latest core-lib and messages versions as part of Spring Boot upgrade |
| Dependencies | Must test each service after version bump |
| Session | 2-7 (H13) |
18. APM Disabled — P1
| Item | Detail |
|---|---|
| Debt | Elastic APM agent is available but disabled by default across all services |
| Affected Services | All services |
| Risk | No application performance visibility; cannot identify slow queries, memory leaks, or performance regressions |
| Effort | S |
| Remediation | Enable APM via Helm chart toggle (already configurable), or switch to OpenTelemetry-based APM |
| Dependencies | Elasticsearch/APM Server capacity planning |
| Session | 8 |
19. PgBouncer Routing for 41 Databases — P1
| Item | Detail |
|---|---|
| Debt | PgBouncer routes 41 databases across 3 replicas — some databases may be unused |
| Affected Services | All database-connected services |
| Risk | Connection overhead for unused databases; 750 max connection limit shared across 35 active DBs |
| Effort | S |
| Remediation | Audit PgBouncer routing table against actual database usage; remove unused database entries |
| Dependencies | Production database access needed |
| Session | 8 |
20. CastAI On-Demand Overrides — P1
| Item | Detail |
|---|---|
| Debt | 25+ services forced to on-demand instances despite CastAI spot-first policy |
| Affected Services | Infrastructure (GKE nodes) |
| Risk | Reduced cost savings from spot instance optimization |
| Effort | S |
| Remediation | Audit on-demand overrides; most stateless services should tolerate spot eviction with proper PodDisruptionBudgets |
| Dependencies | Service resilience testing |
| Session | 8 |
21. Gen 1 DB Repos Still Active — P1
| Item | Detail |
|---|---|
| Debt | ~20 peeq-*-db repos contain Gen 1 Flyway migrations,
but Gen 2 services manage their own schemas |
| Affected Services | None (repos are inert) |
| Risk | Developer confusion about which schema source is authoritative |
| Effort | S |
| Remediation | Archive all Gen 1 DB repos with README pointing to Gen 2 service |
| Dependencies | Confirm no CI pipeline references Gen 1 DB repos |
| Session | 2-7 |
22. Elasticsearch 7.x EOL — P1
| Item | Detail |
|---|---|
| Debt | Elasticsearch 7.15.2 and Kibana 7.15.2 are end-of-life |
| Affected Services | search, peeq-logging, Kibana dashboards |
| Risk | No security patches; compatibility issues with newer clients |
| Effort | M |
| Remediation | Upgrade to Elasticsearch 8.x, or replace log aggregation with Cloud Logging and search with a managed service |
| Dependencies | Kibana dashboard export/import; search index rebuild |
| Session | 8, 9 |
23. NFS Storage Coupling — P2
| Item | Detail |
|---|---|
| Debt | 4 PVCs (50Gi each) for content, media, shoutout, streaming tied to GKE NFS provisioner |
| Affected Services | content, media, shoutout, streaming |
| Risk | NFS not cloud-native; blocks multi-region; tied to specific GKE cluster |
| Effort | M |
| Remediation | Migrate file storage to GCS (Google Cloud Storage) with signed URLs |
| Dependencies | Spring Content filesystem abstraction in content service |
| Session | 3, 9 |
24. No API Versioning — P2
| Item | Detail |
|---|---|
| Debt | GraphQL schemas have no versioning strategy; breaking changes affect all consumers |
| Affected Services | All 24+ GraphQL gateways |
| Risk | Cannot evolve APIs without coordinated frontend+backend deployment |
| Effort | M |
| Remediation | Adopt GraphQL schema evolution best practices (additive changes, deprecation annotations, sunset period) |
| Dependencies | Frontend deployment coordination |
| Session | 1, 9 |
25. No Feature Flags — P2
| Item | Detail |
|---|---|
| Debt | No feature flag system; only tenant-level config toggles |
| Affected Services | All services |
| Risk | Cannot do gradual rollouts, canary deployments, or A/B testing |
| Effort | M |
| Remediation | Add feature flag service (LaunchDarkly, Unleash, or custom) as part of Gen 3 |
| Dependencies | Frontend and backend integration |
| Session | General |
26. No Circuit Breakers — P2
| Item | Detail |
|---|---|
| Debt | No circuit breaker pattern in service-to-service calls |
| Affected Services | All services making synchronous calls |
| Risk | Cascading failures when downstream service degrades |
| Effort | S |
| Remediation | Add Resilience4j circuit breakers to GraphQL client calls; Istio can also provide mesh-level circuit breaking |
| Dependencies | Service dependency map (integration-patterns.md) |
| Session | 9 |
27. No Message Idempotency — P2
| Item | Detail |
|---|---|
| Debt | RabbitMQ consumers don’t implement idempotency checks |
| Affected Services | All RabbitMQ consumers (~28 services) |
| Risk | Duplicate message processing possible during network issues or consumer restarts |
| Effort | M |
| Remediation | Add message deduplication (idempotency key tracking) to core-lib MessageHandler base class |
| Dependencies | core-lib update affecting all services |
| Session | 9 |
28. Public GCS Buckets — P2
| Item | Detail |
|---|---|
| Debt | Some GCS buckets observed with public access |
| Affected Services | Content, media storage |
| Risk | Unauthorized access to uploaded content; data exposure |
| Effort | S |
| Remediation | Audit all GCS bucket IAM policies; switch to signed URLs for content delivery |
| Dependencies | Frontend URL rewriting |
| Session | 8 |
29. No Data Retention Policy — P2
| Item | Detail |
|---|---|
| Debt | No automated data retention or purge policies; soft delete flags exist but no cleanup |
| Affected Services | All services with databases |
| Risk | Unbounded data growth; GDPR/CCPA compliance risk |
| Effort | M |
| Remediation | Define retention policies per data category; implement automated purge jobs; add data subject request workflow |
| Dependencies | Legal/compliance input on retention periods |
| Session | 9 |
30. No WAF/DDoS Protection — P2
| Item | Detail |
|---|---|
| Debt | No Web Application Firewall in front of Istio IngressGateway |
| Affected Services | All public-facing services |
| Risk | Application-layer attacks not filtered |
| Effort | S |
| Remediation | Add Google Cloud Armor to GCP Load Balancer with OWASP rules |
| Dependencies | Load balancer configuration |
| Session | 8 |
31. No Audit Trail Service — P2
| Item | Detail |
|---|---|
| Debt | created_on/updated_on timestamps exist but no centralized audit trail |
| Affected Services | All services |
| Risk | Cannot trace who changed what; compliance audit difficult |
| Effort | M |
| Remediation | Add event sourcing or audit log service that captures change events |
| Dependencies | RabbitMQ message infrastructure (already exists) |
| Session | General |
32. Dual Keycloak Instances — P2
| Item | Detail |
|---|---|
| Debt | identityx-26 (Keycloak 26.3.2) and identityx-25 (Keycloak 25.x) both deployed to agilenetwork tenant |
| Affected Services | Identity domain |
| Risk | Configuration drift between versions; increased operational burden |
| Effort | S |
| Remediation | Complete migration to Keycloak 26 for all tenants; retire identityx-25 |
| Dependencies | Realm config migration, SPI compatibility |
| Session | 2 |
Summary Matrix
By Priority
| Priority | Count | Effort Range | Key Items |
|---|---|---|---|
| P0 | 8 | S to XL | CIB Seven EOL, test coverage, zonal HA, Mandrill lib, security enforcement, frontend dead code, deprecated email verification, no bookkeeping |
| P1 | 14 | S to XL | Cluster cost, CSS split, alerting, tracing, NetworkPolicies, CORS, Arlo, deprecated APIs, core-lib versions, APM, PgBouncer, CastAI, Gen 1 DB repos, Elasticsearch EOL |
| P2 | 10 | S to M | NFS storage, API versioning, feature flags, circuit breakers, idempotency, public GCS, data retention, WAF, audit trail, dual Keycloak |
By Domain
| Domain | P0 | P1 | P2 | Total |
|---|---|---|---|---|
| Infrastructure | 2 | 8 | 2 | 12 |
| Payment/BPM | 2 | 0 | 0 | 2 |
| Frontend | 2 | 1 | 0 | 3 |
| Communication | 1 | 0 | 0 | 1 |
| Cross-cutting | 1 | 3 | 5 | 9 |
| Identity | 0 | 1 | 1 | 2 |
| Events | 0 | 1 | 0 | 1 |
| Content | 0 | 0 | 1 | 1 |
| All services | 0 | 0 | 1 | 1 |
Over-Decomposed Services (Consolidation Candidates)
| Service | Endpoints | Tables/Migrations | Consolidation Target |
|---|---|---|---|
| wallet | ~5 | 3 tables, 3 migrations | → Payment domain service |
| transaction | ~3 | 1 table, 6 migrations | → Payment domain service |
| onsite-event | ~3 | 2 tables, 2 migrations | → Events domain service |
| chat | ~5 | 2 tables (Stream metadata) | → Communication service |
| message-board | ~6 | 5 tables, 5 migrations | → Communication service |
| SSE | ~3 | 2 tables, 3 migrations | Keep separate (infrastructure) |
Note: SSE is listed as low-complexity but serves as platform infrastructure (8 inbound handlers, 7+ publishers). It should remain a separate service despite its small schema.
Last updated: 2026-01-30 — Session 10 Review by: 2026-04-30 Staleness risk: Medium — debt items may be resolved as modernization progresses