ADR-014: Observability Strategy
ADR-014: Observability Strategy
Status
Proposed — Pending engineering team review
Context
The platform has critical observability gaps across all four pillars (metrics, tracing, logging, alerting). Monitoring infrastructure exists (Prometheus, Grafana, Elasticsearch) but is severely under-utilized. No automated incident detection — outages are discovered by users, not monitoring.
Current State
| Pillar | Current | Gap |
|---|---|---|
| Metrics | Prometheus + Grafana deployed (kube-prometheus-stack) | No custom alert rules configured. Zero automated incident detection. |
| Tracing | Istio Stackdriver tracing configured | Not adopted by services. No OpenTelemetry SDK integration. Cannot debug cross-service requests. |
| Logging | peeq-logging (Gen 1 Node.js) → Elasticsearch 7.15.2 | ES 7.x is EOL. Custom pipeline is unnecessary overhead. No structured logging standard. |
| APM | Elastic APM agent available | Disabled by default across all services. Zero application performance visibility. |
| SLOs | None | No SLI definitions, no error budgets, no reliability targets. |
| Error tracking | None | No centralized error aggregation, deduplication, or alerting. |
| Session replay | LogRocket | Working — no change needed. |
Impact
- P1 tech debt items: #11 (alerting), #12 (tracing), #18 (APM), #22 (ES 7.x EOL)
- Cannot measure service reliability during or after migration
- Cannot detect regressions introduced by service consolidation (ADR-001)
- Cannot debug cross-service request failures in 18-service architecture
Decision
Adopt OpenTelemetry as the unified observability framework, replacing the fragmented Elastic/Stackdriver/custom stack. Consolidate on Grafana ecosystem (LGTM stack) for visualization and alerting.
Target Architecture
| Pillar | Target | Rationale |
|---|---|---|
| Metrics | Prometheus + Grafana + AlertManager | Already deployed. Add custom alert rules and SLO-based alerting. |
| Tracing | OpenTelemetry auto-instrumentation → Grafana Tempo | OTel Java agent auto-instruments Spring Boot, GraphQL, RabbitMQ, JDBC. Zero code changes. |
| Logging | Structured JSON logs → GCP Cloud Logging + Loki | Replace peeq-logging pipeline. Cloud Logging for infrastructure, Loki for application logs in Grafana. |
| APM | OpenTelemetry spans + metrics (replaces Elastic APM) | Single instrumentation agent covers tracing + metrics + APM. |
| SLOs | Grafana SLO plugin + Sloth | Define SLIs per service tier, generate PrometheusRules automatically. |
| Error tracking | Sentry | Dedicated error aggregation with deduplication, stack traces, release tracking. |
| Dashboards | Grafana (unified) | Single pane of glass: metrics, traces, logs, SLOs, errors. |
Why OpenTelemetry + Grafana (Not Elastic Observability)
| Factor | OpenTelemetry + Grafana | Elastic Observability |
|---|---|---|
| Instrumentation | Vendor-neutral OTel SDK. Auto-instruments Spring Boot with zero code changes. | Elastic APM agent — vendor-specific. |
| Cost | Open source (Tempo, Loki, Prometheus, Grafana). Cloud option available (Grafana Cloud). | Elastic Cloud pricing per GB ingested. Self-managed ES requires significant ops. |
| Existing investment | Prometheus + Grafana already deployed and working. | ES 7.x is EOL. Upgrade to 8.x or Elastic Cloud is a migration either way. |
| Ecosystem | CNCF graduated project. De facto industry standard. | Strong but vendor-specific ecosystem. |
| Correlation | Grafana correlates metrics ↔︎ traces ↔︎ logs natively (Exemplars). | Elastic APM correlates within Elastic stack only. |
SLI/SLO Framework
| Service Tier | SLI | SLO Target | Error Budget |
|---|---|---|---|
| Tier 1: Payment (payment-service, purchase-workflow) | Success rate (non-5xx) | 99.95% | 21.6 min/month |
| Tier 1: Payment | p99 latency | <1s | — |
| Tier 2: Core (identity, content, shoutout, inventory) | Success rate | 99.9% | 43.2 min/month |
| Tier 2: Core | p99 latency | <500ms | — |
| Tier 3: Supporting (notification, search, platform-services) | Success rate | 99.5% | 3.6 hrs/month |
| Tier 3: Supporting | p99 latency | <2s | — |
Implementation Approach
- Add OTel Java agent to common Helm chart —
auto-instrumentation via
-javaagentJVM flag. Zero application code changes. Instruments Spring Boot, GraphQL, RabbitMQ, JDBC, Redis automatically. - Deploy Grafana Tempo — distributed tracing backend. Receives OTel traces via OTLP protocol.
- Deploy Loki — log aggregation. Application logs in structured JSON. Correlates with traces via trace ID.
- Configure AlertManager rules — start with golden signals (latency, traffic, errors, saturation) per service.
- Define SLOs with Sloth — generate PrometheusRules from SLO definitions. Grafana SLO dashboards.
- Deploy Sentry — error tracking with release correlation. SDKs for Java (backend) and Angular/Next.js (frontend).
- Decommission peeq-logging — replace with Cloud Logging for infrastructure + Loki for application logs.
- Retire Elasticsearch 7.x for logging — keep only for content search (see ADR-019).
Hypothesis Background
Primary: OpenTelemetry + Grafana LGTM stack provides comprehensive observability with minimal application code changes and lower operational overhead than Elastic.
- Evidence: Prometheus + Grafana already deployed and operational (L2)
- Evidence: OTel Java agent auto-instruments Spring Boot applications (L1 — documented, not tested in our codebase)
- Evidence: Elastic APM disabled across all services — no existing investment to protect (L2)
- Evidence: ES 7.x is EOL — migration required regardless (L2)
Alternative 1: Upgrade to Elastic Cloud 8.x (full Elastic Observability). - Rejected: Higher cost, vendor lock-in, requires ES migration. We already have Prometheus + Grafana working.
Alternative 2: GCP-native (Cloud Monitoring, Cloud Trace, Cloud Logging only). - Not rejected entirely — Cloud Logging is part of the recommendation. But GCP-native tracing and dashboards are less flexible than Grafana for custom SLO dashboards and cross-signal correlation.
Falsifiability Criteria
- If OTel auto-instrumentation causes >5% latency overhead in production → reduce sampling rate or switch to manual instrumentation for hot paths
- If Tempo storage costs exceed $X/month at production trace volume → reduce retention or sampling
- If SLO targets prove unrealistic (error budget exhausted within first month) → revise targets based on baseline data
- If Loki query performance is insufficient for debugging → evaluate Elastic Cloud for log search specifically
Evidence Quality
| Evidence | Assurance |
|---|---|
| Prometheus + Grafana deployed and working | L2 |
| ES 7.x is EOL | L2 |
| Elastic APM disabled across all services | L2 |
| No alert rules configured | L2 |
| OTel auto-instruments Spring Boot | L1 (documented, not tested in codebase) |
| Tempo/Loki storage costs at our scale | L0 (need capacity planning) |
| SLO targets appropriate for our traffic patterns | L0 (need baseline data) |
Overall: L0 (WLNK capped by untested OTel overhead and unknown storage costs)
Bounded Validity
- Scope: All backend services, frontend error tracking. Does not cover infrastructure monitoring (GKE, Cloud SQL — already handled by GCP).
- Expiry: Re-evaluate if Grafana Cloud pricing becomes more favorable than self-managed, or if traffic grows 10x.
- Review trigger: If OTel overhead exceeds acceptable thresholds, or if trace storage costs are prohibitive.
- Monitoring: Track OTel agent overhead (CPU/memory), trace storage growth rate, and alert noise ratio.
Consequences
Positive: - Unified observability across all four pillars in a single tool (Grafana) - Zero application code changes for tracing (OTel auto-instrumentation) - Eliminates EOL Elasticsearch 7.x dependency for logging - Removes custom peeq-logging pipeline (Gen 1 Node.js service) - SLO-based alerting prevents alert fatigue - Enables debugging cross-service requests during and after migration - Industry-standard (CNCF) — no vendor lock-in
Negative: - Additional infrastructure to operate (Tempo, Loki, Sentry) - Learning curve for Grafana Tempo query language (TraceQL) - SLO target calibration requires baseline data collection period
Decision date: 2026-02-01 Review by: 2026-08-01