ADR

ADR-014: Observability Strategy

Last updated: 2026-02-01 | Decisions

ADR-014: Observability Strategy

Status

Proposed — Pending engineering team review

Context

The platform has critical observability gaps across all four pillars (metrics, tracing, logging, alerting). Monitoring infrastructure exists (Prometheus, Grafana, Elasticsearch) but is severely under-utilized. No automated incident detection — outages are discovered by users, not monitoring.

Current State

Pillar Current Gap
Metrics Prometheus + Grafana deployed (kube-prometheus-stack) No custom alert rules configured. Zero automated incident detection.
Tracing Istio Stackdriver tracing configured Not adopted by services. No OpenTelemetry SDK integration. Cannot debug cross-service requests.
Logging peeq-logging (Gen 1 Node.js) → Elasticsearch 7.15.2 ES 7.x is EOL. Custom pipeline is unnecessary overhead. No structured logging standard.
APM Elastic APM agent available Disabled by default across all services. Zero application performance visibility.
SLOs None No SLI definitions, no error budgets, no reliability targets.
Error tracking None No centralized error aggregation, deduplication, or alerting.
Session replay LogRocket Working — no change needed.

Impact

Decision

Adopt OpenTelemetry as the unified observability framework, replacing the fragmented Elastic/Stackdriver/custom stack. Consolidate on Grafana ecosystem (LGTM stack) for visualization and alerting.

Target Architecture

Pillar Target Rationale
Metrics Prometheus + Grafana + AlertManager Already deployed. Add custom alert rules and SLO-based alerting.
Tracing OpenTelemetry auto-instrumentation → Grafana Tempo OTel Java agent auto-instruments Spring Boot, GraphQL, RabbitMQ, JDBC. Zero code changes.
Logging Structured JSON logs → GCP Cloud Logging + Loki Replace peeq-logging pipeline. Cloud Logging for infrastructure, Loki for application logs in Grafana.
APM OpenTelemetry spans + metrics (replaces Elastic APM) Single instrumentation agent covers tracing + metrics + APM.
SLOs Grafana SLO plugin + Sloth Define SLIs per service tier, generate PrometheusRules automatically.
Error tracking Sentry Dedicated error aggregation with deduplication, stack traces, release tracking.
Dashboards Grafana (unified) Single pane of glass: metrics, traces, logs, SLOs, errors.

Why OpenTelemetry + Grafana (Not Elastic Observability)

Factor OpenTelemetry + Grafana Elastic Observability
Instrumentation Vendor-neutral OTel SDK. Auto-instruments Spring Boot with zero code changes. Elastic APM agent — vendor-specific.
Cost Open source (Tempo, Loki, Prometheus, Grafana). Cloud option available (Grafana Cloud). Elastic Cloud pricing per GB ingested. Self-managed ES requires significant ops.
Existing investment Prometheus + Grafana already deployed and working. ES 7.x is EOL. Upgrade to 8.x or Elastic Cloud is a migration either way.
Ecosystem CNCF graduated project. De facto industry standard. Strong but vendor-specific ecosystem.
Correlation Grafana correlates metrics ↔︎ traces ↔︎ logs natively (Exemplars). Elastic APM correlates within Elastic stack only.

SLI/SLO Framework

Service Tier SLI SLO Target Error Budget
Tier 1: Payment (payment-service, purchase-workflow) Success rate (non-5xx) 99.95% 21.6 min/month
Tier 1: Payment p99 latency <1s
Tier 2: Core (identity, content, shoutout, inventory) Success rate 99.9% 43.2 min/month
Tier 2: Core p99 latency <500ms
Tier 3: Supporting (notification, search, platform-services) Success rate 99.5% 3.6 hrs/month
Tier 3: Supporting p99 latency <2s

Implementation Approach

  1. Add OTel Java agent to common Helm chart — auto-instrumentation via -javaagent JVM flag. Zero application code changes. Instruments Spring Boot, GraphQL, RabbitMQ, JDBC, Redis automatically.
  2. Deploy Grafana Tempo — distributed tracing backend. Receives OTel traces via OTLP protocol.
  3. Deploy Loki — log aggregation. Application logs in structured JSON. Correlates with traces via trace ID.
  4. Configure AlertManager rules — start with golden signals (latency, traffic, errors, saturation) per service.
  5. Define SLOs with Sloth — generate PrometheusRules from SLO definitions. Grafana SLO dashboards.
  6. Deploy Sentry — error tracking with release correlation. SDKs for Java (backend) and Angular/Next.js (frontend).
  7. Decommission peeq-logging — replace with Cloud Logging for infrastructure + Loki for application logs.
  8. Retire Elasticsearch 7.x for logging — keep only for content search (see ADR-019).

Hypothesis Background

Primary: OpenTelemetry + Grafana LGTM stack provides comprehensive observability with minimal application code changes and lower operational overhead than Elastic.

Alternative 1: Upgrade to Elastic Cloud 8.x (full Elastic Observability). - Rejected: Higher cost, vendor lock-in, requires ES migration. We already have Prometheus + Grafana working.

Alternative 2: GCP-native (Cloud Monitoring, Cloud Trace, Cloud Logging only). - Not rejected entirely — Cloud Logging is part of the recommendation. But GCP-native tracing and dashboards are less flexible than Grafana for custom SLO dashboards and cross-signal correlation.

Falsifiability Criteria

Evidence Quality

Evidence Assurance
Prometheus + Grafana deployed and working L2
ES 7.x is EOL L2
Elastic APM disabled across all services L2
No alert rules configured L2
OTel auto-instruments Spring Boot L1 (documented, not tested in codebase)
Tempo/Loki storage costs at our scale L0 (need capacity planning)
SLO targets appropriate for our traffic patterns L0 (need baseline data)

Overall: L0 (WLNK capped by untested OTel overhead and unknown storage costs)

Bounded Validity

Consequences

Positive: - Unified observability across all four pillars in a single tool (Grafana) - Zero application code changes for tracing (OTel auto-instrumentation) - Eliminates EOL Elasticsearch 7.x dependency for logging - Removes custom peeq-logging pipeline (Gen 1 Node.js service) - SLO-based alerting prevents alert fatigue - Enables debugging cross-service requests during and after migration - Industry-standard (CNCF) — no vendor lock-in

Negative: - Additional infrastructure to operate (Tempo, Loki, Sentry) - Learning curve for Grafana Tempo query language (TraceQL) - SLO target calibration requires baseline data collection period


Decision date: 2026-02-01 Review by: 2026-08-01