ADR

ADR-014: Observability Strategy

Last updated: 2026-02-01 | Decisions

ADR-014: Observability Strategy

Status

Proposed — Pending engineering team review

Context

The platform has critical observability gaps across all four pillars (metrics, tracing, logging, alerting). Monitoring infrastructure exists (Prometheus, Grafana, Elasticsearch) but is severely under-utilized. No automated incident detection — outages are discovered by users, not monitoring.

Current State

Pillar	Current	Gap
Metrics	Prometheus + Grafana deployed (kube-prometheus-stack)	No custom alert rules configured. Zero automated incident detection.
Tracing	Istio Stackdriver tracing configured	Not adopted by services. No OpenTelemetry SDK integration. Cannot debug cross-service requests.
Logging	peeq-logging (Gen 1 Node.js) → Elasticsearch 7.15.2	ES 7.x is EOL. Custom pipeline is unnecessary overhead. No structured logging standard.
APM	Elastic APM agent available	Disabled by default across all services. Zero application performance visibility.
SLOs	None	No SLI definitions, no error budgets, no reliability targets.
Error tracking	None	No centralized error aggregation, deduplication, or alerting.
Session replay	LogRocket	Working — no change needed.

Impact

P1 tech debt items: #11 (alerting), #12 (tracing), #18 (APM), #22 (ES 7.x EOL)
Cannot measure service reliability during or after migration
Cannot detect regressions introduced by service consolidation (ADR-001)
Cannot debug cross-service request failures in 18-service architecture

Decision

Adopt OpenTelemetry as the unified observability framework, replacing the fragmented Elastic/Stackdriver/custom stack. Consolidate on Grafana ecosystem (LGTM stack) for visualization and alerting.

Target Architecture

Pillar	Target	Rationale
Metrics	Prometheus + Grafana + AlertManager	Already deployed. Add custom alert rules and SLO-based alerting.
Tracing	OpenTelemetry auto-instrumentation → Grafana Tempo	OTel Java agent auto-instruments Spring Boot, GraphQL, RabbitMQ, JDBC. Zero code changes.
Logging	Structured JSON logs → GCP Cloud Logging + Loki	Replace peeq-logging pipeline. Cloud Logging for infrastructure, Loki for application logs in Grafana.
APM	OpenTelemetry spans + metrics (replaces Elastic APM)	Single instrumentation agent covers tracing + metrics + APM.
SLOs	Grafana SLO plugin + Sloth	Define SLIs per service tier, generate PrometheusRules automatically.
Error tracking	Sentry	Dedicated error aggregation with deduplication, stack traces, release tracking.
Dashboards	Grafana (unified)	Single pane of glass: metrics, traces, logs, SLOs, errors.

Why OpenTelemetry + Grafana (Not Elastic Observability)

Factor	OpenTelemetry + Grafana	Elastic Observability
Instrumentation	Vendor-neutral OTel SDK. Auto-instruments Spring Boot with zero code changes.	Elastic APM agent — vendor-specific.
Cost	Open source (Tempo, Loki, Prometheus, Grafana). Cloud option available (Grafana Cloud).	Elastic Cloud pricing per GB ingested. Self-managed ES requires significant ops.
Existing investment	Prometheus + Grafana already deployed and working.	ES 7.x is EOL. Upgrade to 8.x or Elastic Cloud is a migration either way.
Ecosystem	CNCF graduated project. De facto industry standard.	Strong but vendor-specific ecosystem.
Correlation	Grafana correlates metrics ↔︎ traces ↔︎ logs natively (Exemplars).	Elastic APM correlates within Elastic stack only.

SLI/SLO Framework

Service Tier	SLI	SLO Target	Error Budget
Tier 1: Payment (payment-service, purchase-workflow)	Success rate (non-5xx)	99.95%	21.6 min/month
Tier 1: Payment	p99 latency	<1s	—
Tier 2: Core (identity, content, shoutout, inventory)	Success rate	99.9%	43.2 min/month
Tier 2: Core	p99 latency	<500ms	—
Tier 3: Supporting (notification, search, platform-services)	Success rate	99.5%	3.6 hrs/month
Tier 3: Supporting	p99 latency	<2s	—

Implementation Approach

Add OTel Java agent to common Helm chart — auto-instrumentation via -javaagent JVM flag. Zero application code changes. Instruments Spring Boot, GraphQL, RabbitMQ, JDBC, Redis automatically.
Deploy Grafana Tempo — distributed tracing backend. Receives OTel traces via OTLP protocol.
Deploy Loki — log aggregation. Application logs in structured JSON. Correlates with traces via trace ID.
Configure AlertManager rules — start with golden signals (latency, traffic, errors, saturation) per service.
Define SLOs with Sloth — generate PrometheusRules from SLO definitions. Grafana SLO dashboards.
Deploy Sentry — error tracking with release correlation. SDKs for Java (backend) and Angular/Next.js (frontend).
Decommission peeq-logging — replace with Cloud Logging for infrastructure + Loki for application logs.
Retire Elasticsearch 7.x for logging — keep only for content search (see ADR-019).

Hypothesis Background

Primary: OpenTelemetry + Grafana LGTM stack provides comprehensive observability with minimal application code changes and lower operational overhead than Elastic.

Evidence: Prometheus + Grafana already deployed and operational (L2)
Evidence: OTel Java agent auto-instruments Spring Boot applications (L1 — documented, not tested in our codebase)
Evidence: Elastic APM disabled across all services — no existing investment to protect (L2)
Evidence: ES 7.x is EOL — migration required regardless (L2)

Alternative 1: Upgrade to Elastic Cloud 8.x (full Elastic Observability). - Rejected: Higher cost, vendor lock-in, requires ES migration. We already have Prometheus + Grafana working.

Alternative 2: GCP-native (Cloud Monitoring, Cloud Trace, Cloud Logging only). - Not rejected entirely — Cloud Logging is part of the recommendation. But GCP-native tracing and dashboards are less flexible than Grafana for custom SLO dashboards and cross-signal correlation.

Falsifiability Criteria

If OTel auto-instrumentation causes >5% latency overhead in production → reduce sampling rate or switch to manual instrumentation for hot paths
If Tempo storage costs exceed $X/month at production trace volume → reduce retention or sampling
If SLO targets prove unrealistic (error budget exhausted within first month) → revise targets based on baseline data
If Loki query performance is insufficient for debugging → evaluate Elastic Cloud for log search specifically

Evidence Quality

Evidence	Assurance
Prometheus + Grafana deployed and working	L2
ES 7.x is EOL	L2
Elastic APM disabled across all services	L2
No alert rules configured	L2
OTel auto-instruments Spring Boot	L1 (documented, not tested in codebase)
Tempo/Loki storage costs at our scale	L0 (need capacity planning)
SLO targets appropriate for our traffic patterns	L0 (need baseline data)

Overall: L0 (WLNK capped by untested OTel overhead and unknown storage costs)

Bounded Validity

Scope: All backend services, frontend error tracking. Does not cover infrastructure monitoring (GKE, Cloud SQL — already handled by GCP).
Expiry: Re-evaluate if Grafana Cloud pricing becomes more favorable than self-managed, or if traffic grows 10x.
Review trigger: If OTel overhead exceeds acceptable thresholds, or if trace storage costs are prohibitive.
Monitoring: Track OTel agent overhead (CPU/memory), trace storage growth rate, and alert noise ratio.

Consequences

Positive: - Unified observability across all four pillars in a single tool (Grafana) - Zero application code changes for tracing (OTel auto-instrumentation) - Eliminates EOL Elasticsearch 7.x dependency for logging - Removes custom peeq-logging pipeline (Gen 1 Node.js service) - SLO-based alerting prevents alert fatigue - Enables debugging cross-service requests during and after migration - Industry-standard (CNCF) — no vendor lock-in

Negative: - Additional infrastructure to operate (Tempo, Loki, Sentry) - Learning curve for Grafana Tempo query language (TraceQL) - SLO target calibration requires baseline data collection period

Decision date: 2026-02-01 Review by: 2026-08-01