Skip to content

Observability — Overview

PerfShop is designed to be observable end to end. That is in fact the pedagogical reason for the platform's existence: a student must be able to see, in real time, what an active anomaly is doing to the system. To make this possible, every application service is instrumented and every type of signal has its own storage sink and its own visualization tool.

This page gives the overview of the observability stack. The following pages (prometheus, grafana, dashboards, loki, tempo, pyroscope, opensearch, dashboard-html) drill down into the details of each component.

Source of truth

All information in this section is extracted from the actual configuration files: prometheus/prometheus.yml, loki/loki-config.yml, promtail/promtail-config.yml, tempo/tempo-config.yml, pyroscope/pyroscope-config.yml, vector/vector.toml, grafana/provisioning/**, and the backend JAVA_OPTS in the compose files.

Four signals, four sinks

PerfShop handles the four pillars of modern observability: metrics, logs, traces, profiles. Each has its own export channel, its own format and its own storage sink, but all converge to Grafana for visualization.

flowchart LR
  APP["perfshop-app<br/>Spring Boot 3.2 + Java 21"]

  APP -->|"metrics<br/>HTTP scrape 5s<br/>/actuator/prometheus"| PROM[("Prometheus<br/>retention 7d / 5GB")]
  APP -->|"traces<br/>OTLP gRPC :4317<br/>(OpenTelemetry agent)"| TEMPO[("Tempo 2.4.2<br/>local blocks")]
  APP -->|"CPU/alloc/lock profiles<br/>JFR HTTP push :4040<br/>(Pyroscope agent)"| PYRO[("Pyroscope<br/>filesystem")]
  APP -->|"stdout logs<br/>(Spring Boot JSON)"| DOC{"Docker logging<br/>driver json-file"}

  DOC --> PT["Promtail<br/>(docker SD<br/>via socket)"]
  DOC --> VEC["Vector<br/>(docker_logs<br/>via socket)"]

  PT --> LOKI[("Loki<br/>retention 168h")]
  VEC -->|"VRL transform<br/>parse JSON +<br/>service_family routing"| OS[("OpenSearch 2.13<br/>perfshop-{family}*")]

  PROM --> GRAF["Grafana 12.0.0"]
  LOKI --> GRAF
  TEMPO --> GRAF
  PYRO --> GRAF
  OS --> OSD["OpenSearch<br/>Dashboards"]

  TEMPO -.->|"metrics_generator<br/>service-graphs +<br/>span-metrics<br/>remote_write"| PROM

  GRAF -.->|"tracesToLogsV2<br/>(trace ↔ log correlation)"| LOKI
  GRAF -.->|"serviceMap"| TEMPO

  style PROM fill:#e6532d20,stroke:#e6532d
  style TEMPO fill:#0099ff20,stroke:#0099ff
  style PYRO fill:#ff930020,stroke:#ff9300
  style LOKI fill:#00b4fa20,stroke:#00b4fa
  style OS fill:#00514520,stroke:#005145
  style GRAF fill:#f4750020,stroke:#f47500

The four flows in detail

1. Metrics — Prometheus

  • Source: Spring Boot Actuator + Micrometer expose metrics on perfshop-app:9090/actuator/prometheus. Three Prometheus jobs scrape every 5 seconds: perfshop-backend (the backend itself), perfshop-docker (the perfshop-monitoring service that aggregates Docker stats via socket), and jmeter (active only during a JMeter test run).
  • Storage: local Prometheus TSDB, retention 7 days / 5 GB (--storage.tsdb.retention.time=7d, --storage.tsdb.retention.size=5GB).
  • Transport: HTTP scrape pull (Prometheus pulls metrics from the exposers).
  • Other ingestion: --web.enable-remote-write-receiver is enabled, which lets Tempo (via its metrics_generator) push service-graph and span-metrics directly into Prometheus.

See prometheus.md for the detailed configuration and 8 real PromQL queries taken from the shipped dashboards.

2. Logs — dual sink Loki and OpenSearch

PerfShop collects Docker logs twice in parallel, into two different sinks. This is not an accidental redundancy: it is a deliberate pedagogical demonstration that students should be able to compare.

Aspect Loki + Promtail OpenSearch + Vector
Model Index on labels only, content stays as text Full-text index on all fields
Querying LogQL (similar to PromQL) DQL / Lucene / OpenSearch queries
UI Grafana → Explore (Loki datasource) OpenSearch Dashboards
Storage cost Very low (compression, no inverted index) Higher (inverted index on all fields)
Use case Filtering by container/level + substring search Aggregations and facets, structured search
Source in PerfShop Promtail via Docker SD socket Vector via Docker logs source
Transformation logfmt parsing + multiline Java stacktrace Spring Boot JSON parsing + extraction of chaos_family, chaos_level, scenario_id via VRL
Retention 168 h (7 days) Indexed as long as the OpenSearch index exists (no automatic purge configured)
Index A single index_* index Seven indices per family: perfshop-spring, perfshop-nginx, perfshop-mysql, perfshop-jmeter, perfshop-qa, perfshop-forgejo, perfshop-observability

See loki.md and opensearch.md for the details of each pipeline.

3. Distributed traces — Tempo + OpenTelemetry

The perfshop-backend image embeds the OpenTelemetry Java agent in /agents/opentelemetry-javaagent.jar. It is unconditionally activated via JAVA_OPTS:

-javaagent:/agents/opentelemetry-javaagent.jar
-Dotel.service.name=perfshop
-Dotel.exporter.otlp.endpoint=http://perfshop-tempo:4317
-Dotel.exporter.otlp.protocol=grpc
-Dotel.traces.exporter=otlp
-Dotel.metrics.exporter=none
-Dotel.logs.exporter=none
-Dotel.instrumentation.http.capture-headers.server.request=X-Admin-Token,Content-Type
-Dotel.instrumentation.jdbc.captured-statements.enabled=true
-Dotel.span.attribute.count.limit=256

Key points:

  • Traces export only (OTel metrics and logs are disabled — Prometheus and Loki/OpenSearch are the preferred sinks for those signals).
  • Capture of the X-Admin-Token header on the server side — this is what enables the instructor panel "Traces with admin trigger" on the APM dashboard.
  • JDBC statement capture — this is what makes it possible to observe the actual SQL queries in traces, and to support the Security S1 scenario (SQL injection).
  • Limit of 256 attributes per span to avoid cardinality explosions.

Tempo is configured with its metrics_generator, which produces two families of metrics derived from spans (service-graphs + span-metrics) and pushes them into Prometheus via remote-write — these metrics then feed panels such as "P95 latency by operation" on the Instructor APM dashboard.

See tempo.md for the details.

4. Continuous profiling — Pyroscope

The perfshop-backend image also embeds the Pyroscope Java agent in /agents/pyroscope.jar, with an aggressive configuration:

-javaagent:/agents/pyroscope.jar
-Dpyroscope.server.address=http://perfshop-pyroscope:4040
-Dpyroscope.application.name=perfshop
-Dpyroscope.format=jfr
-Dpyroscope.profiler.event=cpu
-Dpyroscope.profiler.alloc=512k
-Dpyroscope.profiler.lock=10ms
-Dpyroscope.profilingInterval=PT0.02S

Three profiles are collected continuously:

  • CPU: sampling every 20 ms (PT0.02S), via perf_event on Linux or itimer on Docker Desktop.
  • Heap allocations: a sample every 512 KB allocated.
  • Lock contention: any blocking ≥ 10 ms.

The format is JFR (Java Flight Recorder), parsed on the Pyroscope server side. These profiles feed the "Flamegraph" panels of the Student and Instructor APM dashboards — see pyroscope.md.

Visualization layer

flowchart TB
  subgraph datasources["Grafana datasources<br/>(automatic provisioning)"]
    direction LR
    DS_P["Prometheus<br/>(default)"]
    DS_L["Loki"]
    DS_T["Tempo"]
    DS_PY["Pyroscope"]
  end

  subgraph grafana["Grafana 12.0.0"]
    direction TB
    FE_E["Students folder<br/>(anonymous access<br/>Viewer role)"]
    FE_F["Instructors folder<br/>(Admin-only ACL<br/>set by grafana-seed)"]

    FE_E --> D1["dashboard-apm-eleve"]
    FE_E --> D2["dashboard-backend-eleve"]
    FE_E --> D3["dashboard-frontend-eleve"]
    FE_E --> D4["dashboard-jmeter"]
    FE_E --> D5["dashboard-logs-eleve"]

    FE_F --> D6["dashboard-apm-formateur"]
    FE_F --> D7["dashboard-backend-formateur"]
    FE_F --> D8["dashboard-frontend-formateur"]
    FE_F --> D9["dashboard-logs-formateur"]
    FE_F --> D10["perfshop-general-v1<br/>(home dashboard)"]
  end

  subgraph osd["OpenSearch Dashboards"]
    OSD_DASH["PerfShop — All Logs"]
  end

  subgraph mon["perfshop-monitoring"]
    HTML["Real-time HTML dashboard<br/>(2s polling on the Node side)"]
  end

  datasources --> grafana

PerfShop exposes three distinct visualization interfaces:

  1. Grafana (port 3002 by default) — the main tool, with 10 shipped dashboards (5 Students + 5 Instructors), automatically provisioned datasources, and anonymous read access for the Students folder. See grafana.md and dashboards.md.
  2. OpenSearch Dashboards (port 5601 by default) — alternative interface to explore logs full-text indexed by OpenSearch. The perfshop-opensearch-seed seed creates 8 index patterns (perfshop-*, perfshop-spring, perfshop-nginx…) and imports a PerfShop — All Logs dashboard at first startup. See opensearch.md.
  3. HTML monitoring dashboard (port 3001 by default) — custom real-time view served by perfshop-monitoring (Node + Express). 2 s polling, compact display, no Grafana dependency. This is the screen the student keeps open alongside their chaos page to see the instant impact. See dashboard-html.md.

Cross-signal correlation

The Tempo and Loki datasources are configured to enable cross-correlation:

  • Trace → Log: tracesToLogsV2 on the Tempo side allows clicking on a span to open Loki with an automatic time filter (spanStartTimeShift: -1m, spanEndTimeShift: 1m) and a trace_id filter.
  • Trace → Metric: tracesToMetrics on the Tempo side allows clicking on a span to open Prometheus with a query on the operation in question.
  • Service Map: Tempo automatically generates the inter-service call cartography from the service-graphs produced by its metrics_generator.
  • Loki Search from Tempo: lokiSearch allows searching for logs related to a trace.
sequenceDiagram
  autonumber
  actor F as Instructor
  participant G as Grafana
  participant T as Tempo
  participant L as Loki
  participant P as Prometheus

  F->>G: Opens Instructor APM dashboard
  G->>T: TraceQL { span.exception.type="NullPointerException" }
  T-->>G: List of NPE traces
  F->>G: Clicks on a trace
  G->>T: GET /api/traces/{traceID}
  T-->>G: Detailed spans
  F->>G: Clicks on a span
  G->>L: Loki query { container="perfshop-app" } |= "{traceID}"<br/>(±1 min window around the span)
  L-->>G: Correlated logs
  F->>G: Clicks on "Metric for span"
  G->>P: PromQL p99 latency on the operation
  P-->>G: Time series

Recap — where to look at what?

Question Tool Datasource Dashboard / screen
"How many requests/s? P95 latency?" Grafana Prometheus Student Backend / Student APM
"Why is my endpoint slow?" Grafana Tempo Instructor APM (TraceQL)
"Which code consumes the CPU?" Grafana Pyroscope APM (Flamegraph panel)
"Which logs were produced during the chaos?" Grafana or OpenSearch Loki or OpenSearch Student Logs / Instructor Logs
"How many containers, how much RAM?" perfshop-monitoring or Grafana Prometheus (docker job) HTML dashboard / perfshop-general-v1
"My JMeter test run, how is it doing?" Grafana Prometheus (jmeter job) dashboard-jmeter
"Are there any exceptions?" Grafana Loki or Tempo Student Logs (ERROR filter) or Instructor APM ({span.exception.type=...})
"Full-text search across all logs" OpenSearch Dashboards OpenSearch Discover on perfshop-*
"Heap dump to analyze a memory leak" perfshop-monitoring Spring Boot Actuator /api/heapdump endpoint (proxy to /actuator/heapdump)

To go further