Observability — Overview¶
PerfShop is designed to be observable end to end. That is in fact the pedagogical reason for the platform's existence: a student must be able to see, in real time, what an active anomaly is doing to the system. To make this possible, every application service is instrumented and every type of signal has its own storage sink and its own visualization tool.
This page gives the overview of the observability stack. The following pages (prometheus, grafana, dashboards, loki, tempo, pyroscope, opensearch, dashboard-html) drill down into the details of each component.
Source of truth
All information in this section is extracted from the actual configuration files: prometheus/prometheus.yml, loki/loki-config.yml, promtail/promtail-config.yml, tempo/tempo-config.yml, pyroscope/pyroscope-config.yml, vector/vector.toml, grafana/provisioning/**, and the backend JAVA_OPTS in the compose files.
Four signals, four sinks¶
PerfShop handles the four pillars of modern observability: metrics, logs, traces, profiles. Each has its own export channel, its own format and its own storage sink, but all converge to Grafana for visualization.
flowchart LR
APP["perfshop-app<br/>Spring Boot 3.2 + Java 21"]
APP -->|"metrics<br/>HTTP scrape 5s<br/>/actuator/prometheus"| PROM[("Prometheus<br/>retention 7d / 5GB")]
APP -->|"traces<br/>OTLP gRPC :4317<br/>(OpenTelemetry agent)"| TEMPO[("Tempo 2.4.2<br/>local blocks")]
APP -->|"CPU/alloc/lock profiles<br/>JFR HTTP push :4040<br/>(Pyroscope agent)"| PYRO[("Pyroscope<br/>filesystem")]
APP -->|"stdout logs<br/>(Spring Boot JSON)"| DOC{"Docker logging<br/>driver json-file"}
DOC --> PT["Promtail<br/>(docker SD<br/>via socket)"]
DOC --> VEC["Vector<br/>(docker_logs<br/>via socket)"]
PT --> LOKI[("Loki<br/>retention 168h")]
VEC -->|"VRL transform<br/>parse JSON +<br/>service_family routing"| OS[("OpenSearch 2.13<br/>perfshop-{family}*")]
PROM --> GRAF["Grafana 12.0.0"]
LOKI --> GRAF
TEMPO --> GRAF
PYRO --> GRAF
OS --> OSD["OpenSearch<br/>Dashboards"]
TEMPO -.->|"metrics_generator<br/>service-graphs +<br/>span-metrics<br/>remote_write"| PROM
GRAF -.->|"tracesToLogsV2<br/>(trace ↔ log correlation)"| LOKI
GRAF -.->|"serviceMap"| TEMPO
style PROM fill:#e6532d20,stroke:#e6532d
style TEMPO fill:#0099ff20,stroke:#0099ff
style PYRO fill:#ff930020,stroke:#ff9300
style LOKI fill:#00b4fa20,stroke:#00b4fa
style OS fill:#00514520,stroke:#005145
style GRAF fill:#f4750020,stroke:#f47500
The four flows in detail¶
1. Metrics — Prometheus¶
- Source: Spring Boot Actuator + Micrometer expose metrics on
perfshop-app:9090/actuator/prometheus. Three Prometheus jobs scrape every 5 seconds:perfshop-backend(the backend itself),perfshop-docker(theperfshop-monitoringservice that aggregates Docker stats via socket), andjmeter(active only during a JMeter test run). - Storage: local Prometheus TSDB, retention 7 days / 5 GB (
--storage.tsdb.retention.time=7d,--storage.tsdb.retention.size=5GB). - Transport: HTTP scrape pull (Prometheus pulls metrics from the exposers).
- Other ingestion:
--web.enable-remote-write-receiveris enabled, which lets Tempo (via itsmetrics_generator) push service-graph and span-metrics directly into Prometheus.
See prometheus.md for the detailed configuration and 8 real PromQL queries taken from the shipped dashboards.
2. Logs — dual sink Loki and OpenSearch¶
PerfShop collects Docker logs twice in parallel, into two different sinks. This is not an accidental redundancy: it is a deliberate pedagogical demonstration that students should be able to compare.
| Aspect | Loki + Promtail | OpenSearch + Vector |
|---|---|---|
| Model | Index on labels only, content stays as text | Full-text index on all fields |
| Querying | LogQL (similar to PromQL) | DQL / Lucene / OpenSearch queries |
| UI | Grafana → Explore (Loki datasource) | OpenSearch Dashboards |
| Storage cost | Very low (compression, no inverted index) | Higher (inverted index on all fields) |
| Use case | Filtering by container/level + substring search | Aggregations and facets, structured search |
| Source in PerfShop | Promtail via Docker SD socket | Vector via Docker logs source |
| Transformation | logfmt parsing + multiline Java stacktrace | Spring Boot JSON parsing + extraction of chaos_family, chaos_level, scenario_id via VRL |
| Retention | 168 h (7 days) | Indexed as long as the OpenSearch index exists (no automatic purge configured) |
| Index | A single index_* index |
Seven indices per family: perfshop-spring, perfshop-nginx, perfshop-mysql, perfshop-jmeter, perfshop-qa, perfshop-forgejo, perfshop-observability |
See loki.md and opensearch.md for the details of each pipeline.
3. Distributed traces — Tempo + OpenTelemetry¶
The perfshop-backend image embeds the OpenTelemetry Java agent in /agents/opentelemetry-javaagent.jar. It is unconditionally activated via JAVA_OPTS:
-javaagent:/agents/opentelemetry-javaagent.jar
-Dotel.service.name=perfshop
-Dotel.exporter.otlp.endpoint=http://perfshop-tempo:4317
-Dotel.exporter.otlp.protocol=grpc
-Dotel.traces.exporter=otlp
-Dotel.metrics.exporter=none
-Dotel.logs.exporter=none
-Dotel.instrumentation.http.capture-headers.server.request=X-Admin-Token,Content-Type
-Dotel.instrumentation.jdbc.captured-statements.enabled=true
-Dotel.span.attribute.count.limit=256
Key points:
- Traces export only (OTel metrics and logs are disabled — Prometheus and Loki/OpenSearch are the preferred sinks for those signals).
- Capture of the
X-Admin-Tokenheader on the server side — this is what enables the instructor panel "Traces with admin trigger" on the APM dashboard. - JDBC statement capture — this is what makes it possible to observe the actual SQL queries in traces, and to support the Security S1 scenario (SQL injection).
- Limit of 256 attributes per span to avoid cardinality explosions.
Tempo is configured with its metrics_generator, which produces two families of metrics derived from spans (service-graphs + span-metrics) and pushes them into Prometheus via remote-write — these metrics then feed panels such as "P95 latency by operation" on the Instructor APM dashboard.
See tempo.md for the details.
4. Continuous profiling — Pyroscope¶
The perfshop-backend image also embeds the Pyroscope Java agent in /agents/pyroscope.jar, with an aggressive configuration:
-javaagent:/agents/pyroscope.jar
-Dpyroscope.server.address=http://perfshop-pyroscope:4040
-Dpyroscope.application.name=perfshop
-Dpyroscope.format=jfr
-Dpyroscope.profiler.event=cpu
-Dpyroscope.profiler.alloc=512k
-Dpyroscope.profiler.lock=10ms
-Dpyroscope.profilingInterval=PT0.02S
Three profiles are collected continuously:
- CPU: sampling every 20 ms (
PT0.02S), viaperf_eventon Linux oritimeron Docker Desktop. - Heap allocations: a sample every 512 KB allocated.
- Lock contention: any blocking ≥ 10 ms.
The format is JFR (Java Flight Recorder), parsed on the Pyroscope server side. These profiles feed the "Flamegraph" panels of the Student and Instructor APM dashboards — see pyroscope.md.
Visualization layer¶
flowchart TB
subgraph datasources["Grafana datasources<br/>(automatic provisioning)"]
direction LR
DS_P["Prometheus<br/>(default)"]
DS_L["Loki"]
DS_T["Tempo"]
DS_PY["Pyroscope"]
end
subgraph grafana["Grafana 12.0.0"]
direction TB
FE_E["Students folder<br/>(anonymous access<br/>Viewer role)"]
FE_F["Instructors folder<br/>(Admin-only ACL<br/>set by grafana-seed)"]
FE_E --> D1["dashboard-apm-eleve"]
FE_E --> D2["dashboard-backend-eleve"]
FE_E --> D3["dashboard-frontend-eleve"]
FE_E --> D4["dashboard-jmeter"]
FE_E --> D5["dashboard-logs-eleve"]
FE_F --> D6["dashboard-apm-formateur"]
FE_F --> D7["dashboard-backend-formateur"]
FE_F --> D8["dashboard-frontend-formateur"]
FE_F --> D9["dashboard-logs-formateur"]
FE_F --> D10["perfshop-general-v1<br/>(home dashboard)"]
end
subgraph osd["OpenSearch Dashboards"]
OSD_DASH["PerfShop — All Logs"]
end
subgraph mon["perfshop-monitoring"]
HTML["Real-time HTML dashboard<br/>(2s polling on the Node side)"]
end
datasources --> grafana
PerfShop exposes three distinct visualization interfaces:
- Grafana (port 3002 by default) — the main tool, with 10 shipped dashboards (5 Students + 5 Instructors), automatically provisioned datasources, and anonymous read access for the Students folder. See
grafana.mdanddashboards.md. - OpenSearch Dashboards (port 5601 by default) — alternative interface to explore logs full-text indexed by OpenSearch. The
perfshop-opensearch-seedseed creates 8 index patterns (perfshop-*,perfshop-spring,perfshop-nginx…) and imports aPerfShop — All Logsdashboard at first startup. Seeopensearch.md. - HTML monitoring dashboard (port 3001 by default) — custom real-time view served by
perfshop-monitoring(Node + Express). 2 s polling, compact display, no Grafana dependency. This is the screen the student keeps open alongside their chaos page to see the instant impact. Seedashboard-html.md.
Cross-signal correlation¶
The Tempo and Loki datasources are configured to enable cross-correlation:
- Trace → Log:
tracesToLogsV2on the Tempo side allows clicking on a span to open Loki with an automatic time filter (spanStartTimeShift: -1m,spanEndTimeShift: 1m) and atrace_idfilter. - Trace → Metric:
tracesToMetricson the Tempo side allows clicking on a span to open Prometheus with a query on the operation in question. - Service Map: Tempo automatically generates the inter-service call cartography from the
service-graphsproduced by itsmetrics_generator. - Loki Search from Tempo:
lokiSearchallows searching for logs related to a trace.
sequenceDiagram
autonumber
actor F as Instructor
participant G as Grafana
participant T as Tempo
participant L as Loki
participant P as Prometheus
F->>G: Opens Instructor APM dashboard
G->>T: TraceQL { span.exception.type="NullPointerException" }
T-->>G: List of NPE traces
F->>G: Clicks on a trace
G->>T: GET /api/traces/{traceID}
T-->>G: Detailed spans
F->>G: Clicks on a span
G->>L: Loki query { container="perfshop-app" } |= "{traceID}"<br/>(±1 min window around the span)
L-->>G: Correlated logs
F->>G: Clicks on "Metric for span"
G->>P: PromQL p99 latency on the operation
P-->>G: Time series
Recap — where to look at what?¶
| Question | Tool | Datasource | Dashboard / screen |
|---|---|---|---|
| "How many requests/s? P95 latency?" | Grafana | Prometheus | Student Backend / Student APM |
| "Why is my endpoint slow?" | Grafana | Tempo | Instructor APM (TraceQL) |
| "Which code consumes the CPU?" | Grafana | Pyroscope | APM (Flamegraph panel) |
| "Which logs were produced during the chaos?" | Grafana or OpenSearch | Loki or OpenSearch | Student Logs / Instructor Logs |
| "How many containers, how much RAM?" | perfshop-monitoring or Grafana |
Prometheus (docker job) | HTML dashboard / perfshop-general-v1 |
| "My JMeter test run, how is it doing?" | Grafana | Prometheus (jmeter job) | dashboard-jmeter |
| "Are there any exceptions?" | Grafana | Loki or Tempo | Student Logs (ERROR filter) or Instructor APM ({span.exception.type=...}) |
| "Full-text search across all logs" | OpenSearch Dashboards | OpenSearch | Discover on perfshop-* |
| "Heap dump to analyze a memory leak" | perfshop-monitoring |
Spring Boot Actuator | /api/heapdump endpoint (proxy to /actuator/heapdump) |
To go further¶
- Prometheus — config, jobs, real PromQL examples
- Grafana — datasources, ACL, anonymous access, language
- Shipped dashboards — details of the 10 dashboards, panel by panel
- Loki and Promtail — log pipelines, multiline Java, LogQL examples
- Tempo (OpenTelemetry) — traces, JVM agents, span-metrics
- Pyroscope — continuous profiling, JFR format, flamegraphs
- OpenSearch and Vector — full-text indexing, VRL transform
- Real-time HTML dashboard — custom Node monitoring