Prometheus¶
Prometheus is the single metrics sink of PerfShop. It scrapes three targets every 5 seconds, keeps time series for 7 days, and feeds the Grafana datasources as well as the service-graphs/span-metrics panel produced by Tempo.
Source of truth
This page is taken from prometheus/prometheus.yml and the command: block of the prometheus service in the compose files.
Global configuration¶
global:
scrape_interval: 15s # default, overridden by each job to 5s
evaluation_interval: 15s # default, for rules
The Prometheus service is launched with these explicit command-line options:
--config.file=/etc/prometheus/prometheus.yml
--storage.tsdb.path=/prometheus
--storage.tsdb.retention.time=7d
--storage.tsdb.retention.size=5GB
--web.console.libraries=/usr/share/prometheus/console_libraries
--web.console.templates=/usr/share/prometheus/consoles
--web.enable-remote-write-receiver
--web.enable-lifecycle
| Option | Effect |
|---|---|
--storage.tsdb.retention.time=7d |
Automatic deletion of series older than 7 days |
--storage.tsdb.retention.size=5GB |
Hard cap on TSDB size |
--web.enable-remote-write-receiver |
Enables the /api/v1/write endpoint — used by Tempo (metrics_generator) to push its span-metrics and service-graphs |
--web.enable-lifecycle |
Enables /-/reload and /-/quit for hot reload without restart |
The internal management port is 9090, exposed on host port 9091 by default (PROMETHEUS_HTTP_PORT variable).
Scraped targets¶
PerfShop declares three Prometheus jobs, all with scrape_interval: 5s (more aggressive than the global default of 15 s, to allow fine real-time observation during chaos demos).
flowchart LR
PROM["prometheus<br/>(9090 internal)"]
subgraph t1["Job perfshop-backend"]
APP["perfshop-app:9090<br/>/actuator/prometheus"]
end
subgraph t2["Job perfshop-docker"]
MON["perfshop-monitoring:3001<br/>/metrics"]
end
subgraph t3["Job jmeter"]
JM["perfshop-jmeter:9270<br/>(active during a test run only)"]
end
PROM -->|scrape 5s| APP
PROM -->|scrape 5s| MON
PROM -->|scrape 5s| JM
Job perfshop-backend¶
- job_name: 'perfshop-backend'
static_configs:
- targets: ['perfshop-app:9090']
metrics_path: '/actuator/prometheus'
scrape_interval: 5s
This is the main job. It scrapes the Spring Boot Actuator endpoint exposed by Micrometer on management port 9090 (separate from application port 8080 — this is intentional, to avoid mixing business traffic with observability traffic). All JVM, HTTP, JDBC, Tomcat metrics, and the custom chaos counters are produced by this single endpoint.
The main metric families scraped:
| Family | Prefix / name | Origin |
|---|---|---|
| HTTP server | http_server_requests_seconds_* |
Spring Boot Web Micrometer auto-config |
| JVM memory | jvm_memory_used_bytes, jvm_memory_max_bytes, jvm_memory_committed_bytes |
Micrometer JVM binder |
| JVM threads | jvm_threads_live_threads, jvm_threads_daemon_threads, jvm_threads_peak_threads, jvm_threads_states_threads |
Micrometer JVM binder |
| JVM GC | jvm_gc_pause_seconds_count, jvm_gc_pause_seconds_sum |
Micrometer JVM binder |
| Tomcat | tomcat_threads_busy_threads, tomcat_threads_current_threads, tomcat_threads_config_max_threads |
Micrometer Tomcat binder (enabled via management.metrics.enable.tomcat=true) |
| HikariCP | hikaricp_connections_active, _idle, _max, _pending, _acquire_seconds_* |
Micrometer HikariCP binder, label pool="PerfShopHikariPool" |
| Process | process_cpu_usage, process_uptime_seconds, process_files_open_files |
Micrometer process binder |
| Custom chaos | chaos_intensity{type="cpu|memory|thread_pool|db_pool|slow_query|deadlock|network"}, chaos_functional_level, chaos_functional_f4_corruption |
Custom counters declared in the backend chaos services |
The HTTP histogram is enabled in application.yml:
Without this line, histogram_quantile() queries on http_server_requests_seconds_bucket would not work (the buckets would not be exported).
Job perfshop-docker¶
- job_name: 'perfshop-docker'
static_configs:
- targets: ['perfshop-monitoring:3001']
metrics_path: '/metrics'
scrape_interval: 5s
The perfshop-monitoring service (Node + Express) queries the Docker API via the Unix socket mounted as a bind mount, computes CPU/RAM/network/I/O statistics for the main containers, and exposes them in Prometheus format on its own /metrics route. See dashboard-html.md for the details.
The metrics produced by this job all use the docker_container_ prefix:
| Metric | Type | Description |
|---|---|---|
docker_container_cpu_percent{container="..."} |
gauge | CPU percentage computed as (cpuDelta / sysDelta) × numCpus × 100 |
docker_container_mem_usage_bytes{container="..."} |
gauge | Memory used (cache excluded) |
docker_container_mem_limit_bytes{container="..."} |
gauge | Container memory limit |
docker_container_mem_percent{container="..."} |
gauge | Percentage of memory used |
docker_container_net_rx_bytes{container="..."} |
counter | Cumulative bytes received |
docker_container_net_tx_bytes{container="..."} |
counter | Cumulative bytes sent |
docker_container_io_read_bytes{container="..."} |
counter | Cumulative disk bytes read |
docker_container_io_write_bytes{container="..."} |
counter | Cumulative disk bytes written |
docker_container_pids{container="..."} |
gauge | Number of processes in the container |
And browser metrics pushed by chaos-agent.js from the frontend every 2 seconds (POST /api/chaos/client-metrics), which are re-exposed on the /metrics route:
| Metric | Description |
|---|---|
perfshop_client_fps |
FPS measured in the student's browser |
perfshop_client_heap_used_mb |
JS heap used (MB) |
perfshop_client_long_tasks_per_sec |
Long tasks (tasks > 50 ms) per second |
perfshop_client_fetch_req_per_sec |
Fetch requests issued per second |
perfshop_client_dom_node_count |
DOM node count |
perfshop_client_cpu_worker_active |
1 if a CPU-intensive Web Worker is running in the browser |
perfshop_client_last_received_timestamp |
Unix timestamp (ms) of the last received push — useful to compute freshness |
Monitored containers
The Node code of perfshop-monitoring only monitors four containers: perfshop-frontend, perfshop-app, perfshop-db, perfshop-monitoring. This is intentional — the HTML dashboard targets the "front → back → DB" chain, and the Grafana "General Containers View" table uses the same job.
Job jmeter¶
- job_name: 'jmeter'
static_configs:
- targets: ['perfshop-jmeter:9270']
scrape_interval: 5s
honor_labels: true
The perfshop-jmeter container is idle permanently (tail -f /dev/null). It exposes nothing on port 9270 between test runs. When a run is launched via perfshop-jmeter-ui, the JMeter Prometheus listener plugin opens port 9270 for the duration of the run and exposes jmeter_* metrics. Once the run is over, the port closes and Prometheus logs scrape errors — this is normal and expected, and is documented as a comment in prometheus.yml.
honor_labels: true ensures that the labels exported by JMeter (notably label="..." which corresponds to the sampler name) are not overwritten by the Prometheus job labels.
Metrics exposed by the JMeter Prometheus listener plugin:
| Metric | Description |
|---|---|
jmeter_threads{state="active|started|finished"} |
Active / started / finished virtual threads (vUsers) |
jmeter_transactions_total{label="..."} |
Cumulative transaction counter per sampler |
jmeter_response_time{quantile="0.5|0.95|0.99",label="..."} |
Latency quantiles per sampler |
jmeter_response_time_count{label="..."} |
Sample counter per sampler |
jmeter_response_time_sum{label="..."} |
Cumulative sum of latencies (for average computation) |
jmeter_success_ratio_success{label="..."} |
Success counter |
jmeter_success_ratio_failure{label="..."} |
Failure counter |
PromQL query examples¶
All the queries below are extracted from the Grafana dashboards actually shipped in grafana/dashboards/{eleves,formateurs}/*.json. They are therefore both pedagogical and representative of real usage.
1. P95 HTTP latency percentile — all routes combined¶
histogram_quantile(
0.95,
sum(rate(http_server_requests_seconds_bucket{job="perfshop-backend"}[5m])) by (le)
) * 1000
This is the reference query for observing the global backend latency. The * 1000 converts seconds to milliseconds. The [5m] window is wide enough to absorb short variations while remaining responsive.
Variants: 0.50 for the median, 0.99 for p99. The "HTTP response times" panel uses all three in parallel.
2. HTTP 5xx error rate (as a percentage)¶
sum(rate(http_server_requests_seconds_count{job="perfshop-backend",outcome="SERVER_ERROR"}[5m]))
/
sum(rate(http_server_requests_seconds_count{job="perfshop-backend"}[5m]))
* 100
Uses the outcome label produced by Spring Boot Actuator (SUCCESS, CLIENT_ERROR, SERVER_ERROR, INFORMATIONAL, REDIRECTION, UNKNOWN).
3. JVM heap used (sum of all heap pools)¶
area="heap" filters the heap memory pools (Eden, Survivor, Old Gen). For maximum heap:
4. HikariCP pool — active vs max connections¶
hikaricp_connections_active{job="perfshop-backend"}
hikaricp_connections_idle{job="perfshop-backend"}
hikaricp_connections_max{job="perfshop-backend"}
hikaricp_connections_pending{job="perfshop-backend"}
The four metrics are overlaid in the "DB connections — HikariCP pool state" panel of the Student Backend dashboard. When pending > 0, a connection is being awaited — a typical symptom of pool exhaustion, triggered by the db_pool backend chaos.
5. DB connection acquisition latency¶
rate(hikaricp_connections_acquire_seconds_sum{pool="PerfShopHikariPool"}[1m])
/
rate(hikaricp_connections_acquire_seconds_count{pool="PerfShopHikariPool"}[1m])
* 1000
Computes the average time, in milliseconds, spent waiting for a HikariCP connection. The pool="PerfShopHikariPool" label comes from the name configured in application.yml (spring.datasource.hikari.pool-name).
6. Tomcat HTTP threads — busy vs configured¶
tomcat_threads_busy_threads{job="perfshop-backend"}
tomcat_threads_current_threads{job="perfshop-backend"}
tomcat_threads_config_max_threads{job="perfshop-backend"}
The three curves together: busy ≤ current ≤ config_max. When busy sticks to config_max, the HTTP thread pool is exhausted — triggered by the thread_pool backend chaos.
7. Request throughput 2xx vs 5xx vs 4xx¶
rate(http_server_requests_seconds_count{job="perfshop-backend",status=~"2.."}[1m])
rate(http_server_requests_seconds_count{job="perfshop-backend",status=~"5.."}[1m])
rate(http_server_requests_seconds_count{job="perfshop-backend",status=~"4.."}[1m])
Uses regex on the status label. The three series are stacked in the "HTTP request throughput" panel to visualize the ratio.
8. Current chaos intensity (type by type)¶
chaos_intensity{type="cpu"}
chaos_intensity{type="memory"}
chaos_intensity{type="thread_pool"}
chaos_intensity{type="db_pool"}
chaos_intensity{type="slow_query"}
chaos_intensity{type="deadlock"}
chaos_intensity{type="network"}
This is a custom gauge declared by the backend ChaosService. Its value (between 0 and 100, or 0 and the max level) reflects the current intensity of each backend chaos. The instructor "Time evolution of chaos anomalies" panel overlays the seven series for a global view of what is active.
Bonus — counting instrumented operations¶
Returns the number of distinct URIs known to Spring Boot. Useful to confirm that new endpoints have been correctly instrumented after a backend update.
Expected scrape errors¶
The jmeter job blinks red between test runs — this is normal
As long as no JMeter test run is in progress, port 9270 is not opened by any process — Prometheus therefore logs a scrape error for this target. This is intentional and documented as a comment in prometheus.yml. To hide these errors during a demo, temporarily disable the job in Prometheus, or ignore the series in Grafana.
Useful endpoints on the Prometheus side¶
| Endpoint | Usage |
|---|---|
http://localhost:9091/ |
Prometheus UI |
http://localhost:9091/targets |
State of the scraped targets |
http://localhost:9091/graph |
PromQL query console |
http://localhost:9091/api/v1/write |
Remote-write endpoint (used by Tempo) |
http://localhost:9091/-/reload |
Reloads the configuration without restart |
http://localhost:9091/-/healthy |
Healthcheck |
To go further¶
- Overview — global observability flow
- Grafana — Prometheus datasource and exploration
- Shipped dashboards — panel-by-panel detail of the 10 dashboards
- Tempo —
metrics_generatorthat pushes its span-metrics into Prometheus via remote-write - Real-time HTML dashboard — Node code that produces the
docker_container_*metrics