Skip to content

Prometheus

Prometheus is the single metrics sink of PerfShop. It scrapes three targets every 5 seconds, keeps time series for 7 days, and feeds the Grafana datasources as well as the service-graphs/span-metrics panel produced by Tempo.

Source of truth

This page is taken from prometheus/prometheus.yml and the command: block of the prometheus service in the compose files.

Global configuration

global:
  scrape_interval: 15s        # default, overridden by each job to 5s
  evaluation_interval: 15s    # default, for rules

The Prometheus service is launched with these explicit command-line options:

--config.file=/etc/prometheus/prometheus.yml
--storage.tsdb.path=/prometheus
--storage.tsdb.retention.time=7d
--storage.tsdb.retention.size=5GB
--web.console.libraries=/usr/share/prometheus/console_libraries
--web.console.templates=/usr/share/prometheus/consoles
--web.enable-remote-write-receiver
--web.enable-lifecycle
Option Effect
--storage.tsdb.retention.time=7d Automatic deletion of series older than 7 days
--storage.tsdb.retention.size=5GB Hard cap on TSDB size
--web.enable-remote-write-receiver Enables the /api/v1/write endpoint — used by Tempo (metrics_generator) to push its span-metrics and service-graphs
--web.enable-lifecycle Enables /-/reload and /-/quit for hot reload without restart

The internal management port is 9090, exposed on host port 9091 by default (PROMETHEUS_HTTP_PORT variable).

Scraped targets

PerfShop declares three Prometheus jobs, all with scrape_interval: 5s (more aggressive than the global default of 15 s, to allow fine real-time observation during chaos demos).

flowchart LR
  PROM["prometheus<br/>(9090 internal)"]

  subgraph t1["Job perfshop-backend"]
    APP["perfshop-app:9090<br/>/actuator/prometheus"]
  end

  subgraph t2["Job perfshop-docker"]
    MON["perfshop-monitoring:3001<br/>/metrics"]
  end

  subgraph t3["Job jmeter"]
    JM["perfshop-jmeter:9270<br/>(active during a test run only)"]
  end

  PROM -->|scrape 5s| APP
  PROM -->|scrape 5s| MON
  PROM -->|scrape 5s| JM

Job perfshop-backend

- job_name: 'perfshop-backend'
  static_configs:
    - targets: ['perfshop-app:9090']
  metrics_path: '/actuator/prometheus'
  scrape_interval: 5s

This is the main job. It scrapes the Spring Boot Actuator endpoint exposed by Micrometer on management port 9090 (separate from application port 8080 — this is intentional, to avoid mixing business traffic with observability traffic). All JVM, HTTP, JDBC, Tomcat metrics, and the custom chaos counters are produced by this single endpoint.

The main metric families scraped:

Family Prefix / name Origin
HTTP server http_server_requests_seconds_* Spring Boot Web Micrometer auto-config
JVM memory jvm_memory_used_bytes, jvm_memory_max_bytes, jvm_memory_committed_bytes Micrometer JVM binder
JVM threads jvm_threads_live_threads, jvm_threads_daemon_threads, jvm_threads_peak_threads, jvm_threads_states_threads Micrometer JVM binder
JVM GC jvm_gc_pause_seconds_count, jvm_gc_pause_seconds_sum Micrometer JVM binder
Tomcat tomcat_threads_busy_threads, tomcat_threads_current_threads, tomcat_threads_config_max_threads Micrometer Tomcat binder (enabled via management.metrics.enable.tomcat=true)
HikariCP hikaricp_connections_active, _idle, _max, _pending, _acquire_seconds_* Micrometer HikariCP binder, label pool="PerfShopHikariPool"
Process process_cpu_usage, process_uptime_seconds, process_files_open_files Micrometer process binder
Custom chaos chaos_intensity{type="cpu|memory|thread_pool|db_pool|slow_query|deadlock|network"}, chaos_functional_level, chaos_functional_f4_corruption Custom counters declared in the backend chaos services

The HTTP histogram is enabled in application.yml:

management:
  metrics:
    distribution:
      percentiles-histogram:
        http.server.requests: true

Without this line, histogram_quantile() queries on http_server_requests_seconds_bucket would not work (the buckets would not be exported).

Job perfshop-docker

- job_name: 'perfshop-docker'
  static_configs:
    - targets: ['perfshop-monitoring:3001']
  metrics_path: '/metrics'
  scrape_interval: 5s

The perfshop-monitoring service (Node + Express) queries the Docker API via the Unix socket mounted as a bind mount, computes CPU/RAM/network/I/O statistics for the main containers, and exposes them in Prometheus format on its own /metrics route. See dashboard-html.md for the details.

The metrics produced by this job all use the docker_container_ prefix:

Metric Type Description
docker_container_cpu_percent{container="..."} gauge CPU percentage computed as (cpuDelta / sysDelta) × numCpus × 100
docker_container_mem_usage_bytes{container="..."} gauge Memory used (cache excluded)
docker_container_mem_limit_bytes{container="..."} gauge Container memory limit
docker_container_mem_percent{container="..."} gauge Percentage of memory used
docker_container_net_rx_bytes{container="..."} counter Cumulative bytes received
docker_container_net_tx_bytes{container="..."} counter Cumulative bytes sent
docker_container_io_read_bytes{container="..."} counter Cumulative disk bytes read
docker_container_io_write_bytes{container="..."} counter Cumulative disk bytes written
docker_container_pids{container="..."} gauge Number of processes in the container

And browser metrics pushed by chaos-agent.js from the frontend every 2 seconds (POST /api/chaos/client-metrics), which are re-exposed on the /metrics route:

Metric Description
perfshop_client_fps FPS measured in the student's browser
perfshop_client_heap_used_mb JS heap used (MB)
perfshop_client_long_tasks_per_sec Long tasks (tasks > 50 ms) per second
perfshop_client_fetch_req_per_sec Fetch requests issued per second
perfshop_client_dom_node_count DOM node count
perfshop_client_cpu_worker_active 1 if a CPU-intensive Web Worker is running in the browser
perfshop_client_last_received_timestamp Unix timestamp (ms) of the last received push — useful to compute freshness

Monitored containers

The Node code of perfshop-monitoring only monitors four containers: perfshop-frontend, perfshop-app, perfshop-db, perfshop-monitoring. This is intentional — the HTML dashboard targets the "front → back → DB" chain, and the Grafana "General Containers View" table uses the same job.

Job jmeter

- job_name: 'jmeter'
  static_configs:
    - targets: ['perfshop-jmeter:9270']
  scrape_interval: 5s
  honor_labels: true

The perfshop-jmeter container is idle permanently (tail -f /dev/null). It exposes nothing on port 9270 between test runs. When a run is launched via perfshop-jmeter-ui, the JMeter Prometheus listener plugin opens port 9270 for the duration of the run and exposes jmeter_* metrics. Once the run is over, the port closes and Prometheus logs scrape errors — this is normal and expected, and is documented as a comment in prometheus.yml.

honor_labels: true ensures that the labels exported by JMeter (notably label="..." which corresponds to the sampler name) are not overwritten by the Prometheus job labels.

Metrics exposed by the JMeter Prometheus listener plugin:

Metric Description
jmeter_threads{state="active|started|finished"} Active / started / finished virtual threads (vUsers)
jmeter_transactions_total{label="..."} Cumulative transaction counter per sampler
jmeter_response_time{quantile="0.5|0.95|0.99",label="..."} Latency quantiles per sampler
jmeter_response_time_count{label="..."} Sample counter per sampler
jmeter_response_time_sum{label="..."} Cumulative sum of latencies (for average computation)
jmeter_success_ratio_success{label="..."} Success counter
jmeter_success_ratio_failure{label="..."} Failure counter

PromQL query examples

All the queries below are extracted from the Grafana dashboards actually shipped in grafana/dashboards/{eleves,formateurs}/*.json. They are therefore both pedagogical and representative of real usage.

1. P95 HTTP latency percentile — all routes combined

histogram_quantile(
  0.95,
  sum(rate(http_server_requests_seconds_bucket{job="perfshop-backend"}[5m])) by (le)
) * 1000

This is the reference query for observing the global backend latency. The * 1000 converts seconds to milliseconds. The [5m] window is wide enough to absorb short variations while remaining responsive.

Variants: 0.50 for the median, 0.99 for p99. The "HTTP response times" panel uses all three in parallel.

2. HTTP 5xx error rate (as a percentage)

sum(rate(http_server_requests_seconds_count{job="perfshop-backend",outcome="SERVER_ERROR"}[5m]))
/
sum(rate(http_server_requests_seconds_count{job="perfshop-backend"}[5m]))
* 100

Uses the outcome label produced by Spring Boot Actuator (SUCCESS, CLIENT_ERROR, SERVER_ERROR, INFORMATIONAL, REDIRECTION, UNKNOWN).

3. JVM heap used (sum of all heap pools)

sum(jvm_memory_used_bytes{job="perfshop-backend",area="heap"})

area="heap" filters the heap memory pools (Eden, Survivor, Old Gen). For maximum heap:

sum(jvm_memory_max_bytes{job="perfshop-backend",area="heap"})

4. HikariCP pool — active vs max connections

hikaricp_connections_active{job="perfshop-backend"}
hikaricp_connections_idle{job="perfshop-backend"}
hikaricp_connections_max{job="perfshop-backend"}
hikaricp_connections_pending{job="perfshop-backend"}

The four metrics are overlaid in the "DB connections — HikariCP pool state" panel of the Student Backend dashboard. When pending > 0, a connection is being awaited — a typical symptom of pool exhaustion, triggered by the db_pool backend chaos.

5. DB connection acquisition latency

rate(hikaricp_connections_acquire_seconds_sum{pool="PerfShopHikariPool"}[1m])
/
rate(hikaricp_connections_acquire_seconds_count{pool="PerfShopHikariPool"}[1m])
* 1000

Computes the average time, in milliseconds, spent waiting for a HikariCP connection. The pool="PerfShopHikariPool" label comes from the name configured in application.yml (spring.datasource.hikari.pool-name).

6. Tomcat HTTP threads — busy vs configured

tomcat_threads_busy_threads{job="perfshop-backend"}
tomcat_threads_current_threads{job="perfshop-backend"}
tomcat_threads_config_max_threads{job="perfshop-backend"}

The three curves together: busy ≤ current ≤ config_max. When busy sticks to config_max, the HTTP thread pool is exhausted — triggered by the thread_pool backend chaos.

7. Request throughput 2xx vs 5xx vs 4xx

rate(http_server_requests_seconds_count{job="perfshop-backend",status=~"2.."}[1m])
rate(http_server_requests_seconds_count{job="perfshop-backend",status=~"5.."}[1m])
rate(http_server_requests_seconds_count{job="perfshop-backend",status=~"4.."}[1m])

Uses regex on the status label. The three series are stacked in the "HTTP request throughput" panel to visualize the ratio.

8. Current chaos intensity (type by type)

chaos_intensity{type="cpu"}
chaos_intensity{type="memory"}
chaos_intensity{type="thread_pool"}
chaos_intensity{type="db_pool"}
chaos_intensity{type="slow_query"}
chaos_intensity{type="deadlock"}
chaos_intensity{type="network"}

This is a custom gauge declared by the backend ChaosService. Its value (between 0 and 100, or 0 and the max level) reflects the current intensity of each backend chaos. The instructor "Time evolution of chaos anomalies" panel overlays the seven series for a global view of what is active.

Bonus — counting instrumented operations

count(count by (uri) (http_server_requests_seconds_count{job="perfshop-backend"}))

Returns the number of distinct URIs known to Spring Boot. Useful to confirm that new endpoints have been correctly instrumented after a backend update.

Expected scrape errors

The jmeter job blinks red between test runs — this is normal

As long as no JMeter test run is in progress, port 9270 is not opened by any process — Prometheus therefore logs a scrape error for this target. This is intentional and documented as a comment in prometheus.yml. To hide these errors during a demo, temporarily disable the job in Prometheus, or ignore the series in Grafana.

Useful endpoints on the Prometheus side

Endpoint Usage
http://localhost:9091/ Prometheus UI
http://localhost:9091/targets State of the scraped targets
http://localhost:9091/graph PromQL query console
http://localhost:9091/api/v1/write Remote-write endpoint (used by Tempo)
http://localhost:9091/-/reload Reloads the configuration without restart
http://localhost:9091/-/healthy Healthcheck

To go further

  • Overview — global observability flow
  • Grafana — Prometheus datasource and exploration
  • Shipped dashboards — panel-by-panel detail of the 10 dashboards
  • Tempometrics_generator that pushes its span-metrics into Prometheus via remote-write
  • Real-time HTML dashboard — Node code that produces the docker_container_* metrics