Prometheus¶

Prometheus is the single metrics sink of PerfShop. It scrapes three targets every 5 seconds, keeps time series for 7 days, and feeds the Grafana datasources as well as the service-graphs/span-metrics panel produced by Tempo.

Source of truth

This page is taken from prometheus/prometheus.yml and the command: block of the prometheus service in the compose files.

Global configuration¶

global:
  scrape_interval: 15s        # default, overridden by each job to 5s
  evaluation_interval: 15s    # default, for rules

The Prometheus service is launched with these explicit command-line options:

--config.file=/etc/prometheus/prometheus.yml
--storage.tsdb.path=/prometheus
--storage.tsdb.retention.time=7d
--storage.tsdb.retention.size=5GB
--web.console.libraries=/usr/share/prometheus/console_libraries
--web.console.templates=/usr/share/prometheus/consoles
--web.enable-remote-write-receiver
--web.enable-lifecycle

Option	Effect
`--storage.tsdb.retention.time=7d`	Automatic deletion of series older than 7 days
`--storage.tsdb.retention.size=5GB`	Hard cap on TSDB size
`--web.enable-remote-write-receiver`	Enables the `/api/v1/write` endpoint — used by Tempo (`metrics_generator`) to push its span-metrics and service-graphs
`--web.enable-lifecycle`	Enables `/-/reload` and `/-/quit` for hot reload without restart

The internal management port is 9090, exposed on host port 9091 by default (PROMETHEUS_HTTP_PORT variable).

Scraped targets¶

PerfShop declares three Prometheus jobs, all with scrape_interval: 5s (more aggressive than the global default of 15 s, to allow fine real-time observation during chaos demos).

flowchart LR
  PROM["prometheus<br/>(9090 internal)"]

  subgraph t1["Job perfshop-backend"]
    APP["perfshop-app:9090<br/>/actuator/prometheus"]
  end

  subgraph t2["Job perfshop-docker"]
    MON["perfshop-monitoring:3001<br/>/metrics"]
  end

  subgraph t3["Job jmeter"]
    JM["perfshop-jmeter:9270<br/>(active during a test run only)"]
  end

  PROM -->|scrape 5s| APP
  PROM -->|scrape 5s| MON
  PROM -->|scrape 5s| JM

Job `perfshop-backend`¶

- job_name: 'perfshop-backend'
  static_configs:
    - targets: ['perfshop-app:9090']
  metrics_path: '/actuator/prometheus'
  scrape_interval: 5s

This is the main job. It scrapes the Spring Boot Actuator endpoint exposed by Micrometer on management port 9090 (separate from application port 8080 — this is intentional, to avoid mixing business traffic with observability traffic). All JVM, HTTP, JDBC, Tomcat metrics, and the custom chaos counters are produced by this single endpoint.

The main metric families scraped:

Family	Prefix / name	Origin
HTTP server	`http_server_requests_seconds_*`	Spring Boot Web Micrometer auto-config
JVM memory	`jvm_memory_used_bytes`, `jvm_memory_max_bytes`, `jvm_memory_committed_bytes`	Micrometer JVM binder
JVM threads	`jvm_threads_live_threads`, `jvm_threads_daemon_threads`, `jvm_threads_peak_threads`, `jvm_threads_states_threads`	Micrometer JVM binder
JVM GC	`jvm_gc_pause_seconds_count`, `jvm_gc_pause_seconds_sum`	Micrometer JVM binder
Tomcat	`tomcat_threads_busy_threads`, `tomcat_threads_current_threads`, `tomcat_threads_config_max_threads`	Micrometer Tomcat binder (enabled via `management.metrics.enable.tomcat=true`)
HikariCP	`hikaricp_connections_active`, `_idle`, `_max`, `_pending`, `_acquire_seconds_*`	Micrometer HikariCP binder, label `pool="PerfShopHikariPool"`
Process	`process_cpu_usage`, `process_uptime_seconds`, `process_files_open_files`	Micrometer process binder
Custom chaos	`chaos_intensity{type="cpu\|memory\|thread_pool\|db_pool\|slow_query\|deadlock\|network"}`, `chaos_functional_level`, `chaos_functional_f4_corruption`	Custom counters declared in the backend chaos services

The HTTP histogram is enabled in application.yml:

management:
  metrics:
    distribution:
      percentiles-histogram:
        http.server.requests: true

Without this line, histogram_quantile() queries on http_server_requests_seconds_bucket would not work (the buckets would not be exported).

Job `perfshop-docker`¶

- job_name: 'perfshop-docker'
  static_configs:
    - targets: ['perfshop-monitoring:3001']
  metrics_path: '/metrics'
  scrape_interval: 5s

The perfshop-monitoring service (Node + Express) queries the Docker API via the Unix socket mounted as a bind mount, computes CPU/RAM/network/I/O statistics for the main containers, and exposes them in Prometheus format on its own /metrics route. See dashboard-html.md for the details.

The metrics produced by this job all use the docker_container_ prefix:

Metric	Type	Description
`docker_container_cpu_percent{container="..."}`	gauge	CPU percentage computed as `(cpuDelta / sysDelta) × numCpus × 100`
`docker_container_mem_usage_bytes{container="..."}`	gauge	Memory used (cache excluded)
`docker_container_mem_limit_bytes{container="..."}`	gauge	Container memory limit
`docker_container_mem_percent{container="..."}`	gauge	Percentage of memory used
`docker_container_net_rx_bytes{container="..."}`	counter	Cumulative bytes received
`docker_container_net_tx_bytes{container="..."}`	counter	Cumulative bytes sent
`docker_container_io_read_bytes{container="..."}`	counter	Cumulative disk bytes read
`docker_container_io_write_bytes{container="..."}`	counter	Cumulative disk bytes written
`docker_container_pids{container="..."}`	gauge	Number of processes in the container

And browser metrics pushed by chaos-agent.js from the frontend every 2 seconds (POST /api/chaos/client-metrics), which are re-exposed on the /metrics route:

Metric	Description
`perfshop_client_fps`	FPS measured in the student's browser
`perfshop_client_heap_used_mb`	JS heap used (MB)
`perfshop_client_long_tasks_per_sec`	Long tasks (tasks > 50 ms) per second
`perfshop_client_fetch_req_per_sec`	Fetch requests issued per second
`perfshop_client_dom_node_count`	DOM node count
`perfshop_client_cpu_worker_active`	1 if a CPU-intensive Web Worker is running in the browser
`perfshop_client_last_received_timestamp`	Unix timestamp (ms) of the last received push — useful to compute freshness

Monitored containers

The Node code of perfshop-monitoring only monitors four containers: perfshop-frontend, perfshop-app, perfshop-db, perfshop-monitoring. This is intentional — the HTML dashboard targets the "front → back → DB" chain, and the Grafana "General Containers View" table uses the same job.

Job `jmeter`¶

- job_name: 'jmeter'
  static_configs:
    - targets: ['perfshop-jmeter:9270']
  scrape_interval: 5s
  honor_labels: true

The perfshop-jmeter container is idle permanently (tail -f /dev/null). It exposes nothing on port 9270 between test runs. When a run is launched via perfshop-jmeter-ui, the JMeter Prometheus listener plugin opens port 9270 for the duration of the run and exposes jmeter_* metrics. Once the run is over, the port closes and Prometheus logs scrape errors — this is normal and expected, and is documented as a comment in prometheus.yml.

honor_labels: true ensures that the labels exported by JMeter (notably label="..." which corresponds to the sampler name) are not overwritten by the Prometheus job labels.

Metrics exposed by the JMeter Prometheus listener plugin:

Metric	Description
`jmeter_threads{state="active\|started\|finished"}`	Active / started / finished virtual threads (vUsers)
`jmeter_transactions_total{label="..."}`	Cumulative transaction counter per sampler
`jmeter_response_time{quantile="0.5\|0.95\|0.99",label="..."}`	Latency quantiles per sampler
`jmeter_response_time_count{label="..."}`	Sample counter per sampler
`jmeter_response_time_sum{label="..."}`	Cumulative sum of latencies (for average computation)
`jmeter_success_ratio_success{label="..."}`	Success counter
`jmeter_success_ratio_failure{label="..."}`	Failure counter

PromQL query examples¶

All the queries below are extracted from the Grafana dashboards actually shipped in grafana/dashboards/{eleves,formateurs}/*.json. They are therefore both pedagogical and representative of real usage.

1. P95 HTTP latency percentile — all routes combined¶

histogram_quantile(
  0.95,
  sum(rate(http_server_requests_seconds_bucket{job="perfshop-backend"}[5m])) by (le)
) * 1000

This is the reference query for observing the global backend latency. The * 1000 converts seconds to milliseconds. The [5m] window is wide enough to absorb short variations while remaining responsive.

Variants: 0.50 for the median, 0.99 for p99. The "HTTP response times" panel uses all three in parallel.

2. HTTP 5xx error rate (as a percentage)¶

sum(rate(http_server_requests_seconds_count{job="perfshop-backend",outcome="SERVER_ERROR"}[5m]))
/
sum(rate(http_server_requests_seconds_count{job="perfshop-backend"}[5m]))
* 100

Uses the outcome label produced by Spring Boot Actuator (SUCCESS, CLIENT_ERROR, SERVER_ERROR, INFORMATIONAL, REDIRECTION, UNKNOWN).

3. JVM heap used (sum of all heap pools)¶

sum(jvm_memory_used_bytes{job="perfshop-backend",area="heap"})

area="heap" filters the heap memory pools (Eden, Survivor, Old Gen). For maximum heap:

sum(jvm_memory_max_bytes{job="perfshop-backend",area="heap"})

4. HikariCP pool — active vs max connections¶

hikaricp_connections_active{job="perfshop-backend"}
hikaricp_connections_idle{job="perfshop-backend"}
hikaricp_connections_max{job="perfshop-backend"}
hikaricp_connections_pending{job="perfshop-backend"}

The four metrics are overlaid in the "DB connections — HikariCP pool state" panel of the Student Backend dashboard. When pending > 0, a connection is being awaited — a typical symptom of pool exhaustion, triggered by the db_pool backend chaos.

5. DB connection acquisition latency¶

rate(hikaricp_connections_acquire_seconds_sum{pool="PerfShopHikariPool"}[1m])
/
rate(hikaricp_connections_acquire_seconds_count{pool="PerfShopHikariPool"}[1m])
* 1000

Computes the average time, in milliseconds, spent waiting for a HikariCP connection. The pool="PerfShopHikariPool" label comes from the name configured in application.yml (spring.datasource.hikari.pool-name).

6. Tomcat HTTP threads — busy vs configured¶

tomcat_threads_busy_threads{job="perfshop-backend"}
tomcat_threads_current_threads{job="perfshop-backend"}
tomcat_threads_config_max_threads{job="perfshop-backend"}

The three curves together: busy ≤ current ≤ config_max. When busy sticks to config_max, the HTTP thread pool is exhausted — triggered by the thread_pool backend chaos.

7. Request throughput 2xx vs 5xx vs 4xx¶

rate(http_server_requests_seconds_count{job="perfshop-backend",status=~"2.."}[1m])
rate(http_server_requests_seconds_count{job="perfshop-backend",status=~"5.."}[1m])
rate(http_server_requests_seconds_count{job="perfshop-backend",status=~"4.."}[1m])

Uses regex on the status label. The three series are stacked in the "HTTP request throughput" panel to visualize the ratio.

8. Current chaos intensity (type by type)¶

chaos_intensity{type="cpu"}
chaos_intensity{type="memory"}
chaos_intensity{type="thread_pool"}
chaos_intensity{type="db_pool"}
chaos_intensity{type="slow_query"}
chaos_intensity{type="deadlock"}
chaos_intensity{type="network"}

This is a custom gauge declared by the backend ChaosService. Its value (between 0 and 100, or 0 and the max level) reflects the current intensity of each backend chaos. The instructor "Time evolution of chaos anomalies" panel overlays the seven series for a global view of what is active.

Bonus — counting instrumented operations¶

count(count by (uri) (http_server_requests_seconds_count{job="perfshop-backend"}))

Returns the number of distinct URIs known to Spring Boot. Useful to confirm that new endpoints have been correctly instrumented after a backend update.

Expected scrape errors¶

The jmeter job blinks red between test runs — this is normal

As long as no JMeter test run is in progress, port 9270 is not opened by any process — Prometheus therefore logs a scrape error for this target. This is intentional and documented as a comment in prometheus.yml. To hide these errors during a demo, temporarily disable the job in Prometheus, or ignore the series in Grafana.

Useful endpoints on the Prometheus side¶

Endpoint	Usage
`http://localhost:9091/`	Prometheus UI
`http://localhost:9091/targets`	State of the scraped targets
`http://localhost:9091/graph`	PromQL query console
`http://localhost:9091/api/v1/write`	Remote-write endpoint (used by Tempo)
`http://localhost:9091/-/reload`	Reloads the configuration without restart
`http://localhost:9091/-/healthy`	Healthcheck

To go further¶

Overview — global observability flow
Grafana — Prometheus datasource and exploration
Shipped dashboards — panel-by-panel detail of the 10 dashboards
Tempo — metrics_generator that pushes its span-metrics into Prometheus via remote-write
Real-time HTML dashboard — Node code that produces the docker_container_* metrics