Introduction to Chaos Engineering¶
PerfShop is a pedagogical chaos engineering platform designed for teaching performance testing, software quality, and application security. This section exhaustively documents the seven chaos families that the platform injects into the supporting e-commerce application, the levers exposed to instructors and students, and the associated metrics.
This documentation is a technical reference. For lab scenarios, refer to the training materials shipped with the license.
Pedagogical philosophy¶
The principle is simple: deliberately inject realistic anomalies into a working e-commerce system so that students learn to identify them through observability tools (Grafana, Tempo, Loki, Pyroscope) — not to fix them in the code. Each chaos is modeled on a real-world production incident: missing SQL indexes, lack of idempotency, React listener leaks, IDOR in a REST API, hard-coded obsolete VAT rates.
The golden rule is that with the right test data, the e-commerce journey must always be able to complete, whatever the level of active chaos — except for functional chaos, which deliberately injects terminal JVM exceptions (F1, F2, F3).
The seven chaos families¶
flowchart TB
subgraph ADM[Chaos Admin — instructor control]
AF[chaos-admin panel]
end
subgraph CHAOS[The 7 PerfShop chaos families]
P[**1. Performance**<br/>CPU, memory, GC, DB pool,<br/>threads, slow query,<br/>deadlock, network]
S[**2. Weather scenarios**<br/>20 presets combining<br/>multiple levers<br/>N1-01 → N4-05]
F[**3. Functional**<br/>F1 NPE · F2 SOE<br/>F3 OOM · F4 corruption]
B[**4. Business**<br/>16 anomalies<br/>A1 → A16<br/>TMAP / ISTQB]
SC[**5. Security**<br/>12 flaws<br/>S1 → S12<br/>OWASP Top 10]
SK[**6. Scripting**<br/>Progressive HTTP tokens<br/>Junior → Maestro<br/>correlation, HMAC]
FR[**7. Frontend**<br/>CPU, memory, DOM,<br/>fetch flood,<br/>double fetch]
end
subgraph APP[E-commerce application]
BE[Spring Boot backend]
FE[React frontend]
end
AF --> P & S & F & B & SC & SK & FR
P & F & B & SC & SK --> BE
FR --> FE
S --> BE & FE
Each family is independent: one can simultaneously enable Business Chaos level 3 and Performance Chaos level 2, for example. The Weather scenarios are pre-calibrated combinations of Performance levers (backend + frontend) — they do not touch the Functional, Business, Security, or Scripting families.
Unified levels 0 – 4¶
Five families (Functional, Business, Security, Scripting, Performance through scenarios) use a unified, cumulative level system aligned with the PerfShop nomenclature:
| Level | Label | Meaning |
|---|---|---|
| 0 | Disabled | No active anomaly — nominal reference behavior |
| 1 | Junior | The most visible anomalies — direct diagnosis |
| 2 | Intermediate | More subtle anomalies, cumulative with level 1 |
| 3 | Expert | Diagnosis requires Tempo / Pyroscope / correlation |
| 4 | Master | Everything is active — Master journey / fine-grained business validation |
Scripting Chaos uses Maestro instead of Master at level 4 (dynamic HMAC key derivation per session).
The levels are cumulative: level 3 includes all anomalies of levels 1 and 2. Each chaos service exposes a Prometheus Gauge chaos.<family>.level that reflects the current state.
Sliders and presets¶
Performance Chaos (infrastructure) and Frontend Chaos use 0 – 100 % sliders that are not arbitrary percentages: each value is calibrated to match a measurable impact. The exact formulas are documented in Performance Chaos and Frontend Chaos.
The weather scenarios (N1-01 through N4-05) are 20 presets that combine several Performance levers into a single student click. They are documented in Weather scenarios.
Freemium vs Pro¶
PerfShop is distributed under the AGPL-3.0-or-later license with an optional dual commercial license. Without an active license, the platform remains functional but restricts access on the student side:
| Family | Without license (Freemium) | With license (Pro) |
|---|---|---|
| Performance (CPU) | Level ≤ 1 | Levels 0 → 4 |
| Weather scenarios | N1-01 and N1-02 only | All 20 scenarios |
| Scripting | Level ≤ 1 | Levels 0 → 4 |
| Business | Blocked | Levels 0 → 4 |
| Functional | Blocked | Levels 0 → 4 |
| Security | Blocked | Levels 0 → 4 |
| Pedagogical BAC1-BAC5 | Blocked | Levels 0 → 5 |
Freemium restrictions are enforced on the backend side in ChaosStudentController: any attempt to exceed them returns an HTTP 402 Payment Required with the error code LICENSE_REQUIRED. On the instructor side (chaos-admin), all levels are always accessible regardless of the license — the license only applies to the student surface.
The two freemium scenarios (N1-01 Light breeze and N1-02 Morning mist) are built with the .free() flag in PerformanceScenario.java. They respectively enable backend CPU at 40 % and frontend memory leak at 30 %.
Software architecture¶
flowchart LR
subgraph FRONT[Browser]
R[React App]
CA[chaos-agent.js]
end
subgraph BACK[Spring Boot backend]
CI[ChaosInterceptor]
CS[ChaosService]
MLS[MemoryLeakSimulator]
GPS[GcPressureSimulator]
CCS[CpuChaosScheduler]
DPS[DbPoolChaosScheduler]
BCS[BusinessChaosService]
FCS[FunctionalChaosService]
SCS[SecurityChaosService]
SKS[ChaosScriptingService]
FCC[FrontendChaosController]
end
subgraph OBS[Observability]
PROM[Prometheus]
GRAF[Grafana]
TEMPO[Tempo]
LOKI[Loki]
end
R --> CI
CA -->|poll 5s| FCC
CI --> CS
CS --> MLS & GPS & CCS & DPS
CI -.metrics.-> PROM
CS & MLS & GPS & BCS & FCS & SCS & SKS -.gauges.-> PROM
BCS & FCS & SCS -.logs.-> LOKI
CI -.spans.-> TEMPO
PROM --> GRAF
The backend services expose their current state through Micrometer Gauges (prefix chaos.) scraped by Prometheus every 15 seconds. Business, Functional, and Security anomalies additionally write to a circular activity log (200 entries max) readable through the public endpoints /api/chaos/public/<family>/logs to feed the real-time dashboards.
Frontend Chaos is driven by simple polling: the chaos-agent.js component queries GET /api/chaos/frontend/state every 5 seconds and applies the received 0 – 100 intensities to the five anomalies on the browser side.
Endpoints excluded from chaos¶
Whatever the active chaos level, some endpoints are never impacted by ChaosInterceptor so that monitoring and control remain operational even when the backend is 100 % saturated:
/actuator/**— Prometheus scraping, health checks, heap dump/api/admin/chaos/**— chaos control by the instructor/api/chaos/**— public monitoring endpoints, student page, frontend state/api/license/**— license activation and status
This guarantee is unconditional: it is implemented at the top of ChaosInterceptor.preHandle() through an early return before any anomaly injection.
Going further¶
- Performance Chaos — calibrated backend infrastructure
- Weather scenarios — the 20 presets N1 through N4
- Functional Chaos — injected JVM exceptions (F1 – F4)
- Business Chaos — functional anomalies (A1 – A16)
- Security Chaos — OWASP flaws (S1 – S12)
- Scripting Chaos — HTTP correlation and HMAC
- Frontend Chaos — browser degradation
- Metrics reference — exhaustive Prometheus catalog