Skip to content

Introduction to Chaos Engineering

PerfShop is a pedagogical chaos engineering platform designed for teaching performance testing, software quality, and application security. This section exhaustively documents the seven chaos families that the platform injects into the supporting e-commerce application, the levers exposed to instructors and students, and the associated metrics.

This documentation is a technical reference. For lab scenarios, refer to the training materials shipped with the license.

Pedagogical philosophy

The principle is simple: deliberately inject realistic anomalies into a working e-commerce system so that students learn to identify them through observability tools (Grafana, Tempo, Loki, Pyroscope) — not to fix them in the code. Each chaos is modeled on a real-world production incident: missing SQL indexes, lack of idempotency, React listener leaks, IDOR in a REST API, hard-coded obsolete VAT rates.

The golden rule is that with the right test data, the e-commerce journey must always be able to complete, whatever the level of active chaos — except for functional chaos, which deliberately injects terminal JVM exceptions (F1, F2, F3).

The seven chaos families

flowchart TB
    subgraph ADM[Chaos Admin — instructor control]
      AF[chaos-admin panel]
    end
    subgraph CHAOS[The 7 PerfShop chaos families]
      P[**1. Performance**<br/>CPU, memory, GC, DB pool,<br/>threads, slow query,<br/>deadlock, network]
      S[**2. Weather scenarios**<br/>20 presets combining<br/>multiple levers<br/>N1-01 → N4-05]
      F[**3. Functional**<br/>F1 NPE · F2 SOE<br/>F3 OOM · F4 corruption]
      B[**4. Business**<br/>16 anomalies<br/>A1 → A16<br/>TMAP / ISTQB]
      SC[**5. Security**<br/>12 flaws<br/>S1 → S12<br/>OWASP Top 10]
      SK[**6. Scripting**<br/>Progressive HTTP tokens<br/>Junior → Maestro<br/>correlation, HMAC]
      FR[**7. Frontend**<br/>CPU, memory, DOM,<br/>fetch flood,<br/>double fetch]
    end
    subgraph APP[E-commerce application]
      BE[Spring Boot backend]
      FE[React frontend]
    end
    AF --> P & S & F & B & SC & SK & FR
    P & F & B & SC & SK --> BE
    FR --> FE
    S --> BE & FE

Each family is independent: one can simultaneously enable Business Chaos level 3 and Performance Chaos level 2, for example. The Weather scenarios are pre-calibrated combinations of Performance levers (backend + frontend) — they do not touch the Functional, Business, Security, or Scripting families.

Unified levels 0 – 4

Five families (Functional, Business, Security, Scripting, Performance through scenarios) use a unified, cumulative level system aligned with the PerfShop nomenclature:

Level Label Meaning
0 Disabled No active anomaly — nominal reference behavior
1 Junior The most visible anomalies — direct diagnosis
2 Intermediate More subtle anomalies, cumulative with level 1
3 Expert Diagnosis requires Tempo / Pyroscope / correlation
4 Master Everything is active — Master journey / fine-grained business validation

Scripting Chaos uses Maestro instead of Master at level 4 (dynamic HMAC key derivation per session).

The levels are cumulative: level 3 includes all anomalies of levels 1 and 2. Each chaos service exposes a Prometheus Gauge chaos.<family>.level that reflects the current state.

Sliders and presets

Performance Chaos (infrastructure) and Frontend Chaos use 0 – 100 % sliders that are not arbitrary percentages: each value is calibrated to match a measurable impact. The exact formulas are documented in Performance Chaos and Frontend Chaos.

The weather scenarios (N1-01 through N4-05) are 20 presets that combine several Performance levers into a single student click. They are documented in Weather scenarios.

Freemium vs Pro

PerfShop is distributed under the AGPL-3.0-or-later license with an optional dual commercial license. Without an active license, the platform remains functional but restricts access on the student side:

Family Without license (Freemium) With license (Pro)
Performance (CPU) Level ≤ 1 Levels 0 → 4
Weather scenarios N1-01 and N1-02 only All 20 scenarios
Scripting Level ≤ 1 Levels 0 → 4
Business Blocked Levels 0 → 4
Functional Blocked Levels 0 → 4
Security Blocked Levels 0 → 4
Pedagogical BAC1-BAC5 Blocked Levels 0 → 5

Freemium restrictions are enforced on the backend side in ChaosStudentController: any attempt to exceed them returns an HTTP 402 Payment Required with the error code LICENSE_REQUIRED. On the instructor side (chaos-admin), all levels are always accessible regardless of the license — the license only applies to the student surface.

The two freemium scenarios (N1-01 Light breeze and N1-02 Morning mist) are built with the .free() flag in PerformanceScenario.java. They respectively enable backend CPU at 40 % and frontend memory leak at 30 %.

Software architecture

flowchart LR
    subgraph FRONT[Browser]
      R[React App]
      CA[chaos-agent.js]
    end
    subgraph BACK[Spring Boot backend]
      CI[ChaosInterceptor]
      CS[ChaosService]
      MLS[MemoryLeakSimulator]
      GPS[GcPressureSimulator]
      CCS[CpuChaosScheduler]
      DPS[DbPoolChaosScheduler]
      BCS[BusinessChaosService]
      FCS[FunctionalChaosService]
      SCS[SecurityChaosService]
      SKS[ChaosScriptingService]
      FCC[FrontendChaosController]
    end
    subgraph OBS[Observability]
      PROM[Prometheus]
      GRAF[Grafana]
      TEMPO[Tempo]
      LOKI[Loki]
    end
    R --> CI
    CA -->|poll 5s| FCC
    CI --> CS
    CS --> MLS & GPS & CCS & DPS
    CI -.metrics.-> PROM
    CS & MLS & GPS & BCS & FCS & SCS & SKS -.gauges.-> PROM
    BCS & FCS & SCS -.logs.-> LOKI
    CI -.spans.-> TEMPO
    PROM --> GRAF

The backend services expose their current state through Micrometer Gauges (prefix chaos.) scraped by Prometheus every 15 seconds. Business, Functional, and Security anomalies additionally write to a circular activity log (200 entries max) readable through the public endpoints /api/chaos/public/<family>/logs to feed the real-time dashboards.

Frontend Chaos is driven by simple polling: the chaos-agent.js component queries GET /api/chaos/frontend/state every 5 seconds and applies the received 0 – 100 intensities to the five anomalies on the browser side.

Endpoints excluded from chaos

Whatever the active chaos level, some endpoints are never impacted by ChaosInterceptor so that monitoring and control remain operational even when the backend is 100 % saturated:

  • /actuator/** — Prometheus scraping, health checks, heap dump
  • /api/admin/chaos/** — chaos control by the instructor
  • /api/chaos/** — public monitoring endpoints, student page, frontend state
  • /api/license/** — license activation and status

This guarantee is unconditional: it is implemented at the top of ChaosInterceptor.preHandle() through an early return before any anomaly injection.

Going further