Running Experiments

Using Reproducible Example Scripts

The project currently provides four reproducible risk cases via scripts in examples/: R2, R9, R10, and R13. Each script loads a pre-configured YAML file, runs the experiment, and writes results to results/.

# Tacit collusion (R2)
cd examples/R2
python run_r2.py --condition C1

# Strategic Misreporting (R9)
cd ../R9
python run_r9.py

# Normative Deadlock (R10)
cd ../R10
python run_r10.py --condition e1

# Excessive Rigidity to Initial Directives (R13)
cd ../R13
python run_r13.py

Each script loads configs from its configs/ subdirectory and writes results to results/.

Using the Python API

For full programmatic control, use config_loader and ExperimentRunner directly. This lets you modify components in code before running, or integrate experiments into larger pipelines.

from risklab.experiments.config_loader import (
    load_experiment_config,
    build_experiment_from_config,
)
from risklab.experiments.runner import ExperimentRunner

# Load and build
config = load_experiment_config("path/to/config.yaml")
components = build_experiment_from_config(config)

# Run
runner = ExperimentRunner(**components)
results = runner.run()       # returns list[dict], one per seed

# Access results
for result in results:
    print(result["risk_results"])    # {risk_id: {detected, score, ...}}
    print(result["metric_results"])  # {metric_name: value}
    print(result["task_result"])     # task evaluation or None

Multi-Seed Runs

LLM outputs are stochastic. Running with multiple seeds produces independent repetitions so you can measure variance.

results = runner.run(num_seeds=5)   # 5 independent runs

Note

In the current framework, seed is a run index recorded in outputs. It is not guaranteed to map to a deterministic random seed in external LLM provider APIs.

Output Structure

results/
├── ExperimentId_aggregate.json         # all seed results
└── trajectories/
    ├── ExperimentId_seed0_cyclic.json
    ├── ExperimentId_seed1_cyclic.json
    └── ExperimentId_seed2_cyclic.json

Aggregate file — list of result dicts with risk_results, metric_results, task_result, num_rounds, seed, failure
Trajectory files — full message logs for replay and analysis

Inspecting Configs

Validate and preview a config before running:

python -m risklab.inspect_config path/to/config.yaml -A

See CLI Reference for all inspector flags.