Evaluation

The evaluation module records what happened during an experiment and measures the outcome. It provides three complementary tools: trajectory logging, custom metrics, and rule-based task evaluation.

Trajectory

A Trajectory is an ordered list of TrajectoryStep objects — the complete record of an experiment run. Each step captures a single agent turn, including what the agent observed, what it said, and any side information. Trajectories are the primary input to both risk detectors and metrics.

from risklab.evaluation.trajectory import TrajectoryStep

# Each TrajectoryStep records:
step.round           # int — interaction round number
step.speaker         # str — which agent acted
step.observation     # Any — what the agent observed
step.message         # Any — the raw LLM output
step.action          # Any — the parsed action
step.local_utility   # float | None — per-agent reward
step.system_state    # dict — snapshot of the global state
step.metadata        # dict — additional key-value data

Trajectory Logger

TrajectoryLogger builds a Trajectory in memory and can flush it to disk as JSON. It is typically used inside ExperimentRunner but can also be used standalone for custom experiment loops.

from risklab.evaluation.logger import TrajectoryLogger

logger = TrajectoryLogger(experiment_id="exp_001", output_dir="results/")

logger.log_step(
    round=0,
    speaker="seller_0",
    observation={"prices": [50, 55]},
    action={"price": 48},
)

# Persist to JSON
path = logger.save("exp_001_seed0.json")

Metrics

Metrics quantify trajectory properties that are not necessarily risks — for example, price convergence speed or message diversity. The framework defines three metric families:

  • Outcome — task success, efficiency

  • Interaction — agreement rate, entropy collapse, repetition

  • Risk indicator — collusion score, drift distance

Subclass Metric to implement your own, then group them into a MetricSuite for batch evaluation:

from risklab.evaluation.metrics import (
    Metric, MetricResult, MetricType, MetricSuite,
)

class PriceConvergence(Metric):
    def __init__(self):
        super().__init__("price_convergence", MetricType.OUTCOME)

    def compute(self, trajectory) -> MetricResult:
        # Calculate convergence ...
        return MetricResult(
            name=self.name,
            metric_type=self.metric_type,
            value=0.85,
        )

suite = MetricSuite()
suite.add(PriceConvergence())
results = suite.evaluate(trajectory)           # list[MetricResult]
flat    = suite.evaluate_as_dict(trajectory)    # {"price_convergence": 0.85}

Task Evaluator

RuleBasedTaskEvaluator checks whether agents achieved the task objective defined in TaskConfig.success_criteria. Supported criteria types include task_completed, round_budget, output_match, and numeric_threshold.

from risklab.evaluation.task_evaluator import RuleBasedTaskEvaluator

evaluator = RuleBasedTaskEvaluator()
result = evaluator.evaluate(task_config, trajectory)
# result.success  → bool
# result.score    → float in [0, 1]
# result.details  → dict with per-criterion breakdown