==========
Evaluation
==========

The evaluation module records what happened during an experiment and
measures the outcome. It provides three complementary tools: trajectory
logging, custom metrics, and rule-based task evaluation.

Trajectory
----------

A ``Trajectory`` is an ordered list of ``TrajectoryStep`` objects — the
complete record of an experiment run. Each step captures a single agent
turn, including what the agent observed, what it said, and any side
information. Trajectories are the primary input to both risk detectors
and metrics.

.. code-block:: python

   from risklab.evaluation.trajectory import TrajectoryStep

   # Each TrajectoryStep records:
   step.round           # int — interaction round number
   step.speaker         # str — which agent acted
   step.observation     # Any — what the agent observed
   step.message         # Any — the raw LLM output
   step.action          # Any — the parsed action
   step.local_utility   # float | None — per-agent reward
   step.system_state    # dict — snapshot of the global state
   step.metadata        # dict — additional key-value data

Trajectory Logger
-----------------

``TrajectoryLogger`` builds a ``Trajectory`` in memory and can flush it
to disk as JSON. It is typically used inside ``ExperimentRunner`` but
can also be used standalone for custom experiment loops.

.. code-block:: python

   from risklab.evaluation.logger import TrajectoryLogger

   logger = TrajectoryLogger(experiment_id="exp_001", output_dir="results/")

   logger.log_step(
       round=0,
       speaker="seller_0",
       observation={"prices": [50, 55]},
       action={"price": 48},
   )

   # Persist to JSON
   path = logger.save("exp_001_seed0.json")

Metrics
-------

Metrics quantify trajectory properties that are not necessarily risks —
for example, price convergence speed or message diversity. The framework
defines three metric families:

- **Outcome** — task success, efficiency
- **Interaction** — agreement rate, entropy collapse, repetition
- **Risk indicator** — collusion score, drift distance

Subclass ``Metric`` to implement your own, then group them into a
``MetricSuite`` for batch evaluation:

.. code-block:: python

   from risklab.evaluation.metrics import (
       Metric, MetricResult, MetricType, MetricSuite,
   )

   class PriceConvergence(Metric):
       def __init__(self):
           super().__init__("price_convergence", MetricType.OUTCOME)

       def compute(self, trajectory) -> MetricResult:
           # Calculate convergence ...
           return MetricResult(
               name=self.name,
               metric_type=self.metric_type,
               value=0.85,
           )

   suite = MetricSuite()
   suite.add(PriceConvergence())
   results = suite.evaluate(trajectory)           # list[MetricResult]
   flat    = suite.evaluate_as_dict(trajectory)    # {"price_convergence": 0.85}

Task Evaluator
--------------

``RuleBasedTaskEvaluator`` checks whether agents achieved the task
objective defined in ``TaskConfig.success_criteria``. Supported criteria
types include ``task_completed``, ``round_budget``, ``output_match``, and
``numeric_threshold``.

.. code-block:: python

   from risklab.evaluation.task_evaluator import RuleBasedTaskEvaluator

   evaluator = RuleBasedTaskEvaluator()
   result = evaluator.evaluate(task_config, trajectory)
   # result.success  → bool
   # result.score    → float in [0, 1]
   # result.details  → dict with per-criterion breakdown