Adding an evaluation benchmark¶

This guide walks through creating a new evaluation benchmark for OpenSage.

1. Create the evaluation module¶

Create a directory under benchmarks/ with your benchmark name:

benchmarks/
└── my_benchmark/
    ├── __init__.py
    └── my_evaluation.py

2. Implement the evaluation class¶

Subclass Evaluation and implement the two required abstract methods:

from __future__ import annotations

from dataclasses import dataclass

from opensage.evaluation.base import Evaluation, EvaluationTask


@dataclass(kw_only=True)
class MyEvaluation(Evaluation):
    """Custom evaluation benchmark."""

    # Required fields
    dataset_path: str = "org/dataset_name"
    agent_dir: str = "examples/agents/my_agent"

    # Optional overrides
    max_llm_calls: int = 100
    max_workers: int = 6
    use_multiprocessing: bool = True
    run_until_explicit_finish: bool = True
    use_sandbox_cache: bool = True

    # Custom fields
    custom_param: str = "default_value"

    # --- Required abstract methods ---

    def _get_task_id(self, sample: dict) -> str:
        """Extract unique task ID from a dataset sample."""
        return sample["task_id"]

    def _get_first_user_message(self, sample: dict) -> str:
        """Extract the initial prompt to send to the agent."""
        return sample["prompt"]

    # --- Optional overrides ---

    def _get_dataset(self) -> datasets.Dataset:
        """Custom dataset loading or filtering."""
        dataset = super()._get_dataset()
        # dataset = dataset.filter(lambda x: x["difficulty"] == "hard")
        return dataset

    def _create_task(self, sample: dict) -> EvaluationTask:
        """Attach additional fields to the task if needed."""
        task = super()._create_task(sample)
        return task

    def _get_export_dir_in_sandbox(self, sample: dict) -> str | tuple | None:
        """Sandbox directories to export after execution."""
        return "/output"  # or ("/output1", "/output2")

    def customized_modify_and_save_results(
        self,
        *,
        results: list | None,
        failed_samples: list[str] | None,
        mode: str,
    ) -> None:
        """Post-process and save aggregated results."""
        pass

    def evaluate(self) -> None:
        """Calculate final metrics after all samples complete."""
        pass

Required abstract methods¶

Method	Purpose
`_get_sample_id(sample) -> str`	Extract unique task ID
`_get_user_msg_first(sample) -> str`	Extract initial prompt

Optional methods¶

Method	Purpose
`_get_dataset()`	Load and filter dataset
`_create_task(sample)`	Create task instance
`_get_input_data_path(sample)`	Input data directory
`_get_cache_dir(sample)`	Cache directory
`_get_export_dir_in_sandbox(sample)`	Output dirs to export
`_prepare_general_env()`	Setup shared across all samples
`_before_initialize_hooks(session, task)`	Hooks before sandbox init
`customized_modify_and_save_results(...)`	Post-processing
`evaluate()`	Final evaluation and metrics

3. Add a configuration template¶

Create a TOML config next to your agent:

[llm]
model_name = "gemini-2.0-flash-exp"
temperature = 0.7

[sandbox]
[sandbox.main]
type = "docker"
image = "python:3.12"
working_dir = "/workspace"

# Template variables:
# ${TASK_NAME} - Replaced with actual task ID
# ${ABSOLUTE_SHARED_DATA_PATH} - Replaced with absolute input data dir

4. Registration¶

The evaluation class is automatically registered when imported. The registered name is the lowercase class name:

MyEvaluation is registered as "myevaluation"
Retrieve with get_evaluation_class("myevaluation")

5. Run the evaluation¶

CLI (recommended):

# Production run
python -m opensage.evaluations.my_benchmark.my_evaluation run \
  --dataset_path="org/dataset" \
  --agent_dir="examples/agents/my_agent" \
  --max_workers=6 \
  --output_dir="results/my_benchmark"

# Debug run (single-threaded)
python -m opensage.evaluations.my_benchmark.my_evaluation run_debug \
  --dataset_path="org/dataset" \
  --agent_dir="examples/agents/my_agent"

Python API:

from opensage.evaluations import MyEvaluation

eval = MyEvaluation(
    dataset_path="org/dataset",
    agent_dir="examples/agents/my_agent",
    max_workers=6,
)
eval.run()       # production
eval.run_debug() # debugging

See execution modes for the full list of methods.

Sample lifecycle¶

Each sample goes through six phases:

Task creation (_create_task) -- Convert dataset sample to EvaluationTask
Environment preparation (_prepare_environment) -- Create session, launch sandboxes, restore cache
Agent preparation (_prepare_agent) -- Load agent from agent_dir
Agent execution (_run_agent) -- Run with configured limits
Output collection (_collect_outputs) -- Export sandbox outputs, save traces and cost info
Cleanup -- Stop sandboxes, close session

For the full internal details, see workflow details.

Existing examples¶

Example	Description
`src/opensage/evaluation/base.py`	Base `Evaluation` / `EvaluationTask` implementation
`benchmarks/cybergym/cybergym_static.py`	Static CyberGym evaluation
`benchmarks/cybergym/cybergym_dynamic.py`	Dynamic CyberGym evaluation
`benchmarks/cybergym/cybergym_vul_detection.py`	Vulnerability-detection CyberGym evaluation
`benchmarks/swe_bench_pro/swe_bench_pro.py`	SWE-Bench Pro benchmark entry point