Evaluations¶
Evaluations run agents on benchmark datasets for performance measurement and testing. The evaluation system is built on the Evaluation base class, which handles parallel execution, sandbox lifecycle, result collection, and metrics.
Running an evaluation¶
Evaluations use Python Fire for their CLI. The general form:
Example:
python -m opensage.evaluations.cybergym.cybergym_vul_detection run \
--dataset_path="org/dataset" \
--agent_dir="examples/agents/my_agent" \
--max_workers=6 \
--use_multiprocessing=true
Execution modes¶
| Method | Mode | Best for |
|---|---|---|
run | Auto-select (multiprocessing or threading based on use_multiprocessing), then evaluate() | Production runs |
run_debug | Single-threaded, then evaluate() | Debugging |
generate | Multiprocessing only (no evaluation step) | Explicit parallelism |
generate_threaded | Threading only (no evaluation step) | When multiprocessing has serialization issues |
generate_single_thread | Sequential (no evaluation step) | Step-by-step debugging |
Configuration options¶
| Option | Type | Default | Description |
|---|---|---|---|
dataset_path | str | Required | HuggingFace dataset or local path |
agent_dir | str | Required | Directory containing agent.py |
max_llm_calls | int | 100 | Maximum LLM calls per task |
max_workers | int | 6 | Parallel workers |
use_multiprocessing | bool | True | Use multiprocessing vs threading |
use_sandbox_cache | bool | True | Cache/restore sandbox states |
run_until_explicit_finish | bool | True | Keep running until agent signals done |
use_config_model | bool | False | Use model from config file |
llm_retry_count | int | 3 | Retries for LLM API calls |
llm_retry_timeout | int | 30 | Timeout per LLM request (seconds) |
log_level | str | "INFO" | Terminal log level |
Output structure¶
Each run creates a timestamped output directory:
evals/
└── myevaluation/
└── yymmdd_HHMMSS/
├── evaluation_master.log # Master log
├── eval_params.json # Parameters used
├── task_001/
│ ├── execution_debug.log # DEBUG-level log
│ ├── execution_info.log # INFO-level log
│ ├── config_used.toml # Config for this task
│ ├── cost_info.json # Token usage and costs
│ ├── session_trace.json # Complete session events
│ ├── session_trace.txt # Human-readable trace
│ ├── metadata.json # Task metadata
│ ├── sandbox_output/ # Exported from sandbox
│ └── neo4j_history/ # Neo4j database export
└── task_002/
└── ...
Pages in this section¶
- Workflow details -- Step-by-step internals of how each evaluation runs
- Adding evaluations -- How to create a new benchmark