Skip to main content

The big picture

When you run a benchmark, Benchspan does this for each task:
1. Spins up a Docker container with the benchmark environment
2. Injects your agent code at /runner/
3. Sets environment variables (problem statement, working dir, your API keys)
4. Calls: bash /runner/runner.sh
5. Your agent runs, solves the problem, exits
6. Benchspan grades the result and collects artifacts
7. Container destroyed
Your runner.sh is the only thing you write. Everything else is handled by the platform.

What your agent receives

When runner.sh starts, these environment variables are set:
VariableWhat it isExample
$PROBLEM_STATEMENTThe task to solve"Fix the bug in QuerySet.bulk_create()..."
$WORKING_DIRDirectory with files to work on/app or /testbed
$OUTPUT_DIRWhere to write logs and telemetry/output
$INSTANCE_IDUnique task identifier"django__django-11099"
Your env varsWhatever you set on the dashboardLLM_API_KEY, LLM_MODEL, etc.
Env vars on the dashboard have no naming restrictions. Set whatever your agent expects — ANTHROPIC_API_KEY, LLM_API_KEY, MY_CUSTOM_VAR — it all works.

What your agent produces

Two output channels matter: Stdout — The platform captures everything your runner.sh prints to stdout. Many benchmarks (math, QA, reasoning) grade based on this output. If your agent’s answer isn’t on stdout, it can’t be graded. Files in $WORKING_DIR — For coding benchmarks (SWEbench, HumanEvalFix), the platform checks what files your agent changed. Your agent edits code, the platform diffs it and runs tests. Optionally, write $OUTPUT_DIR/trajectory.json with token usage and tool call data for analytics.

Two integration patterns

Published package

Your runner.sh installs your agent from pip or npm and runs it. The agent directory contains just runner.sh.
agents/my-agent/
└── runner.sh
benchspan run --benchmark swebench \
  --agent ./agents/my-agent
Best for: stable releases, agents published to a registry.

Build from source

Your runner.sh lives at the root of your repo. The entire repo is packaged and built inside the container.
my-agent-repo/
├── runner.sh           ← you add this
├── .benchspanignore    ← exclude large files
├── pyproject.toml
├── src/
└── ...
benchspan run --benchmark swebench \
  --agent /path/to/my-agent-repo
Best for: active development, benchmarking HEAD of your codebase.

The container environment

PropertyValue
OSLinux (amd64)
Userroot (by default)
NetworkFull internet access
StdinNot available — your agent must run non-interactively
Pre-installedbash, varies by benchmark
Not guaranteedcurl, git, python3, node — install what you need
The container is destroyed after each task. Nothing persists between tasks.