The big picture
When you run a benchmark, Benchspan does this for each task:runner.sh is the only thing you write. Everything else is handled by the platform.
What your agent receives
When runner.sh starts, these environment variables are set:| Variable | What it is | Example |
|---|---|---|
$PROBLEM_STATEMENT | The task to solve | "Fix the bug in QuerySet.bulk_create()..." |
$WORKING_DIR | Directory with files to work on | /app or /testbed |
$OUTPUT_DIR | Where to write logs and telemetry | /output |
$INSTANCE_ID | Unique task identifier | "django__django-11099" |
| Your env vars | Whatever you set on the dashboard | LLM_API_KEY, LLM_MODEL, etc. |
Env vars on the dashboard have no naming restrictions. Set whatever your agent expects —
ANTHROPIC_API_KEY, LLM_API_KEY, MY_CUSTOM_VAR — it all works.What your agent produces
Two output channels matter: Stdout — The platform captures everything your runner.sh prints to stdout. Many benchmarks (math, QA, reasoning) grade based on this output. If your agent’s answer isn’t on stdout, it can’t be graded. Files in$WORKING_DIR — For coding benchmarks (SWEbench, HumanEvalFix), the platform checks what files your agent changed. Your agent edits code, the platform diffs it and runs tests.
Optionally, write $OUTPUT_DIR/trajectory.json with token usage and tool call data for analytics.
Two integration patterns
Published package
Your Best for: stable releases, agents published to a registry.
runner.sh installs your agent from pip or npm and runs it. The agent directory contains just runner.sh.Build from source
Your Best for: active development, benchmarking HEAD of your codebase.
runner.sh lives at the root of your repo. The entire repo is packaged and built inside the container.The container environment
| Property | Value |
|---|---|
| OS | Linux (amd64) |
| User | root (by default) |
| Network | Full internet access |
| Stdin | Not available — your agent must run non-interactively |
| Pre-installed | bash, varies by benchmark |
| Not guaranteed | curl, git, python3, node — install what you need |