How It Works

The big picture

When you run a benchmark, Benchspan does this for each task:

Spins up a Docker container with the benchmark environment
Injects your agent code at /runner/
Sets environment variables (problem statement, working dir, your API keys)
Calls: bash /runner/runner.sh
Your agent runs, solves the problem, exits
Benchspan grades the result and collects artifacts
Container destroyed

Your runner.sh is the only thing you write. Everything else is handled by the platform.

What your agent receives

When runner.sh starts, these environment variables are set:

Variable	What it is	Example
`$PROBLEM_STATEMENT`	The task to solve	`"Fix the bug in QuerySet.bulk_create()..."`
`$WORKING_DIR`	Directory with files to work on	`/app` or `/testbed`
`$OUTPUT_DIR`	Where to write logs and telemetry	`/output`
`$INSTANCE_ID`	Unique task identifier	`"django__django-11099"`
Your env vars	Whatever you set on the dashboard	`LLM_API_KEY`, `LLM_MODEL`, etc.

Env vars on the dashboard have no naming restrictions. Set whatever your agent expects — ANTHROPIC_API_KEY, LLM_API_KEY, MY_CUSTOM_VAR — it all works.

What your agent produces

Two output channels matter: Stdout — The platform captures everything your runner.sh prints to stdout. Many benchmarks (math, QA, reasoning) grade based on this output. If your agent’s answer isn’t on stdout, it can’t be graded. Files in $WORKING_DIR — For coding benchmarks (SWEbench, HumanEvalFix), the platform checks what files your agent changed. Your agent edits code, the platform diffs it and runs tests. Optionally, write $OUTPUT_DIR/trajectory.json with token usage and tool call data for analytics.

Two integration patterns

Published package

Your runner.sh installs your agent from pip or npm and runs it. The agent directory contains just runner.sh.

agents/my-agent/
└── runner.sh

benchspan run --benchmark swebench \
  --agent ./agents/my-agent

Best for: stable releases, agents published to a registry.

Build from source

Your runner.sh lives at the root of your repo. The entire repo is packaged and built inside the container.

my-agent-repo/
├── runner.sh           ← you add this
├── .benchspanignore    ← exclude large files
├── pyproject.toml
├── src/
└── ...

benchspan run --benchmark swebench \
  --agent /path/to/my-agent-repo

Best for: active development, benchmarking HEAD of your codebase.

The container environment

Property	Value
OS	Linux (amd64)
User	root (by default)
Network	Full internet access
Stdin	Not available — your agent must run non-interactively
Pre-installed	bash, varies by benchmark
Not guaranteed	curl, git, python3, node — install what you need

The container is destroyed after each task. Nothing persists between tasks.

Getting Started

Onboard Your Agent

Built-in Agents

Benchmarks

The big picture

What your agent receives

What your agent produces

Two integration patterns

Published package

Build from source

The container environment

Getting Started

Onboard Your Agent

Built-in Agents

Benchmarks

​The big picture

​What your agent receives

​What your agent produces

​Two integration patterns

Published package

Build from source

​The container environment

The big picture

What your agent receives

What your agent produces

Two integration patterns

The container environment