benchspan runs - Benchspan

List runs

benchspan runs list
benchspan runs list --benchmark swebench
benchspan runs list --tag "v2-haiku"

Shows all your runs with resolve rate, instance count, and timestamps.

Show run details

benchspan runs show <run_id>

Shows per-instance results: resolved/failed status, score, token usage, and latency.

Compare two runs

benchspan runs compare <run_a> <run_b>

Side-by-side diff of two runs showing which instances improved, regressed, or stayed the same. Useful for A/B testing model changes or agent improvements.

Watch a run in progress

benchspan runs watch <run_id>

Live-updates in your terminal as instances complete. Shows progress, resolve rate, and errors in real time. Exits when the run finishes.

Download logs and artifacts

benchspan runs download <run_id>

Downloads all run artifacts (logs, scores, trajectory files) to a local directory.

Options

Flag	Short	Description
`--failed`	`-f`	Download only failed (unresolved) instances
`--errored`	`-e`	Download only errored instances
`--instance`	`-i`	Download a specific instance by ID
`--output`	`-o`	Directory to save the download (default: `.`)

Examples

# Download everything
benchspan runs download <run_id>

# Only failed instances (most common for debugging)
benchspan runs download <run_id> --failed

# A specific instance
benchspan runs download <run_id> -i django__django-11099

# Save to a specific directory
benchspan runs download <run_id> --failed -o ./debug

What you get

Each instance directory contains logs, scores, and any artifacts your agent wrote to $OUTPUT_DIR. The key files:

100_runner.log — your runner.sh stdout + stderr. Start here when debugging.
200_scoring.log — verifier/grading output. Shows why it passed or failed.
score.json — resolved status and score.

You may also see trajectory.json, reward.txt, and any other files your runner.sh wrote to $OUTPUT_DIR.

Cancel a run

benchspan runs cancel <run_id>

Cancels all pending instances in a run. Instances already running will finish, but no new ones will start.

Commands

​List runs

​Show run details

​Compare two runs

​Watch a run in progress

​Download logs and artifacts

​Options

​Examples

​What you get

​Cancel a run