Skip to main content

List runs

benchspan runs list
benchspan runs list --benchmark swebench
benchspan runs list --tag "v2-haiku"
Shows all your runs with resolve rate, instance count, and timestamps.

Show run details

benchspan runs show <run_id>
Shows per-instance results: resolved/failed status, score, token usage, and latency.

Compare two runs

benchspan runs compare <run_a> <run_b>
Side-by-side diff of two runs showing which instances improved, regressed, or stayed the same. Useful for A/B testing model changes or agent improvements.

Watch a run in progress

benchspan runs watch <run_id>
Live-updates in your terminal as instances complete. Shows progress, resolve rate, and errors in real time. Exits when the run finishes.

Download logs and artifacts

benchspan runs download <run_id>
Downloads all run artifacts (logs, scores, trajectory files) to a local directory.

Options

FlagShortDescription
--failed-fDownload only failed (unresolved) instances
--errored-eDownload only errored instances
--instance-iDownload a specific instance by ID
--output-oDirectory to save the download (default: .)

Examples

# Download everything
benchspan runs download <run_id>

# Only failed instances (most common for debugging)
benchspan runs download <run_id> --failed

# A specific instance
benchspan runs download <run_id> -i django__django-11099

# Save to a specific directory
benchspan runs download <run_id> --failed -o ./debug

What you get

Each instance directory contains logs, scores, and any artifacts your agent wrote to $OUTPUT_DIR. The key files:
  • 100_runner.log — your runner.sh stdout + stderr. Start here when debugging.
  • 200_scoring.log — verifier/grading output. Shows why it passed or failed.
  • score.json — resolved status and score.
You may also see trajectory.json, reward.txt, and any other files your runner.sh wrote to $OUTPUT_DIR.

Cancel a run

benchspan runs cancel <run_id>
Cancels all pending instances in a run. Instances already running will finish, but no new ones will start.