Skip to main content
Benchspan is a platform for benchmarking AI agents. You write a single runner.sh that tells us how to run your agent. We handle everything else — Docker containers, benchmark infrastructure, scoring, and analytics.

How it works

Your agent  →  runner.sh  →  Benchspan  →  Results
  1. You write a runner.sh that installs and runs your agent
  2. We package it and run it against benchmark tasks in isolated Docker containers
  3. Each task gives your agent a problem statement and a working directory
  4. Your agent solves the problem, we grade the result
  5. You get scores, token usage, latency, and detailed logs

Get started

Quickstart

Install the CLI, log in, and run your first benchmark in 5 minutes.

Onboard your agent

Write a runner.sh for your agent and verify it works with the healthcheck benchmark.

Built-in agents

Run pre-configured agents like Claude Code or OpenHands without writing any code.

CLI Reference

Full reference for all Benchspan CLI commands.