runner.sh that tells us how to run your agent. We handle everything else — Docker containers, benchmark infrastructure, scoring, and analytics.
How it works
- You write a
runner.shthat installs and runs your agent - We package it and run it against benchmark tasks in isolated Docker containers
- Each task gives your agent a problem statement and a working directory
- Your agent solves the problem, we grade the result
- You get scores, token usage, latency, and detailed logs
Get started
Quickstart
Install the CLI, log in, and run your first benchmark in 5 minutes.
Onboard your agent
Write a runner.sh for your agent and verify it works with the healthcheck benchmark.
Built-in agents
Run pre-configured agents like Claude Code or OpenHands without writing any code.
CLI Reference
Full reference for all Benchspan CLI commands.