Skip to main content
The agent-healthcheck benchmark runs 10 simple tasks that stress-test your runner.sh across different environments. Tasks are trivial (to save tokens) — the environment is what’s being tested.

Run it

# Quick smoke test — 2 tasks, ~30 seconds
benchspan run --benchmark agent-healthcheck.quick --agent ./my-agent

# Full suite — 10 tasks
benchspan run --benchmark agent-healthcheck --agent ./my-agent

Subsets

Pick the subset that matches your agent:
SubsetTasksWho should run it
quick2Everyone — fast smoke test
universal7All agents — stdout capture, env vars, edge cases, missing deps
coding3Coding agents — file create, file edit, read-only files
All agents should pass universal. Coding agents should also pass coding.

What each task tests

TaskEnvironmentWhat it catches
echo-answerStandardAgent output not reaching stdout
env-varsStandardEnv vars not accessible to agent
special-charsStandardQuoting issues with $PROBLEM_STATEMENT
large-problemStandardAgent breaks on large inputs (~16KB)
no-python3No PythonRunner.sh crashes when Python is missing
no-gitNo gitRunner.sh crashes when git is missing
conda-envConda basePATH issues with conda (bash -lc vs bash -c)
file-createStandardAgent can’t create files in working directory
file-editExisting fileAgent can’t edit existing files
readonly-filesRead-only fileAgent crashes on permission errors

Check results

benchspan runs show <run_id>

When something fails

Download the logs to see what happened:
# Download only failed instances
benchspan runs download <run_id> --failed

# Or a specific instance
benchspan runs download <run_id> -i echo-answer
Then read 100_runner.log (your runner.sh output) and 200_scoring.log (the verifier output):
cat run_<id>/echo-answer/100_runner.log
cat run_<id>/echo-answer/200_scoring.log
Common patterns:
What you seeWhat’s wrongHow to fix
Output emptyAgent stdout swallowedUse tee instead of > redirect
Output full of install noiseInstall output on stdoutRedirect installs to >/dev/null 2>&1
command not foundMissing system depwhich curl || apt-get install curl
TimeoutAgent waiting for inputAdd --headless or --non-interactive flag
Wrong directory editedAgent working on /runner/Set agent’s work dir to $WORKING_DIR
For more, see Common Issues.