Testing with Healthcheck

The agent-healthcheck benchmark runs 10 simple tasks that stress-test your runner.sh across different environments. Tasks are trivial (to save tokens) — the environment is what’s being tested.

Run it

# Quick smoke test — 2 tasks, ~30 seconds
benchspan run --benchmark agent-healthcheck.quick --agent ./my-agent

# Full suite — 10 tasks
benchspan run --benchmark agent-healthcheck --agent ./my-agent

Subsets

Pick the subset that matches your agent:

Subset	Tasks	Who should run it
`quick`	2	Everyone — fast smoke test
`universal`	7	All agents — stdout capture, env vars, edge cases, missing deps
`coding`	3	Coding agents — file create, file edit, read-only files

All agents should pass universal. Coding agents should also pass coding.

What each task tests

Task	Environment	What it catches
`echo-answer`	Standard	Agent output not reaching stdout
`env-vars`	Standard	Env vars not accessible to agent
`special-chars`	Standard	Quoting issues with `$PROBLEM_STATEMENT`
`large-problem`	Standard	Agent breaks on large inputs (~16KB)
`no-python3`	No Python	Runner.sh crashes when Python is missing
`no-git`	No git	Runner.sh crashes when git is missing
`conda-env`	Conda base	PATH issues with conda (`bash -lc` vs `bash -c`)
`file-create`	Standard	Agent can’t create files in working directory
`file-edit`	Existing file	Agent can’t edit existing files
`readonly-files`	Read-only file	Agent crashes on permission errors

Check results

benchspan runs show <run_id>

When something fails

Download the logs to see what happened:

# Download only failed instances
benchspan runs download <run_id> --failed

# Or a specific instance
benchspan runs download <run_id> -i echo-answer

Then read 100_runner.log (your runner.sh output) and 200_scoring.log (the verifier output):

cat run_<id>/echo-answer/100_runner.log
cat run_<id>/echo-answer/200_scoring.log

Common patterns:

What you see	What’s wrong	How to fix
Output empty	Agent stdout swallowed	Use `tee` instead of `>` redirect
Output full of install noise	Install output on stdout	Redirect installs to `>/dev/null 2>&1`
`command not found`	Missing system dep	`which curl \|\| apt-get install curl`
Timeout	Agent waiting for input	Add `--headless` or `--non-interactive` flag
Wrong directory edited	Agent working on `/runner/`	Set agent’s work dir to `$WORKING_DIR`

For more, see Common Issues.

Getting Started

Onboard Your Agent

Built-in Agents

Benchmarks

Testing with Healthcheck

Run it

Subsets

What each task tests

Check results

When something fails

Getting Started

Onboard Your Agent

Built-in Agents

Benchmarks

​Run it

​Subsets

​What each task tests

​Check results

​When something fails

Run it

Subsets

What each task tests

Check results

When something fails