agent-healthcheck benchmark runs 10 simple tasks that stress-test your runner.sh across different environments. Tasks are trivial (to save tokens) — the environment is what’s being tested.
Run it
Subsets
Pick the subset that matches your agent:| Subset | Tasks | Who should run it |
|---|---|---|
quick | 2 | Everyone — fast smoke test |
universal | 7 | All agents — stdout capture, env vars, edge cases, missing deps |
coding | 3 | Coding agents — file create, file edit, read-only files |
universal. Coding agents should also pass coding.
What each task tests
| Task | Environment | What it catches |
|---|---|---|
echo-answer | Standard | Agent output not reaching stdout |
env-vars | Standard | Env vars not accessible to agent |
special-chars | Standard | Quoting issues with $PROBLEM_STATEMENT |
large-problem | Standard | Agent breaks on large inputs (~16KB) |
no-python3 | No Python | Runner.sh crashes when Python is missing |
no-git | No git | Runner.sh crashes when git is missing |
conda-env | Conda base | PATH issues with conda (bash -lc vs bash -c) |
file-create | Standard | Agent can’t create files in working directory |
file-edit | Existing file | Agent can’t edit existing files |
readonly-files | Read-only file | Agent crashes on permission errors |
Check results
When something fails
Download the logs to see what happened:100_runner.log (your runner.sh output) and 200_scoring.log (the verifier output):
| What you see | What’s wrong | How to fix |
|---|---|---|
| Output empty | Agent stdout swallowed | Use tee instead of > redirect |
| Output full of install noise | Install output on stdout | Redirect installs to >/dev/null 2>&1 |
command not found | Missing system dep | which curl || apt-get install curl |
| Timeout | Agent waiting for input | Add --headless or --non-interactive flag |
| Wrong directory edited | Agent working on /runner/ | Set agent’s work dir to $WORKING_DIR |