Available benchmarks
| Category | Benchmarks |
|---|---|
| Coding | SWEbench, SWEbench-Pro, HumanEvalFix, AutoCodeBench, CRUSTBench, QuixBugs |
| Reasoning | AIME, GPQA-Diamond, IneqMath, ReasoningGym |
| Function Calling | BFCL |
| QA | SimpleQA, MMMLU, MMAU |
| Multi-modal | ARC-AGI-2 |
| Data Science | DS1000, DABStep |
| Other | GAIA, LawBench, QCircuitBench, ReplicationBench, SatBench, StrongReject, USACO |
Onboarding new benchmarks
Benchmark onboarding is currently white-glove. If you have a specific benchmark you’d like to run on Benchspan, reach out to us and we’ll work with you to get it set up.Contact us
Email avi@benchspan.com with your benchmark requirements.