Skip to main content
After your agent runs, you can write a trajectory.json file to $OUTPUT_DIR with token usage, tool calls, and latency data. This powers the analytics on your dashboard — cost tracking, token breakdowns, latency percentiles, and tool call patterns. It’s optional. Your agent will work fine without it. But if you want to understand what your agent is doing across hundreds of benchmark instances, trajectory data is how you get there.

The format

We designed trajectory.json to be minimal and flexible. Only two fields are required:
{
  "schema_version": "1.0",
  "instance_id": "django__django-11099"
}
Everything else is optional. Add what you have:
{
  "schema_version": "1.0",
  "instance_id": "django__django-11099",
  "model": "claude-sonnet-4-6",
  "total_tokens": 48500,
  "prompt_tokens": 36000,
  "completion_tokens": 12500,
  "total_latency_ms": 95000,
  "cache_read_tokens": 14000,
  "cache_write_tokens": 4000,
  "steps": [
    {
      "step": 1,
      "type": "tool_call",
      "tool": "Bash",
      "input": {"command": "find /testbed -name '*.py' | head"},
      "output_tokens": 42,
      "latency_ms": 310,
      "cache_hit": true
    },
    {
      "step": 2,
      "type": "tool_call",
      "tool": "Edit",
      "input": {"file_path": "/testbed/django/forms/widgets.py"},
      "output_tokens": 89,
      "latency_ms": 520,
      "cache_hit": false
    }
  ]
}

Field reference

Top-level fields

FieldTypeRequiredDescription
schema_versionstringYesAlways "1.0"
instance_idstringYesFrom $INSTANCE_ID env var
modelstringNoModel name/ID used
total_tokensintNoprompt_tokens + completion_tokens
prompt_tokensintNoTotal input tokens
completion_tokensintNoTotal output tokens
total_latency_msintNoWall-clock time in milliseconds
cache_read_tokensintNoTokens served from cache
cache_write_tokensintNoTokens written to cache
stepsarrayNoOrdered list of agent actions

Step fields

FieldTypeRequiredDescription
stepintYes (if steps)1-indexed step number
typestringYes (if steps)"tool_call", "model_call", or "observation"
toolstringNoTool name (Bash, Edit, Read, etc.)
inputanyNoInput to the tool or model
output_tokensintNoTokens produced in this step
latency_msintNoWall-clock time for this step
cache_hitboolNoWhether this step hit a cache

What the dashboard computes

From trajectory data across all instances in a run, the dashboard shows:
  • Resolve rate — instances solved / total
  • Token usage — avg, p50, p95 across instances
  • Latency — avg, p50, p95 wall-clock time
  • Tool calls — avg per instance, breakdown by tool name
  • Cache hit rate — across all steps
  • Cost — estimated from token counts and model pricing

Writing a converter

Your agent probably already logs its output in some format — JSONL events, structured logs, a custom format. You don’t need to change your agent. Just write a small converter that runs at the end of runner.sh and transforms your agent’s native output into trajectory.json. Here’s the pattern:
#!/bin/bash
set -uo pipefail

# ... install and run your agent ...
my-agent --task "$PROBLEM_STATEMENT" \
  2>"$OUTPUT_DIR/agent_stderr.log" | tee "$OUTPUT_DIR/agent_output.log"

# Convert your agent's output format to trajectory.json
python3 << 'PYEOF'
import json, os

output_path = os.environ["OUTPUT_DIR"] + "/agent_output.log"
traj_path = os.environ["OUTPUT_DIR"] + "/trajectory.json"
instance_id = os.environ["INSTANCE_ID"]

# Parse your agent's native format
steps = []
total_input = 0
total_output = 0

with open(output_path) as f:
    for line in f:
        try:
            obj = json.loads(line.strip())
        except (json.JSONDecodeError, ValueError):
            continue
        # TODO: adapt this to your agent's event format
        # Example: extract tool calls and token counts

traj = {
    "schema_version": "1.0",
    "instance_id": instance_id,
    "model": "your-model-name",
    "total_tokens": total_input + total_output,
    "prompt_tokens": total_input,
    "completion_tokens": total_output,
    "steps": steps,
}
json.dump(traj, open(traj_path, "w"), indent=2)
PYEOF
This converter is a one-time thing. Write it once, and it works for every benchmark run.

Real examples

Claude Code (stream-json format)

Claude Code outputs JSONL with type: "assistant" and type: "result" events:
messages = []
result = None
with open(output_path) as f:
    for line in f:
        try:
            obj = json.loads(line.strip())
        except json.JSONDecodeError:
            continue
        if obj.get("type") == "result":
            result = obj
        elif obj.get("type") == "assistant":
            messages.append(obj)

steps = []
for msg in messages:
    for block in msg.get("message", {}).get("content", []):
        if block.get("type") == "tool_use":
            steps.append({
                "step": len(steps) + 1,
                "type": "tool_call",
                "tool": block.get("name", ""),
            })

usage = result.get("usage", {}) if result else {}

Ante (event stream format)

Ante outputs JSONL with event.ToolStart and event.UsageUpdate:
for line in f:
    obj = json.loads(line.strip())
    event = obj.get("event", {})
    if "ToolStart" in event:
        steps.append({
            "step": len(steps) + 1,
            "type": "tool_call",
            "tool": event["ToolStart"].get("name", ""),
        })
    if "UsageUpdate" in event:
        usage = event["UsageUpdate"].get("usage", {})
        total_input += usage.get("input_tokens", 0)
        total_output += usage.get("output_tokens", 0)

If you have nothing to parse

Write a minimal trajectory so the dashboard has something:
echo '{"schema_version":"1.0","instance_id":"'"$INSTANCE_ID"'","total_tokens":0,"steps":[]}' \
  > "$OUTPUT_DIR/trajectory.json"
Better than nothing — it still tells the dashboard the instance ran.