UStackUStack
BenchSpan icon

BenchSpan

BenchSpan runs AI agent benchmarks in parallel, captures scores and failures in run history, and uses commit-tagged executions to improve reproducibility.

BenchSpan

What is BenchSpan?

BenchSpan helps teams run AI agent benchmarks in a way that’s faster, more reproducible, and easier to share. Instead of manually wiring an agent into different benchmark harnesses and copying results into scattered files, BenchSpan standardizes benchmark execution and funnels scores, errors, and timing into an organized run history.

Its core purpose is to reduce the time and cost of running benchmark suites (including large sweeps like hundreds of instances), while improving trust in results by tying runs to your agent’s commit hash and making it easier to compare runs side by side.

Key Features

  • Benchmark runner that standardizes agent setup via a shell script: BenchSpan can run agents that start via a bash command, minimizing glue code and harness-specific interface work.
  • Benchmark library plus bring-your-own benchmarks: You can choose from an included set of benchmarks or bring your own benchmark definition.
  • Parallel execution using isolated Docker containers: Each benchmark instance runs in its own Docker container and can execute in parallel, targeting much faster completion for large suites.
  • Automatic result capture and organization: BenchSpan captures scores, trajectories, errors, and timing, then organizes them for later comparison.
  • Commit-tagged runs for reproducibility and comparison: Results are tagged with the agent’s commit hash so teams can compare different runs and know what code produced which numbers.
  • Rerun only failed instances: If a run encounters partial failures (e.g., network errors or rate limits on some instances), you can retry only the failed subset rather than re-running everything.

How to Use BenchSpan

  1. Onboard your agent by writing a bash script that starts your agent, then point BenchSpan to it.
  2. Select a benchmark from BenchSpan’s library or use a benchmark you provide.
  3. Run the suite by setting the number of instances and starting the run; BenchSpan executes instances in parallel using Docker containers.
  4. Review results in the organized output, then compare runs using the commit hash tags. If some instances failed, rerun only those failed instances.

Use Cases

  • Comparing agent iterations during development: Run a benchmark suite after updating prompts or agent code, then compare resolve rates and failure patterns across commits.
  • Scaling SWE-style evaluations across hundreds of instances: Execute large benchmark suites that would be impractical to run sequentially, where parallel Docker execution reduces total runtime.
  • Recovering from partial failures without starting over: When some instances fail due to rate limits or timeouts, rerun just the failed instances instead of repeating a full suite.
  • Team collaboration on benchmark claims: Share a single benchmark run record with your team so results aren’t lost in separate spreadsheets or chat messages.
  • Testing agents with different underlying prompts or configurations: Track which prompt version and code commit produced which results via commit-tagged runs, helping avoid disputes over “which config” was used.

FAQ

  • What kind of agent does BenchSpan support? The site states that “any agent that runs via bash” can work, meaning you can start the agent with a shell command and BenchSpan will integrate through that.

  • Do benchmarks run sequentially or in parallel? BenchSpan runs benchmark instances in parallel, with each instance isolated in its own Docker container.

  • How does BenchSpan handle failed runs? If some instances fail, BenchSpan can rerun only the failed instances rather than requiring a full restart of the entire suite.

  • How are results organized for comparison? Results (scores, trajectories, errors, and timing) are captured and organized, and tagged with the agent’s commit hash for side-by-side comparison.

Alternatives

  • Local or single-machine benchmark scripts: Running benchmark suites on a laptop can be simpler initially, but the workflow is slower, and results often remain fragmented unless you build your own tracking and reproducibility tooling.
  • Manual orchestration with Docker and custom harness glue: You can parallelize with containers and write glue code for each benchmark, but you still need to implement interface shims, resume logic, and a centralized results history.
  • Ad-hoc spreadsheet/Notion/Slack result logging: Copying numbers into shared documents can work for small experiments, but it doesn’t provide standardized run management, commit-tagged history, or structured comparisons automatically.