APIEval-20

What is APIEval-20?

APIEval-20 is a task benchmark designed to evaluate AI agents on real-world API test suite generation under a black-box constraint. Instead of focusing on model quality in general or on superficial schema compliance, it measures whether an agent can reason about an API surface and generate tests that actually uncover bugs.

In each scenario, the agent receives only an API request schema and a sample payload—no source code, no documentation beyond what’s in the schema, and no prior knowledge. The generated test suite is then run against a live reference implementation to observe the bugs the tests expose.

Key Features

Task benchmark for AI agents (not a model benchmark): Evaluates end-to-end agent behavior—test design and bug discovery—rather than text-generation quality.
20 scenario set drawn from real-world domains: Scenarios cover e-commerce, payments, authentication, user management, scheduling, notifications, and search/filtering patterns.
Black-box input constraint: The agent is given exactly two inputs per scenario—(1) the JSON schema and (2) a sample request payload—without response schemas, implementation details, error messages, or changelogs.
Bug spectrum with complexity-based labeling: Each scenario includes 3–8 planted bugs classified by reasoning complexity: simple structural issues, moderate field-constraint violations, and complex multi-field/business-logic interactions.
Test suite output format (request-only test cases): The agent produces a list of test cases, each with a short test name and a complete request payload as valid JSON; no expected outcomes are required.

How to Use APIEval-20

Select a scenario from the APIEval-20 benchmark. Each scenario provides an API request JSON schema and a sample payload.
Provide those two inputs to your AI agent. The benchmark is specifically designed so the agent cannot rely on implementation details or extra documentation.
Generate a test suite: Have the agent output test cases where each case includes a human-readable name and a complete JSON request payload.
Run the produced test cases against the live reference implementation: Evaluation is based on what the tests reveal when executed, not on whether the agent predicts expected outcomes.

Use Cases

Evaluating an agent’s ability to generate meaningful API tests: Useful when you want to know whether an agent can go beyond schema-formal generation and produce tests that reveal real bugs.
Comparing agent strategies under the same black-box constraint: Because the inputs are limited to schema + example payload, differences in performance reflect test reasoning and coverage rather than access to additional information.
Testing for structural robustness (simple bug detection): Scenarios include checks for missing required fields, empty values (e.g., "", null, []), and wrong data types—helpful for validating baseline request handling.
Assessing constraint and validation reasoning (moderate bug detection): The benchmark includes cases such as out-of-range numeric values and malformed field formats (e.g., email, currency code, date format), plus boundary/undocumented enum values.
Assessing business-logic and cross-field reasoning (complex bug detection): Some scenarios require detecting issues involving mutually exclusive fields, discounts applied to ineligible orders, or field validity dependent on other fields.

FAQ

What inputs does the agent get for each scenario? The agent receives exactly two inputs: the full request JSON schema and a sample payload example. No response schema, implementation details, error messages, or other documentation are provided.

Does the agent need to predict expected outcomes? No. The produced test suite consists of test cases with request payloads; the evaluation is done by running those tests against the live reference implementation and observing what happens.

How are bugs represented in the benchmark? Each scenario contains multiple planted bugs (between 3 and 8), categorized by complexity: simple structural issues, moderate field-level constraint violations, and complex multi-field or semantic/business-logic relationships.

What does APIEval-20 evaluate: schema compliance or bug-finding? Bug-finding. While schema information is provided to enable test generation, the benchmark is designed to test whether the agent’s tests uncover bugs when executed.

Alternatives

Schema-focused test generation / schema compliance checkers: These are positioned around validating that generated requests match a schema (or that a system follows a schema). They differ from APIEval-20 by not directly evaluating bug-finding behavior under black-box constraints.
Conventional API testing frameworks and tooling (e.g., request/contract test tools): These workflows typically rely on human-authored test cases or additional knowledge. Compared to APIEval-20, they may not evaluate an agent’s ability to generate targeted test suites from schema + example alone.
General AI evaluation benchmarks for code or text generation: Some benchmarks assess output quality rather than executable test effectiveness. APIEval-20 specifically targets end-to-end agent behavior for generating and running tests to expose bugs.
API property-based / fuzz testing approaches: These can exercise an API broadly by generating many inputs, but may not evaluate the agent’s reasoning process for designing targeted tests from schema and example payloads.

APIEval-20

What is APIEval-20?

Key Features

How to Use APIEval-20

Use Cases

FAQ

Alternatives

Alternatives

AakarDev AI

Arduino VENTUNO Q

Devin

open-codex-computer-use

Codex Plugins

Ably Chat