agent-benchmark
agent-benchmark is a CLI tool that runs the same coding task across multiple AI assistant config variants in parallel, then produces a side-by-side comparison of cost, speed, token usage, and diff output, and scores each variant on code quality axes using AI-driven review.
Put numbers to the vibes and make data-driven decisions by understanding how different configurations (e.g. CLAUDE.md, AGENTS.md, agent skills, repo documentation, model choice) influence the way AI coding agents interact with your codebase in terms of cost, speed, and output quality.
How it works
- Scaffold a benchmark directory from a target repo
- Edit the variant config files (
CLAUDE.md,AGENTS.md,README.mdetc.) to test your ideas - Run the benchmark – each variant gets its own git worktree, Claude runs in all of them in parallel, and you get a comparison table of metrics and diffs
- Score the code quality of each variant's changes along configurable axes (0-100) to see which configuration produces better-quality code
Requirements
- Node.js 18+
- Git 2.5+ (for worktree support)
- GitHub CLI (
gh) installed and authenticated (required forcopilot-review) - Claude Code CLI installed and authenticated (
claudeon yourPATH)
Installation
npm install -g agent-benchmark
Or use without installing:
npx agent-benchmark <command>
Quick start
# 1. Scaffold a benchmark from your project
npx agent-benchmark init /path/to/your-project
# 2. Edit the variant files to test your hypothesis
# variants/baseline/ – leave as-is to use the repo's existing config
# variants/variant_b/ – add or modify CLAUDE.md, AGENTS.md, README.md, etc.
# 3. Configure the benchmark in agent-benchmark/benchmark.yaml
# Set the task prompt, model, per-variant budget, and any other settings
# 4. Run the benchmark — each variant runs in parallel in its own worktree
npx agent-benchmark run agent-benchmark/benchmark.yaml
# 5. Score the code quality of each variant's changeset
npx agent-benchmark review agent-benchmark/benchmark.yaml
# 6. Create pull requests and request Copilot reviews for each variant
npx agent-benchmark copilot-review agent-benchmark/benchmark.yaml
What tasks work best
agent-benchmark is designed for tasks that translate a single, self-contained prompt into a PR-ready changeset — no conversational back-and-forth, no clarifying questions, just a clear scope and requirements that an AI agent can execute autonomously from start to finish.
Good prompts:
- Have a clear, bounded outcome ("add pagination to the
/usersendpoint", "migrate allvardeclarations insrc/toconst/let") - Specify any constraints that matter ("keep the existing API shape", "don't add new dependencies")
- Are representative of real tasks you'd delegate to an agent in practice
Less suitable: open-ended explorations, multi-step dialogues, tasks that require human decisions mid-way, or tasks so large that a single run reliably hits the budget cap.
Benchmarks
Benchmark configuration
A benchmark.yaml file drives each run:
# The prompt given to Claude in every variant.
prompt: 'Refactor the auth middleware to use async/await'
# Global model to use (optional, defaults to opusplan).
# Can be overridden per-variant.
model: opusplan
# Maximum spend per variant in USD (safety cap).
max_budget_usd: 1.00
# The target repo to benchmark against (defaults to cwd if omitted).
repo: /path/to/your-project
# Each variant gets its own worktree and config overlay.
variants:
baseline:
label: 'A – No changes'
# No config_files: uses the repo's existing config as-is.
structured_claude:
label: 'B – Structured CLAUDE.md'
config_files:
CLAUDE.md: ./variants/variant_b/CLAUDE.md
with_agents:
label: 'C – CLAUDE.md + AGENTS.md (sonnet)'
model: sonnet
config_files:
CLAUDE.md: ./variants/variant_c/CLAUDE.md
AGENTS.md: ./variants/variant_c/AGENTS.md
minimal_readme:
label: 'D – Lean README context'
config_files:
CLAUDE.md: ./variants/variant_d/CLAUDE.md
README.md: ./variants/variant_d/README.md
Configuration fields
| Field | Required | Default | Description |
|---|---|---|---|
id |
no | – | Unique benchmark ID generated by init; used to namespace branch names |
prompt |
yes | – | The task prompt sent to Claude in every variant |
model |
no | opusplan |
Global default Claude model (can be overridden per-variant) |
max_budget_usd |
no | 1.00 |
Per-variant spend cap in USD |
repo |
no | cwd |
Absolute path to the target git repository |
variants |
yes | – | Map of variant keys to variant definitions |
id is written automatically by init and used to namespace branch names (agent-benchmark/<id>/<variant-key>), preventing collisions when running multiple benchmarks against the same repository.
Variant definition
| Field | Required | Description |
|---|---|---|
label |
no | Human-readable name shown in the report (defaults to variant key) |
model |
no | Claude model to use for this variant (inherits global model if unspecified) |
config_files |
no | Map of <repo-relative dest>: <source path> file overlays |
config_files source paths are resolved relative to the directory containing benchmark.yaml. Destination paths are repo-relative (e.g. CLAUDE.md, .github/copilot-instructions.md). Any file can be used as an overlay -- the destination path is not limited to AI config files. For example, you could override src/config.json or tsconfig.json if that is relevant to your benchmark.
A variant with no config_files entry uses the repo's existing files as-is.
Automatically recognized config files
init scans for these files and copies whichever exist:
AI assistant config:
CLAUDE.mdandAGENTS.md(from any location, including subfolders).claude/folder and contents.github/copilot-instructions.md
Repo documentation:
README.mdCONTRIBUTING.md
Report output
After a run, a comparison table is printed to the terminal:
Benchmark: "Refactor auth middleware" (2026-05-01T12:00:00Z)
Base commit: abc1234
Metric | A – No changes | B – Structured
-------------------+-------------------+---------------
Model | opusplan | sonnet
Duration | 45s | 32s
Input tokens | 12,340 | 9,800
Output tokens | 3,210 | 2,100
Cache write tokens | 4,200 | 0
Cache read tokens | 6,100 | 8,300
Cost | $0.42 | $0.31
Normalized cost | $0.40 | $0.35
Tool calls | Bash:5 Edit:3 | Bash:3 Edit:2
Diff (+/-) | +120/-80 | +95/-60
Input tokens include cache creation and cache read tokens. Normalized cost re-prices all input tokens at the standard (non-cached) rate, removing variance caused by cache hits and misses between variants.
The report also lists the git branch created for each variant:
Variant branches:
A – No changes: agent-benchmark/baseline
B – Structured: agent-benchmark/structured_claude
Results are also written to .agent-benchmark-results/<timestamp>/:
| File | Contents |
|---|---|
results.json |
Structured metrics for all variants |
results.md |
The table above in Markdown |
<variant>/events.jsonl |
Raw Claude stream-json events |
<variant>/diff.patch |
Full unified diff relative to base commit |
Config file exclusion
Config overlay files (those listed under config_files in a variant) are excluded from the variant's branch and diff unless Claude actually modified them. This ensures that diffs reflect only Claude's code changes, not the configuration differences between variants.
Reproducibility
The config file, base commit SHA, model, and prompt are recorded in every results.json. All worktrees branch from the same HEAD commit, so the only variable between runs is the config overlay.
Notes on prompt caching
The first variant to finish will pay the cache creation cost. Later variants running on the same model and account may benefit from cached system prompts. Token counts in the report reflect the actual API charges for each variant, including cache hits.
The Normalized cost row removes this variance by re-pricing all cached input tokens (both cache writes and cache reads) at the standard input rate. Use this row when comparing cost efficiency between variants, since it is independent of execution order and cache state.
Reviews
Review configuration
The review key in benchmark.yaml controls the review command:
review:
# Which axes to score. Each entry can be a string (built-in description)
# or an object with name and optional custom description.
axes:
- focused
- clear
- conventional
- robust
- concise
- tested
# Override the description of a built-in axis:
- name: secure
description: 'Security best practices followed, input validation present'
# Custom axis:
- name: domain-correct
description: 'The implementation is correct with respect to the business domain'
# Model for review sessions (defaults to the global model).
model: opus
# Max budget per review session in USD (default: 0.50).
max_budget_usd: 0.50
If review is omitted, agent-benchmark review uses all 14 default axes, the global model, and a $0.50 budget per session.
Default scoring axes
| Axis | What it measures |
|---|---|
accessible |
a11y was adequately considered |
clear |
Easy to understand what was changed |
concise |
Minimal but still effective |
conventional |
Follows conventions of the repo |
documented |
Changes to public APIs or complex logic are documented where needed |
focused |
Doesn't do stuff outside of the requested changes |
idiomatic |
Follows language/framework idioms |
localized |
i18n was considered |
modular |
Clear separation of concerns |
nonbreaking |
Doesn't break existing contracts or APIs |
performant |
Performance was adequately considered |
robust |
Handles edge cases |
secure |
Security was adequately considered |
tested |
Meaningful change in test coverage, broken tests are patched |
Duplicate axes are deduplicated (first occurrence wins). Axes not in the built-in list are treated as custom axes using their description as-is.
Review output
After a review run, two tables are printed:
Per-axis scores (one column per variant, null means not applicable):
Review scores for run 2026-05-01T12-00-00Z
Axis | A – No changes | B – Structured
-------------+----------------+---------------
focused | 85 | 92
clear | 90 | 88
conventional | 75 | 80
robust | 60 | 70
... | ... | ...
Aggregate statistics (one column per variant, null values excluded):
Aggregate scores per variant (null values excluded)
Metric | A – No changes | B – Structured
-------+----------------+---------------
Min | 60 | 45
Max | 90 | 95
Avg | 77.5 | 80.0
Median | 80.0 | 82.5
Results are written to .agent-benchmark-results/<timestamp>/:
| File | Contents |
|---|---|
review.json |
Structured scores for all variants |
review.md |
The tables above in Markdown |
Additional review workflows
Copilot review
For Copilot-based code review, use the agent-benchmark copilot-review command. It automates creating PRs for each variant and requesting Copilot reviews.
Human review
After a benchmark run, inspect the diffs manually:
# View the diff for a specific variant
git diff <base-commit>..<agent-benchmark/variant-key>
# Or read the saved patch files
cat .agent-benchmark-results/<timestamp>/<variant>/diff.patch
Score each variant on the axes that matter to you and record the scores alongside review.json for comparison.
Config file exclusion
Config overlay files (those listed under config_files in a variant) are excluded from the variant's branch and diff unless Claude actually modified them. This ensures that reviews reflect only Claude's code changes, not the configuration differences between variants.
Commands
Init
Scaffolds a benchmark directory from a target repo.
agent-benchmark init <repo-path> [--variants <n>] [--name <name>]
| Argument / Flag | Default | Description |
|---|---|---|
<repo-path> |
Local path to the target repository | |
--variants <n> |
2 |
Number of variants to create (minimum 2) |
--name <name> |
agent-benchmark |
Name of the benchmark directory to create |
What it does:
- Verifies the target path is a git repository
- Scans for recognized config files
- Creates an
agent-benchmark/directory in the current working directory - Copies found config files into each variant subdirectory
- Generates a pre-filled
benchmark.yaml
If agent-benchmark/ already exists, you will be prompted to use a numbered suffix, delete the existing directory, or cancel.
Run
Runs the benchmark defined in a YAML config file.
agent-benchmark run <benchmark.yaml> [--dry-run] [--yes] [--concurrency <n>]
| Argument / Flag | Default | Description |
|---|---|---|
<benchmark.yaml> |
Path to the benchmark config | |
--dry-run |
Validate config and print what would happen, without running Claude | |
--yes |
Skip the confirmation prompt before running | |
--concurrency <n> |
all | Max number of parallel Claude processes |
Security note: This command runs Claude with --dangerously-skip-permissions, giving it full filesystem and shell access with no confirmation prompts. You will be asked to confirm before any processes are spawned (bypass with --yes).
Each variant's changes are committed to a branch named agent-benchmark/<variant-key>. Worktrees are removed automatically when the run finishes. Branch names are printed at the end of the run and recorded in results.json.
Results
Display benchmark result sets stored in .agent-benchmark-results/.
agent-benchmark results [<timestamp>] [--list]
| Argument / Flag | Description |
|---|---|
<timestamp> |
Which result set to display |
--list |
List all available result sets |
Review
Score the code quality of each variant's changeset along configurable axes (0-100) using AI-driven review sessions. Produces per-variant scores and cross-variant aggregate statistics.
agent-benchmark review <benchmark.yaml> [<timestamp>] [--dry-run] [--yes] [--concurrency <n>]
| Argument / Flag | Default | Description |
|---|---|---|
<benchmark.yaml> |
Path to the benchmark config | |
<timestamp> |
latest | Which result set to review |
--dry-run |
Print what would happen without spawning Claude | |
--yes |
Skip confirmation prompt | |
--concurrency <n> |
all | Max parallel review sessions |
Each review session:
- Gets the full repository checked out at the variant's branch
- Receives the original task prompt and is asked to score the change on each configured axis
- Uses
git logandgit diffto locate and inspect the exact changes, and reads related files for context - Writes scores to
.review-scores.jsonusing theWritetool — a JSON object with a score (0-100 ornull) and a one-sentence rationale per axis
Results are written to .agent-benchmark-results/<timestamp>/review.json and review.md.
Copilot review
Create pull requests for each benchmark variant and request Copilot code reviews.
agent-benchmark copilot-review <benchmark.yaml> [<timestamp>] [--dry-run] [--yes] [--concurrency <n>]
| Argument / Flag | Default | Description |
|---|---|---|
<benchmark.yaml> |
Path to the benchmark config | |
<timestamp> |
latest | Which result set to create PRs for |
--dry-run |
Print what would happen without creating PRs | |
--yes |
Skip confirmation prompt | |
--concurrency <n> |
all | Max parallel PR creation |
Security note: This command uses the gh CLI to create PRs and request reviews. You will be asked to confirm before any operations are performed (bypass with --yes).
For each variant:
- Checks out or creates a worktree at the variant's branch
- Pushes the branch to the remote
- Creates a PR with the benchmark task prompt and variant metadata
- Requests Copilot review via
gh pr review --copilot - Collects the PR URL for the final report
Worktrees are removed automatically when the command finishes. Results are printed to a summary table showing PR URLs.
License
Licensed under MIT. Do what you will.