agent-benchmark

agent-benchmark is a CLI tool that runs the same coding task across multiple AI assistant config variants in parallel, then produces a side-by-side comparison of cost, speed, token usage, and diff output, and scores each variant on code quality axes using AI-driven review.

Put numbers to the vibes and make data-driven decisions by understanding how different configurations (e.g. CLAUDE.md, AGENTS.md, agent skills, repo documentation, model choice) influence the way AI coding agents interact with your codebase in terms of cost, speed, and output quality.

How it works

Scaffold a benchmark directory from a target repo
Edit the variant config files (CLAUDE.md, AGENTS.md, README.md etc.) to test your ideas
Run the benchmark – each variant gets its own git worktree, Claude runs in all of them in parallel, and you get a comparison table of metrics and diffs
Score the code quality of each variant's changes along configurable axes (0-100) to see which configuration produces better-quality code

Requirements

Node.js 18+
Git 2.5+ (for worktree support)
GitHub CLI (gh) installed and authenticated (required for copilot-review)
Claude Code CLI installed and authenticated (claude on your PATH)

Installation

npm install -g agent-benchmark

Or use without installing:

npx agent-benchmark <command>

Quick start

# 1. Scaffold a benchmark from your project
npx agent-benchmark init /path/to/your-project

# 2. Edit the variant files to test your hypothesis
#    variants/baseline/  – leave as-is to use the repo's existing config
#    variants/variant_b/ – add or modify CLAUDE.md, AGENTS.md, README.md, etc.

# 3. Configure the benchmark in agent-benchmark/benchmark.yaml
#    Set the task prompt, model, per-variant budget, and any other settings

# 4. Run the benchmark — each variant runs in parallel in its own worktree
npx agent-benchmark run agent-benchmark/benchmark.yaml

# 5. Score the code quality of each variant's changeset
npx agent-benchmark review agent-benchmark/benchmark.yaml

# 6. Create pull requests and request Copilot reviews for each variant
npx agent-benchmark copilot-review agent-benchmark/benchmark.yaml

What tasks work best

agent-benchmark is designed for tasks that translate a single, self-contained prompt into a PR-ready changeset — no conversational back-and-forth, no clarifying questions, just a clear scope and requirements that an AI agent can execute autonomously from start to finish.

Good prompts:

Have a clear, bounded outcome ("add pagination to the /users endpoint", "migrate all var declarations in src/ to const/let")
Specify any constraints that matter ("keep the existing API shape", "don't add new dependencies")
Are representative of real tasks you'd delegate to an agent in practice

Less suitable: open-ended explorations, multi-step dialogues, tasks that require human decisions mid-way, or tasks so large that a single run reliably hits the budget cap.

Benchmarks

Benchmark configuration

A benchmark.yaml file drives each run:

# The prompt given to Claude in every variant.
prompt: 'Refactor the auth middleware to use async/await'

# Global model to use (optional, defaults to opusplan).
# Can be overridden per-variant.
model: opusplan

# Maximum spend per variant in USD (safety cap).
max_budget_usd: 1.00

# The target repo to benchmark against (defaults to cwd if omitted).
repo: /path/to/your-project

# Each variant gets its own worktree and config overlay.
variants:
  baseline:
    label: 'A – No changes'
    # No config_files: uses the repo's existing config as-is.

  structured_claude:
    label: 'B – Structured CLAUDE.md'
    config_files:
      CLAUDE.md: ./variants/variant_b/CLAUDE.md

  with_agents:
    label: 'C – CLAUDE.md + AGENTS.md (sonnet)'
    model: sonnet
    config_files:
      CLAUDE.md: ./variants/variant_c/CLAUDE.md
      AGENTS.md: ./variants/variant_c/AGENTS.md

  minimal_readme:
    label: 'D – Lean README context'
    config_files:
      CLAUDE.md: ./variants/variant_d/CLAUDE.md
      README.md: ./variants/variant_d/README.md

Configuration fields

Field	Required	Default	Description
`id`	no	–	Unique benchmark ID generated by `init`; used to namespace branch names
`prompt`	yes	–	The task prompt sent to Claude in every variant
`model`	no	`opusplan`	Global default Claude model (can be overridden per-variant)
`max_budget_usd`	no	`1.00`	Per-variant spend cap in USD
`repo`	no	`cwd`	Absolute path to the target git repository
`variants`	yes	–	Map of variant keys to variant definitions

id is written automatically by init and used to namespace branch names (agent-benchmark/<id>/<variant-key>), preventing collisions when running multiple benchmarks against the same repository.

Variant definition

Field	Required	Description
`label`	no	Human-readable name shown in the report (defaults to variant key)
`model`	no	Claude model to use for this variant (inherits global `model` if unspecified)
`config_files`	no	Map of `<repo-relative dest>: <source path>` file overlays

config_files source paths are resolved relative to the directory containing benchmark.yaml. Destination paths are repo-relative (e.g. CLAUDE.md, .github/copilot-instructions.md). Any file can be used as an overlay -- the destination path is not limited to AI config files. For example, you could override src/config.json or tsconfig.json if that is relevant to your benchmark.

A variant with no config_files entry uses the repo's existing files as-is.

Automatically recognized config files

init scans for these files and copies whichever exist:

AI assistant config:

CLAUDE.md and AGENTS.md (from any location, including subfolders)
.claude/ folder and contents
.github/copilot-instructions.md

Repo documentation:

README.md
CONTRIBUTING.md

Report output

After a run, a comparison table is printed to the terminal:

Benchmark: "Refactor auth middleware" (2026-05-01T12:00:00Z)
Base commit: abc1234

Metric             | A – No changes    | B – Structured
-------------------+-------------------+---------------
Model              | opusplan          | sonnet
Duration           | 45s               | 32s
Input tokens       | 12,340            | 9,800
Output tokens      | 3,210             | 2,100
Cache write tokens | 4,200             | 0
Cache read tokens  | 6,100             | 8,300
Cost               | $0.42             | $0.31
Normalized cost    | $0.40             | $0.35
Tool calls         | Bash:5 Edit:3     | Bash:3 Edit:2
Diff (+/-)         | +120/-80          | +95/-60

Input tokens include cache creation and cache read tokens. Normalized cost re-prices all input tokens at the standard (non-cached) rate, removing variance caused by cache hits and misses between variants.

The report also lists the git branch created for each variant:

Variant branches:
  A – No changes: agent-benchmark/baseline
  B – Structured: agent-benchmark/structured_claude

Results are also written to .agent-benchmark-results/<timestamp>/:

File	Contents
`results.json`	Structured metrics for all variants
`results.md`	The table above in Markdown
`<variant>/events.jsonl`	Raw Claude stream-json events
`<variant>/diff.patch`	Full unified diff relative to base commit

Config file exclusion

Config overlay files (those listed under config_files in a variant) are excluded from the variant's branch and diff unless Claude actually modified them. This ensures that diffs reflect only Claude's code changes, not the configuration differences between variants.

Reproducibility

The config file, base commit SHA, model, and prompt are recorded in every results.json. All worktrees branch from the same HEAD commit, so the only variable between runs is the config overlay.

Notes on prompt caching

The first variant to finish will pay the cache creation cost. Later variants running on the same model and account may benefit from cached system prompts. Token counts in the report reflect the actual API charges for each variant, including cache hits.

The Normalized cost row removes this variance by re-pricing all cached input tokens (both cache writes and cache reads) at the standard input rate. Use this row when comparing cost efficiency between variants, since it is independent of execution order and cache state.

Reviews

Review configuration

The review key in benchmark.yaml controls the review command:

review:
  # Which axes to score. Each entry can be a string (built-in description)
  # or an object with name and optional custom description.
  axes:
    - focused
    - clear
    - conventional
    - robust
    - concise
    - tested
    # Override the description of a built-in axis:
    - name: secure
      description: 'Security best practices followed, input validation present'
    # Custom axis:
    - name: domain-correct
      description: 'The implementation is correct with respect to the business domain'

  # Model for review sessions (defaults to the global model).
  model: opus

  # Max budget per review session in USD (default: 0.50).
  max_budget_usd: 0.50

If review is omitted, agent-benchmark review uses all 14 default axes, the global model, and a $0.50 budget per session.

Default scoring axes

Axis	What it measures
`accessible`	a11y was adequately considered
`clear`	Easy to understand what was changed
`concise`	Minimal but still effective
`conventional`	Follows conventions of the repo
`documented`	Changes to public APIs or complex logic are documented where needed
`focused`	Doesn't do stuff outside of the requested changes
`idiomatic`	Follows language/framework idioms
`localized`	i18n was considered
`modular`	Clear separation of concerns
`nonbreaking`	Doesn't break existing contracts or APIs
`performant`	Performance was adequately considered
`robust`	Handles edge cases
`secure`	Security was adequately considered
`tested`	Meaningful change in test coverage, broken tests are patched

Duplicate axes are deduplicated (first occurrence wins). Axes not in the built-in list are treated as custom axes using their description as-is.

Review output

After a review run, two tables are printed:

Per-axis scores (one column per variant, null means not applicable):

Review scores for run 2026-05-01T12-00-00Z

Axis         | A – No changes | B – Structured
-------------+----------------+---------------
focused      | 85             | 92
clear        | 90             | 88
conventional | 75             | 80
robust       | 60             | 70
...          | ...            | ...

Aggregate statistics (one column per variant, null values excluded):

Aggregate scores per variant (null values excluded)

Metric | A – No changes | B – Structured
-------+----------------+---------------
Min    | 60             | 45
Max    | 90             | 95
Avg    | 77.5           | 80.0
Median | 80.0           | 82.5

Results are written to .agent-benchmark-results/<timestamp>/:

File	Contents
`review.json`	Structured scores for all variants
`review.md`	The tables above in Markdown

Additional review workflows

Copilot review

For Copilot-based code review, use the agent-benchmark copilot-review command. It automates creating PRs for each variant and requesting Copilot reviews.

Human review

After a benchmark run, inspect the diffs manually:

# View the diff for a specific variant
git diff <base-commit>..<agent-benchmark/variant-key>

# Or read the saved patch files
cat .agent-benchmark-results/<timestamp>/<variant>/diff.patch

Score each variant on the axes that matter to you and record the scores alongside review.json for comparison.

Config file exclusion

Config overlay files (those listed under config_files in a variant) are excluded from the variant's branch and diff unless Claude actually modified them. This ensures that reviews reflect only Claude's code changes, not the configuration differences between variants.

Commands

Init

Scaffolds a benchmark directory from a target repo.

agent-benchmark init <repo-path> [--variants <n>] [--name <name>]

Argument / Flag	Default	Description
`<repo-path>`		Local path to the target repository
`--variants <n>`	`2`	Number of variants to create (minimum 2)
`--name <name>`	`agent-benchmark`	Name of the benchmark directory to create

What it does:

Verifies the target path is a git repository
Scans for recognized config files
Creates an agent-benchmark/ directory in the current working directory
Copies found config files into each variant subdirectory
Generates a pre-filled benchmark.yaml

If agent-benchmark/ already exists, you will be prompted to use a numbered suffix, delete the existing directory, or cancel.

Run

Runs the benchmark defined in a YAML config file.

agent-benchmark run <benchmark.yaml> [--dry-run] [--yes] [--concurrency <n>]

Argument / Flag	Default	Description
`<benchmark.yaml>`		Path to the benchmark config
`--dry-run`		Validate config and print what would happen, without running Claude
`--yes`		Skip the confirmation prompt before running
`--concurrency <n>`	all	Max number of parallel Claude processes

Security note: This command runs Claude with --dangerously-skip-permissions, giving it full filesystem and shell access with no confirmation prompts. You will be asked to confirm before any processes are spawned (bypass with --yes).

Each variant's changes are committed to a branch named agent-benchmark/<variant-key>. Worktrees are removed automatically when the run finishes. Branch names are printed at the end of the run and recorded in results.json.

Results

Display benchmark result sets stored in .agent-benchmark-results/.

agent-benchmark results [<timestamp>] [--list]

Argument / Flag	Description
`<timestamp>`	Which result set to display
`--list`	List all available result sets

Review

Score the code quality of each variant's changeset along configurable axes (0-100) using AI-driven review sessions. Produces per-variant scores and cross-variant aggregate statistics.

agent-benchmark review <benchmark.yaml> [<timestamp>] [--dry-run] [--yes] [--concurrency <n>]

Argument / Flag	Default	Description
`<benchmark.yaml>`		Path to the benchmark config
`<timestamp>`	latest	Which result set to review
`--dry-run`		Print what would happen without spawning Claude
`--yes`		Skip confirmation prompt
`--concurrency <n>`	all	Max parallel review sessions

Each review session:

Gets the full repository checked out at the variant's branch
Receives the original task prompt and is asked to score the change on each configured axis
Uses git log and git diff to locate and inspect the exact changes, and reads related files for context
Writes scores to .review-scores.json using the Write tool — a JSON object with a score (0-100 or null) and a one-sentence rationale per axis

Results are written to .agent-benchmark-results/<timestamp>/review.json and review.md.

Copilot review

Create pull requests for each benchmark variant and request Copilot code reviews.

agent-benchmark copilot-review <benchmark.yaml> [<timestamp>] [--dry-run] [--yes] [--concurrency <n>]

Argument / Flag	Default	Description
`<benchmark.yaml>`		Path to the benchmark config
`<timestamp>`	latest	Which result set to create PRs for
`--dry-run`		Print what would happen without creating PRs
`--yes`		Skip confirmation prompt
`--concurrency <n>`	all	Max parallel PR creation

Security note: This command uses the gh CLI to create PRs and request reviews. You will be asked to confirm before any operations are performed (bypass with --yes).

For each variant:

Checks out or creates a worktree at the variant's branch
Pushes the branch to the remote
Creates a PR with the benchmark task prompt and variant metadata
Requests Copilot review via gh pr review --copilot
Collects the PR URL for the final report

Worktrees are removed automatically when the command finishes. Results are printed to a summary table showing PR URLs.

License

Licensed under MIT. Do what you will.