Benchmark proofs

Benchmark receipts, not a single headline number.

Every number on these pages is reproduced from a committed file in the repo. Each benchmark links straight to the raw evidence (Anthropic-billed usage JSON, prompts, and answers) plus a small verify.sh. These pages are receipts for specific repos and prompts, not a single-number cross-repo claim.

Source: github.com/mohanagy/madar
License: MIT
Updated: 2026-05-02

Published benchmarks

Two published receipts today, with broader per-repo spread tracked in the suite scaffold.

Retrieval benchmark

receiptretrieval run

2026-04-30 — GoValidate `native_agent`

The same Claude Code question against a 1,268-file NestJS + Next.js production codebase, asked once with file tools only and once with madar available. Read it as one dated receipt with raw evidence attached, not as a universal headline claim.

same model same prompt dated artifact

PR review benchmark

receiptPR review run

2026-05-02 — GoValidate `review-compare`

A real 36-file branch diff measured with review-compare, comparing the verbose and compact pr_impact prompts side by side. It shows one real compact-review receipt, not a cross-repo guarantee.

same review target share-safe artifact 36 files

Why these are reproducible

The committed evidence is small, plain text, and signed by the runner.

Provider-reported numbers: The retrieval benchmark numbers come from Anthropic's own usage field in the claude --output-format json output. That makes the receipts inspectable, but still repo- and prompt-specific.
One-line verify.sh: Each benchmark ships a shell script that recomputes the saved numbers from the committed evidence files. Clone the repo, run it, get the same answer for that receipt.
Prompts and answers are committed: Verbose and compact prompts are in the repo, plus the runner's actual answers, plus the structured report.json. Nothing here is sourced from a screenshot or hand-written table.
Privacy-sanitized: review-compare sanitizes path-derived identifiers before persisting artifacts; usernames and workstation paths don't leak into the committed evidence.
Per-repo spread comes next: The broader benchmark direction is a reproducible suite with per-repo spread and no single-number cross-repo headline. See docs/benchmarks/suite/.

Benchmark receipts, not a single headline number.

Published benchmarks

2026-04-30 — GoValidate native_agent

2026-05-02 — GoValidate review-compare

Why these are reproducible

2026-04-30 — GoValidate `native_agent`

2026-05-02 — GoValidate `review-compare`