Benchmark proofs

Benchmark receipts, not a single headline number.

Every number on these pages is reproduced from a committed file in the repo. Each benchmark links straight to the raw evidence (Anthropic-billed usage JSON, prompts, and answers) plus a small verify.sh. These pages are receipts for specific repos and prompts, not a single-number cross-repo claim.

License
MIT
Updated
2026-05-02

Published benchmarks

Two published receipts today, with broader per-repo spread tracked in the suite scaffold.

Retrieval benchmark

receiptretrieval run

2026-04-30 — GoValidate native_agent

The same Claude Code question against a 1,268-file NestJS + Next.js production codebase, asked once with file tools only and once with madar available. Read it as one dated receipt with raw evidence attached, not as a universal headline claim.

same model same prompt dated artifact

PR review benchmark

receiptPR review run

2026-05-02 — GoValidate review-compare

A real 36-file branch diff measured with review-compare, comparing the verbose and compact pr_impact prompts side by side. It shows one real compact-review receipt, not a cross-repo guarantee.

same review target share-safe artifact 36 files

Why these are reproducible

The committed evidence is small, plain text, and signed by the runner.

Provider-reported numbers
The retrieval benchmark numbers come from Anthropic's own usage field in the claude --output-format json output. That makes the receipts inspectable, but still repo- and prompt-specific.
One-line verify.sh
Each benchmark ships a shell script that recomputes the saved numbers from the committed evidence files. Clone the repo, run it, get the same answer for that receipt.
Prompts and answers are committed
Verbose and compact prompts are in the repo, plus the runner's actual answers, plus the structured report.json. Nothing here is sourced from a screenshot or hand-written table.
Privacy-sanitized
review-compare sanitizes path-derived identifiers before persisting artifacts; usernames and workstation paths don't leak into the committed evidence.
Per-repo spread comes next
The broader benchmark direction is a reproducible suite with per-repo spread and no single-number cross-repo headline. See docs/benchmarks/suite/.