Benchmark proofs

Same model. Same question. With and without graphify-ts.

Every number on these pages is reproduced from a committed file in the repo. Each benchmark links straight to the raw evidence (Anthropic-billed usage JSON, prompts, and answers) and a one-line verify.sh that recomputes the headline ratio from those files.

License
MIT
Updated
2026-05-02

Published benchmarks

Two scenarios, both measured on real production code.

Retrieval benchmark

3× fewer turns

2026-04-30 — GoValidate native_agent

The same Claude Code question against a 1,268-file NestJS + Next.js production codebase, asked once with file tools only and once with graphify-ts available. Anthropic-billed usage from claude --output-format json.

9 → 3 turns 96 → 35 sec 615K → 234K tokens

PR review benchmark

7.25× smaller

2026-05-02 — GoValidate review-compare

A real 36-file branch diff measured with review-compare, comparing the verbose and compact pr_impact prompts side by side. Same review target, smaller prompt.

63K → 8.7K prompt 42K → 6.1K payload 36 files

Why these are reproducible

The committed evidence is small, plain text, and signed by the runner.

Provider-reported numbers
The retrieval benchmark numbers come from Anthropic's own usage field in the claude --output-format json output. No local cl100k_base estimates in the headline.
One-line verify.sh
Each benchmark ships a shell script that recomputes the headline number from the committed evidence files. Clone the repo, run it, get the same answer.
Prompts and answers are committed
Verbose and compact prompts are in the repo, plus the runner's actual answers, plus the structured report.json. Nothing in the headline is sourced from a screenshot.
Privacy-sanitized
review-compare sanitizes path-derived identifiers before persisting artifacts; usernames and workstation paths don't leak into the committed evidence.