Benchmark proofs

Same model. Same question. With and without graphify-ts.

Every number on these pages is reproduced from a committed file in the repo. Each benchmark links straight to the raw evidence (Anthropic-billed usage JSON, prompts, and answers) and a one-line verify.sh that recomputes the headline ratio from those files.

Source: github.com/mohanagy/graphify-ts
License: MIT
Updated: 2026-05-02

Published benchmarks

Two scenarios, both measured on real production code.

Retrieval benchmark

3× fewer turns

2026-04-30 — GoValidate `native_agent`

The same Claude Code question against a 1,268-file NestJS + Next.js production codebase, asked once with file tools only and once with graphify-ts available. Anthropic-billed usage from claude --output-format json.

9 → 3 turns 96 → 35 sec 615K → 234K tokens

PR review benchmark

7.25× smaller

2026-05-02 — GoValidate `review-compare`

A real 36-file branch diff measured with review-compare, comparing the verbose and compact pr_impact prompts side by side. Same review target, smaller prompt.

63K → 8.7K prompt 42K → 6.1K payload 36 files

Why these are reproducible

The committed evidence is small, plain text, and signed by the runner.

Provider-reported numbers: The retrieval benchmark numbers come from Anthropic's own usage field in the claude --output-format json output. No local cl100k_base estimates in the headline.
One-line verify.sh: Each benchmark ships a shell script that recomputes the headline number from the committed evidence files. Clone the repo, run it, get the same answer.
Prompts and answers are committed: Verbose and compact prompts are in the repo, plus the runner's actual answers, plus the structured report.json. Nothing in the headline is sourced from a screenshot.
Privacy-sanitized: review-compare sanitizes path-derived identifiers before persisting artifacts; usernames and workstation paths don't leak into the committed evidence.

Same model. Same question. With and without graphify-ts.

Published benchmarks

2026-04-30 — GoValidate native_agent

2026-05-02 — GoValidate review-compare

Why these are reproducible

2026-04-30 — GoValidate `native_agent`

2026-05-02 — GoValidate `review-compare`