Retrieval benchmark · 2026-04-30
GoValidate native_agent benchmark
The same question — "How does the v2 idea generation pipeline work end-to-end?" — asked twice against a real 1,268-file NestJS + Next.js production SaaS. Once with file tools only, once with graphify-ts available. Numbers come from Anthropic's own usage field in claude --output-format json.
3×
fewer tool-call turns
9 → 3
2.77×
faster end-to-end latency
96.4 sec → 34.7 sec
2.63×
fewer input tokens
615,190 → 233,508
+13%
cold-start session cost
$0.62 → $0.70
Visual comparison
Bars are proportional to the actual measured values.
Honest disclosure: the cold-start session pays an MCP-overhead premium (about +13% on a single-question first run). Multi-question sessions amortize this and end up cheaper than baseline. Speed and turn-count wins are unconditional in the measured run.
Setup
- Codebase
- A real production NestJS + Next.js SaaS. 1,268 files, approximately 860K words.
- Model
claude-opus-4.7(1M context).- Question
- "How does the v2 idea generation pipeline work end-to-end?"
- Baseline runner
- Plain Claude Code with the standard
Read/Grep/Globfile tools, no MCP server attached. - Graphify runner
- Same Claude Code, with graphify-ts MCP server (
coretool profile) attached and a freshgraphify-out/graph.json. - Token source
- Anthropic-reported
usagefromclaude --output-format json. Not localcl100k_baseestimates.
Reproduce the headline numbers
$ git clone https://github.com/mohanagy/graphify-ts.git
$ cd graphify-ts
$ bash docs/benchmarks/2026-04-30-govalidate/verify.sh
# Recomputes the headline ratio from the committed evidence files.
Evidence files
All committed in the repo. Inspect, hash, or rerun against your own runner.