← All benchmarks

Retrieval benchmark · 2026-04-30

GoValidate native_agent benchmark

The same question — "How does the v2 idea generation pipeline work end-to-end?" — asked twice against a real 1,268-file NestJS + Next.js production SaaS. Once with file tools only, once with graphify-ts available. Numbers come from Anthropic's own usage field in claude --output-format json.

Codebase
NestJS + Next.js, 1,268 files, ~860K words
Model
Claude Opus 4.7
Runner
claude --output-format json
3×
fewer tool-call turns
9 → 3
2.77×
faster end-to-end latency
96.4 sec → 34.7 sec
2.63×
fewer input tokens
615,190 → 233,508
+13%
cold-start session cost
$0.62 → $0.70

Visual comparison

Bars are proportional to the actual measured values.

Tool-call turns 3× fewer
Without graphify-ts 9
With graphify-ts 3
End-to-end latency 2.77× faster
Without graphify-ts 96.4 s
With graphify-ts 34.7 s
Anthropic-billed input tokens 2.63× fewer
Without graphify-ts 615,190
With graphify-ts 233,508
Cold-start session cost +13% trade-off
Without graphify-ts $0.62
With graphify-ts $0.70

Honest disclosure: the cold-start session pays an MCP-overhead premium (about +13% on a single-question first run). Multi-question sessions amortize this and end up cheaper than baseline. Speed and turn-count wins are unconditional in the measured run.

Setup

Codebase
A real production NestJS + Next.js SaaS. 1,268 files, approximately 860K words.
Model
claude-opus-4.7 (1M context).
Question
"How does the v2 idea generation pipeline work end-to-end?"
Baseline runner
Plain Claude Code with the standard Read / Grep / Glob file tools, no MCP server attached.
Graphify runner
Same Claude Code, with graphify-ts MCP server (core tool profile) attached and a fresh graphify-out/graph.json.
Token source
Anthropic-reported usage from claude --output-format json. Not local cl100k_base estimates.

Reproduce the headline numbers

$ git clone https://github.com/mohanagy/graphify-ts.git
$ cd graphify-ts
$ bash docs/benchmarks/2026-04-30-govalidate/verify.sh
# Recomputes the headline ratio from the committed evidence files.

Evidence files

All committed in the repo. Inspect, hash, or rerun against your own runner.