← All benchmarks

Retrieval benchmark · 2026-04-30

GoValidate native_agent benchmark

The same question — "How does the v2 idea generation pipeline work end-to-end?" — asked twice against a real 1,268-file NestJS + Next.js production SaaS. Once with file tools only, once with madar available. Numbers come from Anthropic's own usage field in claude --output-format json.

Codebase
NestJS + Next.js, 1,268 files, ~860K words
Model
Claude Opus 4.7
Runner
claude --output-format json
3×
fewer tool-call turns
9 → 3
2.77×
faster end-to-end latency
96.4 sec → 34.7 sec
2.63×
fewer input tokens
615,190 → 233,508
+13%
cold-start session cost
$0.62 → $0.70

Visual comparison

Bars are proportional to the actual measured values.

Tool-call turns 3× fewer
Without madar 9
With madar 3
End-to-end latency 2.77× faster
Without madar 96.4 s
With madar 34.7 s
Anthropic-billed input tokens 2.63× fewer
Without madar 615,190
With madar 233,508
Cold-start session cost +13% trade-off
Without madar $0.62
With madar $0.70

Honest disclosure: the cold-start session pays an MCP-overhead premium (about +13% on a single-question first run). Multi-question sessions amortize this and end up cheaper than baseline. Speed and turn-count wins are unconditional in the measured run.

Setup

Codebase
A real production NestJS + Next.js SaaS. 1,268 files, approximately 860K words.
Model
claude-opus-4.7 (1M context).
Question
"How does the v2 idea generation pipeline work end-to-end?"
Baseline runner
Plain Claude Code with the standard Read / Grep / Glob file tools, no MCP server attached.
Madar runner
Same Claude Code, with madar MCP server (core tool profile) attached and a fresh out/graph.json.
Token source
Anthropic-reported usage from claude --output-format json. Not local cl100k_base estimates.

Reproduce the headline numbers

$ git clone https://github.com/mohanagy/madar.git
$ cd madar
$ bash docs/benchmarks/2026-04-30-govalidate/verify.sh
# Recomputes the headline ratio from the committed evidence files.

Evidence files

All committed in the repo. Inspect, hash, or rerun against your own runner.