Retrieval benchmark · 2026-04-30

GoValidate `native_agent` benchmark

The same question — "How does the v2 idea generation pipeline work end-to-end?" — asked twice against a real 1,268-file NestJS + Next.js production SaaS. Once with file tools only, once with madar available. Numbers come from Anthropic's own usage field in claude --output-format json.

Codebase: NestJS + Next.js, 1,268 files, ~860K words
Model: Claude Opus 4.7
Runner: claude --output-format json

3×

fewer tool-call turns

9 → 3

2.77×

faster end-to-end latency

96.4 sec → 34.7 sec

2.63×

fewer input tokens

615,190 → 233,508

+13%

cold-start session cost

$0.62 → $0.70

Visual comparison

Bars are proportional to the actual measured values.

Tool-call turns 3× fewer

Without madar 9

With madar 3

End-to-end latency 2.77× faster

Without madar 96.4 s

With madar 34.7 s

Anthropic-billed input tokens 2.63× fewer

Without madar 615,190

With madar 233,508

Cold-start session cost +13% trade-off

Without madar $0.62

With madar $0.70

Honest disclosure: the cold-start session pays an MCP-overhead premium (about +13% on a single-question first run). Multi-question sessions amortize this and end up cheaper than baseline. Speed and turn-count wins are unconditional in the measured run.

Setup

Codebase: A real production NestJS + Next.js SaaS. 1,268 files, approximately 860K words.
Model: claude-opus-4.7 (1M context).
Question: "How does the v2 idea generation pipeline work end-to-end?"
Baseline runner: Plain Claude Code with the standard Read / Grep / Glob file tools, no MCP server attached.
Madar runner: Same Claude Code, with madar MCP server (core tool profile) attached and a fresh out/graph.json.
Token source: Anthropic-reported usage from claude --output-format json. Not local cl100k_base estimates.

Reproduce the headline numbers

$ git clone https://github.com/mohanagy/madar.git
$ cd madar
$ bash docs/benchmarks/2026-04-30-govalidate/verify.sh
# Recomputes the headline ratio from the committed evidence files.

Evidence files

All committed in the repo. Inspect, hash, or rerun against your own runner.

GoValidate native_agent benchmark

Visual comparison

Setup

Reproduce the headline numbers

Evidence files

GoValidate `native_agent` benchmark