PR review benchmark · 2026-05-02
GoValidate review-compare benchmark
A real production branch diff (36 changed files) packaged two ways and reviewed by the same model. The verbose pr_impact prompt sends the full evidence dump; the compact prompt sends the same review target with the structural neighborhood collapsed. Same diff, same reviewer — only the prompt size changes.
origin/mainVisual comparison
Bars are proportional to the actual measured token counts.
What this measures: the same diff, the same reviewer, the same expected output shape. The win comes from how pr_impact packages the structural neighborhood of the changed lines — not from changing what the reviewer is asked to do. Both runs succeeded and produced a valid review.
Setup
- Codebase
- The same production NestJS + Next.js SaaS used in the retrieval benchmark.
- Branch under review
- A real working branch with 36 changed files vs
origin/main. - Tool
graphify-ts review-compare— runs both prompt variants against the same reviewer and writes a structuredreport.json.- Reviewer
cat {prompt_file} | claude -p. Same model, same flags, both runs.- Token source
- Locally counted with
cl100k_base. Both prompts measured the same way; the ratio is invariant to the tokenizer choice. - Privacy
review-comparesanitizes path-derived identifiers before persisting the artifacts. Workstation paths and usernames don't leak into the committed evidence.
Reproduce the headline numbers
$ git clone https://github.com/mohanagy/graphify-ts.git
$ cd graphify-ts
$ bash docs/benchmarks/2026-05-02-govalidate-pr-review/verify.sh
# Recomputes prompt-token and payload-token ratios from report.json.
Evidence files
All committed in the repo. Inspect, hash, or rerun against your own branch.