← All benchmarks

PR review benchmark · 2026-05-02

GoValidate review-compare benchmark

A real production branch diff (36 changed files) packaged two ways and reviewed by the same model. The verbose pr_impact prompt sends the full evidence dump; the compact prompt sends the same review target with the structural neighborhood collapsed. Same diff, same reviewer — only the prompt size changes.

Codebase
govalidate/platform (NestJS + Next.js)
Diff
36 files changed vs origin/main
Runner
cat {prompt}.txt | claude -p
7.25×
smaller prompt, same review target
63,024 → 8,690 tokens
6.89×
smaller payload
41,969 → 6,093 tokens
36
changed files in the diff
measured against origin/main
143 / 3
seed nodes / structural hotspots
line-aware seeds from diff hunks
2/2
runs succeeded
verbose and compact both returned a review

Visual comparison

Bars are proportional to the actual measured token counts.

Prompt tokens (entire prompt to the model) 7.25× smaller
Verbose pr_impact 63,024
Compact pr_impact 8,690
Payload tokens (just the structural evidence block) 6.89× smaller
Verbose pr_impact 41,969
Compact pr_impact 6,093

What this measures: the same diff, the same reviewer, the same expected output shape. The win comes from how pr_impact packages the structural neighborhood of the changed lines — not from changing what the reviewer is asked to do. Both runs succeeded and produced a valid review.

Setup

Codebase
The same production NestJS + Next.js SaaS used in the retrieval benchmark.
Branch under review
A real working branch with 36 changed files vs origin/main.
Tool
graphify-ts review-compare — runs both prompt variants against the same reviewer and writes a structured report.json.
Reviewer
cat {prompt_file} | claude -p. Same model, same flags, both runs.
Token source
Locally counted with cl100k_base. Both prompts measured the same way; the ratio is invariant to the tokenizer choice.
Privacy
review-compare sanitizes path-derived identifiers before persisting the artifacts. Workstation paths and usernames don't leak into the committed evidence.

Reproduce the headline numbers

$ git clone https://github.com/mohanagy/graphify-ts.git
$ cd graphify-ts
$ bash docs/benchmarks/2026-05-02-govalidate-pr-review/verify.sh
# Recomputes prompt-token and payload-token ratios from report.json.

Evidence files

All committed in the repo. Inspect, hash, or rerun against your own branch.