PR review benchmark · 2026-05-02

GoValidate `review-compare` benchmark

A real production branch diff (36 changed files) packaged two ways and reviewed by the same model. The verbose pr_impact prompt sends the full evidence dump; the compact prompt sends the same review target with the structural neighborhood collapsed. Same diff, same reviewer — only the prompt size changes.

Codebase: govalidate/platform (NestJS + Next.js)
Diff: 36 files changed vs origin/main
Runner: cat {prompt}.txt | claude -p

7.25×

smaller prompt, same review target

63,024 → 8,690 tokens

6.89×

smaller payload

41,969 → 6,093 tokens

changed files in the diff

measured against origin/main

143 / 3

seed nodes / structural hotspots

line-aware seeds from diff hunks

2/2

runs succeeded

verbose and compact both returned a review

Visual comparison

Bars are proportional to the actual measured token counts.

Prompt tokens (entire prompt to the model) 7.25× smaller

Verbose pr_impact 63,024

Compact pr_impact 8,690

Payload tokens (just the structural evidence block) 6.89× smaller

Verbose pr_impact 41,969

Compact pr_impact 6,093

What this measures: the same diff, the same reviewer, the same expected output shape. The win comes from how pr_impact packages the structural neighborhood of the changed lines — not from changing what the reviewer is asked to do. Both runs succeeded and produced a valid review.

Setup

Codebase: The same production NestJS + Next.js SaaS used in the retrieval benchmark.
Branch under review: A real working branch with 36 changed files vs origin/main.
Tool: graphify-ts review-compare — runs both prompt variants against the same reviewer and writes a structured report.json.
Reviewer: cat {prompt_file} | claude -p. Same model, same flags, both runs.
Token source: Locally counted with cl100k_base. Both prompts measured the same way; the ratio is invariant to the tokenizer choice.
Privacy: review-compare sanitizes path-derived identifiers before persisting the artifacts. Workstation paths and usernames don't leak into the committed evidence.

Reproduce the headline numbers

$ git clone https://github.com/mohanagy/graphify-ts.git
$ cd graphify-ts
$ bash docs/benchmarks/2026-05-02-govalidate-pr-review/verify.sh
# Recomputes prompt-token and payload-token ratios from report.json.

Evidence files

All committed in the repo. Inspect, hash, or rerun against your own branch.

GoValidate review-compare benchmark

Visual comparison

Setup

Reproduce the headline numbers

Evidence files

GoValidate `review-compare` benchmark