← Blog · research · code review · product

The 14% gap: why every code-review finding needs a rationale

Wes Guirra · Founder · Revund

May 20, 2026 · 13 min read

You merged a PR last week where a reviewer commented "nit: maybe pull this into a helper?" and you ignored it. Three weeks later, the duplicated logic forked: someone tweaked one copy and didn't touch the other. Now you're chasing a bug that exists in exactly one of two near-identical functions. The reviewer was right. You didn't ignore them because they were wrong; you ignored them because the comment didn't carry enough information to know whether they were right.

This is the structural problem with code review, and it's the load-bearing problem for any AI tool that tries to fix it. A finding without a rationale is indistinguishable from noise, even when it's correct.

In the previous post we covered the SmartBear cliff: defect-detection collapses past 200 lines of changed code. That data was the human side of the curve. This post is about the output side, of the comments that survive a review, how many actually help the author make a better decision, and what separates the ones that do from the ones that get rolled over. The answer reshapes the contract between any reviewer (human or AI) and the engineer reading their work.

The Bacchelli & Bird gap, restated

The Bacchelli & Bird 2013 ICSE paper found that 44% of surveyed engineers say defect-finding is the primary purpose of code review, but only 14% of actual review comments are defect-related. We covered that gap in post #1. The framing there was: bug-hunting capacity is the first thing that gets cut when the diff gets long.

There's a deeper reading.

The 14% is what reviewers wrote. But that's not the same as what was useful to the author. To see the gap that actually matters, we need a second variable: how often each comment type produced a code change in the PR.

Action rate by comment type (primary) overlaid on the original Bacchelli & Bird category mix (secondary). Defect comments are a 14% minority of the output but the highest-acted-on category, they punch well above their weight. The categories with the lowest action rates (knowledge transfer, alternative solutions) are also the most common time-sinks in real reviews.Sources: Bacchelli & Bird, ICSE 2013, Table 3 (secondary bars). Revund pilot dataset Q1 2026, n=14 teams, ~2,200 findings (primary bars).

What jumps out: defect comments succeed most of the time, even though they're a minority of the output. But that 76% has a tail, and the tail is comments that were correct but didn't carry enough context for the author to act.

The same pilot dataset captured comment lengths. Comments above 60 words had a 71% action rate. Comments under 15 words had an 11% action rate. Length predicts action better than category does. A short defect comment loses to a long style comment. The variable that matters isn't what you wrote, it's how well you defended what you wrote.

This is the rationale gap, and it has a simple shape: short comments lose, even when they're right.

What a rationale actually is

Let's narrow the definition. A rationale on a code-review finding is the answer to:

If I (the author) accept this comment and change the code, what real-world consequence am I avoiding, and how would I know if you were wrong?

Two clauses. The first is the what. The second, the falsifiability clause, is what most comments skip and the reason action rates collapse below 60 words. If a reader can't refute the rationale without becoming the reviewer themselves, the rationale isn't load-bearing; it's a request to take the reviewer's word for it.

There are three common rationale failures, in increasing order of how badly they break trust:

Tautological. "This is wrong because it's not right." Restates the finding without adding signal.
Stylistic. "This isn't how I'd write it." Real preference, but not load-bearing, the author can read the room and decide.
Plausible but unverifiable. "This could cause a race condition." (Could it? Under what scheduling? Which two threads?) The reader can't refute it without spinning up the same analysis the reviewer should have done.

The third one is the most dangerous failure mode. It's also the one AI reviewers are most prone to. A language model can produce plausible-sounding security or concurrency objections at a tempo no human can audit. If the tool surfaces these without a falsifiable why, the only signal the human extracts is "this tool fires false positives", and the team turns it off.

The asymmetry of dismissal

Here's why this matters more for AI than it ever did for humans. When a senior engineer leaves a "nit" comment, the author has a social signal: this is someone whose other thirty PRs they've reviewed, whose track record is known. The cost of dismissing them wrong is real; the cost of accepting a nit is low.

An AI tool has no track record. The first false positive lands at trust = 0. The second lands at trust = −1. By the fifth, the tool is filtered out of the workflow either by configuration or by attention.

Reported user trust in an AI reviewer's findings, plotted against the cumulative number of false positives encountered before the survey point. Trust halves between the second and third false positive. By FP #8 the tool is effectively muted, users report reading findings out of obligation rather than expectation. The curve is shallower for tools that ship rationales; this plot shows the worst case, findings without.Illustrative reconstruction calibrated to adoption patterns in Sadowski et al., 'Modern Code Review at Google', ICSE 2018, Section 5, and to the Revund pilot survey (n=14 teams, 4 weeks).

What this means concretely: an AI reviewer that produces 100 findings with 90% precision is worse than one that produces 60 findings with 99% precision, because the former bleeds out its credibility in the first week and the latter compounds. Coverage doesn't matter if the tool is muted. Recall is irrelevant once trust is zero.

Precision is the only metric that matters for adoption. And precision, for an AI reviewer, is bottlenecked by rationale quality, because the rationale is what lets the human verify the precision claim in real time, finding by finding, without becoming the reviewer themselves.

What dismissal looks like

The pilot data has another shape worth surfacing. Same dataset, broken by rationale form rather than category:

Rationale shape	Rationale length	Dismissed	Acted on	Author behaviour
No rationale	< 15 words	62%	9%	Dismissed on sight; rarely revisited
Restated finding	15–30 words	47%	14%	Dismissed after first read
Generic principle cited	30–60 words	31%	26%	Discussed; sometimes acted on
Falsifiable claim + scope	60–120 words	11%	71%	Acted on or rebutted with detail

The pattern is sharp. The variable that predicts action is rationale length, but the deeper variable is rationale shape. A 30-word "principle cited" rationale ("DRY violation here") gets dismissed twice as often as a 90-word rationale with a falsifiable claim ("this duplicates the logic in formatPrice() at utils/format.ts:14; if either copy is changed without the other, the prices on the cart and the receipt diverge silently"). Same finding. Two different fates.

The forcing function this implies is uncomfortable: short, correct findings lose to long, correct findings, and they should. A reviewer who can't take 90 seconds to write down why is asking the author to take the same 90 seconds to figure it out, except the author has less context and is solving the problem at the wrong end of the funnel.

The product implication

When we designed Revund, this is the constraint we built around. Every Finding from every pass carries a Why field. It is required at three layers:

The type level. A Finding without a non-empty Why cannot be serialized through the gRPC contract. The schema is the gate.
The prompt level. The specialist prompts explicitly demand a falsifiable rationale before the finding is allowed to surface. "Falsifiable" is checked by a self-critique step (more below).
The render level. The CLI and PR-bot will not display a finding whose Why fails a length floor or fails the rationale-shape classifier.

// from core/types.go, abridged
type Finding struct {
    Pass     string
    Severity Severity
    File     string
    Line     int
    Body     string  // what is wrong
    Why      string  // why it matters, required, never empty
    Confidence float32 // 0–1, findings below 0.6 are filtered
}

This is not a UI choice. It is a forcing function on the model.

A required Why does three things to the underlying generation:

It filters the prompt space. Asking "what's wrong" is a different question from "what's wrong and why." The latter is harder to confabulate because it commits the model to a chain of facts: this code does X, X interacts with Y, Y matters because Z. Each link is checkable.
It changes the human review surface. When a finding arrives with a paragraph rationale, the reader's first reaction is "do I buy this", which is the right reaction. Without a rationale, the first reaction is "is this even worth reading", which is the wrong reaction. The two reactions branch the rest of the team's behaviour for weeks.
It makes dismissal a signal. When the human dismisses a finding with a rationale, the dismissal carries information, the team is telling the system which rationale shape failed. That signal is what the feedback loop trains on. Without rationales, dismissal is just noise: the system can't learn from "no."

The fingerprint we hash for each finding includes the rationale text. Dismissals are stored per (repo_id, fingerprint), which means a dismissed rationale never returns. This compounds in the team's favour over weeks: false-positive rationales get burned permanently; true rationales survive and compound.

What this looks like in practice

A worked example from the Revund security pass:

A finding without a Why:

BLOCKER, src/auth/token.ts:47, JWT secret uses fallback string.

A finding with a Why:

BLOCKER, src/auth/token.ts:47, JWT secret falls back to the literal "dev-secret" when process.env.JWT_SECRET is undefined.

Why: in production, an unset env var produces a forgeable secret; any attacker who guesses or leaks the fallback can mint valid tokens. The fallback was almost certainly added to make local dev convenient; confirm by checking the git blame on this line. Recommendation: throw on missing env var at boot rather than fall back, so the misconfiguration fails loud in CI instead of silent in prod.

The second one is 5× the length. It's also closer to 50× the trust. The author can read it, agree or disagree on the specifics, and decide in under fifteen seconds. They cannot do that with the first one without becoming the reviewer.

This is the bar. Every Finding ships at this bar or it doesn't ship.

What's hard about this

Three things, in order of difficulty.

1. Latency. A rationale-bearing finding takes 3–5× the tokens of a bare finding. For a 1,000-line PR with four specialist passes, that's the difference between a 12-second review and a 60-second one. We've absorbed this by running passes in parallel and caching the ContextBundle so each pass doesn't repay the input cost. Worth it, but it's the dominant cost line in the engine.

2. Rationale inflation. When the prompt demands "explain why", the model will produce something regardless of whether the underlying finding is real. The fix is a confidence threshold (we drop findings below 0.6) plus a self-critique pass: each finding is re-prompted with "is this rationale falsifiable, or is it restating the finding?" If the model can't defend its own rationale under that question, the finding is dropped before it ever reaches the bundle. About 18% of first-pass findings die at this stage. Most of them deserve to.

3. Rationale calibration. Different finding types need different rationale shapes. A security finding needs a threat model. A perf finding needs a measurable claim ("this allocates O(n²) where n is users.length"). An architecture finding needs a counterfactual ("if this lands, the next change to module X requires touching modules Y and Z"). One generic prompt won't produce all three with the same fidelity. The four specialist passes each carry their own rationale schema, and the security schema is the strictest: it demands a named threat actor and a named asset before the finding can surface.

What you can do today (regardless of tooling)

The Bacchelli & Bird gap is not a Revund problem. It is a code-review problem. Three things you can change in your team's review process this week:

1. PR review templates with a rationale slot

Add a one-line section to your review-comment template: "Why this matters: ____". Make it social, comments that omit it look incomplete. Within a month the comments-without-rationale rate drops materially. In our pilot, teams that adopted a rationale-prompt template moved their action rate from 38% to 62% in six weeks, without any change to who was reviewing or what they were reviewing.

2. Drop "LGTM" as a review state

LGTM without a rationale is the worst signal in code review: it carries authority without information. If you don't have time for a rationale, you don't have time for the review, approve in silence or hand it to someone who will. The status itself trains your team to mistake speed for thoroughness.

3. Score reviewers by action rate, not review count

Most review tools surface "reviews completed per week." Surface instead the action rate of each reviewer's comments, how often the author actually changed the code in response. Within a team, reviewers who consistently land high-action comments become the team's effective senior; reviewers who don't, become noise. The metric also rewards the right behaviour at the individual level: write fewer, more-specific comments.

Why this is the foundation of Revund's trust model

Everything above is the general case. The reason we built Revund the way we did, required Why field, confidence threshold at 0.6, fingerprint-based dismissal, four specialist passes each with their own rationale schema, comes directly from this research:

The 14% gap is the reason Why is required. Most human review comments aren't actually defects. An AI tool that produces unexplained findings just adds to that gap. We force each finding to commit to a falsifiable rationale before it surfaces, same bar a senior engineer would meet.
The trust-decay curve is the reason confidence is filtered at 0.6. Below 0.6, the rationale is too often vague enough to read as confabulation. Surfacing it costs more credibility than the finding can repay. We'd rather miss a real defect than burn a team's tolerance for the tool.
The dismissal patterns are the reason the fingerprint includes the rationale text. A dismissed-finding signal isn't "never show this code line again", it's "never show this code line with that rationale again." The feedback loop is on the reason, not the symptom.

We didn't pick these design choices because they sounded clever. We picked them because the research said the unstructured alternative trades short-term coverage for long-term irrelevance. The architecture is shaped around the gap.

Methodology and references

For the figures in this post:

Figure 1 combines two datasets. The secondary bars (category mix) are read directly from Table 3 of Bacchelli, A. & Bird, C. (2013). "Expectations, Outcomes, and Challenges of Modern Code Review", ICSE 2013. The primary bars (action rate per category) are from the Revund pilot dataset, 14 teams, 6 weeks, ~2,200 findings, using "the author committed code in direct response within 48 hours" as the action signal. The dataset and the per-team breakdown will be published with the next post in this series.
Figure 2 is an illustrative reconstruction. The shape is calibrated against the adoption-over-time curves in Sadowski, C. et al. (2018). "Modern Code Review: A Case Study at Google", ICSE 2018, Section 5, and against the pilot survey we ran four weeks into the rollout ("would you accept this finding without re-verifying?", 5-point Likert collapsed to a percentage). The curve published here is the worst case, findings without rationales. The rationale-bearing version is materially shallower and will be published once the pilot ends.
Table 1 consolidates dismissal patterns from the pilot dataset by rationale-shape category. The classifier that assigns each rationale to a row was hand-labelled (3 raters, Cohen's κ = 0.71) and is still being calibrated; the directional finding, falsifiable-claim rationales dominate every other shape on action rate, replicates across all three pilot cohorts.

The pilot data has not been published; methodology and full breakdown will follow in a dedicated post. We're being explicit about this because the field has a problem with tools citing internal numbers they never expose. Ours will be exposed when it's ready to defend.

If you've seen a code-review tool produce a thousand findings and watched your team dismiss the lot in three days, the cause is in this post. The fix is in the architecture, and the architecture is in the Why. Email hello@revund.dev. We'd like to see the data you have.