A senior engineer used to be the person who could write the code others couldn't. That definition was true for thirty years, and it's been falsifying for about eighteen months.
What replaces it is not "the person who can prompt the agent" or "the person who can clean up after the agent." Both of those framings are comfort food. The replacement is the person who can defend any line of code on demand: what it does, why it's there, what changes if it leaves, and what the world would have to look like for it to be wrong. The engineers who learn this skill stop competing with AI authoring entirely. The ones who don't get caught in a race they can't win.
This post is the fourth in a series on what the research is actually saying about code review in the AI era. The first three were about why review fails past 200 lines, why every finding needs a rationale, and why one model can't review another model's code. This one is about the shift those three posts add up to: the work is moving from production to judgment, and the engineering culture that doesn't see it coming will pay for it.
The shift, in one number
The clearest single data point on the perception-vs-reality gap is METR's 2025 randomized controlled trial. Sixteen experienced open-source developers, 246 tasks in mature codebases, randomly assigned to "AI allowed" or "no AI." Before the study, the developers predicted AI would make them 24% faster. After the tasks were measured, they reported they had been 20% faster. The actual measurement showed they were 19% slower.
The headline finding is the 39-point gap, but the deeper one is what it tells you about defensibility. The developers genuinely did not know whether the work in front of them was their own, the model's, or some mix of both. They could not, at the end of a session, defend what had been produced because they had not built the working memory required to defend it. The AI did not slow them down by writing bad code. It slowed them down by introducing a class of work — review and integration of unfamiliar output — that they could not tell apart from authoring.
The skill that would have closed the gap is the skill that closed it for the few developers in the study who actually came out faster: the discipline to defend every line before merging it. That discipline used to be implicit because you wrote the line. It isn't anymore.
What defensibility actually is
A claim is defensible when three things are simultaneously true: it can be refuted, it is anchored to something outside the speaker's opinion, and it is scoped to a real consequence. Less than all three and the claim is a request for the listener to do the work. This applies to code review comments, PR descriptions, design docs, and the line of code you wrote thirty seconds ago.
| Finding text | Falsifiable? | Anchored to a reference? | Scoped to a real consequence? | Reader's experience |
|---|---|---|---|---|
| "This looks wrong" | no | no | no | asks the author to be the reviewer |
| "DRY violation here" | yes | no | no | principle stated, no anchor or scope |
| "This duplicates formatPrice() in utils/format.ts:14" | yes | yes | no | anchored, but no claim about what breaks |
| "This duplicates formatPrice() in utils/format.ts:14 — if either copy changes without the other, cart and receipt prices diverge" | yes | yes | yes | defensible: refutable, anchored, scoped |
The test is mechanical. Can the receiver refute it? If the answer requires re-running the whole analysis, the claim has failed the falsifiability test. Is it anchored to something other than the speaker's opinion? A reference to a file, a line, a test name, a config rule, a type system error — any of these qualify. A reference to "experience" or "common practice" does not. Is the consequence specific enough to act on? "This might break" is not a consequence. "If either copy of this function changes without the other, prices on the cart and the receipt diverge" is.
The reason this matters more in 2026 than in 2016 is volume. When most code-review comments came from humans operating at a slow tempo, you could absorb the cost of indefensible ones — there were a few per PR, the senior reviewer was known to you, and the social contract did the work. AI reviewers operate at a tempo no social contract can keep up with. Every indefensible finding costs credibility, and credibility is a one-way ratchet.
What gets cut when work isn't defended
The clearest signal in the Revund pilot data is that the variable predicting whether a finding gets acted on is not its category — it's the shape of its rationale. Six categories collapse onto one dimension: how well did the speaker defend the claim.
The chart explains why the AI-reviewer market has the precision problem it has. A tool optimised for recall ships hundreds of findings per PR. Most of them sit in the top row — terse, unanchored, unscoped. They look like findings, but the action rate is in the 10–15% range. Teams who measure this number for the first time are usually stunned by it. Teams who don't measure it eventually mute the bot anyway, because the experience of triaging unscoped findings is exhausting and the trust runs out.
The fix isn't a better model. The fix is to refuse to ship a finding that doesn't pass the test. That refusal is uncomfortable because it cuts coverage. It is also the only way to keep the team reading the output past week three.
The reviewing-AI-code tax
The other side of the defensibility story is the cost of reviewing AI-authored work. Han et al. (arXiv 2603.15911) tracked 278,790 review conversations across 300 GitHub projects and found that human reviewers spend 11.8% more rounds reviewing AI-generated code than reviewing human-written code, even when the AI tool is supposedly saving the author time. The throughput case for AI authoring starts to leak the moment you measure the downstream cost on the reviewing side.
What the curve compresses into a single line: the cost of AI authoring is not paid by the author. It is paid by the reviewer, by the next reviewer who comes back to the same file, and by the on-call engineer who triages the production incident two weeks later. When the work isn't defended at the moment it's written, the defense gets distributed across the rest of the team in fragments that are individually small and collectively expensive.
The teams that came out ahead in the Greptile public-PR study were not the ones using the best AI tools. They were the ones whose review processes already required defensibility — a tight PR template, a "why" field on every comment, a culture of citing specific lines instead of waving at "the architecture." Those teams absorbed AI authoring without their review cost going up. The other teams paid for it.
Three failure shapes, in code
Defensibility isn't an abstract property. It's a thing you can see in a diff. Three failure modes show up repeatedly in pilot data, and each is a different way a contribution fails the test.
1. The undefended refactor
The author submits:
// src/billing/invoice.ts
export function applyDiscount(amount: number, code: string): number {
// refactored — use new pipeline
return discountPipeline.apply(amount, code);
}The diff is small. The PR description says "refactor to use the new pipeline." The reviewer reads the diff and approves. Six weeks later, two regression reports come in: the new pipeline rounds at a different precision and applies coupon codes case-sensitively where the old one didn't.
The author could not have defended this change. They didn't know what the old code did precisely enough to know what the new code had to preserve. The right finding here is not "refactor" — it is "before this change is safe, tell me three concrete behaviours of applyDiscount that must remain identical, and a test that proves it." That's a finding a defensibility-aware reviewer can write. It's invisible to a model that only reads the diff.
2. The plausible AI suggestion
The reviewer (an AI tool) leaves:
"Consider memoising this — it may have a performance impact on large inputs."
The comment looks reasonable. The author looks at the function, can't tell what "large inputs" means in this context, can't refute the claim without running benchmarks, and dismisses it because there's no time. Three months later the function is in a hot path and the bot's "may" turns out to have been a real bug. Trust in the bot doesn't recover because the comment was undefended at the moment it was written.
A defensible version of the same finding: "This function is called from app/feed/loader.ts:42 inside a for loop over items which has a median length of 184 (from prod logs). At that size, memoisation saves an estimated 22ms per request based on the function's current 0.12ms cost × 184 calls. A test that fails without memo: tests/perf/feed.bench.ts."
Same insight. Two different fates. The first is noise. The second is a referenced, scoped, refutable claim the author can act on in five minutes.
3. The unscoped "this might be wrong"
The most common failure mode in AI review is the half-defended finding: anchored, but not scoped to a real consequence. The bot says "this could be a race condition" and points at a specific line. The line involves setState and a useEffect. The claim is anchored. But there's no consequence attached: under what concurrent flow does this actually race? What does the user see? Without those, the author can't act on the finding because they can't model what they're protecting against.
A defensible version: "If the user clicks Submit twice within 200ms, both useEffect runs will see loading=false (line 23) and both will start fetches. The second response overwrites the first in setState (line 31). Reproducer: double-click Submit on the demo build." The author opens the demo build, double-clicks, sees the bug, fixes it in ten minutes. Same finding the bot was trying to surface. One version compounds, the other doesn't.
Why this is the foundation of Revund
Every architectural choice in Revund's review engine traces back to defensibility. The required Why field is the schema-level enforcement: a finding without a non-empty Why cannot be serialised through the gRPC contract, and the prompt is structured so the model has to produce a defensible rationale or its output gets dropped at the validator. The 0.6 confidence threshold isn't measuring whether the finding is correct — it's measuring whether the rationale is defensible. A finding can be technically right and still drop below 0.6 if the model can't defend it.
The ContextBundle exists so the model has something external to anchor against. When the security pass writes "the JWT secret falls back to the literal 'dev-secret'", the literal it's quoting comes from the resolved file content the worker put in the bundle, not from the model's intuition. The bundle is deterministic, byte-for-byte identical across runs, so the anchor is real. Strip the bundle and the model is reasoning from prior alone — every finding becomes the half-defended kind.
The fingerprint-based dismissal system is the other side of the same coin: when the team dismisses a finding, the dismissal is keyed on the rationale text. A future finding with the same rationale on the same fingerprint never appears again. The team's "no" becomes a defensible signal that compounds over weeks. Without rationale-keyed dismissals, every dismissal is just noise — the system can't tell which kind of finding the team has decided is wrong.
The four specialist passes each carry their own defensibility shape. The security pass demands a named threat actor and a named asset before the finding can surface. The performance pass demands a measurable claim like "this allocates O(n²) where n is users.length". The architecture pass demands a counterfactual. One generic prompt couldn't enforce all three; the engine is shaped around the fact that defensibility looks different in each domain.
None of these are features in the marketing sense. They're the load-bearing structure that makes the rest of the engine possible. A version of Revund without the Why enforcement would surface findings that look like the bot in the second failure shape above — plausible, anchored, but unscoped. We've seen what happens to teams using tools shaped like that, and we built the engine to refuse to be one.
What you can do today (regardless of tooling)
Three things you can change in your team's workflow this week to push toward defensibility without buying anything:
1. Add the test to every review comment
Make the rule explicit: before posting a review comment, the reviewer asks themselves "can the author refute this without re-doing my analysis?" If the answer is no, the comment isn't ready to post. This is the cheapest change with the biggest impact. Within a sprint the comments-without-rationale rate drops by half and action rate jumps with it.
2. Make defensibility a PR-description requirement
Add one section to your PR template: "What three behaviours of the touched code must remain identical, and what test proves it?" This is the author-side mirror of the reviewer-side test. The change you can't answer this question for is the change you can't defend, which is the change you shouldn't be merging. Teams that adopt this stop shipping the kind of refactor in failure shape #1 above almost immediately.
3. Score reviewers by action rate, not comment count
Most teams measure how many comments their reviewers leave. The right metric is what fraction of those comments resulted in a code change. A reviewer at 60% action rate is more valuable than a reviewer at 20% action rate, regardless of how many comments either of them left. The metric also pulls reviewers toward defensibility, because the only way to land in the high-action band is to write defensible comments.
These three changes are free. The teams in our pilot that adopted all three saw their review action rate move from 38% to 62% in six weeks, without changing who was reviewing or what was being reviewed.
Methodology and references
For the figures in this post:
-
Figure 1 plots the perception-vs-reality finding from METR, "Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity," arXiv 2507.09089 (July 2025). Sample: 16 experienced open-source developers, 246 tasks across mature projects they maintain, randomised assignment to "AI allowed" / "no AI." The 39-point gap is the difference between the developers' self-reported speed change (+20%) and the measured speed change (−19%).
-
Figure 2 is original to this post. Examples are real-finding shapes from the Revund pilot dataset; the test framework (falsifiable / anchored / scoped) is the explicit version of the implicit rubric used by the pilot's rationale-quality classifier. Inter-rater agreement on hand-labelling the three properties: Cohen's κ = 0.71 across 3 raters on a 200-finding sample.
-
Figure 3 re-cuts the post-#2 pilot dataset under the defensibility frame. Same n = 14 teams, 6 weeks, ~2,200 findings. The rationale-shape buckets (no rationale / stated principle / principle + scope / falsifiable claim) are the classifier's output; "acted on" = author committed code in direct response within 48 hours.
-
Figure 4 combines two data sources. The +11.8% per-AI-PR review delta is from Han et al., "Human-AI Synergy in Agentic Code Review," arXiv 2603.15911 (2026) (278,790 review conversations across 300 GitHub projects). The AI-authorship mix at 50% in April 2026 is from Greptile's public PR dataset, summarised in their May 2026 post. The flatter curve under defensibility scaffolding is from the Revund pilot. The curves are illustrative; the directional shape (review time grows linearly with AI-authored share, scaffolding flattens it) replicates across both datasets we have.
We are explicit about which numbers are from public peer-reviewed research and which are from our internal pilot because the field has a problem with tools quoting internal numbers they never expose. Ours will be exposed when they are ready to defend. The dedicated methodology post is in the queue.
If the work in front of you isn't defensible at the moment it's written, the defense gets paid for downstream by people who didn't write it. The discipline scales, the cost doesn't. Email hello@revund.dev if you want to see what a defensibility-shaped review engine does to the cost.