← Blog · research · craft · product

The two halves of code review

There's a debate running in every engineering team in 2026 about which kind of code review is the future: the linters and static analysers that have been around for thirty years, or the AI reviewers that arrived in the last eighteen months. The framing is wrong. Pick either side and you ship a review process that fails differently, but always fails. The teams whose review actually works have stopped debating and started layering.

This post is the fifth in a series on what the research is saying about code review in the AI era, after the 200-line cliff, the 14% gap, the mirror problem, and the defensibility shift. The first four established what's failing. This one is about the practical architecture that fixes it: two halves, no overlap, each one doing what the other can't.

The two halves, defined

The mechanical half of code review is everything a machine can verify against a deterministic process. Type checks, lint rules, test pass/fail, AST shape comparison, framework-convention checks, dead-code detection, dependency cycle detection, complexity counts. These have one thing in common: a correct answer exists, and a program can compute it. Output is precise to the point of being boring. Precision is at or near 100%; the only question is whether what they catch is what matters.

The judgment half is everything that doesn't have a correct answer in a deterministic sense. Whether a name describes what the function returns or what it accepts. Whether an abstraction is premature. Whether the comment three lines up still describes what the code does. Whether the chosen pattern matches the team's mental model of the codebase. These require a reader who has internalised the codebase's conventions, its history, and the team's standards. They cannot be computed; they can only be assessed.

Most code-review tools pick one half. Linters and static analysers do the mechanical half perfectly and the judgment half not at all. AI reviewers do the judgment half decently and the mechanical half badly, because they re-derive what a linter already knows and waste tokens on what's already proven.

The architecture that works is not a winner. It's both halves, in sequence, with a strict no-overlap rule.

Which half catches what

Most arguments about which tool is better are arguments about specific findings. Linter people point to a clean codebase and say "see, no AI needed." AI people point to a real refactor that needed taste and say "see, no linter would have caught that." Both are right about their example and wrong about the conclusion. The two halves catch different things, and a competent review process needs both.

Finding typeBest caught byWhy the other layer can't do it
Unused variableLinterPure syntactic; deterministic
Type mismatch / nullable misuseType system (tsc)The compiler is the ground truth
Duplicated function across filesAST hashMechanical comparison, scales
Component in wrong folderRepo rules + frameworkConvention is configurable, checkable
File mixing 5+ concernsAST concern counterCountable property of the whole file
Off-by-one in date mathTest (executable spec)Only an executable spec catches this
Misleading variable nameLLMPure taste, no anchor exists
Wrong abstractionLLMJudgment call against codebase shape
Premature interfaceLLM + contextRequires reasoning about future change
N+1 query in a loopLLM + data contextNeeds the query graph the diff doesn't show
A taxonomy of common code-review findings, classified by which layer should catch them. The argument is in the right-hand column: most findings have a single best home, and the home is determined by whether a deterministic process can verify the claim. The first six rows are the mechanical half: precise, measurable, automatable. The bottom four are the judgment half: anchored to taste, context, or future change, and not computable from the diff alone.Taxonomy is original to this post. Categories are drawn from a sample of ~500 real findings across the Revund pilot dataset (n = 14 teams, 6 weeks), classified manually by three raters with Cohen's κ = 0.78.

The first six rows are not interesting. Linters have been catching unused variables since 1990. The TypeScript compiler has been catching null misuses for years. AST hashing for duplicate detection is decades old. None of this is novel; it's just unevenly adopted. Any team using a competent set of static tools already has the first six rows covered.

The bottom four rows are where the conversation has moved. Misleading names, wrong abstractions, premature interfaces, N+1 queries that require understanding the call graph — these have always been the senior reviewer's job because no deterministic process could touch them. AI reviewers are the first tool that can attack this territory at all. They do it imperfectly, but the alternative until eighteen months ago was "hope a senior catches it."

The trap is to confuse the two halves. When an AI tool reports an unused variable as an architecture finding, it has encroached on the linter's territory and added zero value. When a deterministic detector reports "wrong abstraction" without any judgment behind the call, it has encroached on the AI's territory and produced a false positive. Both encroachments cost the team trust.

Why overlap is the trust killer

The most common failure mode in real-world review setups isn't that one half is missing — it's that both halves are present, and they overlap. A linter flags an unused import. The AI reviewer, reading the same diff, also flags the unused import in slightly different words. The author now has two findings to dismiss for one underlying issue. Multiply by a hundred findings per PR and the team mutes both tools within a sprint.

0% overlap (disjoint)100%15% overlap124%35% overlap162%60% overlap218%85% overlap (both flag same)285%
The overlap penalty. As the two layers' findings start to overlap (same issue reported twice), the team's per-finding triage cost grows faster than the finding count. At zero overlap the team triages each finding once. At 85% overlap they're triaging the same issue from two angles, dismissing one, second-guessing the other, and asking each other 'wait, did the bot already cover this.' The trust decay is multiplicative because every duplicate finding burns the credibility of both layers, not just one.Curve fitted to behavioural data from the Revund pilot (n = 14 teams). 'Per-finding triage cost' = mean seconds the author spent on each finding before either acting on it or dismissing it. Baseline (0% overlap) was measured during a configuration period where the LLM passes received the deterministic detector output as part of the prompt's 'do not flag' list.

The math is uncomfortable for tool vendors. The thing that breaks a review setup isn't bad findings — it's redundant findings. A team can absorb a 20% false-positive rate from a single tool because the dismissal pattern is consistent: same kind of false positive, same way to filter it. The same 20% from two tools, on overlapping territory, produces unpredictable noise — sometimes both tools flag, sometimes one does, sometimes neither — and unpredictable noise is what trust dies of.

The discipline this implies is uncomfortable too. If you're running both a linter and an AI reviewer, the AI reviewer must explicitly know what the linter covers, and refuse to publish findings in that territory. The same applies in reverse: if you have an AST-based detector for duplicate functions, the LLM prompt has to be told "do not flag duplicates; that's handled." Without these explicit no-flag boundaries, the layers cannibalise each other's trust.

What it looks like when one layer does the other's job

The market has hundreds of products on each side of this line. They fail in stereotyped ways when they don't respect the boundary. Three real failure modes, with the rule that prevents each:

When a layer encroachesWhat the team experiencesThe rule that prevents it
LLM flags unused variableStyle nit reported as 'architecture'; reader stops trusting the categoryLint rule covers it; LLM prompt explicitly excludes
LLM flags type errorBot says 'this might be null' on a line tsc already proved is null-safePre-run tsc; LLM receives diagnostics, never re-derives them
Detector flags 'wrong abstraction'False positive every time; deterministic rule with no judgment behind itDetectors stay in measurable territory; never publish taste calls
Detector flags 'misleading name'Same; pattern-match without context produces noiseNames go to the LLM, which can read intent from surrounding code
Both layers flag the same duplicateTwo findings to dismiss for one issue; trust burns 2x fasterLLM prompt receives the detector's findings as a 'do not flag' list
Common encroachment failures and the rules that prevent them. The middle column is what the team experiences — not what the tool intends. The right column is the explicit configuration that keeps each layer in its territory. Every team running both halves needs all five rules; the absence of any one of them is felt within a week.Patterns drawn from the Revund pilot configuration period and from public bug-tracker analysis of the most-installed static-analysis and AI-review tools (2025–2026 sample).

The table compresses a lot of practical experience. Each failure mode looks small in isolation and produces a recognisable team behaviour: "the bot is noisy," "the linter is too strict," "we turned off the architecture rules," "we ignore that whole category now." These aren't tool problems. They're layering problems. The same tools work fine when the boundaries are explicit and break when they're implicit.

This is also why the LLM prompt in a serious review engine is not a vibe. It's a contract. The prompt says, in writing, "these are the kinds of findings you produce. These are the kinds of findings you do not produce because another layer already covers them. These are the rationale schemas you must satisfy before any finding can be surfaced." A reviewer that doesn't know its territory ends up everywhere, which is the same as nowhere.

What the layered approach actually buys you

The reason to do the work of separating the halves is that the result is better on both axes that matter: precision (the share of findings the author acts on) and coverage (the share of real issues the system surfaces).

precision · % of findings the author acted oncoverage · % of real issues surfacedlinter-only92%LLM-only34%layered (no overlap)88%
Signal density by review setup. The bone-coloured bar is precision: of the findings the system surfaces, how many produce a code change from the author. The muted bar is coverage: of the real issues a senior reviewer would have flagged on the same PR, how many the system caught. Linter-only is precise but covers only the mechanical half (28% of real issues). LLM-only catches the judgment half (71%) but burns precision on overlap and confabulation (34%). The layered setup, with explicit no-overlap rules, lands at both ends: 88% precision and 84% coverage.Precision data: Revund pilot dataset across the three configuration regimes (linter-only, LLM-only, layered). Coverage benchmarked against a hand-graded set of 200 PRs reviewed by senior engineers in parallel with each system; coverage = (system findings that overlap senior findings) / (senior findings).

The two-axis chart is the clearest argument I can make for why this is worth doing. A team that ships only linters covers a quarter of the surface area of real issues; nobody who's worked through a serious code review thinks lint output is enough. A team that ships only an AI reviewer gets more coverage but burns the credibility needed to keep the team reading the output past month two. A team that ships both, with the boundary explicit, gets the coverage of the AI reviewer without the precision cost.

This is not a Revund-specific finding. Several teams in the pilot already ran both ESLint with custom rules and a separate AI reviewer before they tried Revund. Their pain point was always the overlap: the AI reviewer kept restating things ESLint had already said. The fix was the same fix regardless of which AI tool they were using — make the boundary explicit, surface a single Architecture section, and never let a finding appear twice.

Three failure shapes, in code

Concrete examples make the architecture argument harder to dismiss. Three real failure shapes from the pilot, each demonstrating what one half catches that the other can't.

1. The thing only the mechanical half catches

The author writes:

// src/orders/loader.ts
export async function loadOrders(userId: string) {
  const orders = await db.orders.find({ userId });
  return orders.map((o) => ({
    ...o,
    customer: await db.customers.findOne({ id: o.customerId }),
  }));
}

The AI reviewer reads this and says "consider batching the customer fetch." The comment is correct in spirit and useless in practice: the author dismisses it because there's no scoped consequence, and the production incident lands two weeks later when the orders page times out on a user with 200 orders.

The mechanical half catches this without ambiguity. An AST-based detector for the N+1 pattern (async call inside .map) flags the line with the structural rule it violates: "async call inside Array.map; this issues one DB query per element. At median dataset size N=184 from your tsc inferred type, this PR makes the loader 184× slower." No taste involved. The detector knows the pattern, the type system supplies the cardinality, the warning is unambiguous.

The AI reviewer's version of this finding is a guess. The mechanical version is a fact.

2. The thing only the judgment half catches

The author writes:

// src/billing/charge.ts
export class ChargeStrategy {
  apply(amount: number, ctx: ChargeContext): ChargeResult {
    return this.strategy.execute(amount, ctx);
  }
}
 
// only used once, in src/billing/index.ts
const charge = new ChargeStrategy(new StripeChargeStrategy());

The PR introduces a ChargeStrategy class that wraps a single strategy implementation, used in exactly one place. No linter flags this. No type checker objects. Tests pass. From a mechanical perspective, the code is fine.

A senior reviewer looking at this writes "premature abstraction — ChargeStrategy has one implementer, introduced in the same PR. Either commit to multiple strategies now and add the second one in this PR, or inline the strategy and remove the wrapper. Adding the indirection without the second implementation is a future-cost without a present-benefit."

This is pure judgment. The pattern is technically a valid use of the Strategy pattern. The mechanical layer has nothing to say. The AI layer, given the codebase context, can catch it because it's seen this anti-pattern thousands of times and can pattern-match against the codebase's actual shape. The finding requires context the mechanical half doesn't have access to.

3. The thing that gets reported twice (the trap)

The author writes:

// src/checkout/total.ts
function calculateTotal(items: Item[], discount?: number): number {
  const subtotal = items.reduce((s, i) => s + i.price, 0);
  return subtotal - (discount ?? 0);
}
 
// src/cart/total.ts
function computeTotal(items: Item[], discount?: number): number {
  const total = items.reduce((s, i) => s + i.price, 0);
  return total - (discount ?? 0);
}

The mechanical layer (AST hash) catches this in 5ms: "duplicate function shape — calculateTotal and computeTotal are structurally identical. Extract to a shared helper."

The AI reviewer also reads both files and reports: "these two functions appear to compute the same thing. Consider DRY-ing them up."

The author now has two findings, one in the architecture section, one in the style section. They dismiss the AI version because it's vaguer, then triage the AST version, then go on with their day. The bot learned nothing. The next time this pattern shows up, both layers fire again.

The fix is not subtle: the LLM's prompt is constructed with the deterministic detector's output already inserted as "the following structural issues have been identified by the detector layer; do not flag duplicates of these." The duplicate finding never appears. The author triages one finding instead of two, the bot keeps its credibility, and the team reads next week's findings instead of skimming them.

Why this is the foundation of Revund

Revund's architecture is exactly this layering, made explicit at every level. The structural pass runs deterministic detectors against the parsed AST: file-size, mixed-concerns, single-responsibility, near-duplicate, layering. Each detector produces findings with precision close to 100% because there's a measurable rule behind each one. The LLM passes (security, performance, architecture, style) run with the detector output already in their prompt as a do-not-flag list. The LLM's job is everything outside the detector's territory: taste, context, judgment.

The reason this matters more than it sounds: the engine is shaped so the two halves can't accidentally collide. The detector layer doesn't know how to publish judgment findings, so it can't encroach on the LLM's territory. The LLM is given explicit instructions about what's already been covered, so it can't restate what's already been said. The architecture pass that emerges combines findings from both halves under a single Architecture section in the PR comment, which is what the customer sees. The two skill sets are visible from the inside; from the outside, it's one reviewer that happens to be unusually thorough.

This is why "deterministic structural detectors" and "specialist LLM passes" are co-equal in our design notes and not in tension. They cover different territory, and the engine's job is to enforce the boundary. A version of Revund with only the LLM passes would have the overlap problem in spades. A version with only the detectors would miss the bottom half of the taxonomy table. The combined engine works because the seams between the halves are explicit, not implicit.

What you can do today (regardless of tooling)

Three things you can change in your team's review setup this week to push toward layered review without buying anything:

1. Audit overlap in your current setup

Run your current review for one week and tag every finding by whether it could have been caught by your linter, your type checker, your tests, or your AI reviewer. The overlap count is the number you care about. Anything above ~15% is bleeding trust. The fix is usually a one-line addition to the AI reviewer's prompt or a one-line exclusion to the linter's config; the hard part is noticing.

2. Move every mechanical claim to a mechanical tool

If a reviewer is leaving comments that could be a lint rule, the comments are wasting their time and the rule should exist. Add the rule. The reviewer's attention budget is the most expensive thing in the team; do not spend it on findings a regex could produce. This is the single highest-leverage change a team can make in week one.

3. Make the LLM's territory explicit

If you're running an AI reviewer, look at its prompt. If the prompt says "find bugs and issues" without naming what kinds it should and shouldn't flag, the LLM is going to spend most of its tokens on territory the rest of your stack already covers. The fix is to name the territory in the prompt: "flag misleading names, wrong abstractions, premature interfaces, race conditions in concurrent code. Do not flag formatting, unused variables, type errors, or duplicate logic — those are handled by other layers."

These three changes are free. The first one alone usually cuts the team's per-PR finding count by a third without losing any real signal.

Methodology and references

For the figures in this post:

  • Figure 1 is original to this post. Categories drawn from a sample of ~500 real findings across the Revund pilot dataset (n = 14 teams, 6 weeks). Hand-labelled by three raters; Cohen's κ = 0.78 on the "best caught by" column.

  • Figure 2 is from the Revund pilot's configuration period (the weeks before the no-overlap rule was made explicit in the LLM prompts). Per-finding triage cost was measured as mean seconds the author spent on each finding before either acting on it or dismissing it. The shape of the curve (multiplicative, not additive) replicates the trust-decay finding from Sadowski et al., "Modern Code Review: A Case Study at Google," ICSE 2018, Section 5.

  • Figure 3 is patterns from the configuration period plus public bug-tracker analysis of the most-installed static-analysis and AI-review tools (2025–2026 sample). The directional finding — that encroachment by either layer onto the other's territory produces stereotyped team behaviour — replicates across every tool combination we looked at.

  • Figure 4 is from the Revund pilot across three configuration regimes (linter-only, LLM-only, layered). Precision is the share of system findings the author committed code in response to within 48 hours. Coverage is benchmarked against a hand-graded set of 200 PRs reviewed by senior engineers in parallel with each system; coverage = (system findings that overlap senior findings) / (senior findings).

The pilot data is internal and has not yet been published. The dedicated methodology post (with the data and the labelling rubric) is being prepared. We're being explicit about which numbers are public-research and which are pilot for the same reason as in post #4: the field has a problem with tools quoting internal numbers they never expose.


If you've been treating "mechanical review" and "AI review" as alternatives, you've been picking the losing side regardless of which one you picked. The choice was never which half. It was always how to layer both halves without letting them step on each other. Email hello@revund.dev if you want to see what the layering looks like in practice.