There's a debate running in every engineering team in 2026 about which kind of code review is the future: the linters and static analysers that have been around for thirty years, or the AI reviewers that arrived in the last eighteen months. The framing is wrong. Pick either side and you ship a review process that fails differently, but always fails. The teams whose review actually works have stopped debating and started layering.
This post is the fifth in a series on what the research is saying about code review in the AI era, after the 200-line cliff, the 14% gap, the mirror problem, and the defensibility shift. The first four established what's failing. This one is about the practical architecture that fixes it: two halves, no overlap, each one doing what the other can't.
The two halves, defined
The mechanical half of code review is everything a machine can verify against a deterministic process. Type checks, lint rules, test pass/fail, AST shape comparison, framework-convention checks, dead-code detection, dependency cycle detection, complexity counts. These have one thing in common: a correct answer exists, and a program can compute it. Output is precise to the point of being boring. Precision is at or near 100%; the only question is whether what they catch is what matters.
The judgment half is everything that doesn't have a correct answer in a deterministic sense. Whether a name describes what the function returns or what it accepts. Whether an abstraction is premature. Whether the comment three lines up still describes what the code does. Whether the chosen pattern matches the team's mental model of the codebase. These require a reader who has internalised the codebase's conventions, its history, and the team's standards. They cannot be computed; they can only be assessed.
Most code-review tools pick one half. Linters and static analysers do the mechanical half perfectly and the judgment half not at all. AI reviewers do the judgment half decently and the mechanical half badly, because they re-derive what a linter already knows and waste tokens on what's already proven.
The architecture that works is not a winner. It's both halves, in sequence, with a strict no-overlap rule.
Which half catches what
Most arguments about which tool is better are arguments about specific findings. Linter people point to a clean codebase and say "see, no AI needed." AI people point to a real refactor that needed taste and say "see, no linter would have caught that." Both are right about their example and wrong about the conclusion. The two halves catch different things, and a competent review process needs both.
| Finding type | Best caught by | Why the other layer can't do it |
|---|---|---|
| Unused variable | Linter | Pure syntactic; deterministic |
| Type mismatch / nullable misuse | Type system (tsc) | The compiler is the ground truth |
| Duplicated function across files | AST hash | Mechanical comparison, scales |
| Component in wrong folder | Repo rules + framework | Convention is configurable, checkable |
| File mixing 5+ concerns | AST concern counter | Countable property of the whole file |
| Off-by-one in date math | Test (executable spec) | Only an executable spec catches this |
| Misleading variable name | LLM | Pure taste, no anchor exists |
| Wrong abstraction | LLM | Judgment call against codebase shape |
| Premature interface | LLM + context | Requires reasoning about future change |
| N+1 query in a loop | LLM + data context | Needs the query graph the diff doesn't show |
The first six rows are not interesting. Linters have been catching unused variables since 1990. The TypeScript compiler has been catching null misuses for years. AST hashing for duplicate detection is decades old. None of this is novel; it's just unevenly adopted. Any team using a competent set of static tools already has the first six rows covered.
The bottom four rows are where the conversation has moved. Misleading names, wrong abstractions, premature interfaces, N+1 queries that require understanding the call graph — these have always been the senior reviewer's job because no deterministic process could touch them. AI reviewers are the first tool that can attack this territory at all. They do it imperfectly, but the alternative until eighteen months ago was "hope a senior catches it."
The trap is to confuse the two halves. When an AI tool reports an unused variable as an architecture finding, it has encroached on the linter's territory and added zero value. When a deterministic detector reports "wrong abstraction" without any judgment behind the call, it has encroached on the AI's territory and produced a false positive. Both encroachments cost the team trust.
Why overlap is the trust killer
The most common failure mode in real-world review setups isn't that one half is missing — it's that both halves are present, and they overlap. A linter flags an unused import. The AI reviewer, reading the same diff, also flags the unused import in slightly different words. The author now has two findings to dismiss for one underlying issue. Multiply by a hundred findings per PR and the team mutes both tools within a sprint.
The math is uncomfortable for tool vendors. The thing that breaks a review setup isn't bad findings — it's redundant findings. A team can absorb a 20% false-positive rate from a single tool because the dismissal pattern is consistent: same kind of false positive, same way to filter it. The same 20% from two tools, on overlapping territory, produces unpredictable noise — sometimes both tools flag, sometimes one does, sometimes neither — and unpredictable noise is what trust dies of.
The discipline this implies is uncomfortable too. If you're running both a linter and an AI reviewer, the AI reviewer must explicitly know what the linter covers, and refuse to publish findings in that territory. The same applies in reverse: if you have an AST-based detector for duplicate functions, the LLM prompt has to be told "do not flag duplicates; that's handled." Without these explicit no-flag boundaries, the layers cannibalise each other's trust.
What it looks like when one layer does the other's job
The market has hundreds of products on each side of this line. They fail in stereotyped ways when they don't respect the boundary. Three real failure modes, with the rule that prevents each:
| When a layer encroaches | What the team experiences | The rule that prevents it |
|---|---|---|
| LLM flags unused variable | Style nit reported as 'architecture'; reader stops trusting the category | Lint rule covers it; LLM prompt explicitly excludes |
| LLM flags type error | Bot says 'this might be null' on a line tsc already proved is null-safe | Pre-run tsc; LLM receives diagnostics, never re-derives them |
| Detector flags 'wrong abstraction' | False positive every time; deterministic rule with no judgment behind it | Detectors stay in measurable territory; never publish taste calls |
| Detector flags 'misleading name' | Same; pattern-match without context produces noise | Names go to the LLM, which can read intent from surrounding code |
| Both layers flag the same duplicate | Two findings to dismiss for one issue; trust burns 2x faster | LLM prompt receives the detector's findings as a 'do not flag' list |
The table compresses a lot of practical experience. Each failure mode looks small in isolation and produces a recognisable team behaviour: "the bot is noisy," "the linter is too strict," "we turned off the architecture rules," "we ignore that whole category now." These aren't tool problems. They're layering problems. The same tools work fine when the boundaries are explicit and break when they're implicit.
This is also why the LLM prompt in a serious review engine is not a vibe. It's a contract. The prompt says, in writing, "these are the kinds of findings you produce. These are the kinds of findings you do not produce because another layer already covers them. These are the rationale schemas you must satisfy before any finding can be surfaced." A reviewer that doesn't know its territory ends up everywhere, which is the same as nowhere.
What the layered approach actually buys you
The reason to do the work of separating the halves is that the result is better on both axes that matter: precision (the share of findings the author acts on) and coverage (the share of real issues the system surfaces).
The two-axis chart is the clearest argument I can make for why this is worth doing. A team that ships only linters covers a quarter of the surface area of real issues; nobody who's worked through a serious code review thinks lint output is enough. A team that ships only an AI reviewer gets more coverage but burns the credibility needed to keep the team reading the output past month two. A team that ships both, with the boundary explicit, gets the coverage of the AI reviewer without the precision cost.
This is not a Revund-specific finding. Several teams in the pilot already ran both ESLint with custom rules and a separate AI reviewer before they tried Revund. Their pain point was always the overlap: the AI reviewer kept restating things ESLint had already said. The fix was the same fix regardless of which AI tool they were using — make the boundary explicit, surface a single Architecture section, and never let a finding appear twice.
Three failure shapes, in code
Concrete examples make the architecture argument harder to dismiss. Three real failure shapes from the pilot, each demonstrating what one half catches that the other can't.
1. The thing only the mechanical half catches
The author writes:
// src/orders/loader.ts
export async function loadOrders(userId: string) {
const orders = await db.orders.find({ userId });
return orders.map((o) => ({
...o,
customer: await db.customers.findOne({ id: o.customerId }),
}));
}The AI reviewer reads this and says "consider batching the customer fetch." The comment is correct in spirit and useless in practice: the author dismisses it because there's no scoped consequence, and the production incident lands two weeks later when the orders page times out on a user with 200 orders.
The mechanical half catches this without ambiguity. An AST-based detector for the N+1 pattern (async call inside .map) flags the line with the structural rule it violates: "async call inside Array.map; this issues one DB query per element. At median dataset size N=184 from your tsc inferred type, this PR makes the loader 184× slower." No taste involved. The detector knows the pattern, the type system supplies the cardinality, the warning is unambiguous.
The AI reviewer's version of this finding is a guess. The mechanical version is a fact.
2. The thing only the judgment half catches
The author writes:
// src/billing/charge.ts
export class ChargeStrategy {
apply(amount: number, ctx: ChargeContext): ChargeResult {
return this.strategy.execute(amount, ctx);
}
}
// only used once, in src/billing/index.ts
const charge = new ChargeStrategy(new StripeChargeStrategy());The PR introduces a ChargeStrategy class that wraps a single strategy implementation, used in exactly one place. No linter flags this. No type checker objects. Tests pass. From a mechanical perspective, the code is fine.
A senior reviewer looking at this writes "premature abstraction — ChargeStrategy has one implementer, introduced in the same PR. Either commit to multiple strategies now and add the second one in this PR, or inline the strategy and remove the wrapper. Adding the indirection without the second implementation is a future-cost without a present-benefit."
This is pure judgment. The pattern is technically a valid use of the Strategy pattern. The mechanical layer has nothing to say. The AI layer, given the codebase context, can catch it because it's seen this anti-pattern thousands of times and can pattern-match against the codebase's actual shape. The finding requires context the mechanical half doesn't have access to.
3. The thing that gets reported twice (the trap)
The author writes:
// src/checkout/total.ts
function calculateTotal(items: Item[], discount?: number): number {
const subtotal = items.reduce((s, i) => s + i.price, 0);
return subtotal - (discount ?? 0);
}
// src/cart/total.ts
function computeTotal(items: Item[], discount?: number): number {
const total = items.reduce((s, i) => s + i.price, 0);
return total - (discount ?? 0);
}The mechanical layer (AST hash) catches this in 5ms: "duplicate function shape — calculateTotal and computeTotal are structurally identical. Extract to a shared helper."
The AI reviewer also reads both files and reports: "these two functions appear to compute the same thing. Consider DRY-ing them up."
The author now has two findings, one in the architecture section, one in the style section. They dismiss the AI version because it's vaguer, then triage the AST version, then go on with their day. The bot learned nothing. The next time this pattern shows up, both layers fire again.
The fix is not subtle: the LLM's prompt is constructed with the deterministic detector's output already inserted as "the following structural issues have been identified by the detector layer; do not flag duplicates of these." The duplicate finding never appears. The author triages one finding instead of two, the bot keeps its credibility, and the team reads next week's findings instead of skimming them.
Why this is the foundation of Revund
Revund's architecture is exactly this layering, made explicit at every level. The structural pass runs deterministic detectors against the parsed AST: file-size, mixed-concerns, single-responsibility, near-duplicate, layering. Each detector produces findings with precision close to 100% because there's a measurable rule behind each one. The LLM passes (security, performance, architecture, style) run with the detector output already in their prompt as a do-not-flag list. The LLM's job is everything outside the detector's territory: taste, context, judgment.
The reason this matters more than it sounds: the engine is shaped so the two halves can't accidentally collide. The detector layer doesn't know how to publish judgment findings, so it can't encroach on the LLM's territory. The LLM is given explicit instructions about what's already been covered, so it can't restate what's already been said. The architecture pass that emerges combines findings from both halves under a single Architecture section in the PR comment, which is what the customer sees. The two skill sets are visible from the inside; from the outside, it's one reviewer that happens to be unusually thorough.
This is why "deterministic structural detectors" and "specialist LLM passes" are co-equal in our design notes and not in tension. They cover different territory, and the engine's job is to enforce the boundary. A version of Revund with only the LLM passes would have the overlap problem in spades. A version with only the detectors would miss the bottom half of the taxonomy table. The combined engine works because the seams between the halves are explicit, not implicit.
What you can do today (regardless of tooling)
Three things you can change in your team's review setup this week to push toward layered review without buying anything:
1. Audit overlap in your current setup
Run your current review for one week and tag every finding by whether it could have been caught by your linter, your type checker, your tests, or your AI reviewer. The overlap count is the number you care about. Anything above ~15% is bleeding trust. The fix is usually a one-line addition to the AI reviewer's prompt or a one-line exclusion to the linter's config; the hard part is noticing.
2. Move every mechanical claim to a mechanical tool
If a reviewer is leaving comments that could be a lint rule, the comments are wasting their time and the rule should exist. Add the rule. The reviewer's attention budget is the most expensive thing in the team; do not spend it on findings a regex could produce. This is the single highest-leverage change a team can make in week one.
3. Make the LLM's territory explicit
If you're running an AI reviewer, look at its prompt. If the prompt says "find bugs and issues" without naming what kinds it should and shouldn't flag, the LLM is going to spend most of its tokens on territory the rest of your stack already covers. The fix is to name the territory in the prompt: "flag misleading names, wrong abstractions, premature interfaces, race conditions in concurrent code. Do not flag formatting, unused variables, type errors, or duplicate logic — those are handled by other layers."
These three changes are free. The first one alone usually cuts the team's per-PR finding count by a third without losing any real signal.
Methodology and references
For the figures in this post:
-
Figure 1 is original to this post. Categories drawn from a sample of ~500 real findings across the Revund pilot dataset (n = 14 teams, 6 weeks). Hand-labelled by three raters; Cohen's κ = 0.78 on the "best caught by" column.
-
Figure 2 is from the Revund pilot's configuration period (the weeks before the no-overlap rule was made explicit in the LLM prompts). Per-finding triage cost was measured as mean seconds the author spent on each finding before either acting on it or dismissing it. The shape of the curve (multiplicative, not additive) replicates the trust-decay finding from Sadowski et al., "Modern Code Review: A Case Study at Google," ICSE 2018, Section 5.
-
Figure 3 is patterns from the configuration period plus public bug-tracker analysis of the most-installed static-analysis and AI-review tools (2025–2026 sample). The directional finding — that encroachment by either layer onto the other's territory produces stereotyped team behaviour — replicates across every tool combination we looked at.
-
Figure 4 is from the Revund pilot across three configuration regimes (linter-only, LLM-only, layered). Precision is the share of system findings the author committed code in response to within 48 hours. Coverage is benchmarked against a hand-graded set of 200 PRs reviewed by senior engineers in parallel with each system; coverage = (system findings that overlap senior findings) / (senior findings).
The pilot data is internal and has not yet been published. The dedicated methodology post (with the data and the labelling rubric) is being prepared. We're being explicit about which numbers are public-research and which are pilot for the same reason as in post #4: the field has a problem with tools quoting internal numbers they never expose.
If you've been treating "mechanical review" and "AI review" as alternatives, you've been picking the losing side regardless of which one you picked. The choice was never which half. It was always how to layer both halves without letting them step on each other. Email hello@revund.dev if you want to see what the layering looks like in practice.