← Blog · research · architecture · product

The mirror problem: why a model can't review another model's code

Wes Guirra · Founder · Revund

May 21, 2026 · 19 min read

The model wrote the function. A different prompt of the same model reviewed it. They agreed. The function had a bug. The bug shipped.

This is not a hypothetical. Three independent 2026 studies, looking at different cuts of the same phenomenon, converge on the same structural finding: an AI reviewer drawn from the same training distribution as the AI author can't see what the author missed. Cross-family review breaks some of the correlation. Adding an external reference, something the reviewer can verify against a deterministic process rather than its own intuition, breaks the rest.

This is the third post in our research series. The first two looked at the human side of code review: when reviewers fail and what their comments should carry. This one is about the system side. What does it actually take for a reviewer to function as a reviewer, and why does dropping a language model into the role not automatically produce one? The answer is a load-bearing wall in how Revund is built, and the research behind it has hardened materially in the last six months.

The mirror problem, defined

A 2026 paper by Zietsman ("The Specification as Quality Gate: Three Hypotheses on AI-Assisted Code Review") frames the issue precisely. When an AI agent generates code and another AI agent reviews it, "both the generating agent and the reviewing agent reason from the same artefact, share the same training distribution, and exhibit correlated failures." The reviewer isn't a second perspective. It's a second draft of the writer's perspective with the same blind spots in the same places.

Call this the mirror problem: a reviewer drawn from the same distribution as the author cannot see what the author missed, because the same distribution produces the same misses. This is not a sampling artefact. It is a structural property of the system.

The mirror problem is not solved by a better model. It is solved by introducing something the reviewer can check against that doesn't share the author's distribution. That something can be another model family (partial fix), an external deterministic process (better fix), or both (what we built). It has to come from outside the mirror, and it has to be verifiable. That second word is the part most AI review tooling skips.

The experiment, run on planted bugs

To make this concrete, run the obvious experiment: build a corpus of small functions with planted bugs, ask the same model that wrote the function to review it, and measure detection rate by bug class. Then repeat with a different model family. Then repeat with the reviewer given access to type-system output, lint config, and the resolved declarations of any external symbols the function touches.

Planted bug class	Same-family LLM review	Cross-family LLM review	LLM + external reference
Off-by-one in date math	22%	41%	88%
Cross-file signature drift	18%	27%	96%
Nullable returned, used as non-null	26%	39%	98%
Repo-convention violation	11%	14%	84%
Stale import after rename	14%	19%	92%
SQL injection (parameterised path)	71%	78%	91%
XSS via unescaped template	64%	72%	89%

Detection rate by reviewer composition, broken out by planted-bug class. The pattern is the argument. Bug classes the model is heavily trained on (SQL injection, XSS) get caught even by same-family review, because the failure mode has a small, well-documented surface and lives near the top of every security tutorial in the training set. Bug classes that depend on this codebase's specific reality (cross-file signature drift, nullable types narrowed wrong, repo-specific convention violations) collapse to single-digit or low-double-digit detection under same-family review. They recover into the 80–90% range only when the reviewer is given access to an external reference: the type system, symbol resolver, lint config, repo rules. The structural failure of the mirror is not uniform; it is concentrated in exactly the bug classes that determine whether a real change is safe in a real codebase.Detection rates are a directional reconstruction following the planted-bug-corpus methodology described in Zietsman, 'The Specification as Quality Gate' (arXiv 2603.25773, 2026), extended with the Revund pilot dataset (n = 14 teams, 6 weeks). The paper publishes per-model rates but not the full bug × reviewer matrix; the values here are calibrated to the paper's directional claims and the pilot's per-class catch rates. Methodology will be published with the next post in this series.

Three things to draw from the table. First: the columns are not "good reviewer" versus "bad reviewer." Every reviewer in the table can read code. The difference is what they have access to outside the diff. Same-family review reads the diff and applies the model's prior. Cross-family review reads the diff and applies a different model's prior. Reference-grounded review reads the diff and applies a model's prior plus the output of deterministic processes that don't share that prior.

Second: the bugs that survive same-family review are not random. They cluster on the classes where catching the bug requires knowing something the diff alone doesn't carry. "Is this user nullable?" depends on a signature defined in another file. "Was this symbol renamed?" depends on the commit history of the symbol. "Does this pattern match our repo's conventions?" depends on rules written down (or not) somewhere outside the diff. None of these are detective work the model can do from the diff text. They require the reviewer to consult something else.

Third: the bug classes that survive same-family review are exactly the bug classes that matter most for whether a PR is safe to merge in a working codebase. SQL injection is real but it's also pattern-matchable from training data; the model catches it because every other tutorial on the internet catches it. Cross-file signature drift after a rename last Tuesday catches no one's attention until it ships, and then it pages someone.

The industry-scale data

The planted-bug experiment is the controlled version. The uncontrolled version was published in 2026 in a study of 278,790 code review conversations across 300 open-source GitHub projects ("Human-AI Synergy in Agentic Code Review," Han et al., arXiv 2603.15911). The findings stack:

AI suggestions are adopted at a significantly lower rate than human suggestions.
Over half of un-adopted AI suggestions were either incorrect or were addressed through an alternative fix the author preferred.
When AI suggestions were adopted, the resulting code grew larger and more complex than when the equivalent human suggestion was adopted. The reviewer's recommendation made the codebase worse on two dimensions at once.
Human reviewers spent 11.8% more conversation rounds when reviewing AI-generated code than when reviewing human-written code.

The first bar in Figure 2 is the human-suggestion adoption rate. The second is the AI-suggestion adoption rate. The third is what we measured in the Revund pilot: an AI reviewer paired with a deterministic external reference (the ContextBundle).

Review-suggestion adoption rate by reviewer type. Human reviewers' suggestions land in the codebase more than twice as often as suggestions from an unaided AI reviewer. The gap closes substantially when the AI reviewer operates with an external reference: every finding can cite a deterministic output (a tsc diagnostic, a failing test, a resolved symbol's signature) rather than asking the author to take the model's word for it. The third bar is not the AI getting smarter. It is the AI being given something the author can verify.First two bars: Han et al., 'Human-AI Synergy in Agentic Code Review,' arXiv 2603.15911 (2026), n = 278,790 review conversations, 300 projects. Exact per-category adoption percentages are not all published in the paper; values plotted are the midpoint of the ranges the authors report. Third bar: Revund pilot, n = 14 teams over 6 weeks, ~2,200 findings, action = 'author committed code in direct response within 48 hours'.

The last finding from Han et al., 11.8% more rounds when reviewing AI code, deserves its own treatment, because it has a second-order cost most discussions of AI review skip.

The reviewing-AI-code tax

When AI authors a PR and AI reviews it, the throughput argument that justifies the tool starts to leak. An independent 2025 study at ICSE looked at automated code review in practice across roughly 238 practitioners and ten projects. Average pull-request closure duration moved from five hours fifty-two minutes before automated review was introduced to eight hours twenty minutes afterward, even though 73.8% of automated review comments were marked resolved. The tool added work even when its comments were "accepted."

Three forces compound here. The author writes the PR faster (true). The reviewer adds another reviewer to the loop, the AI tool (true). Both the human reviewer and the AI reviewer now have to triage the AI tool's comments (also true). The net effect on time-to-merge is rarely the speed-up the headline promises.

Mean review rounds to merge, by author-reviewer composition. The middle bar replicates Han et al.'s 11.8% finding directly: AI-authored PRs reviewed by an unaided AI reviewer take more rounds than the human-authored baseline, because the AI reviewer's findings have to be triaged by the same human downstream who would have caught the same things faster reading the diff. The third bar shows what happens when the AI reviewer can ground its claims in a deterministic reference: round count returns to near-baseline. The reduction is not because the AI reviewer surfaces fewer findings. It is because each finding carries enough evidence for the author to act on it without a second clarifying round.Baseline (left): Han et al., human-authored / human-reviewer cohort. Middle: Han et al., AI-authored cohort, mean rounds adjusted using the +11.8% finding. Right: Revund pilot. Round count is measured as the number of comment-revision cycles between PR open and PR merge.

The shape of the cost matters. Each extra review round is not just additional time on the clock. It is a context switch for the author, a re-read for the reviewer, and a credibility tax on the tool that produced the comment that caused the round. Tools whose findings consistently require a clarifying round get filtered out of the workflow within weeks (we covered the trust-decay curve in post #2).

This is where the mirror problem stops being academic. It pays for itself, in working hours, in every codebase where an AI reviewer is grading AI code without an external reference.

What "external reference" actually means

Strip the metaphor. A reference in code review is anything the reviewer can check against that exists outside the reviewer's own judgement. The reference is the part of the world that, if the reviewer claims something about the code, can be used to settle the claim without running the review again.

The cheapest example: the compiler. If a reviewer says "this returns User | undefined and you're treating it as User", the TypeScript compiler can confirm or refute that claim in milliseconds. The reviewer is not asserting an opinion; the reviewer is quoting a fact already established by a deterministic process the author can also run.

Most failure modes of AI review map to a missing external reference. The reviewer "thinks" something looks wrong but has nothing to anchor the claim to. The author "thinks" something looks right and has the same lack of anchor. Without a reference, both are reasoning from prior probability over the same training distribution, and the prior is what produced the bug in the first place.

Reference	What the reviewer checks against	What an unaided LLM misses without it
Type system	tsc --noEmit · TypeScript compiler	Cross-file signature drift; nullable values used as non-null
Test suite	vitest · jest · pytest · go test	Behavioural regressions; off-by-one math the model also makes
Symbol resolver	ts-morph · LSP · tree-sitter	Renames, deletions, shadowed scopes, stale imports
Lint configuration	.eslintrc · ruff · golangci-lint	Patterns allowed in the language but banned in this repo
Per-repo conventions	.revund.yaml · CODEOWNERS · ADRs	Module-boundary violations; naming drift; rejected patterns

Five layers of external reference a deterministic reviewer can consult. Each row is a class of fact the reviewer can quote rather than assert. The right-hand column is the class of failure that lives outside the LLM's reach without it. These are not a feature list; they are a partial taxonomy of *what the diff alone doesn't tell you*. Any one of them, missing, leaves a class of bugs that the mirror cannot see.Taxonomy is original to this post. Each reference type maps to a layer the Revund ContextBundle assembles for every review pass.

What each reference produces, the actual claim it makes available to the reviewer, is the part most matters. A finding's value is set by whether the author can verify it without becoming the reviewer themselves. Compare the LLM-alone framing of a finding to the reference-grounded version:

Reference	LLM-alone framing	Reference-grounded claim	Engineer's next action
Type system	"This could be null somewhere"	tsc -strict line 47: User \| undefined used as User	Open line 47; either guard or narrow
Test suite	"This might break existing behaviour"	test/orders.spec.ts › 'applies tax' fails at assertion line 22	Run the failing test; read the diff
Symbol resolver	"formatPrice may not exist"	formatPrice → utils/format.ts:14, signature (number, Currency) → string	Confirm call site matches signature
Lint configuration	"This pattern is unusual"	no-throw-in-fallible: line 12 throws inside Result<>	Refactor to Result.err()
Per-repo conventions	"Wrong module boundary"	.revund.yaml: src/auth cannot import from src/billing	Move dependency to shared/

The same finding, twice: how it reads when the reviewer is reasoning from prior alone, and how it reads when the reviewer can cite a deterministic reference. The right-hand column is the engineer's next action. With the reference-grounded version, that action is concrete and bounded: the engineer opens a specific line, runs a specific test, checks a specific signature. With the LLM-alone version, the action is 'become the reviewer.' That is the action no one has time for, and the reason most AI-review tools get muted in week three.Drawn from the Revund pilot dataset; each row is a real finding shape with the framing chosen by the engine.

The point of Figure 5 is not that the reference-grounded findings sound more authoritative. It's that they cost the engineer less to action. The author can open line 47, see the diagnostic, agree or disagree, and decide. They cannot do that with "this could be null somewhere" without spending the same five minutes the reviewer should have spent.

Three failure shapes, in code

Abstract argument; let's see it in code. Three failure modes show up repeatedly in the pilot data, and each maps to a different missing reference.

1. Cross-file signature drift

The author writes:

// src/checkout/handler.ts
import { formatPrice } from "../utils/format";
 
export function renderReceipt(order: Order): string {
  return order.lines
    .map((line) => `${line.name}: ${formatPrice(line.amount)}`)
    .join("\n");
}

Looks fine. Compiles. The author asks the AI reviewer for a check; the reviewer reads the diff and says "looks good." Both are wrong. Two days ago, in an unrelated PR, formatPrice was changed from (amount: number) => string to (amount: number, currency: Currency) => string. The new signature requires the currency. The author's call site doesn't pass it.

A reviewer with no symbol resolver can't see this. The diff does not include utils/format.ts. The model cannot know what formatPrice looks like now versus what it looked like in its training data. The bug is invisible from the diff.

A reviewer with a symbol resolver resolves formatPrice to its current declaration before the pass runs and includes the resolved signature in the bundle. The model now has the full type as a fact it can quote, not a prior it has to guess at. The finding writes itself:

BLOCKER, src/checkout/handler.ts:5, formatPrice is called with 1 argument but its signature requires 2. Why: formatPrice resolves to utils/format.ts:14 with signature (amount: number, currency: Currency) => string. The currency argument was added in PR #4421 two days ago. Without it, every line item in the receipt renders with the default currency, which is USD regardless of the order's actual currency.

2. Same-mistake bugs

The author writes:

// src/billing/invoice.ts
function daysOverdue(invoice: Invoice): number {
  const today = new Date();
  const due = new Date(invoice.dueDate);
  return Math.floor((today.getTime() - due.getTime()) / (1000 * 60 * 60 * 24));
}

The bug: this returns the wrong number across DST boundaries. If dueDate was set in summer and today is in winter (or vice versa), the difference picks up an extra hour, and the floored day count can be off by one near midnight UTC. The off-by-one in date math is one of the most common bugs in any codebase the model has seen, and the model also writes the same bug, because the training distribution is dominated by code that has the same bug.

The mirror reviewer reads the diff and either says "looks good" (most likely) or "watch DST" with no scoped claim (still useless). The bug survives.

A reviewer with the test suite as a reference flags it because there's a test:

// tests/billing/invoice.spec.ts
it("counts days correctly across DST", () => {
  const invoice = { dueDate: "2026-03-01T12:00:00Z", /* ... */ };
  const today = mockDate("2026-04-01T12:00:00Z");
  expect(daysOverdue(invoice)).toBe(31);
});

That test fails. The diagnostic is in the bundle. The finding cites it:

BLOCKER, src/billing/invoice.ts:5, daysOverdue is off by one across DST. Why: tests/billing/invoice.spec.ts › 'counts days correctly across DST' expects 31 and receives 30 after this change. The fix is to use a date library that operates on UTC days rather than millisecond differences (e.g., differenceInCalendarDays from date-fns).

The test was the reference. Without it, the reviewer was working from prior, and the prior produces the bug.

3. Convention drift

The author writes:

// src/auth/session.ts
export async function getSession(token: string): Promise<Session> {
  const decoded = await verifyToken(token);
  if (!decoded) {
    throw new InvalidTokenError("Token verification failed");
  }
  return decoded;
}

Valid TypeScript. Compiles. Passes tests. The mirror reviewer reads it and approves.

But this repo's .revund.yaml says:

patterns:
  src/auth/**:
    - no-throw-in-fallible: error
    - prefer-result: error

…and an ADR from three quarters ago, also indexed in the bundle, says: "All fallible operations in src/auth/* return Result<T, E> to ensure errors are explicit at every call site. Throws are not caught uniformly and have caused two production incidents."

A reviewer without those references treats this as a stylistic preference and either ignores it or hedges. A reviewer with them flags it with an evidence trail:

WARNING, src/auth/session.ts:4, throws inside a fallible operation in src/auth/*. Why: per .revund.yaml's no-throw-in-fallible rule (enforced) and ADR-0021 ("No throws in src/auth/*"), this module returns Result<Session, AuthError> for all error paths. Two production incidents in 2024-Q4 traced to uncaught throws from this module are the basis for the rule. Refactor to return Result.err(new InvalidTokenError(...)) and update the call sites.

Same finding. Two completely different fates. The difference is whether the reviewer has access to the rules the team has already agreed on.

Why this is the foundation of Revund's review engine

Everything in Revund's architecture is shaped around the mirror problem. The single most important data structure in the system is the ContextBundle, and it exists specifically to be the external reference for every review pass:

// from core/types.go
type ContextBundle struct {
    Diff        string
    Files       []FileContent      // full content of changed files
    Symbols     []SymbolDecl       // external declarations resolved by ts-worker
    Diagnostics []TscDiagnostic    // tsc --noEmit output
    PRMeta      PRMetadata         // title, description, ticket ref
    RepoConfig  RepoConfig         // per-repo rules from .revund.yaml
}

Three properties make the bundle function as a reference, and any one of them missing collapses the whole argument.

1. It is deterministic. Given the same PR, the bundle is byte-for-byte identical across runs. We hash it and fail the run if two builds of the same PR produce different bundles. Non-determinism in the reference would mean the reviewer is inconsistently grounded across passes, and the correlation-breaking argument fails: passes would diverge for reasons unrelated to the code under review.

2. It is built by infrastructure that is not the LLM. The symbol resolver is ts-morph running in a Node.js sidecar, not the model. The diagnostics are real tsc --noEmit output. The repo config is parsed YAML. Each of these is something the reviewer can be wrong about, but each one is itself not subject to the same training-distribution failure mode. They are the part of the world the reviewer is anchored to.

3. It is shared identically across passes. The security pass, the performance pass, the architecture pass, and the style pass all read the same bundle. They don't each re-derive a worldview from the diff. When the security pass says "the JWT secret falls back to a literal", it's quoting the same file content the architecture pass quoted when it flagged the same module for missing abstraction. The bundle is a shared coordinate system.

The four specialist passes are themselves a second layer of correlation-breaking. Even with a perfect external reference, a single LLM prompt is still a single perspective. Four passes with different prompts, different output schemas, and different rationale shapes give us four perspectives anchored to the same reference. A finding that surfaces in only one pass is signal, a specialist concern. A finding that surfaces in three or four is consensus risk. Either is useful; both are surfaced with the pass label so the human can weight them.

This is also why ts-worker runs as a separate Node.js process talking gRPC to the Go core, rather than coaxing the LLM into doing type analysis itself. ts-morph and the TypeScript compiler are external references, we can verify what they return. We cannot verify a model's claim that "this looks type-safe." The architecture is the assertion that any claim made about the code should be checkable against something that isn't another claim made about the code.

What you can do today (regardless of tooling)

The mirror problem is not a Revund problem. It is a code-review-with-AI problem, and it shows up the moment AI is present on either side of the review. Three things you can change in your team's workflow this week without buying anything:

1. Run CI before review, not after

Most teams treat CI as a gate after review. Treat it as a gate before. If you sit down to review a PR and you do not have a green CI run that proves the type system and tests pass, you are being asked to be the reference yourself, at a far slower clock rate than the compiler. That is the worst use of a senior engineer's time. Block reviews on green CI and the type system becomes the cheapest reviewer on the team.

2. Calibrate your linter to the team's actual conventions, not the language's defaults

A lint config that matches the language's defaults catches what the language has already opined on. A lint config that matches your team's actual conventions catches what your team has opined on but the language hasn't. The second is the bigger source of value, and it is a deterministic external reference that no LLM can replicate from the diff alone. Every rule you add to .eslintrc (or ruff, or golangci-lint) is a coordinate the reviewer can anchor to. Custom rules pay back ten times the effort to write them.

3. Require a referenceable claim in every review comment

"I think this is wrong" is a hypothesis. "tsc --strict flags this on line 47" is a referenceable claim. "Our team's pattern in src/auth/* does X, see ADR-0021" is a referenceable claim. The discipline of referencing pulls reviewers off the same training distribution as the author and onto the world they share. In every dataset we have looked at, the action rate of comments that cite a reference is materially higher than comments that don't, and that finding does not depend on whether the reviewer was a human or an AI.

These three changes don't require new tooling. They require treating the world outside the model as a first-class participant in the review.

Methodology and references

For the figures in this post:

Figure 1 is a directional reconstruction following the planted-bug-corpus methodology described in Zietsman, The Specification as Quality Gate: Three Hypotheses on AI-Assisted Code Review, arXiv 2603.25773 (2026). The paper publishes per-model rates for several bug categories but does not publish the full bug × reviewer matrix. Values plotted are calibrated to the paper's directional claims and to per-class catch rates from the Revund pilot (n = 14 teams, 6 weeks, planted-bug subset). The qualitative pattern, security-class bugs robust to mirror review, structural-class bugs collapsed, replicates across both datasets.
Figures 2 and 3 combine published findings with pilot data. Bars sourced from Han et al., Human-AI Synergy in Agentic Code Review, arXiv 2603.15911 (2026) (n = 278,790 review conversations across 300 projects) summarise the paper's directional findings; exact per-category percentages are not all published in the paper and are plotted at the midpoint of the ranges the authors report. The third bar in each figure is the Revund pilot, methodology and per-team breakdown will be published with the next post in this series. The ICSE 2025 SEIP figure cited in the "reviewing-AI-code tax" section (PR closure duration from 5h 52m to 8h 20m) is from Automated Code Review in Practice, arXiv 2412.18531.
Figures 4 and 5 are taxonomies original to this post. The reference categories map to the layers the Revund ContextBundle assembles for every pass; the deterministic-bundle property is verifiable by running the engine against the same PR twice and diffing the resulting bundle (no output should differ).
The "correlated failure" framing is Zietsman's. The three hypotheses in that paper (homogeneous-pipeline error correlation, executable specifications as a domain transition, and the "well-defined residual") are the foundation of the architectural argument here. The empirical evidence in the paper is what its author calls "directional, not a controlled demonstration"; we treat it the same way.
The Han et al. 11.8% finding (more rounds when reviewing AI code) is one data point from one large study. We have not seen this number reproduced independently and are using it as the upper bound of a real-but-narrow effect.

We are being explicit about which numbers are from public peer-reviewed research and which are from our internal pilot because the field has a problem with tools quoting internal numbers they never expose. Ours will be exposed when they are ready to defend. That post is being drafted.

If the reviewer in your workflow today is a model that reads the diff and nothing else, you don't have a reviewer. You have a second author with the same blind spots as the first. Email hello@revund.dev if you want to see what changes when the reviewer reads the bundle.