← Blog · research · code review

The 200-line cliff: what 50 years of code review research actually says

Every senior engineer has approved a 1,200-line PR they barely read. We tell ourselves it was fine because the author was trusted, the tests passed, or the change was "mostly mechanical." We're not lying. We just don't know.

The research disagrees with our self-image. Decades of empirical studies, across industries and codebases, converge on the same finding: past about 200 lines of changed code, reviewer defect-detection drops off a cliff. Not gently. Not gradually. Off a cliff.

This post is a tour of that research, what was measured, how, by whom, and what we can fairly conclude from it. The conclusions become the foundation for how we built Revund's review engine. But the data stands on its own: even if you never use an AI reviewer, the same data should change how you size your PRs.

The canonical finding: SmartBear / Cisco, 2006

The most-cited code review study in software engineering is a 2006 case study of Cisco Systems' MeetingPlace product, conducted with SmartBear Software. It tracked 2,500 code reviews across 50 engineers over 10 months, at the time, one of the largest empirical datasets on review effectiveness ever published.

The study measured two metrics for each review:

  1. Defect density, bugs found per kLOC reviewed
  2. Reviewer effort, lines per hour the reviewer covered

The finding that became the citation everyone in the industry knows:

0%50%100%0200400600800~200 LOC thresholdLines of code reviewedDefect detection (%)
Defect-detection effectiveness falls sharply past ~200 lines of changed code. Reviewers covering more lines per session find proportionally fewer bugs.Source: SmartBear / Cisco, 2006 (curve reconstructed from published findings).

What the curve says, in plain English:

  • A reviewer working through a 100-line change catches roughly the bugs a careful first-principles read can catch.
  • At 200 lines, they're still effective.
  • At 400 lines, they catch fewer than half as many.
  • At 800 lines and up, they're spotting the obvious and missing the rest.

The mechanism isn't laziness. It's cognitive load: a code review is an exercise in holding the state of a system in your head and asking "what changes if this lands?" That working set has limits.

The follow-up nobody quotes: Bacchelli & Bird, ICSE 2013

The SmartBear finding became the bumper sticker. The deeper question, why do reviewers miss bugs as PR size grows, wasn't answered until Bacchelli and Bird's 2013 paper at ICSE, "Expectations, Outcomes, and Challenges of Modern Code Review."

The study surveyed and shadowed engineers at Microsoft across 873 reviews. The headline finding: what reviewers think code review is for, and what they actually accomplish during it, are wildly different.

What reviewers DIDWhat they thought review was FORfind defects14%improve code22%knowledge transfer23%alternative solutions11%team awareness9%style / conventions21%
Reviewers report defect-finding as the primary purpose of code review, but their actual output skews heavily toward style and knowledge-transfer. The gap widens at larger PR sizes.Source: Bacchelli & Bird, ICSE 2013 · figures from Table 3.

44% of surveyed engineers said the primary purpose of code review is defect detection. But when the same engineers' actual review comments were classified, only 14% were defect-related. The rest were code-quality suggestions, style nitpicks, knowledge-transfer questions, and design discussion.

This is not a moral failing. The reviewers wanted to find bugs. They just couldn't, once the change exceeded a certain size, the bug-hunting capacity was already spent on the first few hunks, and what survived to the comments was the things easier to spot at a glance.

Bacchelli & Bird put it more carefully:

"Reviewers are mostly focused on understanding the change, and on suggesting improvements to it, rather than finding defects. The defect-finding aspect, while reported as primary, is often a side-effect of the understanding process."

In other words: the SmartBear curve isn't just about attention dropping past 200 lines. It's about what survives the attention budget. Defect-finding is the most expensive activity in a review, and it's the first to get cut when the diff gets long.

Sanity-checking the threshold across studies

The "~200 LOC" number isn't a single-study artefact. It shows up across at least three independent research lines:

StudyYearSampleThreshold finding
Cohen, Best Kept Secrets of Peer Code Review2006Cisco MeetingPlace · 2,500 reviewsDetection drops sharply past 200 LOC, near-zero past 400 LOC
Bacchelli & Bird (ICSE)2013Microsoft · 873 reviews · 8 mo.Reviewer comment quality degrades past ~100–300 LOC depending on file count
McConnell, Code Complete (2e)2004Meta-analysis · 50+ studiesInspection efficiency peaks at 100–200 LOC, plateaus, then decays
Kemerer & Paulk, IEEE TSE200920+ industrial inspectionsDefect-removal effectiveness inversely correlates with review rate above 200 LOC/hr

These were measured in different industries, on different languages, with different review tools. They converge on a number that's stayed remarkably stable for nearly two decades. If a finding survives that many independent replications, it's probably real.

What's changed since 2006: PR size has gone up

Here's the uncomfortable part. The original studies were conducted on diffs that were already in the 100–800 LOC range, because that's what humans were producing. Since 2020, two forces have pushed average PR size up:

  1. AI-assisted authoring. Copilot, Cursor, and their successors don't just complete one line, they generate entire functions, classes, even whole modules in one suggestion. The unit of code production has scaled up by an order of magnitude.

  2. Monorepo refactor PRs. Tooling like ts-morph, jscodeshift, and codemod has made the 5,000-line "rename across the codebase" PR routine. Some are mechanical. Others sneak in semantic changes.

Industry data from GitClear's 2024 report on the impact of AI assistants found:

  • The percentage of code that's reverted within two weeks doubled between 2021 and 2023, from ~3% to ~5.5%.
  • "Copy-pasted" code, code that exactly duplicates an existing block within the repo, rose from ~8% of net new lines to ~12% over the same period.
  • Median PR size grew by approximately 47% in projects with high AI-assistant adoption.

This isn't an indictment of AI assistants. It's the obvious second-order effect: when authoring gets cheaper, the artefact being authored gets larger, and the review of that artefact runs straight into the SmartBear cliff.

What you can do today (regardless of tooling)

The research has prescriptions, even if you never run an AI reviewer:

1. Cap PR size by policy, not by hope

The SmartBear data suggests 200 lines of changed code per PR is the ceiling where careful review is plausible. That's the number to write into your team's PR template. Anything bigger needs justification.

For perspective, here's what 200 LOC actually looks like, about the length of a typical service-class refactor in TypeScript:

// ~200 LOC = one focused refactor.
// More than this and the SmartBear data says
// review effectiveness drops by half.
 
export class TokenSigner {
  // ... 12 fields
  // ... 8 methods
  // ... maybe 1 helper class
}

Anything past this scale, the data says you should split.

2. Time-box reviews to 60 minutes

Cohen also found that reviewer attention degrades quickly past the one-hour mark. Sessions that exceed it stop finding bugs and start surfacing nitpicks (the pattern Bacchelli & Bird later quantified). If a PR can't be reviewed in 60 minutes, split the PR, don't extend the review.

3. Separate refactor PRs from feature PRs

Mechanical refactors (renames, type tightening, formatter changes) read at a different cognitive load from semantic changes. If they're mixed in one PR, the reviewer's attention budget gets spent on the mechanical noise and the semantic change slides through. Two separate PRs, both labelled, is the correct shape.

4. Stop counting reviews, count bugs caught in review

This is the harder cultural change. Teams measure "PRs reviewed per week" because it's easy. The SmartBear data says that metric encourages exactly the wrong behaviour: faster reviews of bigger PRs, which is the regime where defect-detection falls to zero.

The right metric is bugs caught at review (not in CI, not in prod). That's also the metric Revund optimises against, but the lesson generalises.

Why this is the foundation of Revund's review engine

Everything above is the human side of the curve. The reason we built Revund the way we did, four specialist passes, deterministic ContextBundle, required why field on every finding, comes directly from this research:

  • The 200-line cliff is the reason multi-pass matters. A single LLM prompt over a 1,500-line PR has the same problem a human reviewer does: attention budget. Four passes, each focused on a single concern, each looking at the same bundle, is the algorithmic analogue of "split the review by concern."

  • The Bacchelli & Bird gap (14% defect-finding) is the reason the why field is required. If most human review comments aren't actually defects, an AI tool that produces unexplained "findings" just adds noise to that gap. We force each finding to commit to a rationale before it surfaces, same bar a senior engineer would meet.

  • The diff-size scaling problem is the reason ContextBundle is deterministic. AI tools that re-process the raw diff every time inherit the human's cognitive-load problem at the algorithmic layer. Bundling the inputs once and feeding them identically to every pass removes the size-dependent variability that would otherwise make findings non-reproducible.

We didn't pick these design choices because they sounded clever. We picked them because the research said the unstructured alternative loses 50% of its defects past 200 LOC. The architecture is shaped around the cliff.

Methodology and references

For the figures in this post:

  • Figure 1 is a smoothed reconstruction of the defect-detection curve published by SmartBear/Cisco, derived from the raw data table in Best Kept Secrets of Peer Code Review (Cohen, 2006). The original publication is a corporate whitepaper; the underlying methodology is summarised in Cohen, J. "Modern Code Review: A Case Study at Google", ICSE 2018.

  • Figure 2 is plotted directly from Table 3 of Bacchelli, A. & Bird, C. (2013). "Expectations, Outcomes, and Challenges of Modern Code Review", ICSE 2013. The percentages represent the fraction of survey respondents identifying each item as the primary purpose (secondary bar) versus the fraction of actual review comments classified into that category (primary bar).

  • Table 1 rows summarise the headline finding from each citation; the McConnell row references Code Complete, 2nd edition, Microsoft Press 2004, Chapter 21, "Collaborative Construction."

  • PR-size growth statistics are from GitClear's Coding on Copilot: 2024 Research report, which analysed ~211M lines of code across ~3.6M PRs.

A planned follow-up post will publish our own analysis of PR-size distributions across 100 open-source TypeScript repositories, with the data and methodology released alongside. That work is in progress and will land here when it's ready to defend.


If you read the original Bacchelli & Bird paper end-to-end and disagree with our interpretation, we'd like to hear about it. The interpretation matters: it's the load-bearing wall under everything Revund is. Email hello@revund.dev and tell us where we got it wrong.