· 5 min read · Engineering Process

The Quiet Revolution in Code Review

Code review went from optional to mandatory to bottleneck. AI tools are unbottlenecking it -- but the real question is what you optimize for.

Code review has gone through three eras. Understanding where we've been explains why AI review tools matter -- and why they're not the whole story.

Era one: optional. Early teams skipped review entirely. Ship fast, fix later. The codebase became a haunted house of undocumented decisions. Everyone who was there when it was built left. Nobody knew why anything worked.

Era two: mandatory. Companies mandated reviews to solve the haunted house problem. Every PR required an approval. Knowledge spread. Quality improved. But a new problem emerged: reviews became the bottleneck. Engineers waited hours or days for approvals. Reviewers drowned in queues. The sixth PR of the day got a rubber stamp because the reviewer was exhausted after scrutinizing the first five.

Era three: AI-assisted. That's where we are now. And the interesting thing about this transition isn't the technology. It's the question it forces you to answer: what is code review actually for?

The Three Purposes of Code Review

Most teams treat code review as one thing. It's actually three things, and they're often in tension.

Bug detection. Finding logic errors, edge cases, security issues, and performance problems before they reach production. This is what most people think of when they think about code review.

Knowledge transfer. Spreading understanding of the codebase across the team. When I review your PR, I learn about the part of the system you changed. When you address my comments, you learn about conventions and considerations you didn't know about. This is arguably the most valuable purpose of review, and the one most teams underinvest in.

Quality gatekeeping. Enforcing standards -- naming conventions, error handling patterns, test coverage expectations, architectural consistency. This is the most tedious purpose and the one most prone to human fatigue.

AI review tools are excellent at the third purpose. Decent at the first. And currently useless at the second.

This matters because if you adopt AI review tools and only think about bug detection, you'll optimize for the wrong thing. The real opportunity is using AI to handle gatekeeping so that humans can spend more time on knowledge transfer and architectural judgment.

What AI Review Tools Actually Do Well

I've used CodeRabbit, Greptile, and Cursor Bugbot in production workflows. Here's what I've found to be reliably good.

Pattern enforcement is where AI shines brightest. "This endpoint doesn't follow our error response format." "This React component uses class state instead of hooks." "This function doesn't have the error boundary that all our API handlers include." These are rules that human reviewers enforce inconsistently because they're boring to check and easy to miss when fatigued. AI checks them every time, instantly, without getting tired.

Surface-level bug detection is genuinely useful. Null reference risks, unhandled promise rejections, off-by-one errors in loop bounds, missing await statements on async calls. These are the bugs that experienced reviewers catch early in the day and miss after lunch. AI catches them at 3 AM on a Saturday with the same accuracy as 9 AM Monday.

Test gap identification is the surprise value. AI tools flag when a function has three code paths but only two test cases. Human reviewers often skip this because evaluating test completeness is mentally expensive. Having an AI say "you didn't test the error branch" saves more production incidents than I expected.

What AI Review Tools Don't Do

AI review tools operate on the diff. They see what changed. They don't understand why it changed, whether it should have changed, or what it means for the system's future.

Architecture evaluation is beyond them. "This works, but it creates a circular dependency that'll block the refactor we planned for Q2." "This service should be stateless but you're storing user session data in memory." "This solves the immediate problem but the right fix is in a different layer entirely." These require context about the system's trajectory that no review tool possesses.

Domain validation is another gap. "This pricing calculation applies the discount after tax -- is that correct for our enterprise tier?" The AI can verify the math is right. It can't verify the business logic is right.

And the knowledge transfer purpose of review? AI tools don't help here at all. A human reviewer asking "why did you choose this approach?" and the ensuing conversation is where junior engineers learn judgment. AI tools don't have that conversation. They produce comments. They don't produce understanding.

A Practical Model That Works

After experimenting for about a year, here's the structure we've settled on.

Automated checks run first, on every PR, without human intervention. Linting, type checking, tests, and AI review tools all fire when the PR is opened. The engineer addresses automated feedback before requesting human review. This isn't optional -- it's a workflow expectation. The AI handles the gatekeeping layer so the human reviewer never has to.

Human review focuses on two things only. First: does this approach make sense architecturally? Is this the right abstraction? Will this be maintainable? Does this create tech debt we'll regret? Second: does the reviewer learn something from reading this code, and does the author learn something from the review comments? If neither person walks away understanding the system better, the review didn't justify the human time.

Output measurement closes the loop. Here's the part most teams miss. Code review -- manual or AI-assisted -- is an input. It measures the quality of the process. But what you actually care about is the quality of what ships. Are the merged PRs genuinely complex? Are they well-structured? Are they tackling hard problems or just churning through boilerplate?

This is where scoring the output adds a dimension that review tools can't. Review tools optimize the gate. Output scoring evaluates what gets through the gate. You need both.

The Priority Question

When teams adopt AI review tools, they often ask: "Which tool should we use?" That's the wrong first question.

The right first question is: "What is our review process optimizing for?"

If your bottleneck is review turnaround time -- engineers blocked waiting for approvals -- then any decent AI tool will help. The key intervention is automating the gatekeeping layer so human reviewers spend less time on pattern enforcement and more time on the stuff that matters.

If your bottleneck is review quality -- reviewers approve things they shouldn't -- then AI tools help with surface-level issues but you also need to rethink how you allocate human review time. Point your experienced reviewers at the highest-complexity PRs and let AI handle the routine ones. You can use PR complexity scores to route reviewers where they'll have the most impact.

If your bottleneck is knowledge silos -- certain engineers are the only ones who understand certain systems -- then AI tools won't help at all. You need to pair junior reviewers with senior ones on PRs that touch unfamiliar systems. That's a process change, not a tool change.

The tool matters less than the diagnosis.

Where This Is Going

The current generation of AI review tools works on the diff. The next generation will have full codebase context -- understanding not just what changed, but how it fits into the system. That means they'll start catching some architectural issues: "This function duplicates logic that already exists in module X" or "This change conflicts with the pattern established in PR #847."

When that happens, the human review role shifts even further toward strategic judgment and knowledge transfer. The questions become less about "is this code correct?" and more about "is this the right code to write?" and "does the team understand why?"

That's a healthy evolution. It puts human attention on the work that actually requires human judgment. And it makes measuring what ships even more critical, because as the review process gets smarter, you need to verify that smarter reviews translate to better outcomes.


GitVelocity measures engineering velocity by scoring every merged PR using AI. Understand what your team is shipping, not just how they're reviewing.

See how it works.

Conrad Chu
Written by Conrad Chu

Conrad is CTO and Partner at Headline, where he leads data-driven investment across early stage and growth funds with over $4B in AUM. Before becoming an investor, he founded Munchery (raised $130M+) and held engineering and product leadership roles at IAC and Convio (IPO 2010). He and the Headline engineering team built GitVelocity to help engineering organizations roll out agentic coding and measure its impact.