AI Broke Your Engineering Metrics. Here's What Works Now
Commit counts, lines of code, and story points all fail when AI writes code. Here's the measurement approach that works — with real data from our team.
Engineering leaders are flying blind. In my conversations with CTOs across Headline's portfolio, the same critical questions keep surfacing — questions that don't have established answers yet.
AI adoption visibility: How is AI being adopted across my engineering organization? Which engineers are actually more productive with AI tools?
Standardization challenges: How do I standardize AI usage when everyone's doing their own thing? How do I motivate the resistant engineers and channel the enthusiastic ones?
Broken metrics: Performance reviews are subjective. Commit counts measure activity, not value. Project tracking tools measure process, not output. There's no way to objectively identify who's a good developer.
The fundamental problem: the industry lacks a standardized, verifiable measure of engineering output. And AI agentic coding is making it worse.
The Old Metrics Are Failing Harder Than Usual
Traditional engineering metrics were already flawed. Lines of code, commit count, story points — none of them captured what actually matters. But AI tools make these metrics actively misleading.
Consider lines of code. An engineer using Claude Code can generate 500 lines in the time it used to take to write 50. Does that mean they're 10x more productive? Of course not. Most of those lines might be boilerplate. The real value was in the 20 minutes the engineer spent thinking about the architecture before prompting.
Commit count is even worse. AI-assisted development often produces fewer, larger commits. An engineer might scaffold an entire feature in a single session instead of building it incrementally over days. Fewer commits, dramatically more output.
And story points? They were already divorced from reality. AI just widens the gap. A task estimated at 8 points might take an AI-enabled engineer 30 minutes.
If an engineer uses AI to ship six features in a week instead of one, how do you capture that? Traditional metrics don't adjust for quality and complexity of output.
What AI Actually Changes
To measure AI's impact, you need to understand what it changes and what it doesn't.
What AI changes:
- Speed of implementation. Boilerplate, patterns, and well-understood problems get solved faster. Much faster.
- Breadth of capability. Engineers work in unfamiliar languages and frameworks more confidently. The AI fills knowledge gaps in real time.
- Iteration speed. Trying three approaches and picking the best one becomes practical when each takes minutes instead of hours.
What AI doesn't change:
- Problem decomposition. Knowing what to build still requires human judgment.
- System design. Architecture decisions, trade-off analysis, and long-term thinking remain human work.
- Quality standards. AI can generate tests, but deciding what to test and what quality bar to set is still an engineering call.
The new differentiator in the AI era: It's no longer about whether you can write complex code — AI makes that almost effortless. The differentiator is parallelization: how many projects can you work on simultaneously? How quickly can you move from one shipped feature to the next?
AI Is the Best Judge for AI-Era Code
Here's something we believe strongly: if AI is writing the code, AI should judge the code.
Why?
- No bias. No mood, no politics, no recency bias. The same rubric is applied to every PR.
- Modern code understanding. AI understands modern code patterns that traditional static analysis tools miss.
- AI-agnostic measurement. It scores code based on what shipped — regardless of whether AI assisted in writing it. The output is what matters.
When you score a merged PR across dimensions like Scope, Architecture, Implementation, Risk, and Quality, you're measuring the outcome of engineering work. It doesn't matter whether the code was typed by hand or pair-programmed with Claude. What matters is what shipped and how complex it was.
The AI Adoption Signal
Here's what most leaders miss: complexity-based measurement gives you a natural signal for AI adoption effectiveness.
When an engineer starts using AI tools effectively, a predictable pattern emerges:
- Week 1-2: Velocity stays roughly the same. They're learning the tool.
- Week 3-4: Velocity increases 20-40%. They've integrated AI into routine tasks.
- Month 2+: Velocity stabilizes at a new, higher baseline — shipping more work at similar complexity.
We saw this firsthand at Headline. Our team's aggregate productivity nearly doubled from August to November 2025. December still increased, despite being holiday-heavy. And January 2026 saw another 50% jump after we announced GitVelocity scores would factor into annual performance reviews.
Something else happened that we didn't expect: some junior engineers started outperforming seniors on pure velocity metrics. AI had unlocked them. They were shipping complex work that would have been beyond their reach a year ago.
Without complexity-based measurement, none of this would have been visible. We'd be relying on self-reported surveys or vendor seat-usage data that tells you nothing about actual productivity.
What This Looks Like in Practice
Engineer A uses AI heavily. They ship 12 PRs in a week. Most are moderate complexity (25-40 points). Total velocity: ~400 points. Before AI tools, they shipped 6 PRs at similar complexity. AI doubled their output.
Engineer B doesn't use AI much. They ship 5 PRs. But two are high-complexity architectural changes (70+ points). Total velocity: ~280 points. Their work is deeply valuable.
Engineer C uses AI constantly but mostly for trivial tasks. They ship 20 PRs, all simple (5-15 points). Total velocity: ~200 points. They're busy but not impactful.
None of these stories are visible in commit counts or lines of code. All of them are visible in complexity-weighted velocity.
The Path Forward
AI tools are going to widen the gap between effective and ineffective engineers. As agentic coding reshapes how work gets done, leaders need to see this happening — not to punish anyone, but to invest in the right development opportunities. The engineer struggling to adopt AI might need pairing sessions, not a performance plan. The engineer whose velocity doubled might be ready for a tech lead role.
You can't make these decisions without measurement. And the measurement has to capture what actually matters in the AI era — not what mattered five years ago. For a framework on how to quantify AI tool ROI specifically, see how to measure AI tools ROI.
GitVelocity measures engineering velocity by scoring every merged PR using AI.
Conrad is CTO and Partner at Headline, where he leads data-driven investment across early stage and growth funds with over $4B in AUM. Before becoming an investor, he founded Munchery (raised $130M+) and held engineering and product leadership roles at IAC and Convio (IPO 2010). He and the Headline engineering team built GitVelocity to help engineering organizations roll out agentic coding and measure its impact.