· 6 min read · Engineering Measurement

Engineering Measurement Is Broken. Here's What We Do About It.

Every engineering leader has been asked 'how productive is your team?' and felt their stomach drop. The honest answer is: we don't know. Here's why — and how to fix it.

Every engineering leader has been asked this question by their CEO and felt their stomach drop: "How productive is our engineering team?"

The honest answer — the one nobody says out loud — is: we don't know. We have proxies. We have rituals. We have dashboards full of numbers that look precise but measure nothing meaningful. The industry has spent two decades trying to answer this question, and we've failed.

I know because I've been on both sides. As an engineer, I watched my work get reduced to story points that bore no relationship to reality. As an investor at Headline, I ask CTOs across our portfolio this question constantly. The answers range from confident hand-waving to honest despair.

It's time to stop pretending and start fixing.

Everything We've Tried (And Why It Failed)

The history of engineering measurement reads like a graveyard of good intentions.

Lines of code was the first attempt. It had the appeal of simplicity — more lines, more work. But it rewarded verbosity and punished elegance. The engineer who deleted 500 lines of dead code showed up as negatively productive. The engineer who copy-pasted their way to 2,000 lines looked like a hero.

Commit counts seemed smarter. At least they measured discrete units of work. But engineers quickly learned to game them. Atomic commits became micro-commits. "Fix typo" became a productivity signal. The engineer spending three days on a critical security fix looked idle next to one pushing whitespace changes.

Story points were the Agile world's attempt at sophistication. Relative sizing. Fibonacci sequences. Planning poker. The theory was sound — humans are better at relative comparison than absolute estimation. In practice, story points became currency. Teams inflated them. Managers compared them across teams despite being told not to. The numbers went up and to the right while actual output stayed flat.

DORA metrics — deployment frequency, lead time, change failure rate, time to restore — moved the conversation to process health. That was progress. But DORA tells you how fast your pipeline runs, not what's moving through it. A team deploying empty features ten times a day looks great on DORA. A team shipping one transformative change per week doesn't.

PR counts got closer. At least they measured shipped artifacts. But they incentivized splitting work into the smallest possible units. One meaningful feature became seven trivial PRs. The metric went up. The output was identical.

Every one of these was introduced with good intentions. Every one was gamed, misused, or misinterpreted within months.

The Core Problem: We've Been Measuring Everything Except Output

Here's the pattern nobody talks about: every metric we've tried measures an input, a proxy, or a process — never the actual output.

Lines of code measures volume. Commits measure git activity. Story points measure pre-work guesses. DORA measures pipeline speed. PR counts measure how work is packaged. None of them answer the question: what did this engineer actually ship, and how complex was it?

This is like measuring a factory's output by counting how many times workers clocked in, how fast the conveyor belt runs, and how many tools they used — without ever looking at what came off the line.

The output of engineering is code that ships to production. That's the artifact. That's where the truth lives. And until recently, we had no good way to evaluate that artifact at scale.

Why This Problem Is Getting Worse

If measurement was already broken, AI tools are making it catastrophically misleading.

An engineer using Claude Code can generate 500 lines in the time it used to take to write 50. Lines of code as a metric is now actively harmful. Commit counts? AI-assisted development often produces fewer, larger commits. Story points? A task estimated at 8 points might take an AI-enabled engineer 30 minutes.

Here's what I keep hearing from CTOs across Headline's portfolio — the same three questions, over and over:

"How is AI being adopted across my engineering org?" They've bought seats for Copilot, Cursor, Claude Code. They have no idea who's actually using them effectively.

"How do I standardize AI usage when everyone's doing their own thing?" Some engineers have doubled their output. Others haven't changed. Leadership can't tell which is which.

"How do I objectively identify who's a good developer?" Performance reviews are subjective. Commit counts measure activity, not value. Project tracking tools measure process, not output.

The fundamental problem: the industry lacks a standardized, verifiable measure of engineering output. AI is making the gap between reality and measurement wider every day.

What a Real Measurement System Looks Like

After years of frustration — first as an engineer, then as an investor watching portfolio companies struggle with this — we built GitVelocity. Not because the world needed another dashboard. Because we needed an answer to the question that every CTO kept asking.

Here's what we learned about what measurement actually needs to be:

Measure output, not activity. The only artifact that matters is code that reaches production. Not hours online. Not Slack messages. Not attendance at standups. What shipped?

Measure after, not before. Story points try to predict complexity before work begins. That's backwards. The information required for accurate assessment doesn't exist until the work is done. Measure the artifact itself — the actual code changes — after all the surprises have been encountered and resolved.

Be objective and consistent. The same work should get the same score regardless of who did it, what team they're on, or what their manager thinks. No bias. No mood. No politics.

Be transparent. Engineers should see exactly how their score is calculated. Every score should come with a breakdown showing why. Black-box metrics breed distrust.

Be gaming-resistant. Ground measurement in the actual artifact — the code itself — not self-reported estimates or easily manipulated proxies. You can split a feature into five PRs, but the total complexity of the code doesn't change.

Capture complexity, not volume. A brilliant 30-line security fix should score higher than a 500-line copy-paste integration. What matters is the engineering complexity of the change, not its size.

AI Should Judge AI-Era Code

Here's something we believe strongly: if AI is writing the code, AI is now the best judge of code quality.

Traditional static analysis tools check syntax rules. They catch patterns. But they can't understand intent, evaluate architectural decisions, or assess whether a complex algorithm was the right approach. A linting tool can tell you the code follows conventions. It can't tell you the code is good.

AI can. When you score every merged PR across dimensions like Scope, Architecture, Implementation, Risk, Quality, and Performance — using a model that understands modern code patterns across languages — you get a measure that reflects what senior engineers intuitively recognize: some work is genuinely complex, and some isn't.

The key properties:

  • No bias. No mood, no politics, no recency bias. The same rubric is applied to every PR.
  • AI-agnostic. It scores code based on what shipped — regardless of whether AI assisted in writing it. The output is what matters, not the tool that produced it.
  • Universal. Every PR across every team is scored on the same rubric. A 45-point PR on Team A is genuinely comparable to a 45-point PR on Team B. This was never possible with story points.

What Happens When You Actually Measure

When we rolled out GitVelocity internally at Headline, the results were concrete.

Our team's aggregate productivity nearly doubled from August to November 2025. December still increased, despite being holiday-heavy. January 2026 saw another 50% jump after we announced scores would factor into annual reviews.

Something else happened we didn't expect: some junior engineers started outperforming seniors on pure velocity metrics. AI tools had unlocked them. They were shipping complex work that would have been beyond their reach a year ago. Without objective measurement, we never would have seen it.

The adoption followed a predictable pattern: skepticism first ("Can AI really judge my code?"), then testing (engineers checking if the scores matched their intuition), then acceptance (they did), then — and this surprised us — competition. Engineers started wanting to improve their scores. Weekly meetings began celebrating top performers. Natural gamification emerged because fair measurement creates healthy motivation.

The Path Forward

Engineering measurement is broken, but it's not unfixable. The tools to fix it now exist — AI that understands code deeply enough to evaluate it objectively, consistently, and transparently.

The question isn't whether to measure engineering output. It's whether to keep using measurement tools that everyone knows are broken, or to adopt something that actually works.

The code is the code. It tells the truth about what was actually done. We just needed a way to read it at scale.


GitVelocity measures engineering velocity by scoring every merged PR using AI. Every score is transparent and explainable.

See how it works.

Conrad Chu
Written by Conrad Chu

Conrad is CTO and Partner at Headline, where he leads data-driven investment across early stage and growth funds with over $4B in AUM. Before becoming an investor, he founded Munchery (raised $130M+) and held engineering and product leadership roles at IAC and Convio (IPO 2010). He and the Headline engineering team built GitVelocity to help engineering organizations roll out agentic coding and measure its impact.