· 4 min read · Engineering Measurement

Why Story Points Failed: A Post-Mortem

Story points promised to predict engineering capacity. They never delivered. Here's why — and what actually works.

I've sat through hundreds of sprint planning sessions. I've watched senior engineers argue about whether a task is a 5 or an 8. I've seen velocity charts that went up and to the right while shipped features stayed flat. Story points had a good run. They gave us a shared vocabulary and a planning ritual. But let's be honest: everyone in the room knew the numbers were made up.

It's time for a post-mortem.

The Promise

Story points were supposed to solve a real problem. Traditional time-based estimates were consistently wrong. Teams would estimate "two days" and deliver in two weeks. Managers would stack estimates into Gantt charts that were fiction from day one.

Story points offered an alternative: relative sizing. Don't estimate in hours — just compare tasks to each other. A login page is a 3, a payment integration is an 8. The theory was that humans are better at relative comparison than absolute prediction.

And that's true. Humans are better at relative comparison. But that insight, applied to software estimation, ran into three fatal problems.

Fatal Flaw #1: The Numbers Became Currency

The moment story points appeared in a sprint velocity chart, they stopped being a planning tool and became a performance metric. Managers tracked velocity. Teams were compared by points-per-sprint. Story points became the thing teams optimized for.

This created perverse incentives:

  • Inflation: Teams gradually estimated everything higher. Last quarter's 3-pointer became this quarter's 5-pointer. Velocity went up. Everyone celebrated. Nothing changed.
  • Gaming: Engineers learned to break work into more tickets. One feature became five sub-tasks. More tickets = more points = better velocity. The actual output was identical.
  • The Fibonacci illusion: We pretended that mapping work to 1, 2, 3, 5, 8, 13 created meaningful precision. It didn't. It just gave the theater of estimation a mathematical costume.

When a metric becomes a target, it ceases to be a good metric. Story points became a target the moment they appeared on a dashboard.

Fatal Flaw #2: You Can't Measure Output by Predicting Input

Here's the core epistemological problem: you cannot accurately estimate the complexity of knowledge work before doing it. We were trying to measure output by predicting input. We were guessing at complexity before anyone had written a line of code.

Consider a typical task: "Add retry logic to the payment webhook handler." In planning poker, the team estimates 5 points. But when the engineer actually opens the code, they discover:

  • The webhook handler is tightly coupled to a synchronous processing pipeline
  • Adding retries requires introducing a queue
  • The queue needs dead-letter handling
  • Dead-letter handling reveals that error types aren't properly categorized
  • Categorizing errors means touching the shared error module that six services depend on

What was estimated as a 5 became, in reality, a 13. This happens constantly. Not because engineers are bad at estimating — because the information required for accurate estimation doesn't exist until you're inside the work.

This isn't a fixable calibration problem. It's a fundamental limitation of pre-work estimation for complex, interdependent systems.

Fatal Flaw #3: "Velocity" Measured in Story Points Is a Vanity Metric

Story points were designed to be team-specific. "A 5 means different things to different teams" was a feature, not a bug. But this made them useless for exactly the comparisons organizations needed to make:

  • Which team is delivering the most value?
  • Is Team A's velocity decline a problem or a sign of increasingly complex work?
  • How should we allocate engineers across projects?

When every team has its own incomparable unit of measurement, leadership has no meaningful signal. I talk to CTOs constantly — across Headline's portfolio, the same question keeps surfacing: "How productive is my engineering team?" Story points can't answer it. They might as well be measuring in vibes.

The Alternative: Measure After, Not Before

What if, instead of guessing complexity before work begins, you measured it after work ships?

This is the insight behind GitVelocity. When a PR merges to production, we analyze the actual code changes and score their complexity across six dimensions: Scope, Architecture, Implementation, Risk, Quality, and Performance & Security.

The code is the code. It tells the truth about what was actually done.

This approach eliminates all three fatal flaws:

  1. Can't inflate: The score is based on the actual code diff. You can't claim a typo fix is complex work — the AI reads the code.
  2. No pre-work estimation: Complexity is measured from the artifact itself, after all the surprises have been encountered and resolved.
  3. Universal comparability: Every PR across every team is scored on the same rubric. A 45-point PR on Team A is genuinely comparable to a 45-point PR on Team B.

But Won't Engineers Game This Too?

Fair question. If engineers know they're being scored on complexity, won't they just write unnecessarily complex code?

No. Complex code is expensive to review, merge, and maintain. Writing gratuitously complex code to game a score means longer code reviews (teammates will push back), more bugs (complexity correlates with defect rate), and future maintenance burden.

The scoring rubric includes a Quality dimension. Code that's complex but untested scores poorly. Code that's complex but poorly documented scores poorly. The incentive structure naturally rewards doing genuinely complex work well — which is exactly what we want.

The Path Forward

Story points served their purpose. They moved the industry away from the fiction of hour-based estimates. But they were a stepping stone, not a destination.

We built GitVelocity because we needed it ourselves. After months of internal use at Headline, the results were clear: objective measurement based on shipped code works. The AI era gives us tools that previous generations of engineering leaders didn't have. The question isn't whether to measure engineering output — it's whether to keep using a measurement tool that everyone knows is broken.

We know the answer.


GitVelocity measures engineering velocity by scoring every merged PR using AI.

Learn more about how it works.

Conrad Chu
Written by Conrad Chu

Conrad is CTO and Partner at Headline, where he leads data-driven investment across early stage and growth funds with over $4B in AUM. Before becoming an investor, he founded Munchery (raised $130M+) and held engineering and product leadership roles at IAC and Convio (IPO 2010). He and the Headline engineering team built GitVelocity to help engineering organizations roll out agentic coding and measure its impact.