The Estimation Paradox: Why Predicting Software Complexity Is a Fool's Errand
Software estimates are wrong because the information required for accurate estimation doesn't exist until you're inside the work. This isn't a calibration problem — it's a fundamental limitation.
Every engineer has lived this story:
Monday standup. The ticket says "Add retry logic to the payment webhook handler." The team estimates 5 points. Two days, maybe three. Straightforward.
The engineer opens the code. The webhook handler is tightly coupled to a synchronous processing pipeline. Adding retries means introducing a queue. The queue needs dead-letter handling. Dead-letter handling reveals that error types aren't properly categorized. Categorizing errors means touching a shared error module that six services depend on. The shared module hasn't been updated in two years and has no tests.
What was estimated as a 5-point, two-day task is now a 13-point, two-week effort. Not because the engineer is slow. Because the information required for an accurate estimate didn't exist until they opened the code.
This happens constantly. And it's not fixable with better estimation processes.
The Cone of Uncertainty
Software engineering has known about this problem for decades. The "Cone of Uncertainty" — first described by Barry Boehm in the 1980s — shows that at the start of a project, estimates can be off by a factor of 4x in either direction. Even after initial design, they're off by 2x.
The reason is structural: software is an interdependent system. Changing one component reveals dependencies on other components. Those dependencies have their own complexity, their own technical debt, their own surprises. You can't know the full scope of a change until you're inside it, touching the code, discovering what it actually does vs. what you thought it did.
This isn't a failure of discipline. It's an epistemological limitation. You cannot accurately estimate the complexity of knowledge work before doing it.
The Biases That Make It Worse
Even in theory, estimation is impossible. In practice, human cognitive biases make it worse.
The planning fallacy. People consistently underestimate how long things take, even when they know they've underestimated in the past. Psychologist Daniel Kahneman documented this extensively — humans anchor on best-case scenarios and discount risks they can't yet see.
Anchoring bias. In planning poker, whoever speaks first sets the anchor. If the tech lead says "this feels like a 5," the team clusters around 5. The estimate reflects social dynamics as much as technical assessment. Research on group estimation consistently shows that independent estimates, averaged, are more accurate than group discussion — but that's not how sprint planning works.
Optimism bias. Engineers, especially experienced ones, tend to estimate based on the best path through the code. "If the handler is structured the way I think it is, this is a 5." They don't account for the discovery that the handler isn't structured that way — because they can't know that yet.
Survivor bias in calibration. Teams try to calibrate their estimates by looking at how accurate past estimates were. But they only calibrate on the tasks they completed. The tasks that blew up and got descoped, reassigned, or canceled don't make it into the calibration data. The reference class is systematically biased toward tasks that went well.
The Real-World Cost
Bad estimates aren't just embarrassing. They cascade through organizations:
Broken commitments. Product managers promise features based on engineering estimates. When estimates are wrong, commitments break. Trust erodes between engineering and product. The response is usually to add "padding" — which means estimates become even more divorced from reality.
Mis-allocated resources. If you're planning the quarter based on estimates, and the estimates are systematically wrong, your resource allocation is wrong too. Teams end up under-resourced on complex work and over-resourced on simple work because the estimates couldn't distinguish between them.
Sprint theater. Teams commit to a certain number of points per sprint. When they realize mid-sprint that the estimates were wrong, they face a choice: rush to hit the number (sacrificing quality) or miss the target (looking "unproductive"). Neither option produces good outcomes.
Death by re-estimation. Some teams respond to bad estimates by adding more estimation process — refinement sessions, technical spikes, estimation workshops. The irony: they're spending more time estimating and less time doing the work that would reveal the actual complexity.
Better Estimation Doesn't Fix the Problem
Here's where I break from the conventional advice.
The standard response to estimation problems is "get better at estimating." Use reference classes. Do technical spikes. Break work down smaller. Calibrate regularly. Use No Estimates for some work. These are all reasonable practices that will make your estimates marginally less wrong.
But they don't fix the fundamental problem. You're still trying to predict complexity before the information exists. Making the prediction more sophisticated doesn't change the fact that it's a prediction about something inherently unpredictable.
Breaking work down helps — smaller tasks have less variance. But it has diminishing returns. At some point, you're spending more time decomposing and estimating than doing the work. And even small tasks can surprise you when you discover a dependency you didn't know existed.
Technical spikes help — doing a time-boxed investigation before estimating. But a spike that reveals the full complexity of a change is essentially doing the work. If the spike is thorough enough to be accurate, you've already solved the problem. If it's not, you're still guessing.
The fundamental issue is that we're trying to measure output by predicting input. We're guessing at complexity before anyone has written a line of code. And then we're surprised when the guess is wrong.
The Alternative: Measure After, Not Before
What if we flipped the model entirely?
Instead of spending hours estimating how complex work will be, measure how complex it was after it ships. Instead of guessing at Fibonacci numbers in a planning meeting, score the actual code changes that merged to production.
This is the core insight behind GitVelocity. When a PR merges, we analyze the actual code diff and score its complexity across six dimensions: Scope, Architecture, Implementation, Risk, Quality, and Performance & Security. The score is 0-100, based on the real artifact.
This eliminates every problem with estimation:
No pre-work guessing. The score comes from what was actually done, not what someone predicted would be done. All the surprises — the unexpected dependencies, the technical debt, the scope discoveries — are already resolved and reflected in the code.
No inflation or gaming. The score is based on the code, not on self-reported estimates. You can't claim a typo fix is complex work. The AI reads the code.
No calibration drift. The same rubric applies to every PR. There's no team-specific baseline that drifts over time. A 45-point PR on Team A is comparable to a 45-point PR on Team B.
No meeting overhead. No planning poker. No estimation workshops. No debates about whether something is a 5 or an 8. The engineers do the work, ship the code, and the score reflects what shipped.
What About Planning?
The obvious objection: "If we don't estimate, how do we plan?"
It's a fair question, and the answer isn't "you don't." It's that estimation and planning are different activities, and decoupling them makes both better.
Planning is about deciding what to work on, in what order, and who should do it. You need priorities, dependencies, and rough sizing. "This is a big effort" vs. "this is small" is usually sufficient. You don't need Fibonacci numbers for that.
Measurement is about understanding what happened — what shipped, how complex it was, how fast your team is moving. That information feeds back into planning: if your team ships ~400 velocity points per week, you have a real baseline for capacity. That's more useful than story points because it's based on actual output, not predicted input.
The estimation paradox isn't that estimation is useless — rough sizing still helps for planning. It's that estimation was never a good measurement tool. We conflated planning estimates with productivity measurement, and both suffered.
The AI Dimension
AI tools make the estimation paradox even more acute.
A task estimated at 8 points in sprint planning might take 30 minutes with Claude Code. The estimation assumed human implementation speed. That assumption is now wrong by an order of magnitude for some types of work.
Teams respond by re-estimating: "Well, with AI tools, that's really a 3." But which engineers are using AI effectively? It varies. The same task is an 8 for one engineer and a 3 for another, depending on their tool proficiency. The estimate becomes person-dependent, which defeats the purpose of shared estimation.
Output measurement bypasses this entirely. It doesn't matter how long the work took or what tools were used. It measures what shipped. A complex feature scores the same whether it took two hours with AI or two weeks without it. The complexity of the code is the complexity of the code.
Stop Predicting, Start Measuring
The estimation paradox has been with us for forty years. We've tried Fibonacci sequences, t-shirt sizes, ideal days, relative points, No Estimates, and every variation in between. Each one was a more sophisticated way of predicting something that can't be predicted.
The alternative is straightforward: stop trying to measure output before it exists. Measure the output itself — the actual code — after it ships.
It took AI to make this practical at scale. But now that it's practical, there's no good reason to keep playing planning poker with numbers everyone knows are made up.
The code is the code. It tells the truth about what was actually done. No estimation required.
GitVelocity measures the complexity of shipped code — not predicted complexity. No story points needed.
Conrad is CTO and Partner at Headline, where he leads data-driven investment across early stage and growth funds with over $4B in AUM. Before becoming an investor, he founded Munchery (raised $130M+) and held engineering and product leadership roles at IAC and Convio (IPO 2010). He and the Headline engineering team built GitVelocity to help engineering organizations roll out agentic coding and measure its impact.