Why These Engineering Metrics Can't Be Gamed
Every engineering metric ever invented has been gamed. Story points got inflated. Commits got split. Lines of code got padded. Here's why scoring actual code is different.
Goodhart's Law — "When a measure becomes a target, it ceases to be a good measure" — haunts every attempt to measure engineering productivity. It's not theoretical. It's documented, predictable, and almost inevitable.
Story points got inflated within months of being tracked. Commit counts got split the moment they appeared on a dashboard. Lines of code got padded when managers started watching. PR counts led to artificially decomposed work. Every metric we've used has been gamed, and usually faster than anyone expected.
So when someone tells you they have a new engineering metric, the first question should be: how do you game it?
Fair question. Let me answer it honestly.
How Story Points Got Gamed
Story points were supposed to be relative estimates — a shared language for complexity. They were never meant to be compared across teams or tracked over time as a performance signal.
But they were. And once they were, the gaming was immediate:
Inflation. Teams gradually estimated everything higher. Last quarter's 3 became this quarter's 5. The velocity chart went up. Leadership celebrated. Nothing changed. The team was doing the same work and calling it bigger numbers.
Splitting. One feature became five sub-tasks. Each sub-task got its own points. Total points went up. Total output was identical. The engineer who split one 8-point story into four 3-point stories scored 12 instead of 8.
Anchoring. Whoever spoke first in planning poker set the anchor. If the senior engineer said "8," everyone followed. The estimates reflected social dynamics, not complexity analysis.
The fundamental problem: story points are self-reported estimates. When people report their own numbers, and those numbers are used to evaluate them, the numbers become fiction.
How Commit Counts Got Gamed
Commit counts seemed more objective — they come from Git, not from self-reporting. But the gaming was just as fast:
Micro-commits. "Fix typo" as its own commit. "Update import" as its own commit. "Add comment" as its own commit. Each one counted equally. Engineers learned to work in the smallest possible increments.
Whitespace commits. Reformatting a file, adjusting indentation, adding blank lines. These are real commits that show up in the count but contain zero engineering value.
Trivial changes. Updating a version number, changing a config value, modifying a comment. All real commits. All noise.
How PR Counts Got Gamed
PR counts at least measure shipped artifacts. But they incentivize the wrong decomposition:
A feature that's naturally one cohesive PR gets split into seven: the model, the migration, the service, the controller, the route, the tests, the docs fix. Each PR is trivially reviewable but individually meaningless. The total count is seven. The output is one feature.
This isn't even conscious gaming in many cases. Engineers naturally decompose work to match the metric. If PRs are counted, smaller PRs are better. The incentive structure does the rest.
Why Code-Based Scoring Is Different
GitVelocity scores the actual code diff of every merged PR across six dimensions: Scope, Architecture, Implementation, Risk, Quality, and Performance & Security. The score reflects the engineering complexity of the change.
Here's why the common gaming strategies don't work:
Splitting PRs doesn't increase total score
If you split a feature into multiple PRs, each PR gets scored individually based on the code it contains. The total complexity across all the PRs is roughly the same as if you'd shipped it in one PR.
A PR that touches the model, service, controller, and tests scores higher on Scope (many subsystems) and Architecture (structural changes) than four separate PRs that each touch one layer. Splitting reduces the Scope and Architecture scores of each individual PR because each one crosses fewer boundaries.
The complexity is in the code, not the packaging. You can't create complexity by rearranging how you ship it.
Boilerplate inflates volume but not score
An engineer could generate 1,000 lines of CRUD boilerplate using AI tools. Lots of files, lots of lines. By volume metrics, that's impressive.
By complexity scoring, it's straightforward. The Implementation score is low because there's no complex logic. The Architecture score is low because it follows existing patterns. The Risk score is low because CRUD operations have low blast radius. The volume doesn't matter. The complexity does.
Unnecessary complexity is self-punishing
"What if I just write deliberately complex code to score higher?"
Two problems with this strategy:
Code review catches it. Teammates review PRs before they merge. Gratuitously complex code gets challenged. "Why is this a state machine? It's a form submission." Unnecessary complexity creates friction in the review process, slowing you down.
The Quality dimension penalizes it. Complex code without corresponding tests scores poorly on Quality. Complex code without documentation scores poorly on Quality. The rubric rewards complex work done well — with tests, with documentation, with clear structure. Complexity for its own sake fails the Quality check.
The only reliable way to improve your score is to ship genuinely complex, well-tested, well-structured code. Which is exactly what we want.
You can't claim complexity that isn't there
This is the key difference from every previous metric. Story points are self-reported — you can claim whatever you want. Commit counts measure git commands — you can run as many as you want. PR counts measure packaging — you can split however you want.
Velocity scores are based on what the AI reads in the actual code diff. The AI examines the code. It evaluates what changed, how the changes interact with the system, and what engineering decisions they reflect. You can't claim architectural significance that isn't in the code. You can't claim implementation complexity that isn't in the logic. The artifact speaks for itself.
The Gaming That's Actually Good
Here's something interesting: the behaviors that would improve your velocity score are behaviors we actively want to encourage.
Take on more complex work. Engineers who tackle architectural challenges, complex business logic, and cross-cutting concerns will score higher. Good. We want people leaning into hard problems.
Write comprehensive tests. The Quality dimension rewards testing. Engineers who test thoroughly score higher. Good. We want well-tested code.
Think about security and performance. The Performance & Security dimension rewards deliberate optimization and hardening. Good. We want engineers who go beyond the defaults.
Ship consistently. Weekly velocity is the aggregate of all merged PRs. Engineers who ship consistently — rather than batching everything into one massive PR — build higher weekly totals. Good. We want continuous delivery.
Use AI tools effectively. Engineers who leverage AI to ship more complex work, faster, will have higher velocity. Good. We want AI adoption.
When the "gaming" strategy is indistinguishable from "doing your best work," the metric is working as designed.
No Metric Is Perfect
I want to be honest about the limitations.
Investigation work isn't captured. An engineer who spends three days debugging a production issue and fixes it with a one-line change will get a low score on that PR. The investigation effort was real and valuable. The code change was simple. The score reflects the code change, not the investigation.
Mentoring isn't captured. A senior engineer who spends half their time reviewing others' code and pair programming will have lower personal velocity. Their impact on the team is enormous but doesn't show up in their own PRs.
Some valuable work is low-complexity. Dependency updates, config changes, copy fixes — these are necessary and valuable. They score low because the code changes are simple. That's accurate, but it means some genuinely productive weeks show low velocity.
We're transparent about these limitations because acknowledging them builds more trust than pretending they don't exist. The scoring measures the complexity of shipped code changes. That's the most gaming-resistant signal available, but it's not the only dimension of engineering contribution.
The Bar Is Higher Than "Ungameable"
The goal isn't just a metric that can't be gamed. It's a metric that, even when people optimize for it, produces the outcomes you actually want.
Story points, when optimized for, produced inflated estimates and split tickets. Commit counts, when optimized for, produced micro-commits and noise. PR counts, when optimized for, produced artificially decomposed work.
Velocity scores, when optimized for, produce complex, well-tested, frequently shipped code built with effective use of AI tools.
That's the bar. Not just gaming-resistant — aligned incentive. The metric and the desired outcome point in the same direction.
GitVelocity measures engineering velocity by scoring the actual code in every merged PR. The score is grounded in the artifact, not in estimates or activity counts.
Conrad is CTO and Partner at Headline, where he leads data-driven investment across early stage and growth funds with over $4B in AUM. Before becoming an investor, he founded Munchery (raised $130M+) and held engineering and product leadership roles at IAC and Convio (IPO 2010). He and the Headline engineering team built GitVelocity to help engineering organizations roll out agentic coding and measure its impact.