From Skepticism to Competition: How Our Engineers Learned to Love Velocity Scores

I'm going to tell you exactly what happened when we rolled out velocity scoring to our own engineering team at Headline. Not the sanitized version. The real one.

Because if you're considering introducing measurement to your team, you deserve to know what the first few weeks actually look like — the pushback, the testing, the turning point, and the thing that happened after that nobody predicted.

Phase 1: Skepticism

The initial reaction was exactly what you'd expect from a room full of engineers who've been burned by bad metrics before.

"Can AI really judge my code?"

"What if it scores my infrastructure work low because there's no UI?"

"Is this going to be used against us in reviews?"

"I spent three days on that bug — the fix was one line. How does that score?"

These aren't unreasonable questions. They're the questions any thoughtful engineer would ask. They've seen story points weaponized, commit counts gamified, and "productivity tools" that were actually surveillance tools. The skepticism was earned.

We didn't try to convince anyone the system was perfect. We said: "Look at your scores. Tell us where they're wrong. We'll fix it."

That turned out to be the most important thing we did.

Phase 2: Testing

Within the first week, engineers started checking their scores. Not because they were told to — because they were curious. Engineers are builders. They can't resist testing a system.

The testing was methodical. They'd look at a PR they were proud of — something architecturally significant or technically difficult — and check whether the score reflected the complexity. Then they'd look at a simple PR — a config change, a dependency bump — and see if it scored low.

Some specific moments from that first week:

The infrastructure test. One of our senior engineers had spent a week building out a new caching layer. No UI. No user-facing features. Pure infrastructure. His concern was that the system would score it low because there was nothing "visible." It scored 62 — high complexity across Architecture, Implementation, and Risk. He didn't say anything, but he started checking his scores daily after that.

The one-line fix test. Another engineer had tracked down a race condition that was causing intermittent data corruption. Three days of investigation, one line of code changed. The PR scored 14 — low, because the code change was objectively simple. This was the hardest conversation. We acknowledged: the scoring measures the complexity of the code change, not the investigation effort. A simple fix is a simple fix, even if finding it was hard. The investigation work is real but isn't captured in the PR artifact. We were honest about this limitation rather than pretending the system captured everything.

The boilerplate test. A junior engineer had generated a bunch of CRUD endpoints using patterns from existing code. Lots of files, lots of lines, straightforward work. It scored 22. They initially thought it should be higher because of the volume. Then they looked at the breakdown: Scope was moderate (many files touched), but Architecture, Implementation, and Risk were all low. The score made sense. Volume isn't complexity.

Phase 3: Acceptance

Acceptance didn't arrive with a declaration. There was no meeting where everyone said "okay, we trust it now." It happened gradually, over about three weeks.

The signal was behavioral. Engineers stopped questioning whether the scores were fair and started questioning how to improve them. The conversation shifted from "is this metric legitimate?" to "how do I do higher-complexity work?"

That's a profound shift. And it only happens when the underlying measurement is genuinely fair. You can't trick engineers into accepting a bad metric. They'll find the flaws and lose trust permanently. The acceptance came because, PR after PR, the scores matched their intuition about what was simple and what was complex.

A few things that built trust:

Transparency. Every score came with a dimension-by-dimension breakdown and reasoning. Engineers could see exactly why their PR scored what it did. No black box.

Consistency. The same kind of work got the same kind of score, regardless of who did it. No favoritism, no recency bias, no manager judgment.

No punishment. Low scores weren't treated as problems. We explicitly said — and meant — that simple work is necessary. Bug fixes, config changes, dependency updates. The score isn't a grade. It's a measure of complexity.

Phase 4: Competition

This is the part we didn't predict.

About a month in, engineers started competing. Not because we introduced a leaderboard — we hadn't. Not because we tied scores to compensation — we hadn't done that either. The competition emerged organically.

It started in our weekly engineering meeting. Someone mentioned their weekly velocity total. Someone else mentioned theirs. Within two weeks, the weekly meeting had an informal velocity segment where people shared their numbers.

Engineers started celebrating each other's high-scoring PRs. "Did you see Sarah's PR? Scored 71 — that auth refactor was massive." This wasn't management-driven recognition. It was peer recognition, fueled by a shared understanding of what the scores meant.

Some engineers started setting personal targets. Not mandated — self-imposed. "I want to average 35+ per PR this month." The targets were about pushing themselves toward more complex, more impactful work.

The weekly aggregate became a team metric everyone watched. Not as pressure — as pride. When the team's weekly velocity went up, people were genuinely excited. It felt like a team sport.

The Junior Surprise

Here's the thing nobody expected: some junior engineers started outperforming seniors on velocity metrics.

AI tools had changed the game. Junior engineers who adopted Claude Code and Cursor aggressively could ship complex work that would have been beyond their reach a year ago. The implementation skill gap between junior and senior had compressed because AI handles the implementation. What remained was the ability to reason about architecture and direct work — and some juniors had more of that ability than anyone realized.

Without objective measurement, this would have been invisible. The juniors would have stayed "junior" in everyone's perception. The data showed otherwise.

This wasn't threatening to the senior engineers — it was energizing. Several of them accelerated their own AI adoption after seeing the results. The competitive dynamic cut across seniority levels in a healthy way.

The Numbers

I'll share what happened to our aggregate team metrics during the rollout:

Our team's aggregate productivity nearly doubled from August to November 2025. That was the initial adoption curve — engineers getting comfortable with measurement and AI tools simultaneously.

December still showed increases, despite being holiday-heavy.

January 2026 saw another 50% jump. That was when we announced velocity scores would factor into annual performance reviews. The combination of objective measurement and real consequences accelerated output significantly.

We went from not knowing how productive our team was to having a precise, weekly picture of what shipped and how complex it was.

Practical Advice for Your Rollout

Based on what we learned, here's what I'd tell any leader introducing measurement:

Start transparent. Engineers should see their own scores from day one, with full breakdowns. If you're introducing measurement and the engineers can't see how it works, you've already lost trust.

Let engineers validate. Give them time to test the system against their intuition. Expect and welcome pushback. The engineers who push hardest are the ones who care most about fairness.

Don't tie to compensation immediately. Let the system build trust first. We waited months before connecting scores to reviews. By that point, engineers trusted the metric enough that the connection felt fair rather than threatening.

Use data to celebrate, not punish. When velocity drops, ask "what's blocking you?" not "why aren't you performing?" The first question builds trust. The second destroys it.

Share your own scores. If you commit code, show your numbers. Leading by example matters more than any messaging.

Give it time. Trust builds slowly. Our arc from skepticism to competition took about six weeks. Don't panic if the first two weeks are uncomfortable.

Be honest about limitations. The one-line fix problem is real. Not every dimension of engineering contribution is captured in code complexity. Acknowledging limitations builds more trust than pretending they don't exist.

Why This Matters

The journey from skepticism to competition isn't just a feel-good story. It demonstrates something important: engineers don't hate measurement. They hate bad measurement.

Give them a metric that's fair, transparent, and consistent — one that recognizes complex work and doesn't punish simple work — and they'll not only accept it. They'll compete on it. They'll celebrate each other's wins. They'll use it to push themselves.

That's not what happens with story points or commit counts. Those metrics breed cynicism because they deserve cynicism. Fair measurement breeds motivation.

The difference is whether the metric actually captures what engineers care about: the complexity and quality of their craft.

GitVelocity measures engineering velocity by scoring every merged PR using AI. Every score is transparent and explainable.

Get started to see your team's real velocity.