· 6 min read · Ai Measurement

Beyond Seat Counts: How to Measure Real AI Tool Adoption

License dashboards and surveys don't tell you if AI tools are working. Track engineering output instead — here's the framework with real team data.

Your company just bought fifty Cursor licenses. Or maybe you rolled out GitHub Copilot Enterprise. Or maybe you gave everyone Claude Pro accounts and said "go be more productive."

Three months later, someone — probably the CFO — asks: "Is it working?"

And you realize you have no idea.

You can pull usage data from the vendor dashboard. Forty-three out of fifty engineers logged in at least once. Twenty-eight used it last week. Average of 340 completions accepted per active user per month. These numbers look precise. They're also completely useless for answering the question that was actually asked.

The Activity Trap (Again)

We've seen this movie before. For decades, engineering leaders measured developer productivity with activity metrics — commits, lines of code, tickets closed — and the metrics told them nothing meaningful. I've written extensively about why this approach is broken.

Now we're making the exact same mistake with AI tools.

Tracking AI usage with acceptance rates, session counts, and completions per day is just activity metrics for a different tool. It tells you engineers are interacting with the AI. It doesn't tell you whether the AI is making them better.

Consider two engineers:

Engineer A accepts 500 Copilot suggestions per week. She uses AI constantly for boilerplate, test scaffolding, and autocomplete. Her acceptance rate is high. By any vendor dashboard metric, she's a power user.

Engineer B accepts 80 suggestions per week. He mostly uses AI for complex algorithmic work and architecture exploration, rejecting most suggestions and using the tool primarily as a thinking partner. By vendor metrics, he's barely engaged.

Which engineer is getting more value from the tool? You can't tell from usage metrics. The only way to know is to look at what they're actually producing.

What Matters: Output Before and After

The right question isn't "are engineers using AI tools?" It's "are engineers producing better output since adopting AI tools?"

This reframing changes everything about how you measure.

Instead of tracking acceptance rates, you track output complexity over time. Instead of monitoring session duration, you compare the architectural sophistication of merged PRs before and after AI adoption. Instead of counting suggestions accepted, you look at whether the same engineer is now tackling more difficult problems and shipping higher-quality solutions.

At Headline, this is exactly what we did. When our engineering team started using AI coding tools, we didn't track tool usage at all. We tracked what we'd already been tracking: the AI-scored complexity of every merged PR.

The results were unambiguous. Between August and November 2025, team output nearly doubled. Not because engineers were shipping more PRs — they weren't, at least not dramatically more. The PRs they were shipping were more complex. The average complexity score per PR increased, and the total complexity shipped per engineer per week went up significantly.

That's a measurement of AI impact. "43 out of 50 engineers logged in last month" is not.

Why Consistent Scoring Makes This Work

Here's why output-based measurement works for AI adoption and tool-based measurement doesn't: the scoring is indifferent to how the code was written.

When GitVelocity scores a PR, it evaluates the final code change — the diff between what existed before and what exists after. It doesn't know or care whether the code was written by hand, dictated to Copilot, generated by Claude, or carved in stone tablets and OCR'd into the repo. It evaluates what was built.

This means the scoring is automatically controlled for AI usage. An engineer who writes everything by hand and an engineer who uses AI extensively get evaluated on the same scale, using the same criteria. If the AI-assisted engineer produces higher-complexity work, the scores reflect that. If the AI-assisted engineer produces the same or lower-complexity work, the scores reflect that too.

You don't need a separate AI measurement framework. You don't need to instrument your AI tools or build custom dashboards. You just need a consistent measure of output quality, and the AI adoption impact reveals itself in the trendlines.

The Surprising Findings

When you measure AI impact through output rather than usage, you discover things that vendor dashboards would never show you.

Finding one: adoption speed varies dramatically by seniority. We saw this at Headline and across portfolio companies. Junior engineers adopt AI-assisted workflows faster than seniors. This isn't because juniors are more open-minded — it's because they have fewer established patterns to disrupt. A senior engineer with ten years of muscle memory for how they write code has to actively unlearn habits. A junior engineer with two years of experience can adopt a new workflow without fighting old ones.

The practical implication: your AI adoption metrics should be segmented by experience level. A flat "adoption rate" across the team hides the most interesting dynamics.

Finding two: AI amplifies existing skill, but not linearly. The engineers who improved the most with AI tools weren't necessarily the strongest engineers on the team. They were the engineers who already had good architectural judgment but were bottlenecked by implementation speed. AI removed the implementation bottleneck, and their output exploded.

Engineers who struggled with architecture saw smaller gains. AI can generate code fast, but if you don't know what code to generate, speed doesn't help much.

Finding three: the "acceptance rate" metric is often inversely correlated with value. Engineers who accept fewer AI suggestions but use the tool for harder problems often extract more value than engineers who accept many suggestions for trivial completions. High acceptance rates frequently indicate the tool is being used for low-value autocomplete. Low acceptance rates sometimes indicate the tool is being used for high-value exploration where most suggestions are wrong but the few right ones are transformative.

A Framework That Works

If you need to report on AI tool ROI — and you will, because the CFO will ask — here's the framework I'd recommend. It's simpler than most AI measurement frameworks, and it actually works.

Step one: Establish an output baseline before AI adoption. Score your team's PR output for at least 8-12 weeks before rolling out AI tools. You need a stable baseline of per-engineer and per-team complexity scores. If you've already rolled out AI tools, you can use GitVelocity's historical backfill to score older PRs retroactively — the tool supports three-plus months of backfill.

Step two: Roll out AI tools and measure output, not usage. Don't track acceptance rates. Don't monitor session counts. Track the same output metrics you tracked during the baseline period. Watch for changes in per-engineer complexity scores, total team complexity output, and the distribution of scores across the team.

Step three: Segment by engineer. Team averages will improve. That's expected. The interesting question is who improved and by how much. Some engineers will see massive gains. Others will see little change. A few might temporarily see a decrease as they adjust their workflows. This segmentation tells you where to invest in training and support.

Step four: Look for the second-order effects. Beyond raw output, watch for changes in the type of work engineers tackle. Are they taking on harder problems? Are the complexity distributions shifting upward? Are junior engineers producing work that looks like senior output? These qualitative shifts matter as much as the quantitative gains.

Step five: Report impact in terms leadership understands. "Acceptance rate increased 20%" means nothing to a CFO. "Engineering output complexity increased 40% while headcount remained flat" means everything. Translate your AI adoption measurement into the language of business impact: more output per engineer, faster delivery of complex features, higher architectural quality per dollar spent.

Stop Measuring the Tool, Start Measuring the Work

The fundamental error in most AI measurement approaches is that they measure the tool instead of measuring the work.

It's like measuring sales team productivity by tracking how often they use Salesforce. Sure, CRM usage data is mildly interesting. But what you actually care about is revenue closed. If revenue went up, the CRM is working. If it didn't, the CRM isn't the problem — or maybe it is, but usage data won't tell you either way.

AI coding tools are the same. Usage is a leading indicator at best, noise at worst. Output is the only metric that answers the question leadership is actually asking: is this investment making our engineering team more productive?

The good news is that if you're already measuring engineering output with any rigor, measuring AI impact is trivial. You already have the baseline. You already have the scoring. All you need to do is draw the trendline and see whether it bends upward after adoption.

If you're not measuring output yet, AI adoption is a great catalyst to start. You need a baseline anyway. Might as well build the measurement infrastructure that serves both purposes.

GitVelocity measures engineering velocity by scoring every merged PR using AI. Because scores evaluate the code change itself — not how it was written — they automatically reveal whether AI tools are improving your team's output.

See how it works.

Conrad Chu
Written by Conrad Chu

Conrad is CTO and Partner at Headline, where he leads data-driven investment across early stage and growth funds with over $4B in AUM. Before becoming an investor, he founded Munchery (raised $130M+) and held engineering and product leadership roles at IAC and Convio (IPO 2010). He and the Headline engineering team built GitVelocity to help engineering organizations roll out agentic coding and measure its impact.