· 6 min read · Engineering Measurement

Agentic Coding Is Here. Your Metrics Haven't Caught Up.

AI agents now write, test, and iterate on code autonomously. Engineers are becoming orchestrators, not typists. Every existing metric is blind to this shift.

Something fundamental shifted in software development, and most engineering leaders haven't fully processed it yet.

Six months ago, AI coding tools were autocomplete on steroids — suggesting the next line, filling in boilerplate, completing function signatures. Useful, but incremental. The engineer was still the writer. The AI was a faster keyboard.

That's not what's happening now.

Tools like Claude Code, Cursor, and agentic workflows in VS Code don't just suggest code. They write entire features. They scaffold test suites. They refactor modules. They iterate on their own output based on error messages. The engineer describes the intent, reviews the result, and directs the next iteration.

Engineers are becoming orchestrators. And every metric we use to measure their work was designed for a world where they were typists.

What Agentic Coding Actually Means

Let me be specific, because "AI writes code" is vague enough to mean anything.

Agentic coding is when an AI system takes a high-level instruction and autonomously executes multiple steps to fulfill it. Not one line at a time — an entire workflow.

Here's what that looks like in practice:

  • "Add retry logic with exponential backoff to the payment webhook handler." The agent reads the existing code, identifies the handler, writes the retry logic, adds configuration for max retries and backoff intervals, writes unit tests, runs them, fixes failures, and presents the complete changeset.

  • "Refactor the user authentication module to use the new session service." The agent maps the dependency graph, updates imports, rewrites the integration points, updates tests, and flags breaking changes it can't resolve automatically.

  • "Create a new API endpoint for bulk user export with pagination and CSV download." The agent scaffolds the controller, service, DTO, adds validation, implements cursor-based pagination, generates the CSV transformation, writes integration tests, and updates the API documentation.

Each of these would have been a multi-day task for a strong mid-level engineer. With agentic tools, the implementation phase compresses to minutes. The engineer's time goes to problem decomposition, architecture decisions, code review, and directing the next task.

Why Every Existing Metric Breaks

Think about what this does to traditional measurement:

Lines of code? The agent wrote them. Measuring lines of code now measures how verbose your AI tool is, not how productive your engineer is.

Commit count? Agentic workflows often produce complete features in single commits. Fewer commits, dramatically more output. An engineer using agentic coding looks less productive by commit count.

Story points? A task estimated at 8 points during sprint planning might take 20 minutes with an agentic workflow. But it's still an 8-point task in terms of complexity. The estimation system assumes human implementation speed. That assumption is now wrong by an order of magnitude.

DORA metrics? Deployment frequency might go up. Lead time might shrink. But DORA can't tell you whether the engineer is shipping complex architectural work or AI-generated boilerplate. Process velocity without output quality measurement is noise.

PR count? Closer, but still blind. An engineer who ships three PRs using agentic coding — each containing what would have been a week of manual work — looks identical to one who ships three trivial config changes.

The metrics aren't just inaccurate. They're inversely correlated with actual productivity in an agentic workflow. The most productive engineers — the ones leveraging AI to ship the most complex work — can look the same as or worse than engineers doing simple tasks manually.

The New Differentiator: Orchestration, Not Typing

In the pre-AI world, a "fast" engineer was one who could hold complex systems in their head and translate that understanding to code quickly. Speed of typing and breadth of API knowledge mattered.

That skill set is depreciating. Not worthless — understanding systems deeply still matters for directing AI. But the competitive advantage has shifted.

The new differentiator is parallelization. How many workstreams can you manage simultaneously? How quickly can you move from reviewing one AI-generated changeset to directing the next? How effectively can you decompose a large project into tasks that AI can execute?

I've watched this play out across Headline's portfolio companies. The engineers who thrive with agentic tools aren't necessarily the ones who were fastest typists. They're the ones who think in systems, decompose problems cleanly, and review code efficiently.

And here's what surprised us: some junior engineers became dramatically more productive with agentic tools. The implementation skill gap between junior and senior shrank because AI handles the implementation. What remained was the ability to reason about architecture and direct work — skills that some juniors had but couldn't demonstrate when they were bottlenecked by their typing speed and API memorization.

Without objective output measurement, none of this was visible.

What Measurement Needs to Capture Now

If the engineer's role is shifting from writer to orchestrator, measurement needs to evaluate the outcome, not the process.

Measure the artifact, not the method. It doesn't matter whether code was typed by hand, pair-programmed with AI, or generated entirely by an agent. What matters is what merged to production and how complex it was.

Measure complexity, not volume. An agentic workflow can produce thousands of lines of boilerplate in minutes. That's not the same as producing a complex distributed systems design. Measurement needs to distinguish between AI-generated scaffolding and genuinely sophisticated engineering work.

Measure throughput over time. In the agentic era, weekly velocity — the total complexity of shipped work — is more meaningful than any single PR score. An engineer orchestrating AI across five workstreams simultaneously should be measured by their aggregate shipped output, not by any individual commit.

Be AI-agnostic. This is critical. The scoring system can't care whether AI was used. It scores what shipped. Period. This eliminates the impossible question of "how much of this was the AI vs. the human?" It doesn't matter. What matters is the outcome.

This is why we built GitVelocity to score the actual code diff of every merged PR across six dimensions: Scope, Architecture, Implementation, Risk, Quality, and Performance & Security. The score reflects the engineering complexity of the change — regardless of how it was produced.

The AI Adoption Signal

Complexity-based measurement gives leaders something they've never had: a natural signal for AI adoption effectiveness.

When an engineer starts using agentic tools effectively, the pattern in their velocity data is unmistakable:

  • Week 1-2: Velocity stays roughly the same. They're learning the tool, figuring out what it can and can't do.
  • Week 3-4: Velocity increases 20-40%. They've integrated AI into routine tasks. Boilerplate disappears from their workflow.
  • Month 2+: Velocity stabilizes at a new, higher baseline. They're shipping more work at similar or higher complexity levels.

At Headline, we saw our team's aggregate productivity nearly double over four months. That wasn't because we mandated AI usage. It's because we could see the results — which engineers were leveraging AI effectively, which needed support, and which were already operating at a higher level.

Leaders keep asking me: "How do I know if my AI investment is working?" The answer isn't in seat utilization metrics from your tool vendor. It's in whether your team's shipped output actually increased.

The Orchestrator's Scorecard

Here's what a high-performing engineer looks like in the agentic era:

  • High aggregate weekly velocity. Not from one massive PR, but from consistently shipping complex work across multiple workstreams.
  • Sustained complexity per PR. They're not just shipping more — they're shipping work that's architecturally significant, not just AI-generated boilerplate.
  • Consistent output. The velocity curve is steady or climbing, not spiking randomly.
  • Breadth of contribution. Working across subsystems, not siloed in one corner of the codebase.

None of this is visible in commit counts, story points, or DORA dashboards. All of it is visible when you score the actual code that ships.

The Window Is Now

Agentic coding is here. It's not coming — it arrived. The engineering organizations that adapt their measurement systems to this reality will have a massive advantage: they'll know who's effective, where to invest in training, and how to allocate their teams.

The ones that don't will keep measuring ghosts — counting commits that mean nothing, estimating points that predict nothing, and tracking pipeline metrics that say nothing about what actually shipped.

The code is the code. It tells the truth. Your metrics should too.


GitVelocity measures engineering velocity by scoring every merged PR using AI — regardless of whether AI helped write it.

See how it works.

Conrad Chu
Written by Conrad Chu

Conrad is CTO and Partner at Headline, where he leads data-driven investment across early stage and growth funds with over $4B in AUM. Before becoming an investor, he founded Munchery (raised $130M+) and held engineering and product leadership roles at IAC and Convio (IPO 2010). He and the Headline engineering team built GitVelocity to help engineering organizations roll out agentic coding and measure its impact.