Engineering Management Tools Are Getting an AI Overhaul

I've been building and using engineering management tools for years. Every generation promised to finally give engineering leaders the visibility they needed. Every generation got us partway there and then hit a wall.

We're in a new generation now — one powered by large language models — and for the first time, the wall is actually breaking. Let me walk through how we got here and what's genuinely different this time.

Wave One: Spreadsheets and Gut Feel

The earliest engineering "management tools" were spreadsheets and whiteboards. You tracked projects in a Google Sheet. You estimated timelines based on experience. You assessed team performance based on who seemed busy and who shipped on time.

This worked when teams were small enough that the engineering manager personally reviewed most code changes and had direct context on every project. At five engineers, you can hold the entire team's work in your head. At twenty, you can't.

The failure mode of Wave One was invisible: you didn't know what you didn't know. The manager who thought the team was doing fine was often wrong, but they had no data to contradict their intuition.

Wave Two: Project Tracking

Jira, Linear, Asana, and their predecessors brought structure. Work got decomposed into tickets. Tickets got estimated, assigned, and tracked through workflows. You could see a board, watch cards move across columns, and feel like you had visibility.

The problem was that project tracking tools measure process, not output. A ticket moving from "In Progress" to "Done" tells you a unit of work was completed, but nothing about the quality, complexity, or value of that work. A ticket for "Add loading spinner" and a ticket for "Redesign authentication architecture" both show up as one card in the Done column.

Story points were supposed to fix this by assigning relative complexity to tickets. In practice, story points became a currency of negotiation — teams learned to estimate generously so they could reliably "hit their velocity targets." The metric became the goal, and the goal became meaningless. I wrote about this dynamic in why story points failed.

Wave Two gave us the illusion of measurement. It was better than nothing, but it measured the container (the ticket) rather than the contents (the actual code change).

Wave Three: DORA and Delivery Metrics

The DevOps Research and Assessment (DORA) framework brought rigor. Four metrics — deployment frequency, lead time for changes, change failure rate, and time to restore service — gave teams a shared vocabulary for measuring delivery capability.

DORA was a genuine step forward. It shifted focus from activity (how many tickets did we close?) to delivery (how effectively do we ship changes?). Companies that adopted DORA metrics saw real improvements in their delivery processes.

But DORA has a fundamental limitation: it measures the pipeline, not the product. A team that deploys ten trivial config changes per day looks identical to a team that deploys ten complex feature releases per day. High deployment frequency says nothing about the value of what's being deployed.

DORA also operates at the team level. It can tell you that Team A ships faster than Team B, but it can't tell you why. Is it a tooling issue? A skill gap? An architectural bottleneck? A single person creating a review queue? DORA metrics can't answer these questions because they don't look at the work itself.

Wave Three gave us process metrics. Valuable, but incomplete.

Wave Four: AI-Powered Code Understanding

Here's where things get genuinely different. Large language models can do something no previous tool could: read a code diff and understand what it does.

Not pattern-match against style rules like a linter. Not count lines changed like a statistics tool. Actually understand the architectural significance of a change, the implementation sophistication, the risk profile, the quality of the approach.

This capability unlocks a fundamentally new category of engineering analytics. Instead of measuring proxies for work (tickets, deploys, cycle time), you can measure the work itself.

The shift looks like this:

From counting to evaluating. Previous tools counted things — commits, PRs, deploys, tickets. Counting is easy to automate but tells you nothing about substance. AI-powered tools evaluate the content of each change, scoring complexity across dimensions like scope, architecture, implementation quality, risk, and performance.

From team averages to individual visibility. DORA and its relatives produce team-level metrics. Useful for benchmarking, but they hide the variance within a team. When you can score individual contributions, you can see that your "average velocity" team actually has two engineers doing 80% of the complex work and three engineers shipping only trivial changes. That's actionable information that team averages bury.

From rule-based to contextual. Static analysis tools apply rules — this function is too long, this class has too many dependencies, this pattern violates the style guide. These rules are useful but rigid. An LLM evaluates code in context. It understands that a 200-line function might be perfectly reasonable if it's implementing a complex state machine, or that a seemingly simple three-line change might have significant architectural implications because of where it sits in the dependency graph.

From lagging to leading indicators. DORA metrics tell you what already happened. AI-powered analysis can identify patterns as they emerge — an engineer whose complexity scores are trending down, a team that's producing increasingly superficial changes, a codebase where architectural debt is accumulating. These are signals you can act on before they become problems.

What This Means in Practice

Let me make this concrete with a few scenarios I've seen play out.

Identifying hidden top performers. One of our portfolio companies at Headline discovered that a mid-level engineer was consistently producing the highest-complexity work on the team. She wasn't vocal in meetings and didn't self-promote. Traditional visibility would have overlooked her. AI-powered scoring made her output visible, and she was promoted within months.

Catching productivity slides early. Another team noticed that an engineer's complexity scores dropped sharply over a three-week period. The manager had a one-on-one conversation and learned the engineer was dealing with a personal issue and struggling with a new part of the codebase simultaneously. Early intervention — pairing them with a senior engineer on the unfamiliar codebase — resolved the problem before it became a performance issue.

Making the AI adoption case. When Headline's own engineering team started adopting AI coding tools, we could measure the impact directly. Complexity scores per engineer increased. Junior engineers started producing work that looked like senior-level output. The data was clear enough that it changed our hiring strategy across the portfolio.

Where This Goes Wrong

I want to be honest about the failure modes, because every wave of engineering management tools has had them.

The biggest risk is using AI-powered metrics for surveillance rather than insight. If engineers feel like they're being watched and graded by an algorithm, you'll get the same resistance that killed every previous measurement approach. The engineers hate being measured problem doesn't go away just because the measurement is better.

The solution is transparency. Engineers should see their own scores. They should understand how scores are calculated. And scores should be used primarily for development conversations, resource allocation, and process improvement — not as a blunt instrument for performance reviews.

The second risk is over-indexing on a single number. A complexity score is a useful signal, but it's not the complete picture of an engineer's value. Mentoring, architectural guidance, incident response, cross-team collaboration — these matter enormously and don't show up in PR scores. Good engineering leaders use AI-powered metrics as one input among many, not as a replacement for judgment.

The Compounding Advantage

What makes this wave different from the previous ones is that the underlying technology — LLMs — continues to improve rapidly. The scoring models get better at understanding code. The insights get more nuanced. The pattern detection gets more accurate.

Teams that adopt AI-powered engineering analytics now will build a historical dataset that becomes more valuable over time. Six months of scoring data lets you identify trends. A year lets you measure the impact of process changes. Two years lets you build genuinely predictive models for your specific organization.

The companies that wait will eventually adopt these tools too. But they'll be starting from scratch while their competitors have years of data and institutional knowledge about what the data means.

Every wave of engineering management tools was an improvement over the last. Jira was better than spreadsheets. DORA was better than story points. AI-powered code understanding is better than all of them — not because it replaces what came before, but because it finally measures the thing that matters: the actual work.

GitVelocity measures engineering velocity by scoring every merged PR using AI. It represents the next wave of engineering management tools — one that evaluates the substance of engineering work, not just the process around it.

See how it works.