· 7 min read · How It Works

How GitVelocity Scores Code: The Rubric Explained

Every merged PR gets a 0-100 complexity score. Here's exactly how the scoring works, why it matters, and what the numbers mean — no black boxes.

Every PR that merges to your main branch gets a complexity score from 0 to 100. Not a quality judgment. Not a performance grade. A measure of how much engineering complexity was involved in the change.

If you're going to be measured by a system, you deserve to understand it completely. No black boxes, no hand-waving. Here's exactly how it works.

Why Measure Complexity at All?

Engineering work is hard to measure. Lines of code rewards verbosity. Commit count rewards splitting work into tiny pieces. Story points are subjective estimates made before the work happens. None of these reflect what actually shipped.

The result is that engineering effort is often invisible. A developer who spends a week on a gnarly migration across 15 services looks the same on a dashboard as someone who bumped a dependency version — unless a human manager happens to notice.

PR scoring takes a different approach: look at the code that was actually written. An AI reads every pull request — the diff, the file structure, the test coverage — and evaluates it against a structured rubric. The output is a single number that captures the complexity of the change.

This makes engineering work visible. It gives teams a shared language for what they ship, and ensures that hard, impactful work gets recognized.

The Formula

The scoring has two stages:

Final Score = Base Score × Effort Scale Factor

The Base Score is the sum of six rubric categories (max 100) that measure what the PR does — scope, architecture, implementation, risk, quality, and performance/security.

The Effort Scale Factor (ESF) is a multiplier from 0.1× to 1.0× based on PR size. Larger changes require more effort to develop, test, and review, so they get a higher multiplier.

The Six Dimensions

The base score is a simple sum of six dimension scores. We chose addition over weighted formulas because it's easy to understand and verify — you can look at any score breakdown and immediately see where the points came from.

Scope (0-20 points)

Scope measures the breadth of the change. How many files were touched? How many subsystems were affected? Did the change cross architectural boundaries?

Score What it looks like
0-3 Single file, minimal change. A configuration update or typo fix.
4-7 Localized change across 1-3 related files within one subsystem.
8-12 Multiple related files across a subsystem. UI, API, and database touched.
13-17 Cross-cutting change spanning multiple subsystems.
18-20 System-wide impact. New services, breaking changes, or major refactors.

Key factors the AI considers: files modified (weighted by criticality), new APIs or endpoints, database migrations, external service integrations, and cross-team coordination requirements.

A typo fix is a 1. Adding a feature that touches the UI, API layer, service logic, and database schema is a 12-15. Building an entirely new service from scratch is 18-20.

Architecture (0-20 points)

Architecture measures structural impact. Did the PR introduce new abstractions? Change the dependency graph? Establish patterns that future code will follow?

Score What it looks like
0 No architectural changes — bug fixes, feature additions within existing patterns.
1-5 Minimal impact. Internal reorganization.
6-10 Internal refactoring with improved structure.
11-15 New dependencies, service boundaries, or module abstractions.
16-20 Major architectural shifts. New patterns, event-driven design, dependency overhaul.

A score of 0 is common and perfectly fine — most PRs work within existing architecture. High scores are reserved for work that genuinely changes how the system is structured. The AI considers service dependencies, critical path changes, decoupling improvements, and design pattern introductions.

Implementation (0-20 points)

Implementation measures algorithmic and logic complexity. How sophisticated is the business logic? Does the code handle concurrency, complex state machines, or intricate data transformations?

Score What it looks like
0-5 Simple CRUD, text changes, configuration. Straightforward data mapping.
6-10 Business rules with branching logic. Validation with edge cases.
11-15 Complex algorithms, concurrency, advanced patterns, batch processing.
16-20 Performance-critical code, complex state management, distributed systems logic.

Adding a field to an API response is a 2. Implementing a discount engine with stacking rules and currency conversion is a 10-13. Building a concurrent job processor with batching, retry logic, and exponential backoff is 15-18.

Risk (0-20 points)

Risk measures deployment and operational complexity. How dangerous is this change to deploy? What's the blast radius? How hard is it to roll back?

Score What it looks like
0-5 Easily reversible, backward-compatible. Low blast radius.
6-10 Some deployment complexity. Public API change, new external dependency.
11-15 Migration required. Breaking change with migration path. Moderate blast radius.
16-20 Core data model change. Multi-step deployment. High risk, hard to reverse.

Risk factors that increase the score: database schema changes (+3-5), authentication/security changes (+4-6), external API changes (+3-5). Risk mitigations that decrease it: feature flags (-2), documented rollback plans (-2), canary deployment strategy (-1).

Quality (0-15 points)

Quality measures the craftsmanship of the change. Test coverage, documentation, code clarity, and maintainability.

Score What it looks like
0-3 No tests. No documentation. Quick fix with tech debt.
4-7 Unit tests for happy path. Basic inline comments.
8-11 Comprehensive tests including edge cases. Integration tests. Clear docs.
12-15 Exceptional coverage. E2E tests, contract tests, ADRs, load testing.

This dimension is intentionally capped at 15 rather than 20. Quality matters, but the rubric is weighted toward the complexity of the work itself. The AI evaluates test coverage, edge case handling, integration/E2E tests, API documentation, and migration guides.

Performance & Security (0-5 points)

This captures explicit optimization and hardening work. Not "the code runs fast" but "the engineer deliberately optimized performance or hardened security."

Score What it looks like
0 No explicit optimization. Framework defaults only.
1-2 Basic optimizations, security awareness, input validation.
3-4 Benchmarks, performance profiling, or security threat analysis.
5 Comprehensive: defense in depth, threat modeling, load testing, rate limiting, monitoring.

Most PRs score 0-1 here, and that's fine. This rewards engineers who go beyond the defaults when the situation demands it.

The Effort Scale Factor

The ESF adjusts the base score for PR size. A brilliantly executed 5-line fix gets a lower final score than the same quality work in a 500-line feature, because larger changes require more effort to develop, test, and review.

Tier Lines Changed Multiplier
Nano ≤10 0.10×
Micro 11-50 0.25×
Small 51-150 0.40×
Medium 151-400 0.60×
Large 401-800 0.80×
XL 801+ 1.00×

There's one additional rule: the breadth bump. If a PR touches significantly more files than its line count would suggest (the file tier is 2+ levels above the line tier), the ESF is bumped up by one tier. This accounts for cross-cutting changes like renaming a widely-used function that touch many files with few lines per file. The bump is capped at +1 tier and can never reduce the ESF.

A Worked Example

A PR titled "Add user notification preferences API" with 280 lines across 9 files:

  • Sub-scores: Scope 14 + Architecture 12 + Implementation 15 + Risk 10 + Quality 8 + Perf 3 = 62 base
  • ESF: 280 lines → Medium tier → 0.60×
  • Final Score: 62 × 0.60 = 37.2

Why AI Scoring Works

We use Claude to score every PR. The model receives the full diff, PR title, and description, then evaluates the change against each dimension.

The key properties:

No bias. No mood, no politics, no recency bias. The same rubric is applied to every PR, whether it was written by a junior engineer or a staff engineer.

Transparent. Every score comes with a breakdown showing points per dimension and the reasoning behind each. You can see exactly why your PR scored what it did.

Gaming-resistant. The AI reads the actual code. You can't claim complexity that isn't there. Splitting one PR into five doesn't increase your total score — the complexity is in the code, not the ticket count.

Language-agnostic. It scores code based on what shipped — regardless of whether AI assisted in writing it. The rubric works across Python, TypeScript, Rust, Java, Go, Ruby, Swift, Kotlin, C#, and more.

What the Score Is NOT

The score is not a quality judgment. A score of 8 doesn't mean the PR was bad — it means it was small. Typo fixes, config changes, and dependency bumps are important work. They keep the system running. They just aren't complex.

The score is not a performance grade. It measures the complexity of a single change, not the value of the engineer. An engineer who ships ten small PRs in a week (totaling 120 velocity points) might be more productive than one who ships a single 60-point PR.

The score is not a target to optimize for. Chasing high scores would mean avoiding small PRs, which is the opposite of good engineering practice. Small, focused PRs are better for code review, easier to debug, and lower risk. Scores are most useful in aggregate over time, not for any single PR.

The score is not useful in isolation. What matters is velocity over time — the aggregate complexity of work shipped per engineer, per team, per week. Any individual PR is just one data point.

What the Numbers Mean

Range What to expect
1-15 Config tweaks, typo fixes, dependency bumps, minor refactors
16-30 Focused bug fixes, small features, meaningful test additions
31-50 Multi-component features, substantial refactors, new API endpoints
51-75 New systems, complex integrations, architecture changes
76-100 Large-scale rewrites, critical infrastructure, full features end-to-end

The distribution is intentionally right-skewed. Most work falls in the 10-40 range, and that's healthy. A team where every PR scores 70+ is probably under-decomposing their work.

The goal isn't to push scores higher. It's to make engineering work visible — so that hard, impactful work gets the recognition it deserves. For more on why traditional engineering measurement is broken and how complexity scoring addresses the gap, see our deep dive on measurement.


GitVelocity measures engineering velocity by scoring every merged PR using AI. See the full scoring guide for detailed rubric breakdowns.

Conrad Chu
Written by Conrad Chu

Conrad is CTO and Partner at Headline, where he leads data-driven investment across early stage and growth funds with over $4B in AUM. Before becoming an investor, he founded Munchery (raised $130M+) and held engineering and product leadership roles at IAC and Convio (IPO 2010). He and the Headline engineering team built GitVelocity to help engineering organizations roll out agentic coding and measure its impact.