Frequently Asked Questions

Is GitVelocity really free?

Yes. GitVelocity is completely free. You supply your own Anthropic API key, which means the expensive part -- AI inference -- runs on your account, not ours. All we operate are lightweight servers to orchestrate the scoring pipeline, so our costs stay minimal. We're Headline, a data-driven venture capital firm with our own engineering team. We built GitVelocity for ourselves and our portfolio companies, loved it, and decided to share it with everyone.

Why do I need to bring my own API key?

GitVelocity uses your Anthropic API key to analyze PR diffs. We only use Anthropic models because they produce the best scoring results in our testing. Your key is encrypted at rest and only decrypted at runtime when a score is being generated -- we never see it in the clear. Bring-your-own-key is also what keeps GitVelocity free: you cover the AI inference cost directly, so we do not need to charge for the product.

Do you store our source code?

No. We only read the diff of each merged pull request -- the same thing you would see in a PR review. We do not pull your full repository or model a graph of your code. The only things we store are the score, its dimensional breakdown, and a brief rationale. The diff is processed and immediately discarded.

Do you sell our data?

No. We're a regulated venture capital firm. We cannot and do not sell data or operate a side business. GitVelocity exists because our engineering team loves building it and sharing it with other engineering organizations.

Can engineers game the system?

GitVelocity is designed to be gaming-resistant. The AI reads actual code diffs and evaluates substance across six dimensions. It does not rely on superficial metrics like raw lines of code or commit counts. A few specifics on what stops the common tactics:

Effective lines, not raw lines. The line count that feeds the Effort Scale Factor excludes vendor code, lockfiles, images and binary assets, generated data files, snapshots, documentation and pure-text files (.md, .txt, .rst, .adoc), pure formatting changes, and whole-file deletions. A Prettier pass, a lockfile bump, or a docs-only edit does not move the needle.
Splitting PRs into smaller pieces reduces each individual score via the ESF. Total velocity stays roughly the same.
Adding meaningless code or duplicated boilerplate does not lift Base Score dimensions. The AI evaluates substance, not volume.
Combining unrelated changes into one PR does not reliably raise the Base Score either, because the AI assesses coherence and architectural impact, not surface-level scope.
Quality is capped at 15 points and measures craftsmanship, not test volume. 500 lines of trivial tests score the same as 50 well-targeted ones.
ESF is a multiplier, not an additive bonus. A low Base Score times a high ESF is still a low score.
18 anchored reference examples calibrate the AI's interpretation of each dimension, so a large-but-shallow PR cannot drift into high Base Score territory.

The reliable path to higher scores is the same as ever: take on work that genuinely matters and ship it well.

How does test code affect my score?

Test code is handled differently at each stage of the scoring formula.

The base score is calculated from implementation code only. Test files and documentation files are excluded from the base score evaluation. The AI focuses on the complexity of what you built, not how much test code surrounds it.

The Quality dimension (capped at 15 out of 100 points) assesses test quality -- edge case coverage, test types, integration tests -- not test volume. Writing 500 lines of trivial tests scores the same as writing 50 well-targeted ones.

The Effort Scale Factor does include test files in its line count, because writing tests is real engineering effort. But the ESF is a multiplier on your base score, so inflating test volume without substantive implementation work produces marginal gains at best. A low base score multiplied by a higher ESF is still a low score.

Why did my small, complex change score lower than a larger refactor?

This one trips people up early. A tight 40-line bug fix with a clever algorithm lands at a 7, while a 600-line multi-file refactor lands at a 70. On the surface it looks backwards.

Here is what is going on. The final score is Base Score x Effort Scale Factor. The Base Score evaluates the substance of the change across six dimensions. The ESF is a multiplier tied to the scale of the change itself -- effective lines modified, with a file-count adjustment for cross-cutting work.

The 40-line fix may earn a strong Implementation score, but it sits in the Micro tier (11-50 lines, 0.25x). The 600-line refactor sits in the Large or XL tier (0.80x-1.00x), so even a modest Base Score gets multiplied into a much bigger final number. (Note: a pure documentation edit would collapse to the Nano tier, since .md and .txt files are excluded from the effective-lines count -- see below. A mixed refactor that touches real code plus docs is what clears the larger tiers.)

We anchor on effort, not just complexity. Productivity is not only about tackling algorithmically hard or high-risk work. It is also about the time and effort the artifact itself represents. Strip AI out of the picture and picture a human producing that same 600-line change by hand -- reading the code, writing the comments, making the edits. That takes real hours, and GitVelocity scores it accordingly. AI compresses the machine time to produce the code, but the artifact and the effort it represents are unchanged.

So complexity and scale both count, and both are legible in the final number.

How consistent are the scores?

The same PR scored multiple times lands within a 2-4 point range. That consistency comes from three things:

18 anchored reference examples that the AI uses as calibration points across the scoring spectrum
Structured rubric evaluation across six defined dimensions with specific scoring criteria
Deterministic prompt design that minimizes variance between scoring runs

Tight enough variance to trust the scores for trend analysis and cross-team comparisons.

What about chained or stacked PRs?

GitVelocity only scores PRs that merge to your default branch (typically main). This has specific implications for stacked or chained PR workflows.

How it works: If you have a chain of PRs where PR A merges into PR B, PR B merges into PR C, and PR C finally merges into main, only that final merge to main gets scored. The intermediate merges (PR A into B, PR B into C) are not scored because they do not target the default branch.

Does that mean the intermediate PRs are lost? Yes, in terms of individual scoring. Only the merge to main is evaluated. The combined score of one large merge will not equal what the individual PRs would have scored if each had merged directly to main separately.

Is that unfair? In practice, it does not meaningfully affect anyone's performance picture when you look at trends over time. Consider the math:

Hitting 100 on a single PR is extremely rare. The highest score we have seen internally on a 1--2 week project was 81.
Touching many components in a single PR does not automatically push the score toward 100. The AI evaluates coherence and depth, not breadth alone.
An engineer who splits work into smaller PRs and merges each one directly to main will tend to accumulate more total points than someone who stacks the same work. Some might call this unfair, but the difference washes out over time.
Month-over-month trends naturally smooth out these variations. Top performers and those needing support stand out regardless of which PR strategy they use.

Our recommendation: Run a historical backfill so you have enough data for trends to emerge. Day-to-day scores can be noisy -- a stacked PR workflow might undercount one week, while a burst of small merges might overcount another. Trends over weeks and months tell an accurate story regardless of how engineers structure their PRs.

Does GitVelocity only count shipped code?

Yes. Only merged pull requests are scored. Draft PRs, open PRs, and PRs closed without merging are not included. GitVelocity measures what actually ships to production, not what was attempted.

How does AI-generated code affect scores?

GitVelocity treats AI-generated code identically to human-written code, and that is intentional.

Code is code regardless of who or what wrote it. Engineering productivity today increasingly means how well you can shepherd AI-generated code to production: reviewing it, integrating it, catching the places where it is wrong, and shipping it with confidence. An engineer who uses AI to ship more complex work faster is more productive, and their scores reflect that.

We also do not try to detect whether a human or AI wrote a given line. Some tools do. We think it is the wrong thing to measure. What matters is the artifact that shipped to customers.

What languages does GitVelocity support?

GitVelocity is language-agnostic. The AI reads code structure and architectural patterns regardless of the language. Whether your team writes TypeScript, Python, Go, Rust, Java, or something else, PRs are scored against the same rubric.

Can I use GitVelocity with private repositories?

Yes. GitVelocity works with both public and private repositories on GitHub, Bitbucket, and GitLab. The integration requires read access to pull requests so we can analyze diffs when they merge. Code is analyzed in real time and is not stored after scoring completes.

What are your security standards?

We operate in both the US and Europe and hold ourselves to European regulatory standards -- the higher bar. All data is encrypted at rest and in transit. We follow CIS benchmarks, run regular penetration tests, and are compliant with GDPR and the EU Digital Operational Resilience Act (DORA). For full details, see our Security documentation and Privacy Policy.