AI Model Benchmark

GitVelocity uses Claude to score every merged PR on a 0-100 scale across six dimensions. We benchmarked three Anthropic models to find the best tradeoff between cost, accuracy, and consistency.

Models Tested

Model Input $/1M tokens Output $/1M tokens Extended Thinking Notes
Claude Haiku 4.5 $1.00 $5.00 No Fastest, cheapest
Claude Sonnet 4.6 $3.00 $15.00 Yes Best value (default)
Claude Opus 4.6 $5.00 $25.00 Yes Gold standard

Relative cost: Haiku is ~6x cheaper than Opus. Sonnet is ~1.7x cheaper than Opus.

Methodology

We selected 20 real pull requests spanning three languages (Rust, TypeScript, Ruby), five repositories, and a range of sizes from 1-line fixes to 1,400+ line features. Each model scored every PR three times (120 API calls total for Haiku and Sonnet). Opus 4.6 is the baseline -- each PR was scored once with Opus and used as the reference for accuracy comparisons.

We measured three things:

  • Cost -- Actual token usage and USD cost per review
  • Accuracy -- How close each model's scores are to the Opus baseline, measured by Mean Absolute Deviation (MAD) and Pearson correlation
  • Stability -- How consistent scores are across repeated runs, measured by Coefficient of Variation (CV)

Key terms

  • MAD (Mean Absolute Deviation): The average absolute difference between a model's score and the Opus baseline. Lower is better. A MAD of 4.4 means the model is off by ~4.4 points on average on the 0-100 scale.
  • CV (Coefficient of Variation): Standard deviation divided by the mean, as a percentage. Measures run-to-run consistency. A CV of 8.7% means scores vary by about 8.7% across independent runs of the same PR.
  • Correlation (r): Pearson correlation coefficient. 1.0 means perfect agreement with the baseline. Values above 0.95 indicate near-identical ranking of PRs.

Test Corpus

20 PRs selected for diversity:

# Description Lines Changed Files Category
1 Single-line key pruning fix +1/-1 1 Tiny fix
2 Add search index fields +60/-14 3 Small feature
3 Compute derived company metric +314/-24 2 Medium feature
4 Structured JSON API response +138/-4 6 Medium refactor
5 Real-time list updates +1,020/-2 13 Large feature
6 Config change (concurrency) +2/-2 2 Tiny config
7 404 not found page +97/-0 2 Small feature
8 Integration branch support +152/-54 9 Medium feature
9 Full settings page +923/-162 14 Large feature
10 Multi-table data view with filtering +1,447/-47 18 XL feature
11 Timestamp overflow protection +36/-4 2 Small fix
12 S3 orphan cleanup race condition +667/-111 10 Large fix
13 Background job refactor +436/-136 3 Medium refactor
14 Email prompt tuning +18/-5 1 Small prompt
15 Crash fix (null thread ID) +59/-12 2 Small fix
16 Interactive checkpoint tools +999/-5 18 Large feature
17 Debug page text selection fix +119/-4 1 Small fix
18 Connection pool leak fix +58/-37 6 Medium fix
19 Remove SSE heartbeat +56/-208 5 Medium refactor
20 New API tool endpoint +714/-1 4 Large feature

Coverage: 4 tiny, 6 small, 5 medium, 5 large/XL. Languages: Rust (5), TypeScript (11), Ruby (4). Types: features (10), fixes (6), refactors (3), config (1).

Results: Cost

Model Avg Input Tokens Avg Output Tokens Avg Cost/Review Cost for 20 PRs x 3 runs
Haiku 4.5 14,800 3,056 $0.030 $1.80
Sonnet 4.6 14,829 3,970 $0.104 $6.24
Opus 4.6 (est.) ~14,800 ~4,000 ~$0.17 ~$10

Sonnet costs 3.5x more than Haiku per review. Opus is estimated at ~6x Haiku / ~1.7x Sonnet.

Cost at Scale

Volume Haiku Sonnet Opus
100 PRs/month $3 $10 $17
1,000 PRs/month $30 $104 $174
10,000 PRs/month $301 $1,040 $1,740

Results: Accuracy

How close each model's scores are to the Opus baseline (lower MAD is better):

Model Mean Score MAD vs Opus Max Deviation Bias Correlation (r)
Haiku 4.5 24.6 7.2 22.7 +3.6 0.886
Sonnet 4.6 25.4 4.4 11.0 +4.4 0.990
Opus 4.6 24.0 0 (baseline) 0 0 1.0

Sonnet is significantly more accurate than Haiku -- 40% lower MAD, half the max deviation, and near-perfect correlation (0.990) with Opus. Both models trend slightly higher than Opus.

Accuracy by Sub-Score (MAD vs Opus)

Model Scope Architecture Implementation Risk Quality Perf/Security
Haiku 4.5 2.1 1.0 2.4 2.6 1.3 0.8
Sonnet 4.6 1.0 1.9 1.6 1.5 1.2 0.6
Opus 4.6 0 0 0 0 0 0

Sonnet is more accurate across all six dimensions, with the biggest improvements in Implementation and Risk.

Accuracy by PR Size (MAD vs Opus)

Model Tiny (<50 lines) Small (50-150) Medium (150-500) Large (500+)
Haiku 4.5 0.2 5.9 7.5 11.6
Sonnet 4.6 0.3 3.3 3.7 8.2
Opus 4.6 0 (baseline) 0 (baseline) 0 (baseline) 0 (baseline)

Both models are excellent on tiny PRs. Accuracy degrades on larger PRs, but Sonnet holds up much better -- its MAD on large PRs (8.2) is comparable to Haiku's on medium PRs (7.5).

Accuracy by Language (MAD vs Opus)

Model Rust TypeScript Ruby
Haiku 4.5 4.0 7.6 10.4
Sonnet 4.6 3.0 5.5 3.4
Opus 4.6 0 (baseline) 0 (baseline) 0 (baseline)

Haiku struggles most with Ruby (MAD 10.4), while Sonnet handles it well (3.4). Both models are most accurate on Rust code.

Results: Stability

How consistent scores are when the same PR is scored multiple times (lower CV is better):

Model Avg CV Max CV PRs with CV > 10%
Haiku 4.5 18.2% 77.4% 14/20
Sonnet 4.6 8.7% 35.4% 7/20
Opus 4.6 N/A (single baseline score) -- --

Sonnet is 2x more stable than Haiku. Haiku's scores vary wildly across runs -- 14 of 20 PRs had >10% CV, with one reaching 77%. Sonnet's extended thinking produces much more consistent scoring.

Choosing a Model

Sonnet 4.6 is the default. It delivers near-Opus accuracy (r = 0.990, MAD 4.4) at ~1.7x lower cost, with good run-to-run stability (8.7% avg CV). For most teams, this is the best balance of cost and quality.

Opus 4.6 is the gold standard. It is the most capable AI model available for programming today -- the same model many engineering teams use to write their best code. When Opus is also the one judging your code, scores carry a certain weight that's hard to argue with. Some teams, depending on their culture and how they want to roll out scoring, find it important to start with the best-in-class model. If your engineers care about being scored with the highest possible accuracy, Opus is the right choice. There is also a natural correlation: Opus is what you code with, so it makes sense for it to be what evaluates your code. With Sonnet, there can be a slight perception gap -- teams may not use it as much for coding, so there's less of that built-in trust.

Haiku 4.5 is not recommended. While it's the cheapest option at $0.03/review, the 18% average CV (with spikes up to 77%) means scores are too inconsistent to be useful. The same PR can receive wildly different scores across runs. We include the data here for transparency, but we do not recommend Haiku for production scoring.

You can change your model at any time in Settings > AI Configuration.

Benchmark conducted March 2026. 20 PRs, 120 API calls.