AI Model Benchmark
GitVelocity uses Claude to score every merged PR on a 0-100 scale across six dimensions. We benchmarked three Anthropic models to find the best tradeoff between cost, accuracy, and consistency.
Models Tested
| Model | Input $/1M tokens | Output $/1M tokens | Extended Thinking | Notes |
|---|---|---|---|---|
| Claude Haiku 4.5 | $1.00 | $5.00 | No | Fastest, cheapest |
| Claude Sonnet 4.6 | $3.00 | $15.00 | Yes | Best value (default) |
| Claude Opus 4.6 | $5.00 | $25.00 | Yes | Gold standard |
Relative cost: Haiku is ~6x cheaper than Opus. Sonnet is ~1.7x cheaper than Opus.
Methodology
We selected 20 real pull requests spanning three languages (Rust, TypeScript, Ruby), five repositories, and a range of sizes from 1-line fixes to 1,400+ line features. Each model scored every PR three times (120 API calls total for Haiku and Sonnet). Opus 4.6 is the baseline -- each PR was scored once with Opus and used as the reference for accuracy comparisons.
We measured three things:
- Cost -- Actual token usage and USD cost per review
- Accuracy -- How close each model's scores are to the Opus baseline, measured by Mean Absolute Deviation (MAD) and Pearson correlation
- Stability -- How consistent scores are across repeated runs, measured by Coefficient of Variation (CV)
Key terms
- MAD (Mean Absolute Deviation): The average absolute difference between a model's score and the Opus baseline. Lower is better. A MAD of 4.4 means the model is off by ~4.4 points on average on the 0-100 scale.
- CV (Coefficient of Variation): Standard deviation divided by the mean, as a percentage. Measures run-to-run consistency. A CV of 8.7% means scores vary by about 8.7% across independent runs of the same PR.
- Correlation (r): Pearson correlation coefficient. 1.0 means perfect agreement with the baseline. Values above 0.95 indicate near-identical ranking of PRs.
Test Corpus
20 PRs selected for diversity:
| # | Description | Lines Changed | Files | Category |
|---|---|---|---|---|
| 1 | Single-line key pruning fix | +1/-1 | 1 | Tiny fix |
| 2 | Add search index fields | +60/-14 | 3 | Small feature |
| 3 | Compute derived company metric | +314/-24 | 2 | Medium feature |
| 4 | Structured JSON API response | +138/-4 | 6 | Medium refactor |
| 5 | Real-time list updates | +1,020/-2 | 13 | Large feature |
| 6 | Config change (concurrency) | +2/-2 | 2 | Tiny config |
| 7 | 404 not found page | +97/-0 | 2 | Small feature |
| 8 | Integration branch support | +152/-54 | 9 | Medium feature |
| 9 | Full settings page | +923/-162 | 14 | Large feature |
| 10 | Multi-table data view with filtering | +1,447/-47 | 18 | XL feature |
| 11 | Timestamp overflow protection | +36/-4 | 2 | Small fix |
| 12 | S3 orphan cleanup race condition | +667/-111 | 10 | Large fix |
| 13 | Background job refactor | +436/-136 | 3 | Medium refactor |
| 14 | Email prompt tuning | +18/-5 | 1 | Small prompt |
| 15 | Crash fix (null thread ID) | +59/-12 | 2 | Small fix |
| 16 | Interactive checkpoint tools | +999/-5 | 18 | Large feature |
| 17 | Debug page text selection fix | +119/-4 | 1 | Small fix |
| 18 | Connection pool leak fix | +58/-37 | 6 | Medium fix |
| 19 | Remove SSE heartbeat | +56/-208 | 5 | Medium refactor |
| 20 | New API tool endpoint | +714/-1 | 4 | Large feature |
Coverage: 4 tiny, 6 small, 5 medium, 5 large/XL. Languages: Rust (5), TypeScript (11), Ruby (4). Types: features (10), fixes (6), refactors (3), config (1).
Results: Cost
| Model | Avg Input Tokens | Avg Output Tokens | Avg Cost/Review | Cost for 20 PRs x 3 runs |
|---|---|---|---|---|
| Haiku 4.5 | 14,800 | 3,056 | $0.030 | $1.80 |
| Sonnet 4.6 | 14,829 | 3,970 | $0.104 | $6.24 |
| Opus 4.6 (est.) | ~14,800 | ~4,000 | ~$0.17 | ~$10 |
Sonnet costs 3.5x more than Haiku per review. Opus is estimated at ~6x Haiku / ~1.7x Sonnet.
Cost at Scale
| Volume | Haiku | Sonnet | Opus |
|---|---|---|---|
| 100 PRs/month | $3 | $10 | $17 |
| 1,000 PRs/month | $30 | $104 | $174 |
| 10,000 PRs/month | $301 | $1,040 | $1,740 |
Results: Accuracy
How close each model's scores are to the Opus baseline (lower MAD is better):
| Model | Mean Score | MAD vs Opus | Max Deviation | Bias | Correlation (r) |
|---|---|---|---|---|---|
| Haiku 4.5 | 24.6 | 7.2 | 22.7 | +3.6 | 0.886 |
| Sonnet 4.6 | 25.4 | 4.4 | 11.0 | +4.4 | 0.990 |
| Opus 4.6 | 24.0 | 0 (baseline) | 0 | 0 | 1.0 |
Sonnet is significantly more accurate than Haiku -- 40% lower MAD, half the max deviation, and near-perfect correlation (0.990) with Opus. Both models trend slightly higher than Opus.
Accuracy by Sub-Score (MAD vs Opus)
| Model | Scope | Architecture | Implementation | Risk | Quality | Perf/Security |
|---|---|---|---|---|---|---|
| Haiku 4.5 | 2.1 | 1.0 | 2.4 | 2.6 | 1.3 | 0.8 |
| Sonnet 4.6 | 1.0 | 1.9 | 1.6 | 1.5 | 1.2 | 0.6 |
| Opus 4.6 | 0 | 0 | 0 | 0 | 0 | 0 |
Sonnet is more accurate across all six dimensions, with the biggest improvements in Implementation and Risk.
Accuracy by PR Size (MAD vs Opus)
| Model | Tiny (<50 lines) | Small (50-150) | Medium (150-500) | Large (500+) |
|---|---|---|---|---|
| Haiku 4.5 | 0.2 | 5.9 | 7.5 | 11.6 |
| Sonnet 4.6 | 0.3 | 3.3 | 3.7 | 8.2 |
| Opus 4.6 | 0 (baseline) | 0 (baseline) | 0 (baseline) | 0 (baseline) |
Both models are excellent on tiny PRs. Accuracy degrades on larger PRs, but Sonnet holds up much better -- its MAD on large PRs (8.2) is comparable to Haiku's on medium PRs (7.5).
Accuracy by Language (MAD vs Opus)
| Model | Rust | TypeScript | Ruby |
|---|---|---|---|
| Haiku 4.5 | 4.0 | 7.6 | 10.4 |
| Sonnet 4.6 | 3.0 | 5.5 | 3.4 |
| Opus 4.6 | 0 (baseline) | 0 (baseline) | 0 (baseline) |
Haiku struggles most with Ruby (MAD 10.4), while Sonnet handles it well (3.4). Both models are most accurate on Rust code.
Results: Stability
How consistent scores are when the same PR is scored multiple times (lower CV is better):
| Model | Avg CV | Max CV | PRs with CV > 10% |
|---|---|---|---|
| Haiku 4.5 | 18.2% | 77.4% | 14/20 |
| Sonnet 4.6 | 8.7% | 35.4% | 7/20 |
| Opus 4.6 | N/A (single baseline score) | -- | -- |
Sonnet is 2x more stable than Haiku. Haiku's scores vary wildly across runs -- 14 of 20 PRs had >10% CV, with one reaching 77%. Sonnet's extended thinking produces much more consistent scoring.
Choosing a Model
Sonnet 4.6 is the default. It delivers near-Opus accuracy (r = 0.990, MAD 4.4) at ~1.7x lower cost, with good run-to-run stability (8.7% avg CV). For most teams, this is the best balance of cost and quality.
Opus 4.6 is the gold standard. It is the most capable AI model available for programming today -- the same model many engineering teams use to write their best code. When Opus is also the one judging your code, scores carry a certain weight that's hard to argue with. Some teams, depending on their culture and how they want to roll out scoring, find it important to start with the best-in-class model. If your engineers care about being scored with the highest possible accuracy, Opus is the right choice. There is also a natural correlation: Opus is what you code with, so it makes sense for it to be what evaluates your code. With Sonnet, there can be a slight perception gap -- teams may not use it as much for coding, so there's less of that built-in trust.
Haiku 4.5 is not recommended. While it's the cheapest option at $0.03/review, the 18% average CV (with spikes up to 77%) means scores are too inconsistent to be useful. The same PR can receive wildly different scores across runs. We include the data here for transparency, but we do not recommend Haiku for production scoring.
You can change your model at any time in Settings > AI Configuration.
Benchmark conducted March 2026. 20 PRs, 120 API calls.