AI Model Scoring Benchmark

Purpose & Methodology

GitVelocity scores every pull request on a 0-100 scale across 6 sub-categories using a large language model — Anthropic Claude by default, with OpenRouter-hosted models available as alternatives. To choose and validate the scoring model, we benchmark candidates against a fixed reference model. Opus 4.7 is that reference — the gold standard — and this page compares Opus 4.8, Opus 4.6, Sonnet 4.6, Haiku 4.5, Kimi K2.6, GLM 5.1, and Qwen3.6 Plus against it on three dimensions:

  1. Cost - Token usage and USD cost per review
  2. Accuracy - Score deviation from the Opus 4.7 reference
  3. Stability - Variance across independent runs per model

Every model scored the same 20 pull-request corpus under identical conditions in a single benchmarking pass. Most models are scored with 3 independent runs per PR; Kimi K2.6 and GLM 5.1 are additionally scored with 6 runs per PR to test whether averaging more calls stabilizes the lower-cost models. Scores come only from this benchmark — never from your production data.

Which model scores your PRs? By default, GitVelocity uses Claude Sonnet 4.6 — our recommended choice for most teams. It is the best balance of cost, speed, and scoring quality, and we prefer the Anthropic model. Opus 4.7 is the high-accuracy reference this benchmark calibrates against — it anchors the scale, but it is not the day-to-day default. The OpenRouter models (GLM, Kimi, Qwen) are opt-in, lower-cost alternatives you can bring your own key for.

Scoring

  • Total score (0-100): Composite of 6 sub-scores
  • Sub-scores: Scope, Architecture, Implementation, Risk, Quality, Perf/Security
  • Effort Scale Factor: Applied based on PR size

Score breakdown popover showing the six sub-scores — Scope 14/20, Architecture 12/20, Implementation 13/20, Risk 10/20, Quality 9/15, Perf/Security 4/5 — base score 62/100 with the Effort Scale Factor calculation underneath The score breakdown popover any reviewer can open on a PR. This is the model output the benchmark measures against the Opus 4.7 reference.

Statistical Measures

  • Mean Absolute Deviation (MAD): Average |model_avg - opus_4_7_avg| across PRs
  • Stability (CV): Coefficient of variation across a model's independent runs (stddev / mean)
  • Estimated CV of averaged score: raw CV / sqrt(run_count), used to estimate stability of an averaged final score
  • Correlation (r): Pearson correlation of per-PR average model scores vs Opus 4.7 averages

Models Under Test

Model Provider Input $/1M Output $/1M Runs/PR Notes
claude-haiku-4-5-20251001 Anthropic $1.00 $5.00 3 Fastest Anthropic model
claude-sonnet-4-6 Anthropic $3.00 $15.00 3 Mid-tier Claude
claude-opus-4-6 Anthropic $5.00 $25.00 3 Previous-generation Opus
claude-opus-4-7 Anthropic $5.00 $25.00 3 Gold-standard reference
claude-opus-4-8 Anthropic $5.00 $25.00 3 Newest Opus
moonshotai/kimi-k2.6 OpenRouter $0.7448 $4.655 6 Candidate, reasoning disabled
z-ai/glm-5.1 OpenRouter $1.05 $3.50 6 Candidate, reasoning disabled
qwen/qwen3.6-plus OpenRouter $0.325 $1.95 3 Candidate, reasoning disabled

Opus 4.8 uses Anthropic's published Opus standard pricing ($5/$25 per 1M), the same tier as 4.6/4.7. OpenRouter prices are from model metadata checked on 2026-04-25: Kimi K2.6, GLM 5.1, Qwen3.6 Plus.

Test Corpus

20 PRs selected for diversity across size, language, and complexity.

san-francisco (Rust) - 5 PRs

# PR Lines Files Category
1 #537 "Ensure attio keys are properly pruned" +1/-1 1 Tiny fix
2 #548 "seed-investor-pedigree: add fields into ES" +60/-14 3 Small feature
3 #549 "seed-investor-pedigree: compute company's seed" +314/-24 2 Medium feature
4 #542 "Return structured JSON response from LLM list column API" +138/-4 6 Medium refactor
5 #545 "real time eva-list updates for people index" +1020/-2 13 Large feature

gitvelocity (TypeScript/React/NestJS) - 5 PRs

# PR Lines Files Category
6 #174 "Increase review processor concurrency" +2/-2 2 Tiny config
7 #169 "Add 404 not found page" +97/-0 2 Small feature
8 #175 "Add integration branch support for backfill" +152/-54 9 Medium feature
9 #176 "Add Settings > Usage page" +923/-162 14 Large feature
10 #177 "Add backfill history view with per-PR tracking" +1447/-47 18 XL feature

gmail-integration (TypeScript) - 3 PRs

# PR Lines Files Category
11 #257 "Protect timestamptz casts from overflow" +36/-4 2 Small fix
12 #261 "Prevent S3 orphan cleanup race condition" +667/-111 10 Large fix
13 #260 "Move backfill logic to Sidekiq background" +436/-136 3 Medium refactor

skynet (TypeScript) - 3 PRs

# PR Lines Files Category
14 #302 "Humanize outreach email prompt" +18/-5 1 Small prompt
15 #298 "Fix ObservationalMemory threadId crash" +59/-12 2 Small fix
16 #295 "Add interactive checkpoint tools" +999/-5 18 Large feature

eva-web (Rails/React) - 4 PRs

# PR Lines Files Category
17 #4364 "Fix debug page text selection" +119/-4 1 Small fix
18 #4369 "Fix ActiveRecord connection pool leaks" +58/-37 6 Medium fix
19 #4374 "Eliminate persistent MCP SSE heartbeat" +56/-208 5 Medium refactor
20 #4365 "Add get_company_people MCP tool" +714/-1 4 Large feature

Coverage:

  • Sizes: 3 tiny (<50 lines), 5 small (50-150), 6 medium (150-500), 6 large/XL (500+)
  • Languages: Rust (5), TypeScript (11), Ruby/Rails (4)
  • Types: features (10), fixes (6), refactors (3), config (1)

Results: Cost

Per-call cost on the same review input. Newer Opus models (4.7/4.8) are billed for noticeably more input tokens than 4.6/Sonnet/Haiku on the identical review, which drives their higher cost.

Model Calls Avg Input Tokens Avg Output Tokens Avg Cost/Call Total Benchmark Cost
Qwen3.6 Plus 60 15,240 1,855 $0.009 $0.51
Kimi K2.6 120 14,107 2,263 $0.021 $2.52
GLM 5.1 120 14,163 1,762 $0.021 $2.52
Haiku 4.5 60 16,788 2,892 $0.031 $1.87
Sonnet 4.6 60 16,817 3,985 $0.110 $6.61
Opus 4.6 60 16,788 1,736 $0.127 $7.64
Opus 4.7 60 23,064 2,539 $0.179 $10.73
Opus 4.8 60 23,059 2,840 $0.186 $11.18

Qwen is the cheapest model tested, ~20x cheaper than Opus 4.7 per call. GLM and Kimi are ~8x cheaper. Opus 4.8 is the most expensive model tested (~4% over Opus 4.7), driven by higher output volume on a near-identical input.

Results: Accuracy

Deviation from Opus 4.7 (lower is better). Per-PR model scores are the mean of each model's configured run count.

Model Runs/PR Mean Total Score MAD vs Opus 4.7 Max Deviation Bias Correlation (r)
Opus 4.7 3 20.3 0 (reference) 0.00 0.00 1.000
Opus 4.8 3 23.0 2.75 16.00 +2.65 0.988
GLM 5.1 6 21.5 2.92 10.83 +1.22 0.975
Opus 4.6 3 22.0 3.18 13.33 +1.68 0.966
Haiku 4.5 3 24.6 4.70 15.00 +4.27 0.966
Sonnet 4.6 3 25.9 5.70 23.00 +5.63 0.967
Kimi K2.6 6 26.1 5.93 20.50 +5.80 0.979
Qwen3.6 Plus 3 27.4 7.22 20.00 +7.09 0.988

Opus 4.8 tracks Opus 4.7 most closely of any model — MAD 2.75 and r=0.988 — but scores about 2.6 points higher on average (a consistent upward bias). GLM 5.1 is the strongest non-Anthropic model (MAD 2.92, the lowest bias of the lower-cost models at +1.22). Opus 4.6, Sonnet, Haiku, Kimi, and Qwen all over-score relative to 4.7, with Qwen the furthest off. A high correlation alongside a high bias (Qwen: r=0.988, bias +7.09) means a model ranks PRs much like 4.7 but on a shifted scale.

Per-PR Score Comparison

Each cell is the mean of the model's configured run count.

PR# Size Lang Opus 4.7 Opus 4.8 Opus 4.6 GLM 5.1 Kimi K2.6 Sonnet 4.6 Haiku 4.5 Qwen3.6 Plus
1 tiny Rust 1.0 1.0 1.0 1.2 1.0 1.3 1.0 1.0
2 small Rust 9.3 11.3 10.7 12.0 11.2 12.0 24.3 14.3
3 medium Rust 24.3 26.3 27.3 28.7 25.0 28.7 27.7 35.0
4 medium Rust 10.3 14.3 19.7 14.0 15.0 13.3 14.0 14.3
5 large Rust 51.7 58.7 54.3 49.2 58.7 62.0 49.3 63.3
6 tiny TS 1.0 1.7 1.0 1.2 1.1 1.0 1.3 1.0
7 small TS 6.3 7.0 5.7 5.8 8.1 7.7 7.7 7.0
8 medium TS 14.7 14.3 18.3 16.7 23.8 20.3 20.7 23.0
9 large TS 36.0 36.7 35.7 35.3 44.3 49.0 40.3 42.3
10 xl TS 40.7 56.7 54.0 51.5 61.2 63.7 49.0 54.1
11 small TS 6.0 6.3 6.0 7.7 7.0 6.3 7.7 9.3
12 large TS 44.0 46.3 35.0 37.3 50.8 43.3 46.3 52.3
13 medium TS 36.7 42.7 35.3 35.3 45.0 40.3 34.7 56.7
14 tiny TS 3.3 3.0 3.0 4.2 2.1 3.7 4.3 2.0
15 small TS 6.7 7.3 8.3 11.3 14.2 10.7 12.3 11.3
16 large TS 46.0 50.7 54.7 45.5 59.3 66.3 55.7 60.0
17 small Ruby 6.3 6.3 5.3 10.2 7.2 8.7 9.3 10.0
18 medium Ruby 12.0 11.7 10.3 16.5 13.8 12.7 15.0 20.0
19 medium Ruby 16.0 21.0 21.0 18.0 29.8 30.7 23.0 25.3
20 large Ruby 34.0 36.0 33.3 29.2 43.8 37.3 48.0 45.7

Per-Sub-Score Accuracy (MAD vs Opus 4.7)

Model Scope Architecture Implementation Risk Quality Perf/Security
Opus 4.8 0.62 0.65 0.62 1.02 0.53 0.25
Opus 4.6 0.73 0.85 0.62 0.88 0.77 0.22
GLM 5.1 1.82 1.17 1.02 1.36 0.50 0.48
Sonnet 4.6 1.23 1.73 1.12 1.00 1.03 0.20
Kimi K2.6 1.39 1.59 2.08 1.36 1.06 0.53
Haiku 4.5 2.10 1.07 2.55 1.37 1.48 0.70
Qwen3.6 Plus 2.43 2.28 2.98 1.93 1.72 0.73

Opus 4.8 and 4.6 are the tightest at the sub-score level; 4.8's largest gap is Risk (1.02). Qwen's largest gap is Implementation, where it most over-scores larger feature PRs.

Results: Stability

Coefficient of variation across each model's independent runs. Lower is better.

Model Runs/PR Avg CV (total_score) Max CV PRs with CV > 10% Estimated CV of Averaged Score
Opus 4.7 3 6.3% 29.8% 3/20 3.7%
Opus 4.6 3 6.8% 22.6% 4/20 3.9%
Qwen3.6 Plus 3 8.6% 36.3% 8/20 5.0%
Sonnet 4.6 3 10.0% 35.4% 6/20 5.8%
Opus 4.8 3 10.7% 28.3% 8/20 6.2%
GLM 5.1 6 15.6% 36.3% 13/20 6.4%
Kimi K2.6 6 16.1% 39.8% 16/20 6.6%
Haiku 4.5 3 19.9% 68.8% 15/20 11.5%

Opus 4.7 and 4.6 are the most reproducible models (~6-7% CV, ≤4/20 PRs above 10%). Opus 4.8 is noticeably noisier — 10.7% average CV, roughly 1.7x 4.7's run-to-run variance. Kimi and GLM are the noisiest per call (≥15.6%); averaging 6 calls narrows their averaged-score noise to roughly Sonnet's 3-run level but does not make individual calls as steady as Opus. Haiku is the least stable model overall.

Stability by PR Size

Model Tiny Small Medium Large/XL
Opus 4.8 9.4% 9.7% 17.0% 5.7%
Sonnet 4.6 16.1% 12.9% 10.7% 4.1%
Qwen3.6 Plus 0.0% 11.7% 10.5% 8.5%
Kimi K2.6 13.3% 25.7% 15.1% 10.6%
GLM 5.1 26.8% 12.8% 15.3% 12.5%
Haiku 4.5 15.4% 24.5% 24.2% 14.0%

Opus 4.8's instability concentrates on medium PRs (17.0% CV) — the same band where it diverges most from 4.7 — while it is steady on large PRs (5.7%). GLM's high tiny-PR CV is inflated by very small score denominators. Haiku is noisy across every size band.

Does 2x More Testing Stabilize Kimi/GLM?

First 3 runs vs the full 6-run averages for Kimi and GLM.

Model Runs Averaged MAD vs Opus 4.7 Bias r Avg Raw CV Est. CV of Averaged Score Avg Cost per Averaged PR
Kimi K2.6 3 5.81 +5.66 0.976 13.7% 7.9% $0.063
Kimi K2.6 6 5.93 +5.80 0.979 16.1% 6.6% $0.126
GLM 5.1 3 3.57 +1.53 0.960 12.9% 7.5% $0.063
GLM 5.1 6 2.92 +1.22 0.975 15.6% 6.4% $0.126

Answer: averaging more calls helps GLM but not Kimi. GLM's MAD improves from 3.57 to 2.92 and its correlation from 0.960 to 0.975 with 6 runs. Kimi's MAD does not improve (5.81 → 5.93) because the extra runs do not reduce its large, consistent upward bias (+5.8). The averaged-score noise drops for both, but that does not fix a systematic bias.

Results: By Size

Accuracy (MAD vs Opus 4.7) by PR size category.

Model Tiny (<50) Small (50-150) Medium (150-500) Large/XL (500+)
Opus 4.8 0.33 0.73 2.94 5.44
Opus 4.6 0.11 0.93 4.00 5.78
GLM 5.1 0.39 2.67 2.97 4.33
Sonnet 4.6 0.22 2.13 5.33 11.78
Haiku 4.5 0.44 5.33 4.17 6.83
Kimi K2.6 0.43 2.58 6.42 10.97
Qwen3.6 Plus 0.44 3.47 10.06 10.91

All models converge on tiny PRs. Opus 4.8 is tightest on small/medium PRs; GLM is the best non-Opus model on large/XL PRs. The bigger over-scorers (Sonnet, Kimi, Qwen) drift most on large/XL work.

Results: By Language

Accuracy (MAD vs Opus 4.7) by primary language.

Model Rust TypeScript Ruby/Rails
Opus 4.8 3.00 2.97 1.83
Opus 4.6 3.27 3.55 2.08
GLM 5.1 2.67 2.71 3.79
Sonnet 4.6 4.13 6.58 5.25
Haiku 4.5 4.87 3.88 6.75
Kimi K2.6 2.83 7.09 6.58
Qwen3.6 Plus 6.27 7.32 8.17

Opus 4.8 and GLM 5.1 are the most language-balanced of the non-reference models. Kimi and Sonnet drift most on TypeScript; Qwen is weakest across the board.

Opus 4.7 vs the newest model, Opus 4.8

We also evaluated Anthropic's newest release, Opus 4.8, as a scoring model. It tracks Opus 4.7's rankings almost perfectly (r=0.988, the closest of any model on this corpus), but on the same reviews it:

  • scores about 2.6 points higher on average (a consistent upward bias),
  • varies more between repeated runs (10.7% CV vs 4.7's 6.3% — roughly 1.7x the run-to-run variance), and
  • costs the most of any model tested.

A scoring reference is most valuable when it is steady and reproducible, so Opus 4.7 remains GitVelocity's gold standard. Opus 4.8 is a strong, capable model — it simply scores a touch higher and less consistently than the reference, which is why we kept 4.7 as the anchor your scores are calibrated to.

Recommendations

  • Claude Sonnet 4.6 is the default and recommended model. It is the price-performance sweet spot, and we prefer the Anthropic model — it is what scores your PRs unless you opt into another model. (GLM 5.1 is cheaper and closer to the Opus 4.7 reference on this corpus, but it requires bringing your own OpenRouter key and is an opt-in alternative, not the default.)

  • Opus 4.7 is the scoring reference, not the default. It is the most reproducible of the Opus models and the anchor every other model is measured against — it defines the scale Sonnet and the rest are compared to.

  • GLM 5.1 is the best lower-cost option. MAD 2.92, r=0.975, and the lowest bias of the lower-cost models (+1.22) with 6-run averaging — at roughly 8x lower per-call cost than Opus 4.7. The tradeoff is noisier individual calls (15.6% raw CV), so it is best used with averaging.

  • Kimi K2.6 over-scores (+5.80 bias) and does not improve with more runs.

  • Qwen3.6 Plus is the cheapest but the least accurate here — it over-scores medium and large PRs and ranks last on deviation.

  • Averaging lower-cost calls is worthwhile only if you use the averaged score, and only helps models without a systematic bias (GLM, not Kimi).

Cost at Scale

Single-call cost projection:

Volume Qwen GLM Kimi Haiku Sonnet Opus 4.6 Opus 4.7 Opus 4.8
100 PRs/month $1 $2 $2 $3 $11 $13 $18 $19
1,000 PRs/month $9 $21 $21 $31 $110 $127 $179 $186
10,000 PRs/month $85 $210 $210 $312 $1,102 $1,273 $1,788 $1,863

Averaged-score cost projection:

Volume 6x GLM 6x Kimi 3x Sonnet
100 PRs/month $13 $13 $33
1,000 PRs/month $126 $126 $331
10,000 PRs/month $1,260 $1,260 $3,306

Lower-cost OpenRouter option (opt-in)

Our recommended default stays Claude Sonnet 4.6 — we prefer the Anthropic model, and it is what we suggest for most teams. If cost is your priority, GitVelocity also supports bring-your-own OpenRouter keys as an opt-in alternative. Both steps below happen on the same page under Settings → AI Scoring.

1. Connect your OpenRouter key in the API Keys card.

Settings → AI Scoring → API Keys card with Anthropic and OpenRouter providers connected; redacted key suffixes shown next to each provider with an Edit action Click Edit on the OpenRouter row to paste your key (sk-or-...). Only the last four characters are ever shown back to you.

2. Switch the scoring model to GLM 5.1 (OpenRouter) in the Model Selection card.

Settings → AI Scoring → Model Selection dropdown opened, showing Latest (Claude Sonnet 4.6), Claude Opus 4.7, Claude Opus 4.6, Claude Sonnet 4.6, and GLM 5.1 (OpenRouter) as the candidate options PR reviews route through OpenRouter on your own key as soon as you pick GLM 5.1. The default remains Latest (currently Sonnet 4.6) if you don't change it.

Model r vs Opus 4.7 MAD Bias Single-call cost (1k PRs)
GLM 5.1 0.975 2.92 +1.22 $21
Kimi K2.6 0.979 5.93 +5.80 $21

GLM 5.1 lands closest to the Opus 4.7 gold standard of the OpenRouter options, with far lower bias than Kimi (+1.22 vs +5.80). The default model is unchanged (Latest / Claude Sonnet 4.6); GLM 5.1 is fully opt-in.

Appendix: Notes

The dataset behind this page comprises 600 scored results — the 20-PR corpus run across all eight models, with 3 runs each (6 for Kimi and GLM). To keep scoring deterministic, models are queried at temperature 0 where the model supports it; Opus 4.7 and 4.8 are queried with their default sampling. Sonnet 4.6 is given an extended-thinking budget; the OpenRouter candidates are queried with reasoning disabled. All figures on this page come from this benchmark, not from production scores.