AI Model Scoring Benchmark

Purpose & Methodology

GitVelocity uses Claude to score PRs on a 0-100 scale across 6 sub-categories. This benchmark compares Opus 4.6, Sonnet 4.6, Haiku 4.5, Kimi K2.6, GLM 5.1, and Qwen3.6 Plus against the Opus 4.7 gold-standard baseline on three dimensions:

  1. Cost - Token usage and USD cost per review
  2. Accuracy - Score deviation from the Opus 4.7 baseline
  3. Stability - Variance across independent runs per model

Every model scored the same 20 PR corpus using buildClaudePrompt() with the default guideline. Claude, Haiku, Sonnet, Opus, and Qwen are scored with 3 independent runs per PR. Kimi K2.6 and GLM 5.1 are additionally scored with 6 independent runs per PR to test whether cheaper OpenRouter models can be stabilized by averaging more calls. All results are from fresh benchmark runs with the same prompt inputs and the same scoring schema; no production-database scores are used as a baseline.

The benchmark runner supports OpenRouter-hosted candidate models. When testing non-Claude models, use the same 20-PR corpus, buildClaudePrompt() prompt, JSON extraction path, and Opus 4.7 baseline. Do not compare a single OpenRouter run against the three-run Claude averages.

Scoring

  • Total score (0-100): Composite of 6 sub-scores
  • Sub-scores: Scope, Architecture, Implementation, Risk, Quality, Perf/Security
  • Effort Scale Factor: Applied per the guideline based on PR size

Statistical Measures

  • Mean Absolute Deviation (MAD): Average |model_avg - opus_4_7_avg| across PRs
  • Stability (CV): Coefficient of variation across a model's independent runs (stddev / mean)
  • Estimated CV of averaged score: raw CV / sqrt(run_count), used to estimate stability of an averaged final score
  • Correlation (r): Pearson correlation of per-PR average model scores vs Opus 4.7 averages

Models Under Test

Model Provider Input $/1M Output $/1M Runs/PR Notes
claude-haiku-4-5-20251001 Anthropic $1.00 $5.00 3 Fastest Anthropic model
claude-sonnet-4-6 Anthropic $3.00 $15.00 3 Mid-tier Claude
claude-opus-4-6 Anthropic $5.00 $25.00 3 Previous gold standard
claude-opus-4-7 Anthropic $5.00 $25.00 3 Gold standard baseline
moonshotai/kimi-k2.6 OpenRouter $0.7448 $4.655 6 Candidate run with reasoning disabled
z-ai/glm-5.1 OpenRouter $1.05 $3.50 6 Candidate run with reasoning disabled
qwen/qwen3.6-plus OpenRouter $0.325 $1.95 3 Candidate run with reasoning disabled

Running OpenRouter Candidates

Run the 3-run candidate pass with:

OPENROUTER_API_KEY=sk-or-... OPENROUTER_REASONING_EFFORT=none \
  npx ts-node backend/scripts/run-model-benchmark.ts \
  --config docs/benchmark-config.json \
  --models openrouter/kimi-k2.6,openrouter/glm-5.1,openrouter/qwen3.6-plus \
  --baseline claude-opus-4-7 \
  --resume

Run the 6-run Kimi/GLM averaging pass with:

OPENROUTER_API_KEY=sk-or-... OPENROUTER_REASONING_EFFORT=none \
  npx ts-node backend/scripts/run-model-benchmark.ts \
  --config docs/benchmark-config.json \
  --models openrouter/kimi-k2.6,openrouter/glm-5.1 \
  --baseline claude-opus-4-7 \
  --runs 6 \
  --resume

The aliases resolve to OpenRouter model slugs moonshotai/kimi-k2.6, z-ai/glm-5.1, and qwen/qwen3.6-plus. Prices above are from OpenRouter model metadata checked on 2026-04-25: Kimi K2.6, GLM 5.1, and Qwen3.6 Plus. OpenRouter's reasoning-token docs list reasoning.effort = "none" as disabling reasoning, which is the setting used for these candidate runs: OpenRouter reasoning tokens.

For a cheap live smoke test before a full run, add --pr-ids 1 --runs 1 --output /tmp/gitvelocity-openrouter-smoke.json.

deepseek/deepseek-v4-pro pricing and alias support are present in the runner, but DeepSeek is intentionally excluded from this saved result set after provider 429s and the decision to skip it.

Test Corpus

20 PRs selected for diversity across size, language, and complexity.

san-francisco (Rust) - 5 PRs

# PR Lines Files Category
1 #537 "Ensure attio keys are properly pruned" +1/-1 1 Tiny fix
2 #548 "seed-investor-pedigree: add fields into ES" +60/-14 3 Small feature
3 #549 "seed-investor-pedigree: compute company's seed" +314/-24 2 Medium feature
4 #542 "Return structured JSON response from LLM list column API" +138/-4 6 Medium refactor
5 #545 "real time eva-list updates for people index" +1020/-2 13 Large feature

gitvelocity (TypeScript/React/NestJS) - 5 PRs

# PR Lines Files Category
6 #174 "Increase review processor concurrency" +2/-2 2 Tiny config
7 #169 "Add 404 not found page" +97/-0 2 Small feature
8 #175 "Add integration branch support for backfill" +152/-54 9 Medium feature
9 #176 "Add Settings > Usage page" +923/-162 14 Large feature
10 #177 "Add backfill history view with per-PR tracking" +1447/-47 18 XL feature

gmail-integration (TypeScript) - 3 PRs

# PR Lines Files Category
11 #257 "Protect timestamptz casts from overflow" +36/-4 2 Small fix
12 #261 "Prevent S3 orphan cleanup race condition" +667/-111 10 Large fix
13 #260 "Move backfill logic to Sidekiq background" +436/-136 3 Medium refactor

skynet (TypeScript) - 3 PRs

# PR Lines Files Category
14 #302 "Humanize outreach email prompt" +18/-5 1 Small prompt
15 #298 "Fix ObservationalMemory threadId crash" +59/-12 2 Small fix
16 #295 "Add interactive checkpoint tools" +999/-5 18 Large feature

eva-web (Rails/React) - 4 PRs

# PR Lines Files Category
17 #4364 "Fix debug page text selection" +119/-4 1 Small fix
18 #4369 "Fix ActiveRecord connection pool leaks" +58/-37 6 Medium fix
19 #4374 "Eliminate persistent MCP SSE heartbeat" +56/-208 5 Medium refactor
20 #4365 "Add get_company_people MCP tool" +714/-1 4 Large feature

Coverage:

  • Sizes: 3 tiny (<50 lines), 5 small (50-150), 6 medium (150-500), 6 large/XL (500+)
  • Languages: Rust (5), TypeScript (11), Ruby/Rails (4)
  • Types: features (10), fixes (6), refactors (3), config (1)

Results: Cost

Model Calls Avg Input Tokens Avg Output Tokens Avg Cost/Call Total Benchmark Cost
Qwen3.6 Plus 60 13,568 1,877 $0.008 $0.48
GLM 5.1 120 12,523 1,590 $0.019 $2.25
Kimi K2.6 120 12,491 2,259 $0.020 $2.38
Haiku 4.5 60 14,800 3,056 $0.030 $1.80
Sonnet 4.6 60 14,829 3,970 $0.104 $6.24
Opus 4.6 60 14,951 1,806 $0.120 $7.19
Opus 4.7 60 20,341 2,350 $0.161 $9.63

Qwen is the cheapest model tested, at roughly 20x cheaper than Opus 4.7 per call. GLM and Kimi are roughly 8x cheaper than Opus 4.7 per call. Because Kimi and GLM were run 6 times per PR, their total benchmark cost is higher than the 3-run Haiku cost but still far below the Claude Sonnet/Opus runs.

Results: Accuracy

Deviation from Opus 4.7 (lower is better). Per-PR model scores are the mean of each model's configured run count.

Model Runs/PR Mean Total Score MAD vs Opus 4.7 Max Deviation Bias Correlation (r)
Opus 4.7 3 20.8 0 (baseline) 0.00 0.00 1.000
Opus 4.6 3 20.8 1.92 8.00 +0.02 0.989
GLM 5.1 6 21.4 2.85 8.00 +0.55 0.983
Kimi K2.6 6 24.1 3.62 10.83 +3.25 0.984
Sonnet 4.6 3 25.4 5.08 22.00 +4.58 0.979
Haiku 4.5 3 24.6 5.36 16.33 +3.76 0.949
Qwen3.6 Plus 3 27.7 6.93 30.67 +6.90 0.946

Opus 4.6 tracks Opus 4.7 most closely: near-zero bias, MAD under 2 points, and the strongest correlation of the non-baseline models. GLM 5.1 is the best OpenRouter candidate in this result set. With 6-run averaging it beats Sonnet and Haiku on MAD by a clear margin. Qwen3.6 Plus is extremely cheap but over-scores this corpus and ranks last on accuracy.

Per-PR Score Comparison

Each cell is the mean of the model's configured run count.

PR# Size Lang Opus 4.7 Opus 4.6 GLM 5.1 Kimi K2.6 Sonnet 4.6 Haiku 4.5 Qwen3.6 Plus
1 tiny Rust 1.0 1.0 1.2 1.0 1.3 1.0 1.0
2 small Rust 8.7 10.7 13.0 10.2 12.3 11.3 14.0
3 medium Rust 23.7 27.0 29.2 30.0 28.7 26.3 33.3
4 medium Rust 10.3 12.0 15.3 17.7 13.0 18.3 13.3
5 large Rust 58.7 56.7 50.7 59.2 60.7 52.0 62.7
6 tiny TS 1.0 1.0 1.2 1.0 1.0 1.0 2.0
7 small TS 6.0 6.0 5.8 6.7 7.3 7.0 6.7
8 medium TS 15.7 18.3 15.0 19.0 20.0 24.5 17.0
9 large TS 36.7 34.7 37.2 41.3 49.3 45.7 42.3
10 xl TS 51.7 48.0 49.7 48.5 60.3 56.3 53.3
11 small TS 6.0 6.3 7.5 6.5 7.0 7.0 7.3
12 large TS 43.0 41.0 37.8 48.8 48.0 51.0 52.3
13 medium TS 28.3 36.3 35.8 39.2 37.0 33.3 59.0
14 tiny TS 2.0 2.3 3.2 1.8 2.7 1.3 1.7
15 small TS 6.7 7.0 9.0 11.2 10.3 23.0 8.7
16 large TS 47.0 47.3 42.2 55.2 69.0 38.3 72.3
17 small Ruby 5.3 5.7 9.8 5.0 8.3 10.0 12.7
18 medium Ruby 12.0 7.3 12.2 15.0 7.0 19.3 17.0
19 medium Ruby 18.7 17.0 19.8 23.5 26.7 19.7 31.7
20 large Ruby 34.0 31.0 31.8 40.7 38.0 45.0 46.0

Per-Sub-Score Accuracy (MAD vs Opus 4.7)

Model Scope Architecture Implementation Risk Quality Perf/Security
Opus 4.6 0.58 0.55 0.57 0.87 0.72 0.22
GLM 5.1 1.83 1.38 1.28 1.28 0.44 0.42
Kimi K2.6 1.49 1.45 1.89 0.95 0.69 0.35
Sonnet 4.6 1.10 1.63 1.28 1.00 0.88 0.20
Haiku 4.5 2.22 0.92 2.07 1.43 0.98 0.62
Qwen3.6 Plus 2.05 2.23 3.07 1.92 1.43 0.55

Qwen's largest gap is Implementation, which is also where it most visibly over-scored larger feature PRs. GLM's biggest gap is Scope; Kimi's biggest gap is Implementation.

Results: Stability

Coefficient of variation across each model's independent runs. Lower is better.

Model Runs/PR Avg CV (total_score) Max CV PRs with CV > 10% Estimated CV of Averaged Score
Opus 4.6 3 4.6% 20.2% 2/20 2.6%
Opus 4.7 3 5.0% 21.7% 3/20 2.9%
Sonnet 4.6 3 8.7% 35.4% 7/20 5.0%
Qwen3.6 Plus 3 10.0% 28.3% 8/20 5.8%
Kimi K2.6 6 13.5% 32.4% 14/20 5.5%
GLM 5.1 6 14.4% 31.9% 13/20 5.9%
Haiku 4.5 3 18.2% 77.4% 14/20 10.5%

Kimi and GLM still have materially noisier individual calls than Sonnet or Opus. Averaging 6 calls narrows the expected noise of the final averaged score to roughly Sonnet's 3-run average, but it does not make individual OpenRouter calls as stable as Claude Opus.

Stability by PR Size

Model Tiny Small Medium Large/XL
Kimi K2.6 6.8% 13.1% 21.1% 9.8%
GLM 5.1 25.2% 14.5% 12.3% 11.1%
Qwen3.6 Plus 9.4% 13.0% 7.0% 10.7%
Sonnet 4.6 17.7% 9.4% 9.2% 2.9%

The OpenRouter instability is not only a large-PR problem. Kimi is least stable on medium PRs in this corpus; GLM's high tiny-PR CV is inflated by very small score denominators, but it also remains noisier than Sonnet across small, medium, and large PRs.

Does 2x More Testing Stabilize Kimi/GLM?

This compares the first 3 runs against the final 6-run averages for Kimi and GLM.

Model Runs Averaged MAD vs Opus 4.7 Bias r Avg Raw CV Est. CV of Averaged Score Avg Cost per Averaged PR
Kimi K2.6 3 3.52 +2.78 0.972 10.8% 6.2% $0.058
Kimi K2.6 6 3.62 +3.25 0.984 13.5% 5.5% $0.119
GLM 5.1 3 3.47 +0.30 0.970 11.0% 6.3% $0.056
GLM 5.1 6 2.85 +0.55 0.983 14.4% 5.9% $0.112
Model Mean Abs First3-vs-Last3 Gap Max Gap PRs > 5 pts Mean First3-to-Six Shift
Kimi K2.6 2.74 12.33 4/20 1.37
GLM 5.1 1.73 9.00 1/20 0.87

Answer: yes, averaging more calls helps the final averaged score, but it is not a full substitute for a more stable model. The extra runs improved correlation for both Kimi and GLM and improved GLM's MAD from 3.47 to 2.85. Kimi's MAD did not improve, because the extra runs exposed more upward bias on some medium and large PRs.

Cost-wise, 6x GLM costs about $0.112 per scored PR and 6x Kimi costs about $0.119. That is still about 64% cheaper than a 3-run Sonnet average ($0.312), but it is slightly more expensive than a single Sonnet call ($0.104). If production uses one Sonnet call, a 2-call Kimi/GLM average is cheaper but should be expected to remain noisier than Sonnet.

Results: By Size

Accuracy (MAD vs Opus 4.7) broken down by PR size category.

Model Tiny (<50) Small (50-150) Medium (150-500) Large/XL (500+)
Opus 4.6 0.11 0.60 3.67 2.17
GLM 5.1 0.50 2.57 3.33 3.78
Kimi K2.6 0.06 1.51 5.94 4.83
Sonnet 4.6 0.33 2.53 5.61 9.06
Haiku 4.5 0.22 5.13 5.48 8.00
Qwen3.6 Plus 0.44 3.33 10.44 9.67

All models converge on tiny PRs. Kimi is strong on small PRs but drifts on medium PRs; GLM is the best non-Opus model on medium and large PRs in this corpus. Qwen is weakest on medium and large PRs because it substantially over-scores several feature/refactor PRs.

Results: By Language

Accuracy (MAD vs Opus 4.7) broken down by primary language.

Model Rust TypeScript Ruby/Rails
Opus 4.6 1.80 1.79 2.42
GLM 5.1 4.60 2.36 2.00
Kimi K2.6 3.13 3.81 3.71
Sonnet 4.6 2.73 6.18 5.00
Haiku 4.5 4.00 5.75 6.00
Qwen3.6 Plus 4.40 7.21 9.33

Opus 4.6 is the most language-stable. GLM is strongest on TypeScript and Ruby/Rails but weaker on Rust; Kimi is more balanced than Qwen but still trails GLM on TypeScript and Ruby/Rails. Qwen's Ruby/Rails and medium/large PR over-scoring make it a poor fit for this scoring task without calibration.

Recommendations

  • Opus 4.7 remains the gold standard. It has near-identical stability to Opus 4.6 and is the reference target for this benchmark.

  • Opus 4.6 remains the best substitute if the per-review cost gap matters. At MAD 1.92 and r=0.989 against Opus 4.7, with essentially zero bias, it is the closest proxy available.

  • GLM 5.1 is the best OpenRouter candidate in this run. With 6-run averaging it reaches MAD 2.85, r=0.983, and only +0.55 bias. The tradeoff is noisy individual calls: 14.4% raw average CV.

  • Kimi K2.6 is promising but needs bias handling. Six-run averaging improves correlation, but Kimi still has +3.25 bias and higher medium-PR instability than GLM.

  • Qwen3.6 Plus is not recommended for this scoring task as-is. It is the cheapest model tested, but it over-scores too many medium/large PRs and ranks last on MAD.

  • Averaging cheap OpenRouter calls is viable only if you use the averaged score. Six GLM calls are still cheaper than three Sonnet calls, but single GLM/Kimi calls are not stable enough to replace Sonnet or Opus directly.

Cost at Scale

Single-call cost projection:

Volume Qwen GLM Kimi Haiku Sonnet Opus 4.6 Opus 4.7
100 PRs/month $1 $2 $2 $3 $10 $12 $16
1,000 PRs/month $8 $19 $20 $30 $104 $120 $161
10,000 PRs/month $81 $187 $198 $301 $1,040 $1,199 $1,605

Averaged-score cost projection:

Volume 6x GLM 6x Kimi 3x Sonnet
100 PRs/month $11 $12 $31
1,000 PRs/month $112 $119 $312
10,000 PRs/month $1,123 $1,189 $3,121

Appendix: Raw Data

Full results are in docs/benchmark-results.json (540 rows: 20 PRs with 3-run Claude/Qwen results and 6-run Kimi/GLM results). The saved config includes modelRuns to record the per-model run counts.

All scores were collected via backend/scripts/run-model-benchmark.ts. Haiku 4.5 and Opus 4.6 use temperature: 0 without extended thinking, matching the production code path in claude.service.ts. Opus 4.7 omits the temperature parameter entirely because it rejects both the legacy thinking.type.enabled shape and the temperature field. Sonnet 4.6 opts in to extended thinking with a 12,384-token budget. OpenRouter candidate runs use the OpenRouter chat completions endpoint with the same user prompt, temperature: 0, max_tokens: 16384 by default, and OPENROUTER_REASONING_EFFORT=none.

Benchmark run dates: Haiku + Sonnet 2026-03-04, Opus 4.6 + Opus 4.7 2026-04-16, Kimi K2.6 + GLM 5.1 + Qwen3.6 Plus via OpenRouter on 2026-04-25.