AI Model Scoring Benchmark
Purpose & Methodology
GitVelocity uses Claude to score PRs on a 0-100 scale across 6 sub-categories. This benchmark compares Opus 4.6, Sonnet 4.6, Haiku 4.5, Kimi K2.6, GLM 5.1, and Qwen3.6 Plus against the Opus 4.7 gold-standard baseline on three dimensions:
- Cost - Token usage and USD cost per review
- Accuracy - Score deviation from the Opus 4.7 baseline
- Stability - Variance across independent runs per model
Every model scored the same 20 PR corpus using buildClaudePrompt() with the default guideline. Claude, Haiku, Sonnet, Opus, and Qwen are scored with 3 independent runs per PR. Kimi K2.6 and GLM 5.1 are additionally scored with 6 independent runs per PR to test whether cheaper OpenRouter models can be stabilized by averaging more calls. All results are from fresh benchmark runs with the same prompt inputs and the same scoring schema; no production-database scores are used as a baseline.
The benchmark runner supports OpenRouter-hosted candidate models. When testing non-Claude models, use the same 20-PR corpus, buildClaudePrompt() prompt, JSON extraction path, and Opus 4.7 baseline. Do not compare a single OpenRouter run against the three-run Claude averages.
Scoring
- Total score (0-100): Composite of 6 sub-scores
- Sub-scores: Scope, Architecture, Implementation, Risk, Quality, Perf/Security
- Effort Scale Factor: Applied per the guideline based on PR size
Statistical Measures
- Mean Absolute Deviation (MAD): Average
|model_avg - opus_4_7_avg|across PRs - Stability (CV): Coefficient of variation across a model's independent runs (
stddev / mean) - Estimated CV of averaged score:
raw CV / sqrt(run_count), used to estimate stability of an averaged final score - Correlation (r): Pearson correlation of per-PR average model scores vs Opus 4.7 averages
Models Under Test
| Model | Provider | Input $/1M | Output $/1M | Runs/PR | Notes |
|---|---|---|---|---|---|
claude-haiku-4-5-20251001 |
Anthropic | $1.00 | $5.00 | 3 | Fastest Anthropic model |
claude-sonnet-4-6 |
Anthropic | $3.00 | $15.00 | 3 | Mid-tier Claude |
claude-opus-4-6 |
Anthropic | $5.00 | $25.00 | 3 | Previous gold standard |
claude-opus-4-7 |
Anthropic | $5.00 | $25.00 | 3 | Gold standard baseline |
moonshotai/kimi-k2.6 |
OpenRouter | $0.7448 | $4.655 | 6 | Candidate run with reasoning disabled |
z-ai/glm-5.1 |
OpenRouter | $1.05 | $3.50 | 6 | Candidate run with reasoning disabled |
qwen/qwen3.6-plus |
OpenRouter | $0.325 | $1.95 | 3 | Candidate run with reasoning disabled |
Running OpenRouter Candidates
Run the 3-run candidate pass with:
OPENROUTER_API_KEY=sk-or-... OPENROUTER_REASONING_EFFORT=none \
npx ts-node backend/scripts/run-model-benchmark.ts \
--config docs/benchmark-config.json \
--models openrouter/kimi-k2.6,openrouter/glm-5.1,openrouter/qwen3.6-plus \
--baseline claude-opus-4-7 \
--resume
Run the 6-run Kimi/GLM averaging pass with:
OPENROUTER_API_KEY=sk-or-... OPENROUTER_REASONING_EFFORT=none \
npx ts-node backend/scripts/run-model-benchmark.ts \
--config docs/benchmark-config.json \
--models openrouter/kimi-k2.6,openrouter/glm-5.1 \
--baseline claude-opus-4-7 \
--runs 6 \
--resume
The aliases resolve to OpenRouter model slugs moonshotai/kimi-k2.6, z-ai/glm-5.1, and qwen/qwen3.6-plus. Prices above are from OpenRouter model metadata checked on 2026-04-25: Kimi K2.6, GLM 5.1, and Qwen3.6 Plus. OpenRouter's reasoning-token docs list reasoning.effort = "none" as disabling reasoning, which is the setting used for these candidate runs: OpenRouter reasoning tokens.
For a cheap live smoke test before a full run, add --pr-ids 1 --runs 1 --output /tmp/gitvelocity-openrouter-smoke.json.
deepseek/deepseek-v4-pro pricing and alias support are present in the runner, but DeepSeek is intentionally excluded from this saved result set after provider 429s and the decision to skip it.
Test Corpus
20 PRs selected for diversity across size, language, and complexity.
san-francisco (Rust) - 5 PRs
| # | PR | Lines | Files | Category |
|---|---|---|---|---|
| 1 | #537 "Ensure attio keys are properly pruned" | +1/-1 | 1 | Tiny fix |
| 2 | #548 "seed-investor-pedigree: add fields into ES" | +60/-14 | 3 | Small feature |
| 3 | #549 "seed-investor-pedigree: compute company's seed" | +314/-24 | 2 | Medium feature |
| 4 | #542 "Return structured JSON response from LLM list column API" | +138/-4 | 6 | Medium refactor |
| 5 | #545 "real time eva-list updates for people index" | +1020/-2 | 13 | Large feature |
gitvelocity (TypeScript/React/NestJS) - 5 PRs
| # | PR | Lines | Files | Category |
|---|---|---|---|---|
| 6 | #174 "Increase review processor concurrency" | +2/-2 | 2 | Tiny config |
| 7 | #169 "Add 404 not found page" | +97/-0 | 2 | Small feature |
| 8 | #175 "Add integration branch support for backfill" | +152/-54 | 9 | Medium feature |
| 9 | #176 "Add Settings > Usage page" | +923/-162 | 14 | Large feature |
| 10 | #177 "Add backfill history view with per-PR tracking" | +1447/-47 | 18 | XL feature |
gmail-integration (TypeScript) - 3 PRs
| # | PR | Lines | Files | Category |
|---|---|---|---|---|
| 11 | #257 "Protect timestamptz casts from overflow" | +36/-4 | 2 | Small fix |
| 12 | #261 "Prevent S3 orphan cleanup race condition" | +667/-111 | 10 | Large fix |
| 13 | #260 "Move backfill logic to Sidekiq background" | +436/-136 | 3 | Medium refactor |
skynet (TypeScript) - 3 PRs
| # | PR | Lines | Files | Category |
|---|---|---|---|---|
| 14 | #302 "Humanize outreach email prompt" | +18/-5 | 1 | Small prompt |
| 15 | #298 "Fix ObservationalMemory threadId crash" | +59/-12 | 2 | Small fix |
| 16 | #295 "Add interactive checkpoint tools" | +999/-5 | 18 | Large feature |
eva-web (Rails/React) - 4 PRs
| # | PR | Lines | Files | Category |
|---|---|---|---|---|
| 17 | #4364 "Fix debug page text selection" | +119/-4 | 1 | Small fix |
| 18 | #4369 "Fix ActiveRecord connection pool leaks" | +58/-37 | 6 | Medium fix |
| 19 | #4374 "Eliminate persistent MCP SSE heartbeat" | +56/-208 | 5 | Medium refactor |
| 20 | #4365 "Add get_company_people MCP tool" | +714/-1 | 4 | Large feature |
Coverage:
- Sizes: 3 tiny (<50 lines), 5 small (50-150), 6 medium (150-500), 6 large/XL (500+)
- Languages: Rust (5), TypeScript (11), Ruby/Rails (4)
- Types: features (10), fixes (6), refactors (3), config (1)
Results: Cost
| Model | Calls | Avg Input Tokens | Avg Output Tokens | Avg Cost/Call | Total Benchmark Cost |
|---|---|---|---|---|---|
| Qwen3.6 Plus | 60 | 13,568 | 1,877 | $0.008 | $0.48 |
| GLM 5.1 | 120 | 12,523 | 1,590 | $0.019 | $2.25 |
| Kimi K2.6 | 120 | 12,491 | 2,259 | $0.020 | $2.38 |
| Haiku 4.5 | 60 | 14,800 | 3,056 | $0.030 | $1.80 |
| Sonnet 4.6 | 60 | 14,829 | 3,970 | $0.104 | $6.24 |
| Opus 4.6 | 60 | 14,951 | 1,806 | $0.120 | $7.19 |
| Opus 4.7 | 60 | 20,341 | 2,350 | $0.161 | $9.63 |
Qwen is the cheapest model tested, at roughly 20x cheaper than Opus 4.7 per call. GLM and Kimi are roughly 8x cheaper than Opus 4.7 per call. Because Kimi and GLM were run 6 times per PR, their total benchmark cost is higher than the 3-run Haiku cost but still far below the Claude Sonnet/Opus runs.
Results: Accuracy
Deviation from Opus 4.7 (lower is better). Per-PR model scores are the mean of each model's configured run count.
| Model | Runs/PR | Mean Total Score | MAD vs Opus 4.7 | Max Deviation | Bias | Correlation (r) |
|---|---|---|---|---|---|---|
| Opus 4.7 | 3 | 20.8 | 0 (baseline) | 0.00 | 0.00 | 1.000 |
| Opus 4.6 | 3 | 20.8 | 1.92 | 8.00 | +0.02 | 0.989 |
| GLM 5.1 | 6 | 21.4 | 2.85 | 8.00 | +0.55 | 0.983 |
| Kimi K2.6 | 6 | 24.1 | 3.62 | 10.83 | +3.25 | 0.984 |
| Sonnet 4.6 | 3 | 25.4 | 5.08 | 22.00 | +4.58 | 0.979 |
| Haiku 4.5 | 3 | 24.6 | 5.36 | 16.33 | +3.76 | 0.949 |
| Qwen3.6 Plus | 3 | 27.7 | 6.93 | 30.67 | +6.90 | 0.946 |
Opus 4.6 tracks Opus 4.7 most closely: near-zero bias, MAD under 2 points, and the strongest correlation of the non-baseline models. GLM 5.1 is the best OpenRouter candidate in this result set. With 6-run averaging it beats Sonnet and Haiku on MAD by a clear margin. Qwen3.6 Plus is extremely cheap but over-scores this corpus and ranks last on accuracy.
Per-PR Score Comparison
Each cell is the mean of the model's configured run count.
| PR# | Size | Lang | Opus 4.7 | Opus 4.6 | GLM 5.1 | Kimi K2.6 | Sonnet 4.6 | Haiku 4.5 | Qwen3.6 Plus |
|---|---|---|---|---|---|---|---|---|---|
| 1 | tiny | Rust | 1.0 | 1.0 | 1.2 | 1.0 | 1.3 | 1.0 | 1.0 |
| 2 | small | Rust | 8.7 | 10.7 | 13.0 | 10.2 | 12.3 | 11.3 | 14.0 |
| 3 | medium | Rust | 23.7 | 27.0 | 29.2 | 30.0 | 28.7 | 26.3 | 33.3 |
| 4 | medium | Rust | 10.3 | 12.0 | 15.3 | 17.7 | 13.0 | 18.3 | 13.3 |
| 5 | large | Rust | 58.7 | 56.7 | 50.7 | 59.2 | 60.7 | 52.0 | 62.7 |
| 6 | tiny | TS | 1.0 | 1.0 | 1.2 | 1.0 | 1.0 | 1.0 | 2.0 |
| 7 | small | TS | 6.0 | 6.0 | 5.8 | 6.7 | 7.3 | 7.0 | 6.7 |
| 8 | medium | TS | 15.7 | 18.3 | 15.0 | 19.0 | 20.0 | 24.5 | 17.0 |
| 9 | large | TS | 36.7 | 34.7 | 37.2 | 41.3 | 49.3 | 45.7 | 42.3 |
| 10 | xl | TS | 51.7 | 48.0 | 49.7 | 48.5 | 60.3 | 56.3 | 53.3 |
| 11 | small | TS | 6.0 | 6.3 | 7.5 | 6.5 | 7.0 | 7.0 | 7.3 |
| 12 | large | TS | 43.0 | 41.0 | 37.8 | 48.8 | 48.0 | 51.0 | 52.3 |
| 13 | medium | TS | 28.3 | 36.3 | 35.8 | 39.2 | 37.0 | 33.3 | 59.0 |
| 14 | tiny | TS | 2.0 | 2.3 | 3.2 | 1.8 | 2.7 | 1.3 | 1.7 |
| 15 | small | TS | 6.7 | 7.0 | 9.0 | 11.2 | 10.3 | 23.0 | 8.7 |
| 16 | large | TS | 47.0 | 47.3 | 42.2 | 55.2 | 69.0 | 38.3 | 72.3 |
| 17 | small | Ruby | 5.3 | 5.7 | 9.8 | 5.0 | 8.3 | 10.0 | 12.7 |
| 18 | medium | Ruby | 12.0 | 7.3 | 12.2 | 15.0 | 7.0 | 19.3 | 17.0 |
| 19 | medium | Ruby | 18.7 | 17.0 | 19.8 | 23.5 | 26.7 | 19.7 | 31.7 |
| 20 | large | Ruby | 34.0 | 31.0 | 31.8 | 40.7 | 38.0 | 45.0 | 46.0 |
Per-Sub-Score Accuracy (MAD vs Opus 4.7)
| Model | Scope | Architecture | Implementation | Risk | Quality | Perf/Security |
|---|---|---|---|---|---|---|
| Opus 4.6 | 0.58 | 0.55 | 0.57 | 0.87 | 0.72 | 0.22 |
| GLM 5.1 | 1.83 | 1.38 | 1.28 | 1.28 | 0.44 | 0.42 |
| Kimi K2.6 | 1.49 | 1.45 | 1.89 | 0.95 | 0.69 | 0.35 |
| Sonnet 4.6 | 1.10 | 1.63 | 1.28 | 1.00 | 0.88 | 0.20 |
| Haiku 4.5 | 2.22 | 0.92 | 2.07 | 1.43 | 0.98 | 0.62 |
| Qwen3.6 Plus | 2.05 | 2.23 | 3.07 | 1.92 | 1.43 | 0.55 |
Qwen's largest gap is Implementation, which is also where it most visibly over-scored larger feature PRs. GLM's biggest gap is Scope; Kimi's biggest gap is Implementation.
Results: Stability
Coefficient of variation across each model's independent runs. Lower is better.
| Model | Runs/PR | Avg CV (total_score) | Max CV | PRs with CV > 10% | Estimated CV of Averaged Score |
|---|---|---|---|---|---|
| Opus 4.6 | 3 | 4.6% | 20.2% | 2/20 | 2.6% |
| Opus 4.7 | 3 | 5.0% | 21.7% | 3/20 | 2.9% |
| Sonnet 4.6 | 3 | 8.7% | 35.4% | 7/20 | 5.0% |
| Qwen3.6 Plus | 3 | 10.0% | 28.3% | 8/20 | 5.8% |
| Kimi K2.6 | 6 | 13.5% | 32.4% | 14/20 | 5.5% |
| GLM 5.1 | 6 | 14.4% | 31.9% | 13/20 | 5.9% |
| Haiku 4.5 | 3 | 18.2% | 77.4% | 14/20 | 10.5% |
Kimi and GLM still have materially noisier individual calls than Sonnet or Opus. Averaging 6 calls narrows the expected noise of the final averaged score to roughly Sonnet's 3-run average, but it does not make individual OpenRouter calls as stable as Claude Opus.
Stability by PR Size
| Model | Tiny | Small | Medium | Large/XL |
|---|---|---|---|---|
| Kimi K2.6 | 6.8% | 13.1% | 21.1% | 9.8% |
| GLM 5.1 | 25.2% | 14.5% | 12.3% | 11.1% |
| Qwen3.6 Plus | 9.4% | 13.0% | 7.0% | 10.7% |
| Sonnet 4.6 | 17.7% | 9.4% | 9.2% | 2.9% |
The OpenRouter instability is not only a large-PR problem. Kimi is least stable on medium PRs in this corpus; GLM's high tiny-PR CV is inflated by very small score denominators, but it also remains noisier than Sonnet across small, medium, and large PRs.
Does 2x More Testing Stabilize Kimi/GLM?
This compares the first 3 runs against the final 6-run averages for Kimi and GLM.
| Model | Runs Averaged | MAD vs Opus 4.7 | Bias | r | Avg Raw CV | Est. CV of Averaged Score | Avg Cost per Averaged PR |
|---|---|---|---|---|---|---|---|
| Kimi K2.6 | 3 | 3.52 | +2.78 | 0.972 | 10.8% | 6.2% | $0.058 |
| Kimi K2.6 | 6 | 3.62 | +3.25 | 0.984 | 13.5% | 5.5% | $0.119 |
| GLM 5.1 | 3 | 3.47 | +0.30 | 0.970 | 11.0% | 6.3% | $0.056 |
| GLM 5.1 | 6 | 2.85 | +0.55 | 0.983 | 14.4% | 5.9% | $0.112 |
| Model | Mean Abs First3-vs-Last3 Gap | Max Gap | PRs > 5 pts | Mean First3-to-Six Shift |
|---|---|---|---|---|
| Kimi K2.6 | 2.74 | 12.33 | 4/20 | 1.37 |
| GLM 5.1 | 1.73 | 9.00 | 1/20 | 0.87 |
Answer: yes, averaging more calls helps the final averaged score, but it is not a full substitute for a more stable model. The extra runs improved correlation for both Kimi and GLM and improved GLM's MAD from 3.47 to 2.85. Kimi's MAD did not improve, because the extra runs exposed more upward bias on some medium and large PRs.
Cost-wise, 6x GLM costs about $0.112 per scored PR and 6x Kimi costs about $0.119. That is still about 64% cheaper than a 3-run Sonnet average ($0.312), but it is slightly more expensive than a single Sonnet call ($0.104). If production uses one Sonnet call, a 2-call Kimi/GLM average is cheaper but should be expected to remain noisier than Sonnet.
Results: By Size
Accuracy (MAD vs Opus 4.7) broken down by PR size category.
| Model | Tiny (<50) | Small (50-150) | Medium (150-500) | Large/XL (500+) |
|---|---|---|---|---|
| Opus 4.6 | 0.11 | 0.60 | 3.67 | 2.17 |
| GLM 5.1 | 0.50 | 2.57 | 3.33 | 3.78 |
| Kimi K2.6 | 0.06 | 1.51 | 5.94 | 4.83 |
| Sonnet 4.6 | 0.33 | 2.53 | 5.61 | 9.06 |
| Haiku 4.5 | 0.22 | 5.13 | 5.48 | 8.00 |
| Qwen3.6 Plus | 0.44 | 3.33 | 10.44 | 9.67 |
All models converge on tiny PRs. Kimi is strong on small PRs but drifts on medium PRs; GLM is the best non-Opus model on medium and large PRs in this corpus. Qwen is weakest on medium and large PRs because it substantially over-scores several feature/refactor PRs.
Results: By Language
Accuracy (MAD vs Opus 4.7) broken down by primary language.
| Model | Rust | TypeScript | Ruby/Rails |
|---|---|---|---|
| Opus 4.6 | 1.80 | 1.79 | 2.42 |
| GLM 5.1 | 4.60 | 2.36 | 2.00 |
| Kimi K2.6 | 3.13 | 3.81 | 3.71 |
| Sonnet 4.6 | 2.73 | 6.18 | 5.00 |
| Haiku 4.5 | 4.00 | 5.75 | 6.00 |
| Qwen3.6 Plus | 4.40 | 7.21 | 9.33 |
Opus 4.6 is the most language-stable. GLM is strongest on TypeScript and Ruby/Rails but weaker on Rust; Kimi is more balanced than Qwen but still trails GLM on TypeScript and Ruby/Rails. Qwen's Ruby/Rails and medium/large PR over-scoring make it a poor fit for this scoring task without calibration.
Recommendations
Opus 4.7 remains the gold standard. It has near-identical stability to Opus 4.6 and is the reference target for this benchmark.
Opus 4.6 remains the best substitute if the per-review cost gap matters. At MAD 1.92 and r=0.989 against Opus 4.7, with essentially zero bias, it is the closest proxy available.
GLM 5.1 is the best OpenRouter candidate in this run. With 6-run averaging it reaches MAD 2.85, r=0.983, and only +0.55 bias. The tradeoff is noisy individual calls: 14.4% raw average CV.
Kimi K2.6 is promising but needs bias handling. Six-run averaging improves correlation, but Kimi still has +3.25 bias and higher medium-PR instability than GLM.
Qwen3.6 Plus is not recommended for this scoring task as-is. It is the cheapest model tested, but it over-scores too many medium/large PRs and ranks last on MAD.
Averaging cheap OpenRouter calls is viable only if you use the averaged score. Six GLM calls are still cheaper than three Sonnet calls, but single GLM/Kimi calls are not stable enough to replace Sonnet or Opus directly.
Cost at Scale
Single-call cost projection:
| Volume | Qwen | GLM | Kimi | Haiku | Sonnet | Opus 4.6 | Opus 4.7 |
|---|---|---|---|---|---|---|---|
| 100 PRs/month | $1 | $2 | $2 | $3 | $10 | $12 | $16 |
| 1,000 PRs/month | $8 | $19 | $20 | $30 | $104 | $120 | $161 |
| 10,000 PRs/month | $81 | $187 | $198 | $301 | $1,040 | $1,199 | $1,605 |
Averaged-score cost projection:
| Volume | 6x GLM | 6x Kimi | 3x Sonnet |
|---|---|---|---|
| 100 PRs/month | $11 | $12 | $31 |
| 1,000 PRs/month | $112 | $119 | $312 |
| 10,000 PRs/month | $1,123 | $1,189 | $3,121 |
Appendix: Raw Data
Full results are in docs/benchmark-results.json (540 rows: 20 PRs with 3-run Claude/Qwen results and 6-run Kimi/GLM results). The saved config includes modelRuns to record the per-model run counts.
All scores were collected via backend/scripts/run-model-benchmark.ts. Haiku 4.5 and Opus 4.6 use temperature: 0 without extended thinking, matching the production code path in claude.service.ts. Opus 4.7 omits the temperature parameter entirely because it rejects both the legacy thinking.type.enabled shape and the temperature field. Sonnet 4.6 opts in to extended thinking with a 12,384-token budget. OpenRouter candidate runs use the OpenRouter chat completions endpoint with the same user prompt, temperature: 0, max_tokens: 16384 by default, and OPENROUTER_REASONING_EFFORT=none.
Benchmark run dates: Haiku + Sonnet 2026-03-04, Opus 4.6 + Opus 4.7 2026-04-16, Kimi K2.6 + GLM 5.1 + Qwen3.6 Plus via OpenRouter on 2026-04-25.