AI Model Scoring Benchmark

Purpose & Methodology

GitVelocity uses Claude to score PRs on a 0-100 scale across 6 sub-categories. This benchmark compares Opus 4.6, Sonnet 4.6, Haiku 4.5, Kimi K2.6, GLM 5.1, and Qwen3.6 Plus against the Opus 4.7 gold-standard baseline on three dimensions:

Cost - Token usage and USD cost per review
Accuracy - Score deviation from the Opus 4.7 baseline
Stability - Variance across independent runs per model

Every model scored the same 20 PR corpus using buildClaudePrompt() with the default guideline. Claude, Haiku, Sonnet, Opus, and Qwen are scored with 3 independent runs per PR. Kimi K2.6 and GLM 5.1 are additionally scored with 6 independent runs per PR to test whether cheaper OpenRouter models can be stabilized by averaging more calls. All results are from fresh benchmark runs with the same prompt inputs and the same scoring schema; no production-database scores are used as a baseline.

The benchmark runner supports OpenRouter-hosted candidate models. When testing non-Claude models, use the same 20-PR corpus, buildClaudePrompt() prompt, JSON extraction path, and Opus 4.7 baseline. Do not compare a single OpenRouter run against the three-run Claude averages.

Scoring

Total score (0-100): Composite of 6 sub-scores
Sub-scores: Scope, Architecture, Implementation, Risk, Quality, Perf/Security
Effort Scale Factor: Applied per the guideline based on PR size

Statistical Measures

Mean Absolute Deviation (MAD): Average |model_avg - opus_4_7_avg| across PRs
Stability (CV): Coefficient of variation across a model's independent runs (stddev / mean)
Estimated CV of averaged score: raw CV / sqrt(run_count), used to estimate stability of an averaged final score
Correlation (r): Pearson correlation of per-PR average model scores vs Opus 4.7 averages

Models Under Test

Model	Provider	Input $/1M	Output $/1M	Runs/PR	Notes
`claude-haiku-4-5-20251001`	Anthropic	$1.00	$5.00	3	Fastest Anthropic model
`claude-sonnet-4-6`	Anthropic	$3.00	$15.00	3	Mid-tier Claude
`claude-opus-4-6`	Anthropic	$5.00	$25.00	3	Previous gold standard
`claude-opus-4-7`	Anthropic	$5.00	$25.00	3	Gold standard baseline
`moonshotai/kimi-k2.6`	OpenRouter	$0.7448	$4.655	6	Candidate run with reasoning disabled
`z-ai/glm-5.1`	OpenRouter	$1.05	$3.50	6	Candidate run with reasoning disabled
`qwen/qwen3.6-plus`	OpenRouter	$0.325	$1.95	3	Candidate run with reasoning disabled

Running OpenRouter Candidates

Run the 3-run candidate pass with:

OPENROUTER_API_KEY=sk-or-... OPENROUTER_REASONING_EFFORT=none \
  npx ts-node backend/scripts/run-model-benchmark.ts \
  --config docs/benchmark-config.json \
  --models openrouter/kimi-k2.6,openrouter/glm-5.1,openrouter/qwen3.6-plus \
  --baseline claude-opus-4-7 \
  --resume

Run the 6-run Kimi/GLM averaging pass with:

OPENROUTER_API_KEY=sk-or-... OPENROUTER_REASONING_EFFORT=none \
  npx ts-node backend/scripts/run-model-benchmark.ts \
  --config docs/benchmark-config.json \
  --models openrouter/kimi-k2.6,openrouter/glm-5.1 \
  --baseline claude-opus-4-7 \
  --runs 6 \
  --resume

The aliases resolve to OpenRouter model slugs moonshotai/kimi-k2.6, z-ai/glm-5.1, and qwen/qwen3.6-plus. Prices above are from OpenRouter model metadata checked on 2026-04-25: Kimi K2.6, GLM 5.1, and Qwen3.6 Plus. OpenRouter's reasoning-token docs list reasoning.effort = "none" as disabling reasoning, which is the setting used for these candidate runs: OpenRouter reasoning tokens.

For a cheap live smoke test before a full run, add --pr-ids 1 --runs 1 --output /tmp/gitvelocity-openrouter-smoke.json.

deepseek/deepseek-v4-pro pricing and alias support are present in the runner, but DeepSeek is intentionally excluded from this saved result set after provider 429s and the decision to skip it.

Test Corpus

20 PRs selected for diversity across size, language, and complexity.

san-francisco (Rust) - 5 PRs

#	PR	Lines	Files	Category
1	#537 "Ensure attio keys are properly pruned"	+1/-1	1	Tiny fix
2	#548 "seed-investor-pedigree: add fields into ES"	+60/-14	3	Small feature
3	#549 "seed-investor-pedigree: compute company's seed"	+314/-24	2	Medium feature
4	#542 "Return structured JSON response from LLM list column API"	+138/-4	6	Medium refactor
5	#545 "real time eva-list updates for people index"	+1020/-2	13	Large feature

gitvelocity (TypeScript/React/NestJS) - 5 PRs

#	PR	Lines	Files	Category
6	#174 "Increase review processor concurrency"	+2/-2	2	Tiny config
7	#169 "Add 404 not found page"	+97/-0	2	Small feature
8	#175 "Add integration branch support for backfill"	+152/-54	9	Medium feature
9	#176 "Add Settings > Usage page"	+923/-162	14	Large feature
10	#177 "Add backfill history view with per-PR tracking"	+1447/-47	18	XL feature

gmail-integration (TypeScript) - 3 PRs

#	PR	Lines	Files	Category
11	#257 "Protect timestamptz casts from overflow"	+36/-4	2	Small fix
12	#261 "Prevent S3 orphan cleanup race condition"	+667/-111	10	Large fix
13	#260 "Move backfill logic to Sidekiq background"	+436/-136	3	Medium refactor

skynet (TypeScript) - 3 PRs

#	PR	Lines	Files	Category
14	#302 "Humanize outreach email prompt"	+18/-5	1	Small prompt
15	#298 "Fix ObservationalMemory threadId crash"	+59/-12	2	Small fix
16	#295 "Add interactive checkpoint tools"	+999/-5	18	Large feature

eva-web (Rails/React) - 4 PRs

#	PR	Lines	Files	Category
17	#4364 "Fix debug page text selection"	+119/-4	1	Small fix
18	#4369 "Fix ActiveRecord connection pool leaks"	+58/-37	6	Medium fix
19	#4374 "Eliminate persistent MCP SSE heartbeat"	+56/-208	5	Medium refactor
20	#4365 "Add get_company_people MCP tool"	+714/-1	4	Large feature

Coverage:

Sizes: 3 tiny (<50 lines), 5 small (50-150), 6 medium (150-500), 6 large/XL (500+)
Languages: Rust (5), TypeScript (11), Ruby/Rails (4)
Types: features (10), fixes (6), refactors (3), config (1)

Results: Cost

Model	Calls	Avg Input Tokens	Avg Output Tokens	Avg Cost/Call	Total Benchmark Cost
Qwen3.6 Plus	60	13,568	1,877	$0.008	$0.48
GLM 5.1	120	12,523	1,590	$0.019	$2.25
Kimi K2.6	120	12,491	2,259	$0.020	$2.38
Haiku 4.5	60	14,800	3,056	$0.030	$1.80
Sonnet 4.6	60	14,829	3,970	$0.104	$6.24
Opus 4.6	60	14,951	1,806	$0.120	$7.19
Opus 4.7	60	20,341	2,350	$0.161	$9.63

Qwen is the cheapest model tested, at roughly 20x cheaper than Opus 4.7 per call. GLM and Kimi are roughly 8x cheaper than Opus 4.7 per call. Because Kimi and GLM were run 6 times per PR, their total benchmark cost is higher than the 3-run Haiku cost but still far below the Claude Sonnet/Opus runs.

Results: Accuracy

Deviation from Opus 4.7 (lower is better). Per-PR model scores are the mean of each model's configured run count.

Model	Runs/PR	Mean Total Score	MAD vs Opus 4.7	Max Deviation	Bias	Correlation (r)
Opus 4.7	3	20.8	0 (baseline)	0.00	0.00	1.000
Opus 4.6	3	20.8	1.92	8.00	+0.02	0.989
GLM 5.1	6	21.4	2.85	8.00	+0.55	0.983
Kimi K2.6	6	24.1	3.62	10.83	+3.25	0.984
Sonnet 4.6	3	25.4	5.08	22.00	+4.58	0.979
Haiku 4.5	3	24.6	5.36	16.33	+3.76	0.949
Qwen3.6 Plus	3	27.7	6.93	30.67	+6.90	0.946

Opus 4.6 tracks Opus 4.7 most closely: near-zero bias, MAD under 2 points, and the strongest correlation of the non-baseline models. GLM 5.1 is the best OpenRouter candidate in this result set. With 6-run averaging it beats Sonnet and Haiku on MAD by a clear margin. Qwen3.6 Plus is extremely cheap but over-scores this corpus and ranks last on accuracy.

Per-PR Score Comparison

Each cell is the mean of the model's configured run count.

PR#	Size	Lang	Opus 4.7	Opus 4.6	GLM 5.1	Kimi K2.6	Sonnet 4.6	Haiku 4.5	Qwen3.6 Plus
1	tiny	Rust	1.0	1.0	1.2	1.0	1.3	1.0	1.0
2	small	Rust	8.7	10.7	13.0	10.2	12.3	11.3	14.0
3	medium	Rust	23.7	27.0	29.2	30.0	28.7	26.3	33.3
4	medium	Rust	10.3	12.0	15.3	17.7	13.0	18.3	13.3
5	large	Rust	58.7	56.7	50.7	59.2	60.7	52.0	62.7
6	tiny	TS	1.0	1.0	1.2	1.0	1.0	1.0	2.0
7	small	TS	6.0	6.0	5.8	6.7	7.3	7.0	6.7
8	medium	TS	15.7	18.3	15.0	19.0	20.0	24.5	17.0
9	large	TS	36.7	34.7	37.2	41.3	49.3	45.7	42.3
10	xl	TS	51.7	48.0	49.7	48.5	60.3	56.3	53.3
11	small	TS	6.0	6.3	7.5	6.5	7.0	7.0	7.3
12	large	TS	43.0	41.0	37.8	48.8	48.0	51.0	52.3
13	medium	TS	28.3	36.3	35.8	39.2	37.0	33.3	59.0
14	tiny	TS	2.0	2.3	3.2	1.8	2.7	1.3	1.7
15	small	TS	6.7	7.0	9.0	11.2	10.3	23.0	8.7
16	large	TS	47.0	47.3	42.2	55.2	69.0	38.3	72.3
17	small	Ruby	5.3	5.7	9.8	5.0	8.3	10.0	12.7
18	medium	Ruby	12.0	7.3	12.2	15.0	7.0	19.3	17.0
19	medium	Ruby	18.7	17.0	19.8	23.5	26.7	19.7	31.7
20	large	Ruby	34.0	31.0	31.8	40.7	38.0	45.0	46.0

Per-Sub-Score Accuracy (MAD vs Opus 4.7)

Model	Scope	Architecture	Implementation	Risk	Quality	Perf/Security
Opus 4.6	0.58	0.55	0.57	0.87	0.72	0.22
GLM 5.1	1.83	1.38	1.28	1.28	0.44	0.42
Kimi K2.6	1.49	1.45	1.89	0.95	0.69	0.35
Sonnet 4.6	1.10	1.63	1.28	1.00	0.88	0.20
Haiku 4.5	2.22	0.92	2.07	1.43	0.98	0.62
Qwen3.6 Plus	2.05	2.23	3.07	1.92	1.43	0.55

Qwen's largest gap is Implementation, which is also where it most visibly over-scored larger feature PRs. GLM's biggest gap is Scope; Kimi's biggest gap is Implementation.

Results: Stability

Coefficient of variation across each model's independent runs. Lower is better.

Model	Runs/PR	Avg CV (total_score)	Max CV	PRs with CV > 10%	Estimated CV of Averaged Score
Opus 4.6	3	4.6%	20.2%	2/20	2.6%
Opus 4.7	3	5.0%	21.7%	3/20	2.9%
Sonnet 4.6	3	8.7%	35.4%	7/20	5.0%
Qwen3.6 Plus	3	10.0%	28.3%	8/20	5.8%
Kimi K2.6	6	13.5%	32.4%	14/20	5.5%
GLM 5.1	6	14.4%	31.9%	13/20	5.9%
Haiku 4.5	3	18.2%	77.4%	14/20	10.5%

Kimi and GLM still have materially noisier individual calls than Sonnet or Opus. Averaging 6 calls narrows the expected noise of the final averaged score to roughly Sonnet's 3-run average, but it does not make individual OpenRouter calls as stable as Claude Opus.

Stability by PR Size

Model	Tiny	Small	Medium	Large/XL
Kimi K2.6	6.8%	13.1%	21.1%	9.8%
GLM 5.1	25.2%	14.5%	12.3%	11.1%
Qwen3.6 Plus	9.4%	13.0%	7.0%	10.7%
Sonnet 4.6	17.7%	9.4%	9.2%	2.9%

The OpenRouter instability is not only a large-PR problem. Kimi is least stable on medium PRs in this corpus; GLM's high tiny-PR CV is inflated by very small score denominators, but it also remains noisier than Sonnet across small, medium, and large PRs.

Does 2x More Testing Stabilize Kimi/GLM?

This compares the first 3 runs against the final 6-run averages for Kimi and GLM.

Model	Runs Averaged	MAD vs Opus 4.7	Bias	r	Avg Raw CV	Est. CV of Averaged Score	Avg Cost per Averaged PR
Kimi K2.6	3	3.52	+2.78	0.972	10.8%	6.2%	$0.058
Kimi K2.6	6	3.62	+3.25	0.984	13.5%	5.5%	$0.119
GLM 5.1	3	3.47	+0.30	0.970	11.0%	6.3%	$0.056
GLM 5.1	6	2.85	+0.55	0.983	14.4%	5.9%	$0.112

Model	Mean Abs First3-vs-Last3 Gap	Max Gap	PRs > 5 pts	Mean First3-to-Six Shift
Kimi K2.6	2.74	12.33	4/20	1.37
GLM 5.1	1.73	9.00	1/20	0.87

Answer: yes, averaging more calls helps the final averaged score, but it is not a full substitute for a more stable model. The extra runs improved correlation for both Kimi and GLM and improved GLM's MAD from 3.47 to 2.85. Kimi's MAD did not improve, because the extra runs exposed more upward bias on some medium and large PRs.

Cost-wise, 6x GLM costs about $0.112 per scored PR and 6x Kimi costs about $0.119. That is still about 64% cheaper than a 3-run Sonnet average (~~$0.312), but it is slightly more expensive than a single Sonnet call (~~$0.104). If production uses one Sonnet call, a 2-call Kimi/GLM average is cheaper but should be expected to remain noisier than Sonnet.

Results: By Size

Accuracy (MAD vs Opus 4.7) broken down by PR size category.

Model	Tiny (<50)	Small (50-150)	Medium (150-500)	Large/XL (500+)
Opus 4.6	0.11	0.60	3.67	2.17
GLM 5.1	0.50	2.57	3.33	3.78
Kimi K2.6	0.06	1.51	5.94	4.83
Sonnet 4.6	0.33	2.53	5.61	9.06
Haiku 4.5	0.22	5.13	5.48	8.00
Qwen3.6 Plus	0.44	3.33	10.44	9.67

All models converge on tiny PRs. Kimi is strong on small PRs but drifts on medium PRs; GLM is the best non-Opus model on medium and large PRs in this corpus. Qwen is weakest on medium and large PRs because it substantially over-scores several feature/refactor PRs.

Results: By Language

Accuracy (MAD vs Opus 4.7) broken down by primary language.

Model	Rust	TypeScript	Ruby/Rails
Opus 4.6	1.80	1.79	2.42
GLM 5.1	4.60	2.36	2.00
Kimi K2.6	3.13	3.81	3.71
Sonnet 4.6	2.73	6.18	5.00
Haiku 4.5	4.00	5.75	6.00
Qwen3.6 Plus	4.40	7.21	9.33

Opus 4.6 is the most language-stable. GLM is strongest on TypeScript and Ruby/Rails but weaker on Rust; Kimi is more balanced than Qwen but still trails GLM on TypeScript and Ruby/Rails. Qwen's Ruby/Rails and medium/large PR over-scoring make it a poor fit for this scoring task without calibration.

Recommendations

Opus 4.7 remains the gold standard. It has near-identical stability to Opus 4.6 and is the reference target for this benchmark.
Opus 4.6 remains the best substitute if the per-review cost gap matters. At MAD 1.92 and r=0.989 against Opus 4.7, with essentially zero bias, it is the closest proxy available.
GLM 5.1 is the best OpenRouter candidate in this run. With 6-run averaging it reaches MAD 2.85, r=0.983, and only +0.55 bias. The tradeoff is noisy individual calls: 14.4% raw average CV.
Kimi K2.6 is promising but needs bias handling. Six-run averaging improves correlation, but Kimi still has +3.25 bias and higher medium-PR instability than GLM.
Qwen3.6 Plus is not recommended for this scoring task as-is. It is the cheapest model tested, but it over-scores too many medium/large PRs and ranks last on MAD.
Averaging cheap OpenRouter calls is viable only if you use the averaged score. Six GLM calls are still cheaper than three Sonnet calls, but single GLM/Kimi calls are not stable enough to replace Sonnet or Opus directly.

Cost at Scale

Single-call cost projection:

Volume	Qwen	GLM	Kimi	Haiku	Sonnet	Opus 4.6	Opus 4.7
100 PRs/month	$1	$2	$2	$3	$10	$12	$16
1,000 PRs/month	$8	$19	$20	$30	$104	$120	$161
10,000 PRs/month	$81	$187	$198	$301	$1,040	$1,199	$1,605

Averaged-score cost projection:

Volume	6x GLM	6x Kimi	3x Sonnet
100 PRs/month	$11	$12	$31
1,000 PRs/month	$112	$119	$312
10,000 PRs/month	$1,123	$1,189	$3,121

Appendix: Raw Data

Full results are in docs/benchmark-results.json (540 rows: 20 PRs with 3-run Claude/Qwen results and 6-run Kimi/GLM results). The saved config includes modelRuns to record the per-model run counts.

All scores were collected via backend/scripts/run-model-benchmark.ts. Haiku 4.5 and Opus 4.6 use temperature: 0 without extended thinking, matching the production code path in claude.service.ts. Opus 4.7 omits the temperature parameter entirely because it rejects both the legacy thinking.type.enabled shape and the temperature field. Sonnet 4.6 opts in to extended thinking with a 12,384-token budget. OpenRouter candidate runs use the OpenRouter chat completions endpoint with the same user prompt, temperature: 0, max_tokens: 16384 by default, and OPENROUTER_REASONING_EFFORT=none.

Benchmark run dates: Haiku + Sonnet 2026-03-04, Opus 4.6 + Opus 4.7 2026-04-16, Kimi K2.6 + GLM 5.1 + Qwen3.6 Plus via OpenRouter on 2026-04-25.