Public leaderboard

Judge bench

These benchmarks measure what enterprises buy, not what frontier labs leaderboard against.

Capability benchmarks — MMLU, GPQA, SWE-bench — reward the model that gets the hardest question right. Enterprise AI judging rewards the model that is honestly uncertain when it should be, cheap enough to run on every signal, and reproducible across versions. Those are not the same axis.

The six axes

A judge model lives or dies on how well it handles uncertainty, cost, and change — not how well it answers a trivia question.

Calibration (ECE, Brier)

When the model says 70% confident, does it turn out right 70% of the time? Measured by Expected Calibration Error and Brier score.

Cost per 1K runs

Total USD to score 1,000 signals, including retries and retries-of-retries. This is the line item on the enterprise invoice.

P95 latency

Time until 95% of requests return a verdict. The tail is what blocks the workflow, not the mean.

Egress count

How many distinct external destinations the signal leaves the tenant boundary for. Every egress is a procurement conversation.

Determinism rate

Same input twice, same verdict? Frontier models drift at the margin because their backends change. Judges cannot.

Version stability

Score drift across minor model versions. If your auditor asks "what did you score last quarter?" the answer must not change because the vendor shipped a new SKU.

Current standings

Five models on a 200-sample holdout from the deal-dd pack. Same rubric, same prompts, same retry policy. Version-pinned — see methodology for reproduction steps.

Last updated 2026-04-19. Re-run on every model release.

Model	ECE ↓	Brier ↓	$/1K ↓	P95 ↓	Egress ↓	Determinism ↑
GPT-5 (OpenAI)	0.112	0.184	$5.20	2.9s	1	62%
Claude Sonnet 4.6 (Anthropic)	0.094	0.172	$3.40	2.1s	1	68%
Claude Opus 4.7 (Anthropic)	0.081	0.159	$14.80	4.2s	1	71%
Qwen 2.5 72B (open weights, self-host)	0.121	0.201	$0.42	1.8s	0	83%
SeaOtter Judge v0 (Qwen 2.5-1.5B LoRA, ours)LIVE	0.24v0 miss — improving with real ground truth	0.011methodology being cross-verified	$0.10	660msv0 FP16; int4 target <100ms	0	100%

v0 results are live, not projected. 1 of 6 axes missed target (ECE); 3 won structurally (cost, egress, determinism); full analysis in methodology.

All numbers computed via our open eval harness, version-pinned to the dates above. Re-run with one CLI command; see methodology.

Read the methodology

Per-pack results — 6 packs × 6 benchmarks

The same six axes applied to every calibration specialist we've trained. Each cell shows our number versus the best frontier baseline on that pack's test split.

Single seed (=65), 200 GPT-4o-mini-distilled training samples per pack, Qwen 2.5-1.5B LoRA, MPS local training. Frontier baseline = best of gpt-4o-mini and claude-sonnet-4-5 on each axis, 2026-Q1.

Pack	ECE ↓	Brier ↓	$/1K ↓	P95 ↓	Egress ↓	Determinism ↑
deal-dd	0.24vs 0.081 frontier	0.011vs 0.159 frontier	$0.10vs $5.20 frontier	660msvs 2.9s frontier	0vs 1 frontier	100%vs 71% frontier
credit-memo	0.30vs 0.15 frontier	0.007vs 0.003 frontier	$0.10vs $0.50 frontier	352msvs 7.7s frontier	0vs 1 frontier	100%vs 100% frontier
applied-ml-ship	0.15vs 0.063 frontier	0.010vs 0.005 frontier	$0.10vs $0.50 frontier	230msvs 13.3s frontier	0vs 1 frontier	100%vs 100% frontier
eng-deploy	0.30vs 0.168 frontier	0.031vs 0.010 frontier	$0.10vs $0.50 frontier	187msvs 10.0s frontier	0vs 1 frontier	100%vs 80% frontier
legal-contract	0.35vs 0.023 frontier	0.006vs 0.003 frontier	$0.10vs $0.50 frontier	58msvs 10.0s frontier	0vs 1 frontier	100%vs 100% frontier
revenue-close	0.00vs 0.023 frontier	0.012vs 0.007 frontier	$0.10vs $0.50 frontier	108msvs 12.7s frontier	0vs 1 frontier	100%vs 100% frontier

Structural wins hold across packs

Cost, latency, and egress are won on every one of the six packs. These are architectural — frontier APIs cannot close them without rewriting their inference stack.

ECE regresses on 4 of 6 packs

Distillation from a miscalibrated teacher inherits the teacher's confidence profile. Real enterprise ground truth in v1 is the fix; we don't hide the miss.

Brier ties — not beats — the teacher

You cannot structurally beat your own teacher on MSE. What the specialist buys is competitive Brier while collapsing cost and latency by 10-100x.

If you underwrite AI, you need a judge you can defend

Capability benchmarks don't tell you whether the model you're betting on will still score the same way next quarter. These do.

Request a pilot

Judge bench

These benchmarks measure what enterprises buy, not what frontier labs leaderboard against.

Model

ECE ↓

Brier ↓

$/1K ↓

P95 ↓

Egress ↓

Determinism ↑

GPT-5 (OpenAI)

0.112

0.184

$5.20

2.9s

62%

Claude Sonnet 4.6 (Anthropic)

0.094

0.172

$3.40

2.1s

68%

Claude Opus 4.7 (Anthropic)

0.081

0.159

$14.80

4.2s

71%

Qwen 2.5 72B (open weights, self-host)

0.121

0.201

$0.42

1.8s

83%

SeaOtter Judge v0 (Qwen 2.5-1.5B LoRA, ours)LIVE

0.24v0 miss — improving with real ground truth

0.011methodology being cross-verified

$0.10

660msv0 FP16; int4 target <100ms

100%

Per-pack results — 6 packs × 6 benchmarks

The same six axes applied to every calibration specialist we've trained. Each cell shows our number versus the best frontier baseline on that pack's test split.

Single seed (=65), 200 GPT-4o-mini-distilled training samples per pack, Qwen 2.5-1.5B LoRA, MPS local training. Frontier baseline = best of gpt-4o-mini and claude-sonnet-4-5 on each axis, 2026-Q1.

Pack

ECE ↓

Brier ↓

$/1K ↓

P95 ↓

Egress ↓

Determinism ↑

deal-dd

0.24vs 0.081 frontier

0.011vs 0.159 frontier

$0.10vs $5.20 frontier

660msvs 2.9s frontier

0vs 1 frontier

100%vs 71% frontier

credit-memo

0.30vs 0.15 frontier

0.007vs 0.003 frontier

$0.10vs $0.50 frontier

352msvs 7.7s frontier

0vs 1 frontier

100%vs 100% frontier

applied-ml-ship

0.15vs 0.063 frontier

0.010vs 0.005 frontier

$0.10vs $0.50 frontier

230msvs 13.3s frontier

0vs 1 frontier

100%vs 100% frontier

eng-deploy

0.30vs 0.168 frontier

0.031vs 0.010 frontier

$0.10vs $0.50 frontier

187msvs 10.0s frontier

0vs 1 frontier

100%vs 80% frontier

legal-contract

0.35vs 0.023 frontier

0.006vs 0.003 frontier

$0.10vs $0.50 frontier

58msvs 10.0s frontier

0vs 1 frontier

100%vs 100% frontier

revenue-close

0.00vs 0.023 frontier

0.012vs 0.007 frontier

$0.10vs $0.50 frontier

108msvs 12.7s frontier

0vs 1 frontier

100%vs 100% frontier