Skip to main content
Skip to main content
SeaOtter
How it worksMethodologyEnterpriseRisk officersPricingDownload app

Public leaderboard

Judge bench

These benchmarks measure what enterprises buy, not what frontier labs leaderboard against.

Capability benchmarks — MMLU, GPQA, SWE-bench — reward the model that gets the hardest question right. Enterprise AI judging rewards the model that is honestly uncertain when it should be, cheap enough to run on every signal, and reproducible across versions. Those are not the same axis.

The six axes

A judge model lives or dies on how well it handles uncertainty, cost, and change — not how well it answers a trivia question.

Calibration (ECE, Brier)

When the model says 70% confident, does it turn out right 70% of the time? Measured by Expected Calibration Error and Brier score.

Cost per 1K runs

Total USD to score 1,000 signals, including retries and retries-of-retries. This is the line item on the enterprise invoice.

P95 latency

Time until 95% of requests return a verdict. The tail is what blocks the workflow, not the mean.

Egress count

How many distinct external destinations the signal leaves the tenant boundary for. Every egress is a procurement conversation.

Determinism rate

Same input twice, same verdict? Frontier models drift at the margin because their backends change. Judges cannot.

Version stability

Score drift across minor model versions. If your auditor asks "what did you score last quarter?" the answer must not change because the vendor shipped a new SKU.

Current standings

Five models on a 200-sample holdout from the deal-dd pack. Same rubric, same prompts, same retry policy. Version-pinned — see methodology for reproduction steps.

Last updated 2026-04-19. Re-run on every model release.

ModelECE ↓Brier ↓$/1K ↓P95 ↓Egress ↓Determinism ↑
GPT-5 (OpenAI)
0.112
0.184
$5.20
2.9s
1
62%
Claude Sonnet 4.6 (Anthropic)
0.094
0.172
$3.40
2.1s
1
68%
Claude Opus 4.7 (Anthropic)
0.081
0.159
$14.80
4.2s
1
71%
Qwen 2.5 72B (open weights, self-host)
0.121
0.201
$0.42
1.8s
0
83%
★SeaOtter Judge v0 (Qwen 2.5-1.5B LoRA, ours)LIVE
⚠️0.24v0 miss — improving with real ground truth
✓0.011methodology being cross-verified
✓$0.10
◐660msv0 FP16; int4 target <100ms
✓0
✓100%

v0 results are live, not projected. 1 of 6 axes missed target (ECE); 3 won structurally (cost, egress, determinism); full analysis in methodology.

All numbers computed via our open eval harness, version-pinned to the dates above. Re-run with one CLI command; see methodology.

Read the methodology

Per-pack results — 6 packs × 6 benchmarks

The same six axes applied to every calibration specialist we've trained. Each cell shows our number versus the best frontier baseline on that pack's test split.

Single seed (=65), 200 GPT-4o-mini-distilled training samples per pack, Qwen 2.5-1.5B LoRA, MPS local training. Frontier baseline = best of gpt-4o-mini and claude-sonnet-4-5 on each axis, 2026-Q1.

PackECE ↓Brier ↓$/1K ↓P95 ↓Egress ↓Determinism ↑
deal-dd
⚠️0.24vs 0.081 frontier
✓0.011vs 0.159 frontier
✓$0.10vs $5.20 frontier
✓660msvs 2.9s frontier
✓0vs 1 frontier
✓100%vs 71% frontier
credit-memo
⚠️0.30vs 0.15 frontier
◐0.007vs 0.003 frontier
✓$0.10vs $0.50 frontier
✓352msvs 7.7s frontier
✓0vs 1 frontier
◐100%vs 100% frontier
applied-ml-ship
⚠️0.15vs 0.063 frontier
◐0.010vs 0.005 frontier
✓$0.10vs $0.50 frontier
✓230msvs 13.3s frontier
✓0vs 1 frontier
◐100%vs 100% frontier
eng-deploy
⚠️0.30vs 0.168 frontier
◐0.031vs 0.010 frontier
✓$0.10vs $0.50 frontier
✓187msvs 10.0s frontier
✓0vs 1 frontier
✓100%vs 80% frontier
legal-contract
⚠️0.35vs 0.023 frontier
◐0.006vs 0.003 frontier
✓$0.10vs $0.50 frontier
✓58msvs 10.0s frontier
✓0vs 1 frontier
◐100%vs 100% frontier
revenue-close
◐0.00vs 0.023 frontier
◐0.012vs 0.007 frontier
✓$0.10vs $0.50 frontier
✓108msvs 12.7s frontier
✓0vs 1 frontier
◐100%vs 100% frontier

Structural wins hold across packs

Cost, latency, and egress are won on every one of the six packs. These are architectural — frontier APIs cannot close them without rewriting their inference stack.

ECE regresses on 4 of 6 packs

Distillation from a miscalibrated teacher inherits the teacher's confidence profile. Real enterprise ground truth in v1 is the fix; we don't hide the miss.

Brier ties — not beats — the teacher

You cannot structurally beat your own teacher on MSE. What the specialist buys is competitive Brier while collapsing cost and latency by 10-100x.

If you underwrite AI, you need a judge you can defend

Capability benchmarks don't tell you whether the model you're betting on will still score the same way next quarter. These do.

Request a pilot
SeaOtter
DownloadJudge benchMethodologyEnterpriseRisk officersPricingHow it works
TermsPrivacy·hello@seaotter.ai

© 2026 SeaOtter. AI workflow operating layer for knowledge workers.