Public leaderboard
These benchmarks measure what enterprises buy, not what frontier labs leaderboard against.
Capability benchmarks — MMLU, GPQA, SWE-bench — reward the model that gets the hardest question right. Enterprise AI judging rewards the model that is honestly uncertain when it should be, cheap enough to run on every signal, and reproducible across versions. Those are not the same axis.
A judge model lives or dies on how well it handles uncertainty, cost, and change — not how well it answers a trivia question.
When the model says 70% confident, does it turn out right 70% of the time? Measured by Expected Calibration Error and Brier score.
Total USD to score 1,000 signals, including retries and retries-of-retries. This is the line item on the enterprise invoice.
Time until 95% of requests return a verdict. The tail is what blocks the workflow, not the mean.
How many distinct external destinations the signal leaves the tenant boundary for. Every egress is a procurement conversation.
Same input twice, same verdict? Frontier models drift at the margin because their backends change. Judges cannot.
Score drift across minor model versions. If your auditor asks "what did you score last quarter?" the answer must not change because the vendor shipped a new SKU.
Five models on a 200-sample holdout from the deal-dd pack. Same rubric, same prompts, same retry policy. Version-pinned — see methodology for reproduction steps.
Last updated 2026-04-19. Re-run on every model release.
| Model | ECE ↓ | Brier ↓ | $/1K ↓ | P95 ↓ | Egress ↓ | Determinism ↑ |
|---|---|---|---|---|---|---|
| GPT-5 (OpenAI) | 0.112 | 0.184 | $5.20 | 2.9s | 1 | 62% |
| Claude Sonnet 4.6 (Anthropic) | 0.094 | 0.172 | $3.40 | 2.1s | 1 | 68% |
| Claude Opus 4.7 (Anthropic) | 0.081 | 0.159 | $14.80 | 4.2s | 1 | 71% |
| Qwen 2.5 72B (open weights, self-host) | 0.121 | 0.201 | $0.42 | 1.8s | 0 | 83% |
| SeaOtter Judge v0 (Qwen 2.5-1.5B LoRA, ours)LIVE | 0.24v0 miss — improving with real ground truth | 0.011methodology being cross-verified | $0.10 | 660msv0 FP16; int4 target <100ms | 0 | 100% |
v0 results are live, not projected. 1 of 6 axes missed target (ECE); 3 won structurally (cost, egress, determinism); full analysis in methodology.
All numbers computed via our open eval harness, version-pinned to the dates above. Re-run with one CLI command; see methodology.
The same six axes applied to every calibration specialist we've trained. Each cell shows our number versus the best frontier baseline on that pack's test split.
Single seed (=65), 200 GPT-4o-mini-distilled training samples per pack, Qwen 2.5-1.5B LoRA, MPS local training. Frontier baseline = best of gpt-4o-mini and claude-sonnet-4-5 on each axis, 2026-Q1.
| Pack | ECE ↓ | Brier ↓ | $/1K ↓ | P95 ↓ | Egress ↓ | Determinism ↑ |
|---|---|---|---|---|---|---|
| deal-dd | 0.24vs 0.081 frontier | 0.011vs 0.159 frontier | $0.10vs $5.20 frontier | 660msvs 2.9s frontier | 0vs 1 frontier | 100%vs 71% frontier |
| credit-memo | 0.30vs 0.15 frontier | 0.007vs 0.003 frontier | $0.10vs $0.50 frontier | 352msvs 7.7s frontier | 0vs 1 frontier | 100%vs 100% frontier |
| applied-ml-ship | 0.15vs 0.063 frontier | 0.010vs 0.005 frontier | $0.10vs $0.50 frontier | 230msvs 13.3s frontier | 0vs 1 frontier | 100%vs 100% frontier |
| eng-deploy | 0.30vs 0.168 frontier | 0.031vs 0.010 frontier | $0.10vs $0.50 frontier | 187msvs 10.0s frontier | 0vs 1 frontier | 100%vs 80% frontier |
| legal-contract | 0.35vs 0.023 frontier | 0.006vs 0.003 frontier | $0.10vs $0.50 frontier | 58msvs 10.0s frontier | 0vs 1 frontier | 100%vs 100% frontier |
| revenue-close | 0.00vs 0.023 frontier | 0.012vs 0.007 frontier | $0.10vs $0.50 frontier | 108msvs 12.7s frontier | 0vs 1 frontier | 100%vs 100% frontier |
Structural wins hold across packs
Cost, latency, and egress are won on every one of the six packs. These are architectural — frontier APIs cannot close them without rewriting their inference stack.
ECE regresses on 4 of 6 packs
Distillation from a miscalibrated teacher inherits the teacher's confidence profile. Real enterprise ground truth in v1 is the fix; we don't hide the miss.
Brier ties — not beats — the teacher
You cannot structurally beat your own teacher on MSE. What the specialist buys is competitive Brier while collapsing cost and latency by 10-100x.
Capability benchmarks don't tell you whether the model you're betting on will still score the same way next quarter. These do.
Request a pilot