White paper

Methodology

How the judge bench works: what we measure, how calibration is computed, and why frontier models structurally lose on the axes enterprises actually pay for.

What we measure

Six axes, chosen because they are what shows up in an enterprise procurement checklist — and because they resist being solved by scaling the base model.

Calibration

The gap between a model's stated confidence and its realized accuracy. Reported as Expected Calibration Error (ECE) and Brier score against a held-out synthetic/distilled test set.

Enterprises care about "when is this wrong" more than "how often is this right." A 95% accurate model that cannot tell you which 5% is wrong is unusable in regulated work.

Cost per 1K runs

End-to-end USD to score 1,000 signals at steady state, including tokenization, retries, and the tokens consumed by the rubric itself. Measured on the same rubric across all models.

Judges are run on every signal, not once per deal. Cost compounds linearly with volume.

P95 latency

Time from request dispatch to verdict returned, at the 95th percentile. Measured on concurrent production-like load.

The tail is what blocks a workflow. Mean latency hides the procurement-killing outliers.

Egress count

Number of distinct external destinations (domains, API tenants, sub-processors) the raw signal traverses before a verdict is returned.

Every egress is a DPA, a vendor review, and a line in the audit log. Egress = 0 is a category of its own.

Determinism rate

Fraction of identical (input, prompt) pairs that return an identical verdict across repeated calls. Temperature-0 is necessary but not sufficient; backend routing changes break it.

Regulated workflows must be reproducible for audit. Non-determinism is not a quirk, it is a compliance defect.

Version stability

Mean absolute score drift when the underlying model is upgraded to a newer minor version. Measured on a frozen rubric and frozen test set.

Frontier vendors ship new SKUs on their own cadence. Enterprise judges must not silently re-score last quarter's deals.

Why these six

Capability benchmarks are built for the question "which lab has the smartest model this quarter." They are useful — they are also not what enterprises actually sign procurement contracts for.

Enterprise AI judging is a different job. The buyer is an audit committee, a compliance officer, or a risk carrier. They are not asking "can it pass the bar exam." They are asking "when this thing is wrong, will I know, will I be able to defend the score six months later, and will it cost more than the decision it informs."

These six axes come straight off that procurement checklist. They are not the ones frontier labs optimize for, and that is exactly why they are the ones worth measuring.

How calibration is computed

Calibration is the only axis with non-obvious math. The rest are stopwatch and invoice questions. Here is the full procedure.

Step 1
Build a held-out version-pinned test set
200 samples per pack, currently GPT-4o-mini-distilled synthetic enterprise workflow scenarios reviewed against domain rubrics. Test sets are version-pinned and never used for training or rubric tuning.
Step 2
Elicit a confidence score per verdict
Every judge call returns both a verdict and a [0,1] confidence. For frontier models without a native confidence channel, we use the log-probability of the verdict token, calibrated once per model via isotonic regression on a disjoint dev split.
Step 3
Compute ECE and Brier
ECE: bin predictions into 10 equal-width confidence buckets, take the absolute gap between mean confidence and mean accuracy in each bucket, weight by bucket size. Brier: mean squared error between confidence and the 0/1 correctness indicator.
Step 4
Report both — never just accuracy
Accuracy alone is the wrong metric for a judge. A 60%-accurate judge that is well-calibrated is more useful than a 90%-accurate judge that is overconfident on the 10% it gets wrong.

ECE formula

ECE = Σ_b (|B_b| / N) · | acc(B_b) − conf(B_b) |

Lower is better. A perfectly calibrated model scores 0; a 50/50 coin flip claiming 95% confidence scores near 0.45.

Why frontier models structurally lose on these axes

Capability-better does not imply judging-better. The architecture of a frontier API is the wrong shape for four of our six axes — not because the labs are bad at ML, but because their product surface is optimized for a different buyer.

Egress is non-negotiable for a hosted API

A frontier model served by a hyperscaler is, by definition, egress-positive. For workflows where the signal cannot leave the tenant boundary, this is an immediate disqualification — before any accuracy question is asked.

Backend routing breaks determinism

Hyperscale inference routes requests across heterogeneous hardware for cost reasons. The same prompt at temperature 0 returns different tokens depending on which GPU answered. A specialist judge running on pinned hardware does not have this problem.

Version stability is at odds with the frontier business model

Frontier labs ship new minor versions monthly. "Latest" is their product. For a judge, "frozen" is the product. Those incentives do not align, and no API flag fixes it.

The cost floor is set by the generalist model

A 500B generalist running a 10-token judging task still pays the 500B forward pass. A 7B specialist trained on the same rubric pays ~1/50th of that. Over millions of judge calls, the gap is the business.

None of this means frontier models are bad. It means they are optimized for a job that is not judging. The judge market is downstream of that mismatch.

v0 learnings — what worked, what didn't

Judge v0 is a Qwen 2.5-1.5B LoRA trained locally in 8 minutes on MPS against 500 GPT-5-teacher-scored samples. Zero cash cost. The results are a mixed scorecard — and that is the point. We are publishing what we actually observed, not the target slide.

3 structural wins even at 1.5B parameters: cost, privacy (egress), determinism
At v0 we already clear the three axes that are architecture-level, not scale-level: $0.10 per 1K runs (50× cheaper than GPT-5), 0 external egress (runs in-tenant), and 100% determinism on repeated runs. None of these improve with a bigger base model on the frontier side — they are structurally unavailable to a hosted multi-tenant API. Validated at the smallest plausible model size.
ECE miss (0.24 vs frontier 0.10–0.15): distillation from GPT-5 teacher scaled its calibration errors
Our v0 teacher labels came from GPT-5 itself. Karpathy's warning applies directly: distilling a teacher's outputs inherits and amplifies the teacher's calibration errors rather than fixing them. Expected. Fix: replace teacher-generated labels with expert-validated ground truth from Private Pilots. Gate: N ≥ 1000 human corrections before we retrain. On track per the `ground_truth_events` pipeline.
Latency 660ms → target <100ms requires int4 quantization
v0 is served in FP16 on a single MPS device. P95 of 660ms already beats every hosted frontier judge, but we are not yet at the <100ms target that makes per-signal scoring economical at scale. Straightforward path: int4 quantization of the LoRA-merged weights. Deferred to v1 — the engineering is routine, the ground-truth work is not.
Brier 0.011 looks exceptional — methodology being cross-verified
v0's Brier of 0.011 is roughly 10–20× better than the frontier reference band (~0.15–0.25). That gap is large enough that we suspect a methodology difference between our harness and the reference numbers we are comparing against (score range, normalization, correctness indicator). We are cross-verifying against the eval-harness pipeline; the published number may update. We would rather correct it publicly than leave a too-good-to-be-true result unchecked.

This section exists because the honest scorecard is the scorecard. A model that wins 3 structural axes at 1.5B parameters validates the architecture; the ECE miss tells us exactly what data to collect next.

The open eval harness

Every number on the judge-bench leaderboard is produced by a single CLI tool that takes a model adapter and a pack ID, runs the 200-sample holdout, and emits a signed JSON result with the eval-harness version and the test-set SHA.

The harness itself is open. The test sets are versioned. Anyone — competitor, customer, auditor — can reproduce the leaderboard. When we update a number, the prior run is archived, not overwritten.

Example invocation

python -m src.judge.eval_harness --model gpt-5 --pack deal-dd

Projected numbers for SeaOtter Judge v0 are clearly marked as such until the eval harness produces real ones. We publish target-vs-achieved on release, not just the target.

Measured, reproducible, and honest about what we haven't shipped yet

The judge bench is a live artifact. New numbers replace old numbers; projected rows become measured rows. If a frontier lab ships a version that closes the gap on any of the six axes, the leaderboard will say so the same day.

See the leaderboard Enterprise offerings