We ran 10 frontier models from 8 providers across 300 demanding commercial & cross-border legal tasks — contract drafting, M&A, disputes, regulatory, cross-jurisdiction structuring — spanning 51 practice areas and 20+ jurisdictions, merged with every prior run of this benchmark. Same prompts, same rubric, no vibes.
Every model collapsed to its provider brand and merged across all three datasets we've run, normalized to a 0–100 legal-quality index and weighted by number of evaluations. The version churns; the brand signal is stable.
| # | Provider | Index /100 | Evaluations | Datasets | Strength |
|---|
300 prompts × 10 models, scored on Quality (10) · Accuracy (10) · Speed (5, measured latency) · Style (5) · Creativity (5) = Total /35. Click any column to sort.
| # | Model | Total /35 | Qual | Acc | Spd | Style | Crea | Wins | Halluc | Latency | $/task |
|---|
Each dot is a model: accuracy (is it right) across, quality (is it usable) up. The best sit top-right. Colour = provider. Hover for the breakdown.
The headline failure mode in legal AI: confidently citing law that doesn't hold. Share of answers flagged for hallucinated or misapplied citations, by provider (lower is better).
Quality vs response time. The highest-quality models are often the slowest — for a client-facing product the target is top-right-but-left: high score, low latency.
No single model wins everywhere. Full model × practice-area matrix (avg /35, 51 areas). Greener = stronger; the per-row leader is boxed. This is the routing argument in one grid.
A cautionary result. Our first run capped answers at 1,200 tokens; reasoning models returned empty answers and verbose models were truncated and penalised. Re-running uniformly at 6,000 tokens moved scores — but not uniformly. Most models gained 2–3 points; two got worse with more room, because length exposed more bad citations. Output caps quietly rig most public leaderboards.
| Model | 1,200-tok | 6,000-tok | Δ |
|---|
Built to be reproducible. Same system prompt and temperature 0 for every model; answers capped at 6,000 tokens so reasoning models aren't starved and verbose models aren't truncated.
| Quality | 1–10 | depth, completeness, actionability |
| Accuracy | 1–10 | legal correctness; −5 hallucinated cite, −3 wrong jurisdiction |
| Speed | 1–5 | measured latency rank (not judged) |
| Style | 1–5 | professional structure |
| Creativity | 1–5 | non-obvious / cross-jurisdiction issues |
Prompts are original, specific legal tasks (named parties, amounts, dates, statutes) engineered to expose failure modes — hallucinated case law, wrong-jurisdiction defaults, missed conflicts of law. They deliberately mix jurisdictions; most are difficulty 4–5 of 5. Categories tested:
Everything is open. Download the raw results, the prompts, and the runner.
⬇ results.json (full scores) ⬇ report_data.json (this page's data) ⬇ comparison.md (leaderboard)This report covers commercial & cross-border legal work. Two companion benchmarks cover the other angles: