Commercial & Cross-Border Legal AI · Independent · Reproducible · Published by HAQQ

Which AI is actually reliable
on real legal work?

We ran 10 frontier models from 8 providers across 300 demanding commercial & cross-border legal tasks — contract drafting, M&A, disputes, regulatory, cross-jurisdiction structuring — spanning 51 practice areas and 20+ jurisdictions, merged with every prior run of this benchmark. Same prompts, same rubric, no vibes.

Last updated 2026-06-07 · judged by anthropic/claude-sonnet-4.6 · temperature 0
10models tested
8providers
300legal prompts
51practice areas
20+jurisdictions
3394total evaluations*
*across all three datasets ever run (this fresh run + the original ChatHub /35 runs + the cross-jurisdiction runs), merged by provider brand.
One finding outweighs the ranking. Across this run, 24% of all answers cited or applied law that doesn't say what the model claimed. The most accurate model still hallucinates. No frontier model is safe to ship a legal answer unverified — which is the entire argument for a citation-verification layer on top of any model.

01Provider leaderboard

Every model collapsed to its provider brand and merged across all three datasets we've run, normalized to a 0–100 legal-quality index and weighted by number of evaluations. The version churns; the brand signal is stable.

#ProviderIndex /100EvaluationsDatasetsStrength
Blended index mixes a /35 rubric and a 0–10 rubric across runs — directional, not absolute. Brands with fewer datasets (thinner data) noted.

02Model leaderboard (this run)

300 prompts × 10 models, scored on Quality (10) · Accuracy (10) · Speed (5, measured latency) · Style (5) · Creativity (5) = Total /35. Click any column to sort.

#ModelTotal /35 QualAccSpd StyleCrea WinsHalluc Latency$/task

03Reliability vs usefulness

Each dot is a model: accuracy (is it right) across, quality (is it usable) up. The best sit top-right. Colour = provider. Hover for the breakdown.

04The citation gap

The headline failure mode in legal AI: confidently citing law that doesn't hold. Share of answers flagged for hallucinated or misapplied citations, by provider (lower is better).

What the judge actually caught

05The latency tax

Quality vs response time. The highest-quality models are often the slowest — for a client-facing product the target is top-right-but-left: high score, low latency.

06Drill down by practice area

No single model wins everywhere. Full model × practice-area matrix (avg /35, 51 areas). Greener = stronger; the per-row leader is boxed. This is the routing argument in one grid.

≥3027–3024–2721–24<21

07Benchmark integrity

A cautionary result. Our first run capped answers at 1,200 tokens; reasoning models returned empty answers and verbose models were truncated and penalised. Re-running uniformly at 6,000 tokens moved scores — but not uniformly. Most models gained 2–3 points; two got worse with more room, because length exposed more bad citations. Output caps quietly rig most public leaderboards.

Model1,200-tok6,000-tokΔ

08Methodology

Built to be reproducible. Same system prompt and temperature 0 for every model; answers capped at 6,000 tokens so reasoning models aren't starved and verbose models aren't truncated.

Scoring rubric (/35)
Quality1–10depth, completeness, actionability
Accuracy1–10legal correctness; −5 hallucinated cite, −3 wrong jurisdiction
Speed1–5measured latency rank (not judged)
Style1–5professional structure
Creativity1–5non-obvious / cross-jurisdiction issues
Quality/Accuracy/Style/Creativity by LLM judge (anthropic/claude-sonnet-4.6). Speed computed from real response latency.
Models & routing slugs
Prompt design

Prompts are original, specific legal tasks (named parties, amounts, dates, statutes) engineered to expose failure modes — hallucinated case law, wrong-jurisdiction defaults, missed conflicts of law. They deliberately mix jurisdictions; most are difficulty 4–5 of 5. Categories tested:

09Limitations

  • Single run, temperature 0. Models are non-deterministic; absolute scores move run-to-run. Rankings are more reliable than the decimals. Treat the top cluster as co-leaders.
  • The judge is an LLM (anthropic/claude-sonnet-4.6, a Claude model) — possible self-preference. Mitigating evidence: a non-Claude model leads on accuracy and a non-Claude model is in the top cluster, so the judge is not blindly pro-Claude. A second-model re-judge is the next hardening step.
  • Blended brand index merges two scoring rubrics across three datasets — directional, for brand-level comparison only.
  • Budget-capped run. This pass covers 300 of 300 staged prompts; the rest run on the next top-up. Older brands appear via historical data only where noted.

10Data & reproducibility

Everything is open. Download the raw results, the prompts, and the runner.

⬇ results.json (full scores) ⬇ report_data.json (this page's data) ⬇ comparison.md (leaderboard)

11Part of the HAQQ benchmark series

This report covers commercial & cross-border legal work. Two companion benchmarks cover the other angles: