Commercial & Cross-Border Legal AI · Independent · Reproducible · Published by HAQQ

Which AI is actually reliable
on real legal work?

We ran 10 frontier models from 8 providers across 300 demanding commercial & cross-border legal tasks — contract drafting, M&A, disputes, regulatory, cross-jurisdiction structuring — spanning 51 practice areas and 20+ jurisdictions, merged with every prior run of this benchmark. Same prompts, same rubric, no vibes.

Last updated 2026-06-07 · judged by anthropic/claude-sonnet-4.6 · temperature 0

10models tested

8providers

300legal prompts

51practice areas

20+jurisdictions

3394total evaluations*

*across all three datasets ever run (this fresh run + the original ChatHub /35 runs + the cross-jurisdiction runs), merged by provider brand.

One finding outweighs the ranking. Across this run, 24% of all answers cited or applied law that doesn't say what the model claimed. The most accurate model still hallucinates. No frontier model is safe to ship a legal answer unverified — which is the entire argument for a citation-verification layer on top of any model.

01Provider leaderboard

Every model collapsed to its provider brand and merged across all three datasets we've run, normalized to a 0–100 legal-quality index and weighted by number of evaluations. The version churns; the brand signal is stable.

#	Provider	Index /100	Evaluations	Datasets	Strength

Blended index mixes a /35 rubric and a 0–10 rubric across runs — directional, not absolute. Brands with fewer datasets (thinner data) noted.

02Model leaderboard (this run)

300 prompts × 10 models, scored on Quality (10) · Accuracy (10) · Speed (5, measured latency) · Style (5) · Creativity (5) = Total /35. Click any column to sort.

#	Model	Total /35	Qual	Acc	Spd	Style	Crea	Wins	Halluc	Latency	$/task

03Reliability vs usefulness

Each dot is a model: accuracy (is it right) across, quality (is it usable) up. The best sit top-right. Colour = provider. Hover for the breakdown.

04The citation gap

The headline failure mode in legal AI: confidently citing law that doesn't hold. Share of answers flagged for hallucinated or misapplied citations, by provider (lower is better).

What the judge actually caught

05The latency tax

Quality vs response time. The highest-quality models are often the slowest — for a client-facing product the target is top-right-but-left: high score, low latency.

06Drill down by practice area

No single model wins everywhere. Full model × practice-area matrix (avg /35, 51 areas). Greener = stronger; the per-row leader is boxed. This is the routing argument in one grid.

≥3027–3024–2721–24<21

07Benchmark integrity

A cautionary result. Our first run capped answers at 1,200 tokens; reasoning models returned empty answers and verbose models were truncated and penalised. Re-running uniformly at 6,000 tokens moved scores — but not uniformly. Most models gained 2–3 points; two got worse with more room, because length exposed more bad citations. Output caps quietly rig most public leaderboards.

Model	1,200-tok	6,000-tok	Δ

08Methodology

Built to be reproducible. Same system prompt and temperature 0 for every model; answers capped at 6,000 tokens so reasoning models aren't starved and verbose models aren't truncated.

Scoring rubric (/35)

Quality	1–10	depth, completeness, actionability
Accuracy	1–10	legal correctness; −5 hallucinated cite, −3 wrong jurisdiction
Speed	1–5	measured latency rank (not judged)
Style	1–5	professional structure
Creativity	1–5	non-obvious / cross-jurisdiction issues

Quality/Accuracy/Style/Creativity by LLM judge (anthropic/claude-sonnet-4.6). Speed computed from real response latency.

Models & routing slugs

Prompt design

Prompts are original, specific legal tasks (named parties, amounts, dates, statutes) engineered to expose failure modes — hallucinated case law, wrong-jurisdiction defaults, missed conflicts of law. They deliberately mix jurisdictions; most are difficulty 4–5 of 5. Categories tested:

09Limitations

Single run, temperature 0. Models are non-deterministic; absolute scores move run-to-run. Rankings are more reliable than the decimals. Treat the top cluster as co-leaders.
The judge is an LLM (anthropic/claude-sonnet-4.6, a Claude model) — possible self-preference. Mitigating evidence: a non-Claude model leads on accuracy and a non-Claude model is in the top cluster, so the judge is not blindly pro-Claude. A second-model re-judge is the next hardening step.
Blended brand index merges two scoring rubrics across three datasets — directional, for brand-level comparison only.
Budget-capped run. This pass covers 300 of 300 staged prompts; the rest run on the next top-up. Older brands appear via historical data only where noted.

10Data & reproducibility

Everything is open. Download the raw results, the prompts, and the runner.

⬇ results.json (full scores) ⬇ report_data.json (this page's data) ⬇ comparison.md (leaderboard)

11Part of the HAQQ benchmark series

This report covers commercial & cross-border legal work. Two companion benchmarks cover the other angles:

Consumer / common-law →

3 models on 100 real r/legaladvice questions. Pass rates 78–88%; the weak spot is appropriate caveats.

Civil-law / MENA (HAQQ-LAB) →

Jurisdiction adherence on UAE·DIFC·KSA·LB·EG·QA. Ungoverned 0% → governed 100%.

Which AI is actually reliableon real legal work?