# Legal LLM Benchmark — complete run

- Prompts: 300 | Models: 10 | Judge: anthropic/claude-sonnet-4.6 | temp 0, single run
- Speed = measured-latency rank (5=fastest). Quality/Accuracy/Style/Creativity = LLM-judged.
- Caveat: single run is directional; judge is a Claude model (self-preference bias possible).

## Leaderboard

| # | Model | Total/35 | Qual/10 | Acc/10 | Spd/5 | Sty/5 | Crea/5 | Wins | Top3 | Halluc | Answered |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | claude-opus-4.8 | **30.02** | 8.95 | 8.39 | 2.84 | 4.98 | 4.86 | 130 | 265 | 17 | 300/300 |
| 2 | grok-4.3 | **28.98** | 8.06 | 7.56 | 4.69 | 4.8 | 3.87 | 61 | 176 | 51 | 300/300 |
| 3 | gemini-3.1-pro | **28.66** | 8.46 | 7.59 | 3.54 | 4.92 | 4.14 | 34 | 152 | 46 | 291/300 |
| 4 | claude-sonnet-4.6 | **27.77** | 8.75 | 7.73 | 1.75 | 4.88 | 4.66 | 5 | 63 | 72 | 300/300 |
| 5 | gpt-5.5 | **27.27** | 8.53 | 8.41 | 1.14 | 4.93 | 4.27 | 1 | 35 | 9 | 300/300 |
| 6 | qwen3.7-max | **26.69** | 8.39 | 7.45 | 1.72 | 4.79 | 4.34 | 2 | 29 | 52 | 300/300 |
| 7 | deepseek-v3.2 | **26.24** | 8.11 | 7.15 | 2.29 | 4.85 | 3.84 | 1 | 15 | 80 | 300/300 |
| 8 | o3 | **24.38** | 6.99 | 5.89 | 4.08 | 3.98 | 3.44 | 66 | 164 | 96 | 300/300 |
| 9 | mistral-large | **21.88** | 6.58 | 4.74 | 3.08 | 4.2 | 3.27 | 0 | 1 | 191 | 300/300 |
| 10 | llama-4-maverick | **20.01** | 4.84 | 4.97 | 4.9 | 3.05 | 2.25 | 0 | 0 | 111 | 300/300 |