Each ranking is the panel mean across 4 frontier evaluator models. Each model is given the same rubric, the same set of fifteen labs, and is required to argue its own scores in writing. Where the panel disagrees is usually the most interesting story.
| Company | Panel mean | Avg spread |
|---|---|---|
| 🇺🇸 Apple | 7.65 | 2.91 |
| 🇫🇷 Mistral AI | 7.36 | 2.77 |
| 🇨🇳 DeepSeek | 7.71 | 2.45 |
| 🇺🇸 Meta | 8.46 | 2.35 |
| 🇺🇸 xAI | 7.94 | 2.35 |
| 🇨🇳 Zhipu AI | 7.35 | 2.34 |
| 🇨🇳 Moonshot AI | 7.36 | 2.26 |
| 🇨🇳 ByteDance | 8.01 | 2.12 |
| 🇨🇳 Baidu | 7.63 | 1.95 |
| 🇺🇸 Amazon | 8.16 | 1.84 |
| 🇨🇳 Alibaba | 8.23 | 1.72 |
| 🇺🇸 Microsoft | 8.60 | 1.55 |
| 🇺🇸 OpenAI | 9.10 | 0.86 |
| 🇺🇸 Anthropic | 8.82 | 0.71 |
| 9.32 | 0.46 |