Each ranking is the panel mean across 4 frontier evaluator models. Each model is given the same rubric, the same set of fifteen labs, and is required to argue its own scores in writing. Where the panel disagrees is usually the most interesting story.
| Company | Panel mean | Avg spread |
|---|---|---|
| 🇺🇸 xAI | 7.23 | 2.93 |
| 🇨🇳 Zhipu AI | 6.24 | 2.74 |
| 🇺🇸 Apple | 6.80 | 2.59 |
| 🇺🇸 Meta | 8.07 | 2.53 |
| 🇨🇳 DeepSeek | 7.21 | 2.47 |
| 🇨🇳 Moonshot AI | 6.50 | 2.46 |
| 🇫🇷 Mistral AI | 7.05 | 1.91 |
| 🇨🇳 Alibaba | 7.87 | 1.62 |
| 🇨🇳 Baidu | 7.25 | 1.62 |
| 🇺🇸 Amazon | 7.77 | 1.61 |
| 🇨🇳 ByteDance | 7.82 | 1.60 |
| 🇺🇸 OpenAI | 8.85 | 1.42 |
| 🇺🇸 Microsoft | 8.35 | 1.39 |
| 🇺🇸 Anthropic | 8.38 | 1.31 |
| 9.46 | 0.60 |