Avg: Overall average.
| Model | CE | C1 | C2 | L1 | L2 | GP | HC | Avg |
| Closed-Source | ||||||||
| o3-mini-high 🧠 | 83.8 | 82.0 | 81.7 | 80.6 | 79.2 | 77.1 | 73.3 | 79.7 |
| Gemini-2.0-Flash-T 🧠 | 81.3 | 71.4 | 71.0 | 69.5 | 66.4 | 61.9 | 59.2 | 68.7 |
| Gemini-2.0-Flash | 58.9 | 56.8 | 55.4 | 54.0 | 51.3 | 43.7 | 41.2 | 51.6 |
| Gemini-1.5-Pro | 59.8 | 54.3 | 53.2 | 52.4 | 50.2 | 43.9 | 41.2 | 50.7 |
| Gemini-1.5-Flash | 60.7 | 49.7 | 47.4 | 45.5 | 42.9 | 36.9 | 36.0 | 45.6 |
| GPT-4o | 61.7 | 50.1 | 46.2 | 43.8 | 42.3 | 35.0 | 33.0 | 44.6 |
| GPT-4o-mini | 49.5 | 42.2 | 42.8 | 41.5 | 39.8 | 31.4 | 30.0 | 40.4 |
| Gemini-1.5-Flash-8B | 40.2 | 35.1 | 33.9 | 31.2 | 29.3 | 21.8 | 20.6 | 31.6 |
| Open-Source > 8B | ||||||||
| DeepSeek-R1 🧠 | 85.0 | 77.3 | 75.9 | 74.7 | 72.7 | 69.7 | 69.0 | 74.9 |
| DeepSeek-V3 | 66.4 | 54.3 | 52.1 | 50.7 | 48.3 | 40.8 | 36.5 | 49.9 |
| Phi-4-14B * | 66.4 | 51.9 | 48.1 | 46.0 | 43.7 | 37.0 | 32.2 | 46.5 |
| Phi-4-14B | 59.8 | 50.4 | 45.9 | 43.6 | 41.1 | 33.6 | 30.0 | 43.5 |
| Qwen2.5-72B | 53.3 | 48.4 | 45.2 | 43.3 | 41.4 | 34.4 | 29.6 | 42.2 |
| QwQ-32B 🧠 | 56.1 | 43.5 | 40.0 | 37.3 | 34.4 | 25.4 | 23.2 | 37.1 |
| LLaMA-3.3-70B | 44.9 | 41.5 | 39.7 | 37.3 | 35.7 | 26.3 | 26.2 | 35.9 |
| DeepSeek-R1-Qwen 🧠 | 44.2 | 38.7 | 38.0 | 36.3 | 33.4 | 25.8 | 19.8 | 33.7 |
| Open-Source ≤ 8B (Math-Specialized) | ||||||||
| Qwen2.5-Math-7B * 🔧 | 53.3 | 47.9 | 48.3 | 47.3 | 46.9 | 39.8 | 34.1 | 45.4 |
| Qwen2.5-Math-7B 🔧 | 43.9 | 45.4 | 44.1 | 42.2 | 41.2 | 33.6 | 31.3 | 40.2 |
| NuminaMath-7B * 🔧 | 43.0 | 38.8 | 38.0 | 36.7 | 35.3 | 26.7 | 24.5 | 34.7 |
| Qwen2.5-Math-7B * | 40.7 | 37.0 | 38.3 | 36.7 | 35.6 | 27.5 | 26.1 | 34.6 |
| Qwen2.5-Math-7B | 40.2 | 36.5 | 37.8 | 36.2 | 35.1 | 26.9 | 24.9 | 33.9 |
| NuminaMath-7B 🔧 | 39.2 | 27.6 | 31.1 | 29.4 | 28.3 | 19.8 | 18.0 | 27.7 |
| NuminaMath-7B | 28.8 | 25.6 | 24.9 | 24.1 | 23.1 | 25.4 | 22.4 | 24.9 |
| Mathstral-7B * | 35.5 | 27.2 | 26.1 | 23.6 | 21.8 | 16.7 | 12.4 | 23.3 |
| NuminaMath-7B * | 31.8 | 25.2 | 25.4 | 24.2 | 22.7 | 13.1 | 9.0 | 21.6 |
| DeepSeek-Math-7B * 🔧 | 23.4 | 24.4 | 23.7 | 22.9 | 21.3 | 15.1 | 14.2 | 20.7 |
| Mathstral-7B | 27.1 | 22.0 | 23.4 | 21.3 | 20.1 | 12.2 | 11.2 | 19.6 |
| DeepSeek-Math-7B * | 21.3 | 21.6 | 21.9 | 20.7 | 19.6 | 13.2 | 10.2 | 18.4 |
| DeepSeek-Math-7B 🔧 | 21.1 | 21.4 | 21.7 | 20.5 | 19.3 | 12.8 | 9.8 | 18.1 |
| DeepSeek-Math-7B | 20.6 | 21.0 | 21.4 | 20.1 | 18.9 | 12.5 | 9.4 | 17.7 |
| ToRA-7B * 🔧 | 12.2 | 11.6 | 12.1 | 11.5 | 11.1 | 7.6 | 6.4 | 10.4 |
| ToRA-7B 🔧 | 6.5 | 11.1 | 12.4 | 11.8 | 11.3 | 9.3 | 7.7 | 10.0 |
Note: 🧠 Reasoning-focused; * maj@8 instead of pass@1; 🔧 TIR mode.