MathGames

Can Large Language Models Win the International Mathematical Games?

University of Bologna, Italy
EMNLP 2025 Main Track

Introduction

Recent advances in large language models (LLMs) have demonstrated strong mathematical reasoning abilities, even in visual contexts, with some models surpassing human performance on existing benchmarks. However, these benchmarks lack structured age categorization, clearly defined skill requirements, and—crucially—were not designed to assess human performance in international competitions.

To address these limitations, we introduce MathGames, a new benchmark of 2,183 high-quality mathematical problems (both text-only and multimodal) in an open-ended format, sourced from an international mathematical games championships. Spanning seven age groups and a skill-based taxonomy, MathGames enables a structured evaluation of LLMs' mathematical and logical reasoning abilities.

Our experiments reveal a substantial gap between state-of-the-art LLMs and human participants—even 11-year-olds consistently outperform some of the strongest models—highlighting the need for advancements. Further, our detailed error analysis offers valuable insights to guide future research.

MathGames Dataset

Overview

MathGames is a carefully designed benchmark for evaluating the mathematical and logical reasoning abilities of foundation models across 2,183 high-quality, playful-style problems (1,389 textual problems and 794 multimodal problems) across different age categories in an open-ended format (i.e., without multiple-choice answers).

You can download the dataset on Hugging Face Dataset.

Leaderboard on MathGames

Textual Problems

CE, C1, C2, L1, L2, GP, HC: Accuracy across age groups.
Avg: Overall average.
Model CE C1 C2 L1 L2 GP HC Avg
Closed-Source
o3-mini-high 🧠83.882.081.780.679.277.173.379.7
Gemini-2.0-Flash-T 🧠81.371.471.069.566.461.959.268.7
Gemini-2.0-Flash58.956.855.454.051.343.741.251.6
Gemini-1.5-Pro59.854.353.252.450.243.941.250.7
Gemini-1.5-Flash60.749.747.445.542.936.936.045.6
GPT-4o61.750.146.243.842.335.033.044.6
GPT-4o-mini49.542.242.841.539.831.430.040.4
Gemini-1.5-Flash-8B40.235.133.931.229.321.820.631.6
Open-Source > 8B
DeepSeek-R1 🧠85.077.375.974.772.769.769.074.9
DeepSeek-V366.454.352.150.748.340.836.549.9
Phi-4-14B *66.451.948.146.043.737.032.246.5
Phi-4-14B59.850.445.943.641.133.630.043.5
Qwen2.5-72B53.348.445.243.341.434.429.642.2
QwQ-32B 🧠56.143.540.037.334.425.423.237.1
LLaMA-3.3-70B44.941.539.737.335.726.326.235.9
DeepSeek-R1-Qwen 🧠44.238.738.036.333.425.819.833.7
Open-Source ≤ 8B (Math-Specialized)
Qwen2.5-Math-7B * 🔧53.347.948.347.346.939.834.145.4
Qwen2.5-Math-7B 🔧43.945.444.142.241.233.631.340.2
NuminaMath-7B * 🔧43.038.838.036.735.326.724.534.7
Qwen2.5-Math-7B *40.737.038.336.735.627.526.134.6
Qwen2.5-Math-7B40.236.537.836.235.126.924.933.9
NuminaMath-7B 🔧39.227.631.129.428.319.818.027.7
NuminaMath-7B28.825.624.924.123.125.422.424.9
Mathstral-7B *35.527.226.123.621.816.712.423.3
NuminaMath-7B *31.825.225.424.222.713.19.021.6
DeepSeek-Math-7B * 🔧23.424.423.722.921.315.114.220.7
Mathstral-7B27.122.023.421.320.112.211.219.6
DeepSeek-Math-7B *21.321.621.920.719.613.210.218.4
DeepSeek-Math-7B 🔧21.121.421.720.519.312.89.818.1
DeepSeek-Math-7B20.621.021.420.118.912.59.417.7
ToRA-7B * 🔧12.211.612.111.511.17.66.410.4
ToRA-7B 🔧6.511.112.411.811.39.37.710.0

Note: 🧠 Reasoning-focused; * maj@8 instead of pass@1; 🔧 TIR mode.

Multimodal Problems

CE, C1, C2, L1, L2, GP, HC: Accuracy across age groups.
Avg: Overall average.
Model CE C1 C2 L1 L2 GP HC Avg
Closed-Source
Gemini-2.0-Flash-T 🧠 38.329.332.231.531.325.425.130.4
Gemini-1.5-Pro 30.425.524.221.320.418.415.322.2
Gemini-1.5-Flash 27.019.416.115.015.612.714.217.1
GPT-4o 25.220.417.814.812.910.210.916.0
GPT-4o-mini 23.518.516.113.412.110.211.515.0
Gemini-1.5-Flash-8B 18.314.312.311.311.39.210.912.5
Open-Source > 8B
InternVL-2.5-38B-MPO 19.121.019.717.116.412.712.016.9
InternVL-2.5-38B 14.814.713.611.59.97.46.611.2
QVQ-72B 🧠 20.011.88.97.57.16.76.69.8
Qwen2-VL-72B 14.812.511.28.87.56.03.89.2
Pixtral-12B * 12.26.45.95.24.85.34.46.3
Pixtral-12B 11.38.96.14.02.64.23.35.8
Open-Source ≤ 8B
Phi-3.5-4.2B * 24.411.211.411.110.37.47.111.5
Qwen2-VL-7B * 13.011.510.810.29.39.27.710.2
Qwen2-VL-7B 13.99.210.08.87.94.64.98.5
InternVL-2.5-8B * 14.89.65.74.86.35.78.27.9
InternVL-2.5-8B * 11.39.67.86.35.93.53.36.8
Phi-3.5-4.2B 5.27.06.86.56.57.44.96.3

Note: 🧠 Reasoning-focused; * maj@8 instead of pass@1.

Error Analysis

BibTeX

@inproceedings{cocchieri-etal-2025-mathgames,
    title = "Can Large Language Models Win the International Mathematical Games?",
    author = "Cocchieri, Alessio  and
      Ragazzi, Luca  and
      Tagliavini, Giuseppe  and
      Tordi, Lorenzo  and
      Carbonaro, Antonella  and
      Moro, Gianluca",
    booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    abstract = "Recent advances in large language models (LLMs) have demonstrated strong mathematical reasoning abilities, even in visual contexts, with some models surpassing human performance on existing benchmarks. However, these benchmarks lack structured age categorization, clearly defined skill requirements, and—crucially—were not designed to assess human performance in international competitions. To address these limitations, we introduce MathGames, a new benchmark of 2,183 high-quality mathematical problems (both text-only and multimodal) in an open-ended format, sourced from an international mathematical games championships. Spanning seven age groups and a skill-based taxonomy, MathGames enables a structured evaluation of LLMs' mathematical and logical reasoning abilities. Our experiments reveal a substantial gap between state-of-the-art LLMs and human participants—even 11-year-olds consistently outperform some of the strongest models—highlighting the need for advancements. Further, our detailed error analysis offers valuable insights to guide future research. The data is publicly available at https://disi-unibo-nlp.github.io/math-games/."
  }