Can Large Language Models Win the International Mathematical Games?

LLM accuracy across skill categories and age groups. Top: performance on text-only problems. Bottom: performance on image-based problems (accuracy scale 0-70 for better visualization). Skills include Arithmetic (Ari), Logic (Log), Geometry (Geo), Combinatorics (Com), Algebra (Alg), and Pattern Recognition (Pat).

Introduction

Recent advances in large language models (LLMs) have demonstrated strong mathematical reasoning abilities, even in visual contexts, with some models surpassing human performance on existing benchmarks. However, these benchmarks lack structured age categorization, clearly defined skill requirements, and—crucially—were not designed to assess human performance in international competitions.

To address these limitations, we introduce MathGames, a new benchmark of 2,183 high-quality mathematical problems (both text-only and multimodal) in an open-ended format, sourced from an international mathematical games championships. Spanning seven age groups and a skill-based taxonomy, MathGames enables a structured evaluation of LLMs' mathematical and logical reasoning abilities.

Our experiments reveal a substantial gap between state-of-the-art LLMs and human participants—even 11-year-olds consistently outperform some of the strongest models—highlighting the need for advancements. Further, our detailed error analysis offers valuable insights to guide future research.

Overview

MathGames is a carefully designed benchmark for evaluating the mathematical and logical reasoning abilities of foundation models across 2,183 high-quality, playful-style problems (1,389 textual problems and 794 multimodal problems) across different age categories in an open-ended format (i.e., without multiple-choice answers).

You can download the dataset on Hugging Face Dataset.

Statistics of MathGames, including problem count and word lengths. The overall count is not the sum of category-specific counts due to overlapping problems.

Textual Problems

CE, C1, C2, L1, L2, GP, HC: Accuracy across age groups.
Avg: Overall average.

Model	CE	C1	C2	L1	L2	GP	HC	Avg
Closed-Source
o3-mini-high 🧠	83.8	82.0	81.7	80.6	79.2	77.1	73.3	79.7
Gemini-2.0-Flash-T 🧠	81.3	71.4	71.0	69.5	66.4	61.9	59.2	68.7
Gemini-2.0-Flash	58.9	56.8	55.4	54.0	51.3	43.7	41.2	51.6
Gemini-1.5-Pro	59.8	54.3	53.2	52.4	50.2	43.9	41.2	50.7
Gemini-1.5-Flash	60.7	49.7	47.4	45.5	42.9	36.9	36.0	45.6
GPT-4o	61.7	50.1	46.2	43.8	42.3	35.0	33.0	44.6
GPT-4o-mini	49.5	42.2	42.8	41.5	39.8	31.4	30.0	40.4
Gemini-1.5-Flash-8B	40.2	35.1	33.9	31.2	29.3	21.8	20.6	31.6
Open-Source > 8B
DeepSeek-R1 🧠	85.0	77.3	75.9	74.7	72.7	69.7	69.0	74.9
DeepSeek-V3	66.4	54.3	52.1	50.7	48.3	40.8	36.5	49.9
Phi-4-14B *	66.4	51.9	48.1	46.0	43.7	37.0	32.2	46.5
Phi-4-14B	59.8	50.4	45.9	43.6	41.1	33.6	30.0	43.5
Qwen2.5-72B	53.3	48.4	45.2	43.3	41.4	34.4	29.6	42.2
QwQ-32B 🧠	56.1	43.5	40.0	37.3	34.4	25.4	23.2	37.1
LLaMA-3.3-70B	44.9	41.5	39.7	37.3	35.7	26.3	26.2	35.9
DeepSeek-R1-Qwen 🧠	44.2	38.7	38.0	36.3	33.4	25.8	19.8	33.7
Open-Source ≤ 8B (Math-Specialized)
Qwen2.5-Math-7B * 🔧	53.3	47.9	48.3	47.3	46.9	39.8	34.1	45.4
Qwen2.5-Math-7B 🔧	43.9	45.4	44.1	42.2	41.2	33.6	31.3	40.2
NuminaMath-7B * 🔧	43.0	38.8	38.0	36.7	35.3	26.7	24.5	34.7
Qwen2.5-Math-7B *	40.7	37.0	38.3	36.7	35.6	27.5	26.1	34.6
Qwen2.5-Math-7B	40.2	36.5	37.8	36.2	35.1	26.9	24.9	33.9
NuminaMath-7B 🔧	39.2	27.6	31.1	29.4	28.3	19.8	18.0	27.7
NuminaMath-7B	28.8	25.6	24.9	24.1	23.1	25.4	22.4	24.9
Mathstral-7B *	35.5	27.2	26.1	23.6	21.8	16.7	12.4	23.3
NuminaMath-7B *	31.8	25.2	25.4	24.2	22.7	13.1	9.0	21.6
DeepSeek-Math-7B * 🔧	23.4	24.4	23.7	22.9	21.3	15.1	14.2	20.7
Mathstral-7B	27.1	22.0	23.4	21.3	20.1	12.2	11.2	19.6
DeepSeek-Math-7B *	21.3	21.6	21.9	20.7	19.6	13.2	10.2	18.4
DeepSeek-Math-7B 🔧	21.1	21.4	21.7	20.5	19.3	12.8	9.8	18.1
DeepSeek-Math-7B	20.6	21.0	21.4	20.1	18.9	12.5	9.4	17.7
ToRA-7B * 🔧	12.2	11.6	12.1	11.5	11.1	7.6	6.4	10.4
ToRA-7B 🔧	6.5	11.1	12.4	11.8	11.3	9.3	7.7	10.0

Note: 🧠 Reasoning-focused; * maj@8 instead of pass@1; 🔧 TIR mode.

Multimodal Problems

CE, C1, C2, L1, L2, GP, HC: Accuracy across age groups.
Avg: Overall average.

Model	CE	C1	C2	L1	L2	GP	HC	Avg
Closed-Source
Gemini-2.0-Flash-T 🧠	38.3	29.3	32.2	31.5	31.3	25.4	25.1	30.4
Gemini-1.5-Pro	30.4	25.5	24.2	21.3	20.4	18.4	15.3	22.2
Gemini-1.5-Flash	27.0	19.4	16.1	15.0	15.6	12.7	14.2	17.1
GPT-4o	25.2	20.4	17.8	14.8	12.9	10.2	10.9	16.0
GPT-4o-mini	23.5	18.5	16.1	13.4	12.1	10.2	11.5	15.0
Gemini-1.5-Flash-8B	18.3	14.3	12.3	11.3	11.3	9.2	10.9	12.5
Open-Source > 8B
InternVL-2.5-38B-MPO	19.1	21.0	19.7	17.1	16.4	12.7	12.0	16.9
InternVL-2.5-38B	14.8	14.7	13.6	11.5	9.9	7.4	6.6	11.2
QVQ-72B 🧠	20.0	11.8	8.9	7.5	7.1	6.7	6.6	9.8
Qwen2-VL-72B	14.8	12.5	11.2	8.8	7.5	6.0	3.8	9.2
Pixtral-12B *	12.2	6.4	5.9	5.2	4.8	5.3	4.4	6.3
Pixtral-12B	11.3	8.9	6.1	4.0	2.6	4.2	3.3	5.8
Open-Source ≤ 8B
Phi-3.5-4.2B *	24.4	11.2	11.4	11.1	10.3	7.4	7.1	11.5
Qwen2-VL-7B *	13.0	11.5	10.8	10.2	9.3	9.2	7.7	10.2
Qwen2-VL-7B	13.9	9.2	10.0	8.8	7.9	4.6	4.9	8.5
InternVL-2.5-8B *	14.8	9.6	5.7	4.8	6.3	5.7	8.2	7.9
InternVL-2.5-8B *	11.3	9.6	7.8	6.3	5.9	3.5	3.3	6.8
Phi-3.5-4.2B	5.2	7.0	6.8	6.5	6.5	7.4	4.9	6.3

Note: 🧠 Reasoning-focused; * maj@8 instead of pass@1.

Examples of errors made by Gemini models in text-only problems within MathGames.

Examples of errors made by OpenAI models in multimodal problems within MathGames.

BibTeX

@inproceedings{cocchieri-etal-2025-large,
    title = "Can Large Language Models Win the International Mathematical Games?",
    author = "Cocchieri, Alessio  and
      Ragazzi, Luca  and
      Tagliavini, Giuseppe  and
      Tordi, Lorenzo  and
      Carbonaro, Antonella  and
      Moro, Gianluca",
    editor = "Christodoulopoulos, Christos  and
      Chakraborty, Tanmoy  and
      Rose, Carolyn  and
      Peng, Violet",
    booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.emnlp-main.488/",
    doi = "10.18653/v1/2025.emnlp-main.488",
    pages = "9645--9671",
    ISBN = "979-8-89176-332-6",
    abstract = "Recent advances in large language models (LLMs) have demonstrated strong mathematical reasoning abilities, even in visual contexts, with some models surpassing human performance on existing benchmarks. However, these benchmarks lack structured age categorization, clearly defined skill requirements, and{---}crucially{---}were not designed to assess human performance in international competitions. To address these limitations, we introduce MathGames, a new benchmark of 2,183 high-quality mathematical problems (both text-only and multimodal) in an open-ended format, sourced from an international mathematical games championships. Spanning seven age groups and a skill-based taxonomy, MathGames enables a structured evaluation of LLMs' mathematical and logical reasoning abilities. Our experiments reveal a substantial gap between state-of-the-art LLMs and human participants{---}even 11-year-olds consistently outperform some of the strongest models{---}highlighting the need for advancements. Further, our detailed error analysis offers valuable insights to guide future research. The data is publicly available at https://disi-unibo-nlp.github.io/math-games."
}

MathGames