Phunny: Testing LLM Generalization through Humor

Illustration of the tasks in Phunny. (1) Comprehension evaluates the ability to interpret coherent and misleading puns; (2) Completion involves generating a valid punchline; and (3) Generation asks creating new puns following specific guidelines. Correct and incorrect outputs highlight reasoning capabilities and common errors.

The pun-based reasoning framework behind our Phunny dataset, designed to assess linguistic generalization through humor in a human-intuitive format.

Introduction

Humor, requiring creativity and contextual understanding, is a hallmark of human intelligence, showcasing adaptability across linguistic scenarios. While recent advances in large language models (LLMs) demonstrate strong reasoning on various benchmarks, it remains unclear whether they truly adapt to new tasks like humans (i.e., generalize) or merely replicate memorized content.

To explore this, we introduce Phunny, a new humor-based question-answering benchmark designed to assess LLMs' reasoning through carefully crafted puns. Our dataset is manually curated to ensure novelty and minimize data contamination, providing a robust evaluation of LLMs' linguistic comprehension.

Experiments on pun comprehension, resolution, and generation reveal that most LLMs struggle with generalization, even on simple tasks, consistently underperforming the human baseline. Additionally, our detailed error analysis provides valuable insights to guide future research.

Pun Comprehension

CPA (Coherent Pun Accuracy): Correctly recognizing true puns.
MPA (Misleading Pun Accuracy): Correctly recognizing false puns.
(MPA⁻ and MPA⁺ stands for whether the original pun's subject has been substituted with a dissimilar or similar item).

#	Model	CPA	MPA⁻	MPA⁺	MPA
1	o3-mini ⭐ 🧠	78.3	6.0	3.4	4.7
2	Gemini-2.0-Flash-Think ⭐ 🧠	71.1	6.9	24.6	15.8
3	LLaMA-3.3 (70B)	70.0	29.4	29.1	29.3
4	GPT-4o ⭐	64.9	14.9	17.4	16.2
5	Phi-4 (14B)	64.6	9.4	13.7	11.6
6	Gemini-2.0-Flash ⭐	44.6	44.9	35.7	40.3
7	GPT-4o-mini ⭐	36.3	10.9	11.7	23.6
8	LLaMA-3.1 (8B)	29.7	0.0	0.3	0.2
9	Phi-3.5 (3B)	14.9	48.3	47.1	47.7
10	Humans 👤	87.9	87.3	94.4	90.9

Note: ⭐ Closed-source models. 🧠 Reasoning-focused models.

Pun Resolution

ACC (Accuracy): Correctly resolving the pun.
VPA (Valid Prefix Accuracy): The response correctly starts with the question's subject.
EWA (Existing Word Accuracy): The answer is a real existing word.

#	Model	ACC	VPA	EWA
1	o3-mini ⭐ 🧠	93.9	98.0	99.1
2	GPT-4o ⭐	79.9	84.6	96.2
3	Gemini-2.0-Flash-Think ⭐ 🧠	70.6	80.2	88.1
4	Gemini-2.0-Flash ⭐	69.5	75.9	87.8
5	LLaMA-3.3 (70B)	67.3	77.4	83.4
6	GPT-4o-mini ⭐	64.5	73.0	89.5
7	Phi-4 (14B)	53.9	69.3	80.5
8	LLaMA-3.1 (8B)	27.9	31.7	96.8
9	Phi-3.5 (3B)	22.4	27.3	78.5
10	Humans 👤	85.7	95.1	100.0

Note: ⭐ Closed-source models. 🧠 Reasoning-focused models.

Pun Generation

Constrained ACC: Accuracy in the constrained pun generation task (where both meanings must be explicitly used).
Free ACC: Accuracy in the free-form pun generation task (open-ended).
C_S: Semantic correctness of the pun meaning.
C_A: Acceptability of the generated pun as a natural sentence.

#	Model	Constr. ACC	Free ACC	Free C_S	Free C_A
1	o3-mini ⭐ 🧠	93.5	100.0	38.0	52.0
2	GPT-4o ⭐	85.3	46.0	60.0	88.0
3	Gemini-2.0-Flash ⭐	80.1	40.0	48.0	78.0
4	Gemini-2.0-Flash-Think ⭐ 🧠	66.7	36.0	58.0	82.0
5	LLaMA-3.3 (70B)	25.5	15.0	15.0	57.5
6	GPT-4o-mini ⭐	41.7	24.0	36.0	84.0
7	Phi-4 (14B)	15.4	6.0	34.0	76.0
8	LLaMA-3.1 (8B)	13.1	20.5	59.0	89.7
9	Phi-3.5 (3B)	4.7	95.4	2.3	7.0
10	Humans 👤	88.7	92.8	82.3	95.2

Note: ⭐ Closed-source models. 🧠 Reasoning-focused models.

Overview

Phunny is a meticulously curated benchmark dataset of 350 hand-crafted structured puns. It consists of a novel type of pun, designed to evaluate the generalization capabilities of models in a QA task.

You can download the dataset on Hugging Face Dataset.

Phunny's data statistics distribution.

Top-15 most frequent subjects in Phunny.

Examples of errors in the Pun Comprehension task.

Examples of errors in the Pun Generation and Resolution tasks, organized into three error types. FT stands for Flash-Thinking. The text actually generated by the LLM is highlighted in violet.

BibTeX

@inproceedings{cocchieri-etal-2025-what,
    title = "What do you call a dog that is incontrovertibly true? Dogma: Testing LLM Generalization through Humor",
    author = "Cocchieri, Alessio  and
      Ragazzi, Luca  and
      Italiani, Paolo  and
      Tagliavini, Giuseppe and
      Moro, Gianluca",
    booktitle = "Proceedings of the 63nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    year = "2025",
    address = "Vienna, Austria",
    publisher = "Association for Computational Linguistics"
}

Phunny

Testing LLM Generalization through Humor

Introduction

Leaderboard on Phunny

Pun Comprehension

Pun Resolution

Pun Generation

Phunny Dataset

Overview

Error Analysis

BibTeX