Phunny

Testing LLM Generalization through Humor

University of Bologna, Italy
ACL 2025 Main Track

Introduction

Humor, requiring creativity and contextual understanding, is a hallmark of human intelligence, showcasing adaptability across linguistic scenarios. While recent advances in large language models (LLMs) demonstrate strong reasoning on various benchmarks, it remains unclear whether they truly adapt to new tasks like humans (i.e., generalize) or merely replicate memorized content.

To explore this, we introduce Phunny, a new humor-based question-answering benchmark designed to assess LLMs' reasoning through carefully crafted puns. Our dataset is manually curated to ensure novelty and minimize data contamination, providing a robust evaluation of LLMs' linguistic comprehension.

Experiments on pun comprehension, resolution, and generation reveal that most LLMs struggle with generalization, even on simple tasks, consistently underperforming the human baseline. Additionally, our detailed error analysis provides valuable insights to guide future research.

Leaderboard on Phunny

Pun Comprehension

CPA (Coherent Pun Accuracy): Correctly recognizing true puns.
MPA (Misleading Pun Accuracy): Correctly recognizing false puns.
(MPA and MPA+ stands for whether the original pun's subject has been substituted with a dissimilar or similar item).
# Model CPA MPA MPA+ MPA
1 o3-mini ⭐ 🧠 78.3 6.0 3.4 4.7
2 Gemini-2.0-Flash-Think ⭐ 🧠 71.1 6.9 24.6 15.8
3 LLaMA-3.3 (70B) 70.0 29.4 29.1 29.3
4 GPT-4o ⭐ 64.9 14.9 17.4 16.2
5 Phi-4 (14B) 64.6 9.4 13.7 11.6
6 Gemini-2.0-Flash ⭐ 44.6 44.9 35.7 40.3
7 GPT-4o-mini ⭐ 36.3 10.9 11.7 23.6
8 LLaMA-3.1 (8B) 29.7 0.0 0.3 0.2
9 Phi-3.5 (3B) 14.9 48.3 47.1 47.7
10 Humans 👤 87.9 87.3 94.4 90.9

Note: ⭐ Closed-source models. 🧠 Reasoning-focused models.

Pun Resolution

ACC (Accuracy): Correctly resolving the pun.
VPA (Valid Prefix Accuracy): The response correctly starts with the question's subject.
EWA (Existing Word Accuracy): The answer is a real existing word.
# Model ACC VPA EWA
1 o3-mini ⭐ 🧠 93.9 98.0 99.1
2 GPT-4o ⭐ 79.9 84.6 96.2
3 Gemini-2.0-Flash-Think ⭐ 🧠 70.6 80.2 88.1
4 Gemini-2.0-Flash ⭐ 69.5 75.9 87.8
5 LLaMA-3.3 (70B) 67.3 77.4 83.4
6 GPT-4o-mini ⭐ 64.5 73.0 89.5
7 Phi-4 (14B) 53.9 69.3 80.5
8 LLaMA-3.1 (8B) 27.9 31.7 96.8
9 Phi-3.5 (3B) 22.4 27.3 78.5
10 Humans 👤 85.7 95.1 100.0

Note: ⭐ Closed-source models. 🧠 Reasoning-focused models.

Pun Generation

Constrained ACC: Accuracy in the constrained pun generation task (where both meanings must be explicitly used).
Free ACC: Accuracy in the free-form pun generation task (open-ended).
CS: Semantic correctness of the pun meaning.
CA: Acceptability of the generated pun as a natural sentence.
# Model Constr. ACC Free ACC Free CS Free CA
1 o3-mini ⭐ 🧠 93.5 100.0 38.0 52.0
2 GPT-4o ⭐ 85.3 46.0 60.0 88.0
3 Gemini-2.0-Flash ⭐ 80.1 40.0 48.0 78.0
4 Gemini-2.0-Flash-Think ⭐ 🧠 66.7 36.0 58.0 82.0
5 LLaMA-3.3 (70B) 25.5 15.0 15.0 57.5
6 GPT-4o-mini ⭐ 41.7 24.0 36.0 84.0
7 Phi-4 (14B) 15.4 6.0 34.0 76.0
8 LLaMA-3.1 (8B) 13.1 20.5 59.0 89.7
9 Phi-3.5 (3B) 4.7 95.4 2.3 7.0
10 Humans 👤 88.7 92.8 82.3 95.2

Note: ⭐ Closed-source models. 🧠 Reasoning-focused models.

Phunny Dataset

Overview

Phunny is a meticulously curated benchmark dataset of 350 hand-crafted structured puns. It consists of a novel type of pun, designed to evaluate the generalization capabilities of models in a QA task.

You can download the dataset on Hugging Face Dataset.

Error Analysis

BibTeX

@inproceedings{cocchieri-etal-2025-what,
    title = "What do you call a dog that is incontrovertibly true? Dogma: Testing LLM Generalization through Humor",
    author = "Cocchieri, Alessio  and
      Ragazzi, Luca  and
      Italiani, Paolo  and
      Tagliavini, Giuseppe and
      Moro, Gianluca",
    booktitle = "Proceedings of the 63nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    year = "2025",
    address = "Vienna, Austria",
    publisher = "Association for Computational Linguistics"
}