EACL 2026

Published: Jan 3, 2026

We are delighted to announce that we will be at EACL 2026 with 2 long papers in the Main and Findings Track! Catch us in Rabat, Morocco to learn more about medical reliability and multimodal hate speech detection.


"ReMedQA: Are We Done With Medical Multiple-Choice Benchmarks?

by A. Cocchieri, L. Ragazzi, G. Tagliavini, and G. Moro

Medical multiple-choice question answering (MCQA) benchmarks report near-human accuracy, with some approaching saturation and fueling claims of clinical readiness. Yet a single accuracy score is a poor proxy for competence: models that change answers under minor perturbations cannot be considered reliable. We argue that reliability underpins accuracy–only consistent predictions make correctness meaningful. To address this, we release ReMedQA, a benchmark suite that augments three standard medical MCQA datasets with open-answer variants and systematically perturbed items. Building on this design, we introduce ReAcc and ReCon, two reliability metrics: ReAcc measures the proportion of questions answered correctly across all variations, while ReCon measures the proportion answered consistently regardless of correctness. Our evaluation shows that high MCQA accuracy masks low reliability: models remain sensitive to format and perturbation changes, and domain specialization offers no robustness gain. MCQA further underestimates smaller models while inflating large ones that exploit structural cues–with some producing correct answers without seeing the question. These findings show that, despite near-saturated accuracy, we are not yet done with medical multiple-choice benchmarks.

  • The paper will be available soon!


MemeWeaver: Inter-Meme Graph Reasoning for Sexism and Misogyny Detection

by P. Italiani, D. G. Gómez, L. Ragazzi, G. Moro, and P. Rosso

Women are twice as likely as men to face online harassment due to their gender. Despite recent advances in multimodal content moderation, most approaches still overlook the social dynamics behind this phenomenon, where perpetrators reinforce prejudices and group identity within like-minded communities. Graph-based methods offer a promising way to capture such interactions, yet existing solutions remain limited by heuristic graph construction, shallow modality fusion, and instance-level reasoning. In this work, we present MemeWeaver, an end-to-end trainable multimodal framework for detecting sexism and misogyny through a novel inter-meme graph reasoning mechanism. We systematically evaluate multiple visual-textual fusion strategies and show that our approach consistently outperforms state-of-the-art baselines on the MAMI and EXIST benchmarks, while achieving faster training convergence. Further analyses reveal that the learned graph structure captures semantically meaningful patterns, offering valuable insights into the relational nature of online hate.

  • The paper will be available soon!