Published: Apr 7, 2026

We are proud to share that we will be at ACL 2026 with 2 long papers in the Main Track! Catch us in San Diego, California to learn more about medical uncertainty of LLMs and legal question answering.

LLMs (Almost) Never Abstain Under Medical Uncertainty

by A. Cocchieri, L. Ragazzi, G. Tagliavini, and G. Moro

Medical multiple-choice question answering (MCQA) benchmarks implicitly assume that large language models (LLMs) should always commit to an answer. However, in clinical practice, uncertainty is pervasive and abstaining is often the only safe action. We introduce MedQAbstain, a benchmark explicitly designed to evaluate medical abstention under uncertainty. MedQAbstain repurposes standard medical MCQA datasets by removing the gold answer and introducing an explicit "I abstain" option, framed as a safety-critical decision with clinical consequences. The benchmark supports systematic analysis across abstention regimes, distractor complexity, and input modalities, and elicits self-reported model confidence to study calibration. Across all settings, we find that state-of-the-art LLMs systematically overcommit, rarely abstaining even when the question itself is hidden. These results reveal a fundamental mismatch between LLM behavior and clinical norms, highlighting abstention as a critical but overlooked dimension of medical decision-making evaluation.

The paper will be available soon!

Sycophants in the Courtroom: Are LLMs Fragile to Juridical Authority and Evolving Legal Standards?

by L. Molfetta, A. Cocchieri, L. Ragazzi, I. Bartolini, M. Patella, and G. Moro

In medicine, claims persist insofar as they withstand empirical verification against a stable biological reality; in law, by contrast, truth is contingent, defined by jurisdiction, temporal validity, and the hierarchy of authoritative sources. The recent success of Large Language Models (LLMs) on medical licensing examinations has encouraged an expectation of comparable legal competence. This analogy, however, obscures a critical distinction between domains. Unlike in medicine, legal performance often depends less on inference than on determining when external authority is applicable, valid, and non-contradictory. We introduce a comparative diagnostic framework that evaluates legal reasoning against medical baselines along four axes spanning knowledge recall, grounding, confidence, and robustness to format changes. Evaluating models on a newly introduced benchmark that explicitly encodes temporal validity and normative relationships, we uncover a sharp domain asymmetry. While medical models reliably benefit from verified sources, legal LLMs struggle to assess when retrieved citations are useful or misleading, exhibiting overconfidence in perturbed contexts and sensitivity to superficial formatting cues. Notably, increased model scale amplifies this tendency, revealing a mismatch between instruction following and epistemic reliability in law. These findings show that current LLMs treat law as unstructured text rather than binding precedent, motivating the definition of new jurisdiction-aware benchmarks that penalize uncritical reliance on spurious context.

The paper will be available soon!