0 publications

For previous works, please refer to this link.

Conference Proceedings

LLMs (Almost) Never Abstain Under Medical Uncertainty

Alessio Cocchieri, Luca Ragazzi, Giuseppe Tagliavini, Gianluca Moro

ACL 2026 Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Medical multiple-choice question answering (MCQA) benchmarks implicitly assume that large language models (LLMs) should always commit to an answer. However, in clinical practice, uncertainty is pervasive and abstaining is often the only safe action. We introduce MedQAbstain, a benchmark explicitly designed to evaluate medical abstention under uncertainty. MedQAbstain repurposes standard medical MCQA datasets by removing the gold answer and introducing an explicit "I abstain" option, framed as a safety-critical decision with clinical consequences. The benchmark supports systematic analysis across abstention regimes, distractor complexity, and input modalities, and elicits self-reported model confidence to study calibration. Across all settings, we find that state-of-the-art LLMs systematically overcommit, rarely abstaining even when the question itself is hidden. These results reveal a fundamental mismatch between LLM behavior and clinical norms, highlighting abstention as a critical but overlooked dimension of medical decision-making evaluation. More

Medical multiple-choice question answering (MCQA) benchmarks implicitly assume that large language models (LLMs) should always commit to an answer. However, in clinical practice, uncertainty is pervasive and abstaining is often the only safe action. We introduce MedQAbstain, a benchmark explicitly designed to evaluate medical abstention under uncertainty. MedQAbstain repurposes standard medical MCQA datasets by removing the gold answer and introducing an explicit "I abstain" option, framed as a safety-critical decision with clinical consequences. The benchmark supports systematic analysis across abstention regimes, distractor complexity, and input modalities, and elicits self-reported model confidence to study calibration. Across all settings, we find that state-of-the-art LLMs systematically overcommit, rarely abstaining even when the question itself is hidden. These results reveal a fundamental mismatch between LLM behavior and clinical norms, highlighting abstention as a critical but overlooked dimension of medical decision-making evaluation. Less
Sycophants in the Courtroom: Are LLMs Fragile to Juridical Authority and Evolving Legal Standards?

Lorenzo Molfetta, Alessio Cocchieri, Luca Ragazzi, Ilaria Bartolini, Marco Patella, Gianluca Moro

ACL 2026 Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

In medicine, claims persist insofar as they withstand empirical verification against a stable biological reality; in law, by contrast, truth is contingent, defined by jurisdiction, temporal validity, and the hierarchy of authoritative sources. The recent success of Large Language Models (LLMs) on medical licensing examinations has encouraged an expectation of comparable legal competence. This analogy, however, obscures a critical distinction between domains. Unlike in medicine, legal performance often depends less on inference than on determining when external authority is applicable, valid, and non-contradictory. We introduce a comparative diagnostic framework that evaluates legal reasoning against medical baselines along four axes spanning knowledge recall, grounding, confidence, and robustness to format changes. Evaluating models on a newly introduced benchmark that explicitly encodes temporal validity and normative relationships, we uncover a sharp domain asymmetry. While medical models reliably benefit from verified sources, legal LLMs struggle to assess when retrieved citations are useful or misleading, exhibiting overconfidence in perturbed contexts and sensitivity to superficial formatting cues. Notably, increased model scale amplifies this tendency, revealing a mismatch between instruction following and epistemic reliability in law. These findings show that current LLMs treat law as unstructured text rather than binding precedent, motivating the definition of new jurisdiction-aware benchmarks that penalize uncritical reliance on spurious context. More

In medicine, claims persist insofar as they withstand empirical verification against a stable biological reality; in law, by contrast, truth is contingent, defined by jurisdiction, temporal validity, and the hierarchy of authoritative sources. The recent success of Large Language Models (LLMs) on medical licensing examinations has encouraged an expectation of comparable legal competence. This analogy, however, obscures a critical distinction between domains. Unlike in medicine, legal performance often depends less on inference than on determining when external authority is applicable, valid, and non-contradictory. We introduce a comparative diagnostic framework that evaluates legal reasoning against medical baselines along four axes spanning knowledge recall, grounding, confidence, and robustness to format changes. Evaluating models on a newly introduced benchmark that explicitly encodes temporal validity and normative relationships, we uncover a sharp domain asymmetry. While medical models reliably benefit from verified sources, legal LLMs struggle to assess when retrieved citations are useful or misleading, exhibiting overconfidence in perturbed contexts and sensitivity to superficial formatting cues. Notably, increased model scale amplifies this tendency, revealing a mismatch between instruction following and epistemic reliability in law. These findings show that current LLMs treat law as unstructured text rather than binding precedent, motivating the definition of new jurisdiction-aware benchmarks that penalize uncritical reliance on spurious context. Less
ReMedQA: Are We Done With Medical Multiple-Choice Benchmarks?

Alessio Cocchieri, Luca Ragazzi, Giuseppe Tagliavini, Gianluca Moro

EACL 2026 Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Try!

Code

Read

Medical multiple-choice question answering (MCQA) benchmarks report near-human accuracy, with some approaching saturation and fueling claims of clinical readiness. Yet a single accuracy score is a poor proxy for competence: models that change answers under minor perturbations cannot be considered reliable. We argue that reliability underpins accuracy–only consistent predictions make correctness meaningful. To address this, we release ReMedQA, a benchmark suite that augments three standard medical MCQA datasets with open-answer variants and systematically perturbed items. Building on this design, we introduce ReAcc and ReCon, two reliability metrics: ReAcc measures the proportion of questions answered correctly across all variations, while ReCon measures the proportion answered consistently regardless of correctness. Our evaluation shows that high MCQA accuracy masks low reliability: models remain sensitive to format and perturbation changes, and domain specialization offers no robustness gain. MCQA further underestimates smaller models while inflating large ones that exploit structural cues–with some producing correct answers without seeing the question. These findings show that, despite near-saturated accuracy, we are not yet done with medical multiple-choice benchmarks. More

Medical multiple-choice question answering (MCQA) benchmarks report near-human accuracy, with some approaching saturation and fueling claims of clinical readiness. Yet a single accuracy score is a poor proxy for competence: models that change answers under minor perturbations cannot be considered reliable. We argue that reliability underpins accuracy–only consistent predictions make correctness meaningful. To address this, we release ReMedQA, a benchmark suite that augments three standard medical MCQA datasets with open-answer variants and systematically perturbed items. Building on this design, we introduce ReAcc and ReCon, two reliability metrics: ReAcc measures the proportion of questions answered correctly across all variations, while ReCon measures the proportion answered consistently regardless of correctness. Our evaluation shows that high MCQA accuracy masks low reliability: models remain sensitive to format and perturbation changes, and domain specialization offers no robustness gain. MCQA further underestimates smaller models while inflating large ones that exploit structural cues–with some producing correct answers without seeing the question. These findings show that, despite near-saturated accuracy, we are not yet done with medical multiple-choice benchmarks. Less
MemeWeaver: Inter-Meme Graph Reasoning for Sexism and Misogyny Detection

Paolo Italiani, David Gimeno-Gómez, Luca Ragazzi, Gianluca Moro, Paolo Rosso

EACL 2026 Findings of the Association for Computational Linguistics: EACL 2026

Code

Read

Women are twice as likely as men to face online harassment due to their gender. Despite recent advances in multimodal content moderation, most approaches still overlook the social dynamics behind this phenomenon, where perpetrators reinforce prejudices and group identity within like-minded communities. Graph-based methods offer a promising way to capture such interactions, yet existing solutions remain limited by heuristic graph construction, shallow modality fusion, and instance-level reasoning. In this work, we present MemeWeaver, an end-to-end trainable multimodal framework for detecting sexism and misogyny through a novel inter-meme graph reasoning mechanism. We systematically evaluate multiple visual-textual fusion strategies and show that our approach consistently outperforms state-of-the-art baselines on the MAMI and EXIST benchmarks, while achieving faster training convergence. Further analyses reveal that the learned graph structure captures semantically meaningful patterns, offering valuable insights into the relational nature of online hate. More

Women are twice as likely as men to face online harassment due to their gender. Despite recent advances in multimodal content moderation, most approaches still overlook the social dynamics behind this phenomenon, where perpetrators reinforce prejudices and group identity within like-minded communities. Graph-based methods offer a promising way to capture such interactions, yet existing solutions remain limited by heuristic graph construction, shallow modality fusion, and instance-level reasoning. In this work, we present MemeWeaver, an end-to-end trainable multimodal framework for detecting sexism and misogyny through a novel inter-meme graph reasoning mechanism. We systematically evaluate multiple visual-textual fusion strategies and show that our approach consistently outperforms state-of-the-art baselines on the MAMI and EXIST benchmarks, while achieving faster training convergence. Further analyses reveal that the learned graph structure captures semantically meaningful patterns, offering valuable insights into the relational nature of online hate. Less
Graph-of-Mark: Promote Spatial Reasoning in Multimodal Language Models with Graph-Based Visual Prompting

Giacomo Frisoni, Lorenzo Molfetta, Mattia Buzzoni, Gianluca Moro

AAAI 2026 Proceedings of the 40th AAAI Conference on Artificial Intelligence (AAAI 2026)

Read

Recent advances in training-free visual prompting, such as Set-of-Mark, have emerged as a promising direction for enhancing the grounding capabilities of multimodal language models (MLMs). These techniques operate by partitioning the input image into object regions and annotating them with marks—predominantly boxes with numeric identifiers—before feeding the augmented image to the MLM. However, these approaches treat marked objects as isolated entities, failing to capture the relationships between them. On these premises, we propose Graph-of-Mark (GoM), the first pixel-level visual prompting technique that overlays scene graphs onto the input image for spatial reasoning tasks. We evaluate GoM across 3 open-source MLMs and 4 different datasets, conducting extensive ablations on drawn components and investigating the impact of auxiliary graph descriptions in the text prompt. Our results demonstrate that GoM consistently improves the zero-shot capability of MLMs in interpreting object positions and relative directions, improving base accuracy in visual question answering and localization up to 11 percentage points. More

Recent advances in training-free visual prompting, such as Set-of-Mark, have emerged as a promising direction for enhancing the grounding capabilities of multimodal language models (MLMs). These techniques operate by partitioning the input image into object regions and annotating them with marks—predominantly boxes with numeric identifiers—before feeding the augmented image to the MLM. However, these approaches treat marked objects as isolated entities, failing to capture the relationships between them. On these premises, we propose Graph-of-Mark (GoM), the first pixel-level visual prompting technique that overlays scene graphs onto the input image for spatial reasoning tasks. We evaluate GoM across 3 open-source MLMs and 4 different datasets, conducting extensive ablations on drawn components and investigating the impact of auxiliary graph descriptions in the text prompt. Our results demonstrate that GoM consistently improves the zero-shot capability of MLMs in interpreting object positions and relative directions, improving base accuracy in visual question answering and localization up to 11 percentage points. Less
Can Large Language Models Win the International Mathematical Games?

Alessio Cocchieri, Luca Ragazzi, Giuseppe Tagliavini, Lorenzo Tordi, Antonella Carbonaro, Gianluca Moro

EMNLP 2025 Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Try!

Code

Read

Recent advances in large language models (LLMs) have demonstrated strong mathematical reasoning abilities, even in visual contexts, with some models surpassing human performance on existing benchmarks. However, these benchmarks lack structured age categorization, clearly defined skill requirements, and—crucially—were not designed to assess human performance in international competitions. To address these limitations, we introduce MathGames, a new benchmark of 2,183 high-quality mathematical problems (both text-only and multimodal) in an open-ended format, sourced from an international mathematical games championships. Spanning seven age groups and a skill-based taxonomy, MathGames enables a structured evaluation of LLMs’ mathematical and logical reasoning abilities. Our experiments reveal a substantial gap between state-of-the-art LLMs and human participants—even 11-year-olds consistently outperform some of the strongest models—highlighting the need for advancements. Further, our detailed error analysis offers valuable insights to guide future research. The data is publicly available at https://disi-unibo-nlp.github.io/math-games. More

Recent advances in large language models (LLMs) have demonstrated strong mathematical reasoning abilities, even in visual contexts, with some models surpassing human performance on existing benchmarks. However, these benchmarks lack structured age categorization, clearly defined skill requirements, and—crucially—were not designed to assess human performance in international competitions. To address these limitations, we introduce MathGames, a new benchmark of 2,183 high-quality mathematical problems (both text-only and multimodal) in an open-ended format, sourced from an international mathematical games championships. Spanning seven age groups and a skill-based taxonomy, MathGames enables a structured evaluation of LLMs’ mathematical and logical reasoning abilities. Our experiments reveal a substantial gap between state-of-the-art LLMs and human participants—even 11-year-olds consistently outperform some of the strongest models—highlighting the need for advancements. Further, our detailed error analysis offers valuable insights to guide future research. The data is publicly available at https://disi-unibo-nlp.github.io/math-games. Less
PORTS: Preference-Optimized Retrievers for Tool Selection with Large Language Models

Lorenzo Molfetta, Giacomo Frisoni, Nicolò Monaldini, Gianluca Moro

EMNLP 2025 Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Code

Read

Integrating external tools with Large Language Models (LLMs) has emerged as a promising paradigm for accomplishing complex tasks. Since LLMs still struggle to effectively manage large tool collections, researchers have begun exploring retrieval-based methods to pre-select the most relevant options, addressing input length and latency constraints. However, existing retrievers are often misaligned with tool-calling LLMs due to their separate training processes. This paper presents PORTS, a novel odds ratio preference optimization method for training retrievers aimed at tool selection. Using a perplexity-inspired preference signal from a frozen LLM, our approach fine-tunes a retriever to find helpful tools by optimizing the correlation between the selection probabilities and the downstream performances while jointly enforcing a contrastive semantic loss between documentation strings. The versatility of PORTS and its ability to significantly improve tool selection accuracy are demonstrated through extensive experiments on six datasets, two encoder models, and three LLMs with diverse prior knowledge. With low computational demands, our alignment process facilitates generalization to new queries and tools, proving valuable for practical applications with evolving toolsets. More

Integrating external tools with Large Language Models (LLMs) has emerged as a promising paradigm for accomplishing complex tasks. Since LLMs still struggle to effectively manage large tool collections, researchers have begun exploring retrieval-based methods to pre-select the most relevant options, addressing input length and latency constraints. However, existing retrievers are often misaligned with tool-calling LLMs due to their separate training processes. This paper presents PORTS, a novel odds ratio preference optimization method for training retrievers aimed at tool selection. Using a perplexity-inspired preference signal from a frozen LLM, our approach fine-tunes a retriever to find helpful tools by optimizing the correlation between the selection probabilities and the downstream performances while jointly enforcing a contrastive semantic loss between documentation strings. The versatility of PORTS and its ability to significantly improve tool selection accuracy are demonstrated through extensive experiments on six datasets, two encoder models, and three LLMs with diverse prior knowledge. With low computational demands, our alignment process facilitates generalization to new queries and tools, proving valuable for practical applications with evolving toolsets. Less
FEAST: Retrieval-Augmented Multi-Hierarchical Food Classification for the FoodEx2 System

Lorenzo Molfetta, Alessio Cocchieri, Stefano Fantazzini, Giacomo Frisoni, Luca Ragazzi, Gianluca Moro

ECAI 2025 Proceedings of the 28th European Conference on Artificial Intelligence

Read

Hierarchical text classification (HTC) and extreme multi-label classification (XML) tasks face compounded challenges from complex label interdependencies, data sparsity, and extreme output dimensions. These challenges are exemplified in the European Food Safety Authority's FoodEx2 system—a standardized food classification framework essential for food consumption monitoring and contaminant exposure assessment across Europe. FoodEx2 coding transforms natural language food descriptions into a set of codes from multiple standardized hierarchies, but faces implementation barriers due to its complex structure. Given a food description (e.g., 'organic yogurt'), the system identifies its base term ('yogurt'), all the applicable facet categories (e.g., 'production method'), and then, every relevant facet descriptors to each category (e.g., 'organic production'). While existing approaches perform adequately on well-balanced and semantically dense hierarchies, no work has been applied on the practical constraints imposed by the FoodEx2 system. The limited literature addressing such real-world scenarios further compounds these challenges. We propose FEAST (Food Embedding And Semantic Taxonomy), a novel retrieval-augmented framework that decomposes the FoodEx2 classification challenge into a three-stage approach: (1) base term identification, (2) multi-label facet prediction, and (3) facet descriptor assignment. By leveraging the system's hierarchical structure to guide training and performing deep metric learning, FEAST learns discriminative embeddings that mitigate data sparsity and improve generalization on rare and fine-grained labels. Evaluated on the multilingual FoodEx2 benchmark, FEAST outperforms the prior European's CNN baseline F1 scores by 12—38% on rare classes. More

Hierarchical text classification (HTC) and extreme multi-label classification (XML) tasks face compounded challenges from complex label interdependencies, data sparsity, and extreme output dimensions. These challenges are exemplified in the European Food Safety Authority's FoodEx2 system—a standardized food classification framework essential for food consumption monitoring and contaminant exposure assessment across Europe. FoodEx2 coding transforms natural language food descriptions into a set of codes from multiple standardized hierarchies, but faces implementation barriers due to its complex structure. Given a food description (e.g., 'organic yogurt'), the system identifies its base term ('yogurt'), all the applicable facet categories (e.g., 'production method'), and then, every relevant facet descriptors to each category (e.g., 'organic production'). While existing approaches perform adequately on well-balanced and semantically dense hierarchies, no work has been applied on the practical constraints imposed by the FoodEx2 system. The limited literature addressing such real-world scenarios further compounds these challenges. We propose FEAST (Food Embedding And Semantic Taxonomy), a novel retrieval-augmented framework that decomposes the FoodEx2 classification challenge into a three-stage approach: (1) base term identification, (2) multi-label facet prediction, and (3) facet descriptor assignment. By leveraging the system's hierarchical structure to guide training and performing deep metric learning, FEAST learns discriminative embeddings that mitigate data sparsity and improve generalization on rare and fine-grained labels. Evaluated on the multilingual FoodEx2 benchmark, FEAST outperforms the prior European's CNN baseline F1 scores by 12—38% on rare classes. Less
Magic Mirror on the Wall, Which is the Fairest Prompt of All? A Survey on Automatic Prompt Learning

Stefano Fantazzini, Giacomo Frisoni, Gianluca Moro, Luca Ragazzi, Mario Ciccioni, Claudio Sartori

ECAI 2025 Proceedings of the 28th European Conference on Artificial Intelligence

Read

Prompts direct the behavior of a model by conditioning its outputs on carefully designed instructions and examples, similar to setting the trajectory of an arrow before release. More broadly, prompt learning is the research area that aims to solve downstream tasks by directly leveraging the knowledge acquired by language models at pre-training time, removing the need for expensive fine-tuning stages with potentially different objective functions. While manual prompt engineering has enabled both small and large language models to achieve superhuman performance on numerous benchmarks, it remains a labor-intensive and suboptimal process. Recently, the field has shifted towards automating the search for prompts that effectively elicit the desired model responses. This survey presents the first systematic review of prompt learning for pre-trained language models operating on text inputs, with a particular focus on automatic methods. We critically analyze existing publications and organize them into a novel taxonomy, describing key aspects for practical usage. We finally discuss promising directions for future research. Our curated repository of annotated papers, continuously updated, is available at https://github.com/disi-unibo-nlp/awesome-prompt-learning. More

Prompts direct the behavior of a model by conditioning its outputs on carefully designed instructions and examples, similar to setting the trajectory of an arrow before release. More broadly, prompt learning is the research area that aims to solve downstream tasks by directly leveraging the knowledge acquired by language models at pre-training time, removing the need for expensive fine-tuning stages with potentially different objective functions. While manual prompt engineering has enabled both small and large language models to achieve superhuman performance on numerous benchmarks, it remains a labor-intensive and suboptimal process. Recently, the field has shifted towards automating the search for prompts that effectively elicit the desired model responses. This survey presents the first systematic review of prompt learning for pre-trained language models operating on text inputs, with a particular focus on automatic methods. We critically analyze existing publications and organize them into a novel taxonomy, describing key aspects for practical usage. We finally discuss promising directions for future research. Our curated repository of annotated papers, continuously updated, is available at https://github.com/disi-unibo-nlp/awesome-prompt-learning. Less
Predicting Protein Functions with Ensemble Deep Learning and Protein Language Models

Giacomo Frisoni, Marcello Fuschi, Gianluca Moro

ECAI 2025 Proceedings of the 28th European Conference on Artificial Intelligence (Demonstrations Track)

Read

Understanding protein functions enables deciphering cellular mechanisms and improving healthcare outcomes, from disease diagnosis to targeted therapy. We present GOMix, an ensemble learning method for predicting the functions of newly discovered proteins, packaged within an easy-to-use web application. By combining seven complementary base predictors---including sequence homology and protein language models, GOMix achieves competitive or state-of-the-art performance in the CAFA-3 challenge. Unlike existing solutions, GOMix is entirely open-source, modular, and computationally low-resource. The code is publicly available at https://github.com/disi-unibo-nlp/gomix (MIT License). More

Understanding protein functions enables deciphering cellular mechanisms and improving healthcare outcomes, from disease diagnosis to targeted therapy. We present GOMix, an ensemble learning method for predicting the functions of newly discovered proteins, packaged within an easy-to-use web application. By combining seven complementary base predictors---including sequence homology and protein language models, GOMix achieves competitive or state-of-the-art performance in the CAFA-3 challenge. Unlike existing solutions, GOMix is entirely open-source, modular, and computationally low-resource. The code is publicly available at https://github.com/disi-unibo-nlp/gomix (MIT License). Less
"What do you call a dog that is incontrovertibly true? Dogma": Testing LLM Generalization through Humor

Alessio Cocchieri, Luca Ragazzi, Paolo Italiani, Giuseppe Tagliavini, Gianluca Moro

ACL 2025 Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Try!

Code

Cite

Read

Humor, requiring creativity and contextual understanding, is a hallmark of human intelligence, showcasing adaptability across linguistic scenarios. While recent advances in large language models (LLMs) demonstrate strong reasoning on various benchmarks, it remains unclear whether they truly adapt to new tasks like humans (i.e., generalize) or merely replicate memorized content. To explore this, we introduce Phunny, a new humor-based question-answering benchmark designed to assess LLMs' reasoning through carefully crafted puns. Our dataset is manually curated to ensure novelty and minimize data contamination, providing a robust evaluation of LLMs' linguistic comprehension. Experiments on pun comprehension, resolution, and generation reveal that most LLMs struggle with generalization, even on simple tasks, consistently underperforming the human baseline. Additionally, our detailed error analysis provides valuable insights to guide future research. The data is available at https://anonymous.4open.science/r/phunny/. More

Humor, requiring creativity and contextual understanding, is a hallmark of human intelligence, showcasing adaptability across linguistic scenarios. While recent advances in large language models (LLMs) demonstrate strong reasoning on various benchmarks, it remains unclear whether they truly adapt to new tasks like humans (i.e., generalize) or merely replicate memorized content. To explore this, we introduce Phunny, a new humor-based question-answering benchmark designed to assess LLMs' reasoning through carefully crafted puns. Our dataset is manually curated to ensure novelty and minimize data contamination, providing a robust evaluation of LLMs' linguistic comprehension. Experiments on pun comprehension, resolution, and generation reveal that most LLMs struggle with generalization, even on simple tasks, consistently underperforming the human baseline. Additionally, our detailed error analysis provides valuable insights to guide future research. The data is available at https://anonymous.4open.science/r/phunny/. Less
ZeroNER: Fueling Zero-Shot Named Entity Recognition via Entity Type Descriptions

Alessio Cocchieri, Marcos Martínez Galindo, Giacomo Frisoni, Gianluca Moro, Claudio Sartori, Giuseppe Tagliavini

ACL 2025 Findings of the Association for Computational Linguistics: ACL 2025

Cite

Read

In real-world Named Entity Recognition (NER), annotation scarcity and the challenge of unseen entity types make zero-shot learning essential. While Large Language Models (LLMs) possess vast parametric knowledge, they fall short in cost-effectiveness compared to specialized encoders. Current zero-shot methods often rely solely on entity type names, overlooking both the critical role of descriptions in resolving definition ambiguities and the issue of type leakage during pretraining. In this work, we introduce ZeroNER, a description-driven framework that enhances zero-shot NER in low-resource settings. By leveraging entity type descriptions through cross-attention, ZeroNER enables a BERT-based student model to identify any entity type without additional training. Evaluated on three real-world zero-shot benchmarks under a rigorous hard zero-shot setting, ZeroNER consistently outperforms several LLMs by up to 15% in F1 score and surpasses alternative lightweight methods that rely solely on type names. Furthermore, our findings reveal that many LLMs significantly benefit from using type descriptions, underscoring their potential in zero-shot NER. More

In real-world Named Entity Recognition (NER), annotation scarcity and the challenge of unseen entity types make zero-shot learning essential. While Large Language Models (LLMs) possess vast parametric knowledge, they fall short in cost-effectiveness compared to specialized encoders. Current zero-shot methods often rely solely on entity type names, overlooking both the critical role of descriptions in resolving definition ambiguities and the issue of type leakage during pretraining. In this work, we introduce ZeroNER, a description-driven framework that enhances zero-shot NER in low-resource settings. By leveraging entity type descriptions through cross-attention, ZeroNER enables a BERT-based student model to identify any entity type without additional training. Evaluated on three real-world zero-shot benchmarks under a rigorous hard zero-shot setting, ZeroNER consistently outperforms several LLMs by up to 15% in F1 score and surpasses alternative lightweight methods that rely solely on type names. Furthermore, our findings reveal that many LLMs significantly benefit from using type descriptions, underscoring their potential in zero-shot NER. Less
Neuro-Symbolic Artificial Intelligence: A Task-Directed Survey in the Black-Box Era

Giovanni Pio Delvecchio, Lorenzo Molfetta, Gianluca Moro

IJCAI 2025 Proceedings of the 34th International Joint Conference on Artificial Intelligence

Code

Cite

Read

The integration of symbolic computing with neural networks has intrigued researchers since the first theorizations of Artificial intelligence (AI). The ability of Neuro-Symbolic (NeSy) methods to infer or exploit behavioral schema has been widely considered as one of the possible proxies for human-level intelligence. However, the limited semantic generalizability and the challenges in declining complex domains with pre-defined patterns and rules hinder their practical implementation in real-world scenarios. The unprecedented results achieved by connectionist systems since the last AI breakthrough in 2017 have raised questions about the competitiveness of NeSy solutions, with particular emphasis on the Natural Language Processing and Computer Vision fields. This survey examines task-specific advancements in the NeSy domain to explore how incorporating symbolic systems can enhance explainability and reasoning capabilities. Our findings are meant to serve as a resource for researchers exploring explainable NeSy methodologies for real-life tasks and applications. Reproducibility details and in-depth comments on each surveyed research work are made available at https://github.com/disi-unibo-nlp/task-oriented-neuro-symbolic.git. More

The integration of symbolic computing with neural networks has intrigued researchers since the first theorizations of Artificial intelligence (AI). The ability of Neuro-Symbolic (NeSy) methods to infer or exploit behavioral schema has been widely considered as one of the possible proxies for human-level intelligence. However, the limited semantic generalizability and the challenges in declining complex domains with pre-defined patterns and rules hinder their practical implementation in real-world scenarios. The unprecedented results achieved by connectionist systems since the last AI breakthrough in 2017 have raised questions about the competitiveness of NeSy solutions, with particular emphasis on the Natural Language Processing and Computer Vision fields. This survey examines task-specific advancements in the NeSy domain to explore how incorporating symbolic systems can enhance explainability and reasoning capabilities. Our findings are meant to serve as a resource for researchers exploring explainable NeSy methodologies for real-life tasks and applications. Reproducibility details and in-depth comments on each surveyed research work are made available at https://github.com/disi-unibo-nlp/task-oriented-neuro-symbolic.git. Less
OpenBioNER: Lightweight Open-Domain Biomedical Named Entity Recognition Through Entity Type Description

Alessio Cocchieri, Giacomo Frisoni, Marcos Martinez Galindo, Gianluca Moro, Giuseppe Tagliavini, Francesco Candoli

NAACL 2025 Findings of the Association for Computational Linguistics: NAACL 2025

Code

Cite

Read

Biomedical Named Entity Recognition (BioNER) faces significant challenges in real-world applications due to limited annotated data and the constant emergence of new entity types, making zero-shot learning capabilities crucial. While Large Language Models (LLMs) possess extensive domain knowledge necessary for specialized fields like biomedicine, their computational costs often make them impractical. To address these challenges, we introduce OpenBioNER, a lightweight BERT-based cross-encoder architecture that can identify any biomedical entity using only its description, eliminating the need for retraining on new, unseen entity types. Through comprehensive evaluation on established biomedical benchmarks, we demonstrate that OpenBioNER surpasses state-of-the-art baselines, including specialized 7B NER LLMs and GPT-4o, achieving up to 10% higher F1 scores while using 110M parameters only. Moreover, OpenBioNER outperforms existing small-scale models that match textual spans with entity types rather than descriptions, both in terms of accuracy and computational efficiency. More

Biomedical Named Entity Recognition (BioNER) faces significant challenges in real-world applications due to limited annotated data and the constant emergence of new entity types, making zero-shot learning capabilities crucial. While Large Language Models (LLMs) possess extensive domain knowledge necessary for specialized fields like biomedicine, their computational costs often make them impractical. To address these challenges, we introduce OpenBioNER, a lightweight BERT-based cross-encoder architecture that can identify any biomedical entity using only its description, eliminating the need for retraining on new, unseen entity types. Through comprehensive evaluation on established biomedical benchmarks, we demonstrate that OpenBioNER surpasses state-of-the-art baselines, including specialized 7B NER LLMs and GPT-4o, achieving up to 10% higher F1 scores while using 110M parameters only. Moreover, OpenBioNER outperforms existing small-scale models that match textual spans with entity types rather than descriptions, both in terms of accuracy and computational efficiency. Less
Unknown Claims: Generation of Fact-Checking Training Examples from Unstructured and Structured Data

Jean-Flavien Bussotti, Luca Ragazzi, Giacomo Frisoni, Gianluca Moro, Paolo Papotti

EMNLP 2024 Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Code

Cite

Read

Computational fact-checking (FC) relies on supervised models to verify claims based on given evidence, requiring a resource-intensive process to annotate large volumes of training data. We introduce Unown, a novel framework that generates training instances for FC systems automatically using both textual and tabular content. Unown selects relevant evidence and generates supporting and refuting claims with advanced negation artifacts. Designed to be flexible, Unown accommodates various strategies for evidence selection and claim generation, offering unparalleled adaptability. We comprehensively evaluate Unown on both text-only and table+text benchmarks, including Feverous, SciFact, and MMFC, a new multi-modal FC dataset. Our results prove that Unown examples are of comparable quality to expert-labeled data, even enabling models to achieve up to 5% higher accuracy. The code, data, and models are available at https://github.com/disi-unibo-nlp/unown More

Computational fact-checking (FC) relies on supervised models to verify claims based on given evidence, requiring a resource-intensive process to annotate large volumes of training data. We introduce Unown, a novel framework that generates training instances for FC systems automatically using both textual and tabular content. Unown selects relevant evidence and generates supporting and refuting claims with advanced negation artifacts. Designed to be flexible, Unown accommodates various strategies for evidence selection and claim generation, offering unparalleled adaptability. We comprehensively evaluate Unown on both text-only and table+text benchmarks, including Feverous, SciFact, and MMFC, a new multi-modal FC dataset. Our results prove that Unown examples are of comparable quality to expert-labeled data, even enabling models to achieve up to 5% higher accuracy. The code, data, and models are available at https://github.com/disi-unibo-nlp/unown Less
Diagnosing the Optimal Prompt Trick: A Case Study on the Effectiveness of Prompt Engineering in Medical Question Answering

Francesco Zangrillo, Stefano Lodi, Gianluca Moro

SDPS 2024 International Conference of the Society for Design and Process Science on Advances and Challenges of Applying AI/GenAI

Read

This study examines the challenges of identifying a universally effective prompt for medical question answering with large language models (LLMs). The complexity of medical terminology and the need for multi-step reasoning make this task particularly difficult, as even minor prompt variations can lead to significant differences in model responses. The research compares various prompting strategies, such as in-context learning, chain-of-thought reasoning, and self-consistency, demonstrating that the effectiveness of a prompt is influenced by the interaction between the prompt itself, the model's architecture, and the specific characteristics of the medical query. Remarkably, one of the smallest LLMs showed a 25-point accuracy improvement in one case and almost obtained medical license in another, using the same inference technique, underscoring the importance of adaptive prompt strategies tailored to the input and model. More

This study examines the challenges of identifying a universally effective prompt for medical question answering with large language models (LLMs). The complexity of medical terminology and the need for multi-step reasoning make this task particularly difficult, as even minor prompt variations can lead to significant differences in model responses. The research compares various prompting strategies, such as in-context learning, chain-of-thought reasoning, and self-consistency, demonstrating that the effectiveness of a prompt is influenced by the interaction between the prompt itself, the model's architecture, and the specific characteristics of the medical query. Remarkably, one of the smallest LLMs showed a 25-point accuracy improvement in one case and almost obtained medical license in another, using the same inference technique, underscoring the importance of adaptive prompt strategies tailored to the input and model. Less
What Are You Token About? Differentiable Perturbed Top-k Token Selection for Scientific Document Summarization

Luca Ragazzi, Paolo Italiani, Gianluca Moro, Mattia Panni

ACL 2024 Findings of the Association for Computational Linguistics: ACL 2024

Code

Cite

Read

Scientific document summarization (SDS) aims to condense complex and long articles in both technical and plain-language terms to facilitate the accessibility and dissemination of scientific findings. Existing datasets lack source heterogeneity, hindering effective model training and generalizability. First, we introduce SciLay, a novel dataset that includes documents from multiple natural science journals with expert-authored technical and lay summaries. Second, we propose PrunePert, a new transformer-based model that incorporates a differentiable perturbed top-k encoder layer to prune irrelevant tokens in end-to-end learning. Experimental results show that our model achieves a nearly 2x speed-up compared to a state-of-the-art linear transformer, remaining comparable in effectiveness. Additional examinations underscore the importance of employing a training dataset that includes different sources to enhance the generalizability of the models. Code is available at https://github.com/disi-unibo-nlp/sci-lay. More

Scientific document summarization (SDS) aims to condense complex and long articles in both technical and plain-language terms to facilitate the accessibility and dissemination of scientific findings. Existing datasets lack source heterogeneity, hindering effective model training and generalizability. First, we introduce SciLay, a novel dataset that includes documents from multiple natural science journals with expert-authored technical and lay summaries. Second, we propose PrunePert, a new transformer-based model that incorporates a differentiable perturbed top-k encoder layer to prune irrelevant tokens in end-to-end learning. Experimental results show that our model achieves a nearly 2x speed-up compared to a state-of-the-art linear transformer, remaining comparable in effectiveness. Additional examinations underscore the importance of employing a training dataset that includes different sources to enhance the generalizability of the models. Code is available at https://github.com/disi-unibo-nlp/sci-lay. Less
To Generate or to Retrieve? On the Effectiveness of Artificial Contexts for Medical Open-Domain Question Answering

Giacomo Frisoni, Alessio Cocchieri, Alex Presepi, Gianluca Moro, Zaiqiao Meng

ACL 2024 Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Code

Cite

Read

Medical open-domain question answering demands substantial access to specialized knowledge. Recent efforts have sought to decouple knowledge from model parameters, counteracting architectural scaling and allowing for training on common low-resource hardware. The retrieve-then-read paradigm has become ubiquitous, with model predictions grounded on relevant knowledge pieces from external repositories such as PubMed, textbooks, and UMLS. An alternative path, still under-explored but made possible by the advent of domain-specific large language models, entails constructing artificial contexts through prompting. As a result, 'to generate or to retrieve' is the modern equivalent of Hamlet’s dilemma. This paper presents MEDGENIE, the first generate-then-read framework for multiple-choice question answering in medicine. We conduct extensive experiments on MedQA-USMLE, MedMCQA, and MMLU, incorporating a practical perspective by assuming a maximum of 24GB VRAM. MEDGENIE sets a new state-of-the- art (SOTA) in the open-book setting of each testbed, even allowing a small-scale reader to outcompete zero-shot closed-book 175B baselines while using up to 706× fewer parameters. Overall, our findings reveal that generated passages are more effective than retrieved counterparts in attaining higher accuracy. More

Medical open-domain question answering demands substantial access to specialized knowledge. Recent efforts have sought to decouple knowledge from model parameters, counteracting architectural scaling and allowing for training on common low-resource hardware. The retrieve-then-read paradigm has become ubiquitous, with model predictions grounded on relevant knowledge pieces from external repositories such as PubMed, textbooks, and UMLS. An alternative path, still under-explored but made possible by the advent of domain-specific large language models, entails constructing artificial contexts through prompting. As a result, 'to generate or to retrieve' is the modern equivalent of Hamlet’s dilemma. This paper presents MEDGENIE, the first generate-then-read framework for multiple-choice question answering in medicine. We conduct extensive experiments on MedQA-USMLE, MedMCQA, and MMLU, incorporating a practical perspective by assuming a maximum of 24GB VRAM. MEDGENIE sets a new state-of-the- art (SOTA) in the open-book setting of each testbed, even allowing a small-scale reader to outcompete zero-shot closed-book 175B baselines while using up to 706× fewer parameters. Overall, our findings reveal that generated passages are more effective than retrieved counterparts in attaining higher accuracy. Less
Revelio: Interpretable Long-Form Question Answering

Gianluca Moro, Luca Ragazzi, Lorenzo Valgimigli, Fabian Vincenzi, Davide Freddi

ICLR 2024 Proceedings of The Second Tiny Papers Track at ICLR 2024

Cite

Read

The black-box architecture of pretrained language models (PLMs) hinders the interpretability of lengthy responses in long-form question answering (LFQA). Prior studies use knowledge graphs (KGs) to enhance output transparency, but mostly focus on non-generative or short-form QA. We present Revelio, a new layer that maps PLM's inner working onto a KG walk. Tests on two LFQA datasets show that Revelio supports PLM-generated answers with reasoning paths presented as rationales while retaining performance and time akin to their vanilla counterparts. More

The black-box architecture of pretrained language models (PLMs) hinders the interpretability of lengthy responses in long-form question answering (LFQA). Prior studies use knowledge graphs (KGs) to enhance output transparency, but mostly focus on non-generative or short-form QA. We present Revelio, a new layer that maps PLM's inner working onto a KG walk. Tests on two LFQA datasets show that Revelio supports PLM-generated answers with reasoning paths presented as rationales while retaining performance and time akin to their vanilla counterparts. Less
Retrieve-and-Rank End-to-End Summarization of Biomedical Studies

Gianluca Moro, Luca Ragazzi, Lorenzo Valgimigli, Lorenzo Molfetta

SISAP 2023 Proceedings of the 16th International Conference on Similarity Search and Applications

Cite

Read

An arduous biomedical task involves condensing evidence derived from multiple interrelated studies, given a context as input, to generate reviews or provide answers autonomously. We named this task context-aware multi-document summarization (CA-MDS). Existing state-of-the-art (SOTA) solutions require truncation of the input due to the high memory demands, resulting in the loss of meaningful content. To address this issue effectively, we propose a novel approach called RAMSES, which employs a retrieve-and-rank technique for end-to-end summarization. The model acquires the ability to (i) index each document by modeling its semantic features, (ii) retrieve the most relevant ones, and (iii) generate a summary via token probability marginalization. To facilitate the evaluation, we introduce a new dataset, FAQSUMC19, which includes the synthesizing of multiple supporting papers to answer questions related to Covid-19. Our experimental findings demonstrate that RAMSES achieves notably superior ROUGE scores compared to state-of-the-art methodologies, including the establishment of a new SOTA for the generation of systematic literature reviews using MS2. Quality observation through human evaluation indicates that our model produces more informative responses than previous leading approaches. More

An arduous biomedical task involves condensing evidence derived from multiple interrelated studies, given a context as input, to generate reviews or provide answers autonomously. We named this task context-aware multi-document summarization (CA-MDS). Existing state-of-the-art (SOTA) solutions require truncation of the input due to the high memory demands, resulting in the loss of meaningful content. To address this issue effectively, we propose a novel approach called RAMSES, which employs a retrieve-and-rank technique for end-to-end summarization. The model acquires the ability to (i) index each document by modeling its semantic features, (ii) retrieve the most relevant ones, and (iii) generate a summary via token probability marginalization. To facilitate the evaluation, we introduce a new dataset, FAQSUMC19, which includes the synthesizing of multiple supporting papers to answer questions related to Covid-19. Our experimental findings demonstrate that RAMSES achieves notably superior ROUGE scores compared to state-of-the-art methodologies, including the establishment of a new SOTA for the generation of systematic literature reviews using MS2. Quality observation through human evaluation indicates that our model produces more informative responses than previous leading approaches. Less
Graph-based Summarization of Extracted Essential Knowledge for Low-Resource Scenarios

Gianluca Moro, Luca Ragazzi, Lorenzo Valgimigli

ECAI 2023 Proceedings of the 26th European Conference on Artificial Intelligence

Cite

Read

Although current summarization models can process increasingly long text sequences, they still struggle to capture salient related information spread across the lengthy size of inputs with few labeled training instances. Today’s research still relies on standard input truncation without considering graph-based modeling of multiple semantic units to summarize only crucial facets. This paper proposes G-SEEK, a graph-based summarization of extracted essential knowledge. By representing the long source with a heterogeneous graph, our method extracts and provides salient sentences to an abstractive summarization model to generate the summary. Experimental results in low-resource scenarios, distinguished by data scarcity, reveal that G-SEEK consistently improves both the long- and multi-document summarization performance and accuracy across several datasets. More

Although current summarization models can process increasingly long text sequences, they still struggle to capture salient related information spread across the lengthy size of inputs with few labeled training instances. Today’s research still relies on standard input truncation without considering graph-based modeling of multiple semantic units to summarize only crucial facets. This paper proposes G-SEEK, a graph-based summarization of extracted essential knowledge. By representing the long source with a heterogeneous graph, our method extracts and provides salient sentences to an abstractive summarization model to generate the summary. Experimental results in low-resource scenarios, distinguished by data scarcity, reveal that G-SEEK consistently improves both the long- and multi-document summarization performance and accuracy across several datasets. Less
Cogito Ergo Summ: Abstractive Summarization of Biomedical Papers via Semantic Parsing Graphs and Consistency Rewards

Giacomo Frisoni, Paolo Italiani, Stefano Salvatori, Gianluca Moro

AAAI 2023 Proceedings of the 37th AAAI Conference on Artificial Intelligence (AAAI 2023)

Code

Cite

Read

The automatic synthesis of biomedical publications catalyzes a profound research interest elicited by literature congestion. Current sequence-to-sequence models mainly rely on the lexical surface and seldom consider the deep semantic interconnections between the entities mentioned in the source document. Such superficiality translates into fabricated, poorly informative, redundant, and near-extractive summaries that severely restrict their real-world application in biomedicine, where the specialized jargon and the convoluted facts further emphasize task complexity. To fill this gap, we argue that the summarizer should acquire semantic interpretation over input, exploiting structured and unambiguous representations to capture and conserve the most relevant parts of the text content. This paper presents CogitoErgoSumm, the first framework for biomedical abstractive summarization equipping large pre-trained language models with rich semantic graphs. Precisely, we infuse graphs from two complementary semantic parsing techniques with different goals and granularities—Event Extraction and Abstract Meaning Representation, also designing a reward signal to maximize information content preservation through reinforcement learning. Extensive quantitative and qualitative evaluations on the CDSR dataset show that our solution achieves competitive performance according to multiple metrics, despite using 2.5x fewer parameters. Results and ablation studies indicate that our joint text-graph model generates more enlightening, readable, and consistent summaries. Code available at: https://github.com/disi-unibo-nlp/cogito-ergo-summ. More

The automatic synthesis of biomedical publications catalyzes a profound research interest elicited by literature congestion. Current sequence-to-sequence models mainly rely on the lexical surface and seldom consider the deep semantic interconnections between the entities mentioned in the source document. Such superficiality translates into fabricated, poorly informative, redundant, and near-extractive summaries that severely restrict their real-world application in biomedicine, where the specialized jargon and the convoluted facts further emphasize task complexity. To fill this gap, we argue that the summarizer should acquire semantic interpretation over input, exploiting structured and unambiguous representations to capture and conserve the most relevant parts of the text content. This paper presents CogitoErgoSumm, the first framework for biomedical abstractive summarization equipping large pre-trained language models with rich semantic graphs. Precisely, we infuse graphs from two complementary semantic parsing techniques with different goals and granularities—Event Extraction and Abstract Meaning Representation, also designing a reward signal to maximize information content preservation through reinforcement learning. Extensive quantitative and qualitative evaluations on the CDSR dataset show that our solution achieves competitive performance according to multiple metrics, despite using 2.5x fewer parameters. Results and ablation studies indicate that our joint text-graph model generates more enlightening, readable, and consistent summaries. Code available at: https://github.com/disi-unibo-nlp/cogito-ergo-summ. Less
Carburacy: Summarization Models Tuning and Comparison in Eco-Sustainable Regimes with a Novel Carbon-Aware Accuracy

Gianluca Moro, Luca Ragazzi, Lorenzo Valgimigli

AAAI 2023 Proceedings of the 37th AAAI Conference on Artificial Intelligence (AAAI 2023)

Cite

Read

Generative transformer-based models have reached cutting-edge performance in long document summarization. Nevertheless, this task is witnessing a paradigm shift in developing ever-increasingly computationally-hungry solutions, focusing on effectiveness while ignoring the economic, environmental, and social costs of yielding such results. Accordingly, such extensive resources impact climate change and raise barriers to small and medium organizations distinguished by low-resource regimes of hardware and data. As a result, this unsustainable trend has lifted many concerns in the community, which directs the primary efforts on the proposal of tools to monitor models' energy costs. Despite their importance, no evaluation measure considering models' eco-sustainability exists yet. In this work, we propose Carburacy, the first carbon-aware accuracy measure that captures both model effectiveness and eco-sustainability. We perform a comprehensive benchmark for long document summarization, comparing multiple state-of-the-art quadratic and linear transformers on several datasets under eco-sustainable regimes. Finally, thanks to Carburacy, we found optimal combinations of hyperparameters that let models be competitive in effectiveness with significantly lower costs. More

Generative transformer-based models have reached cutting-edge performance in long document summarization. Nevertheless, this task is witnessing a paradigm shift in developing ever-increasingly computationally-hungry solutions, focusing on effectiveness while ignoring the economic, environmental, and social costs of yielding such results. Accordingly, such extensive resources impact climate change and raise barriers to small and medium organizations distinguished by low-resource regimes of hardware and data. As a result, this unsustainable trend has lifted many concerns in the community, which directs the primary efforts on the proposal of tools to monitor models' energy costs. Despite their importance, no evaluation measure considering models' eco-sustainability exists yet. In this work, we propose Carburacy, the first carbon-aware accuracy measure that captures both model effectiveness and eco-sustainability. We perform a comprehensive benchmark for long document summarization, comparing multiple state-of-the-art quadratic and linear transformers on several datasets under eco-sustainable regimes. Finally, thanks to Carburacy, we found optimal combinations of hyperparameters that let models be competitive in effectiveness with significantly lower costs. Less
BioReader: a Retrieval-Enhanced Text-to-Text Transformer for Biomedical Literature

Giacomo Frisoni, Miki Mizutani, Gianluca Moro, Lorenzo Valgimigli

EMNLP 2022 Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Code

Cite

Read

The latest batch of research has equipped language models with the ability to attend over relevant and factual information from non-parametric external sources, drawing a complementary path to architectural scaling. Besides mastering language, exploiting and contextualizing the latent world knowledge is crucial in complex domains like biomedicine. However, most works in the field rely on general-purpose models supported by databases like Wikipedia and Books. We introduce BioReader, the first retrieval-enhanced text-to-text model for biomedical natural language processing. Our domain-specific T5-based solution augments the input prompt by fetching and assembling relevant scientific literature chunks from a neural database with ≈60 million tokens centered on PubMed. We fine-tune and evaluate BioReader on a broad array of downstream tasks, significantly outperforming several state-of-the-art methods despite using up to 3x fewer parameters. In tandem with extensive ablation studies, we show that domain knowledge can be easily altered or supplemented to make the model generate correct predictions bypassing the retraining step and thus addressing the literature overload issue. More

The latest batch of research has equipped language models with the ability to attend over relevant and factual information from non-parametric external sources, drawing a complementary path to architectural scaling. Besides mastering language, exploiting and contextualizing the latent world knowledge is crucial in complex domains like biomedicine. However, most works in the field rely on general-purpose models supported by databases like Wikipedia and Books. We introduce BioReader, the first retrieval-enhanced text-to-text model for biomedical natural language processing. Our domain-specific T5-based solution augments the input prompt by fetching and assembling relevant scientific literature chunks from a neural database with ≈60 million tokens centered on PubMed. We fine-tune and evaluate BioReader on a broad array of downstream tasks, significantly outperforming several state-of-the-art methods despite using up to 3x fewer parameters. In tandem with extensive ablation studies, we show that domain knowledge can be easily altered or supplemented to make the model generate correct predictions bypassing the retraining step and thus addressing the literature overload issue. Less
Deep Vision-Language Model for Efficient Multi-modal Similarity Search in Fashion Retrieval

Gianluca Moro, Stefano Salvatori

SISAP 2022 Proceedings of the 15th International Conference on Similarity Search and Applications

Cite

Read

Fashion multi-modal retrieval has been recently addressed using vision-and-language transformers. However, these models cannot scale in training time and memory requirements due to the quadratic attention mechanism. Moreover, they design the retrieval as a classification task, assigning a similarity score to pairs of text and images in input. Each query is thus resolved inefficiently by pairing it, at runtime, with every text or image in the entire dataset, precluding the scalability to large-scale datasets. We propose a novel approach for efficient multi-modal retrieval in the fashion domain that combines self-supervised pretraining with linear attention and deep metric learning to create a latent space where spatial proximity among instances translates into a semantic similarity score. Unlike existing contributions, our approach separately embeds text and images, decoupling them and allowing to collocate and search in the space, after training, even for new images with missing text and vice versa. Experiments show that with a single 12 GB GPU, our solution outperforms, both in efficacy and efficiency, existing state-of-the-art contributions on the FashionGen dataset. Our architecture also enables the adoption of multidimensional indices, with which retrieval scales in logarithmic time up to millions, and potentially billions, of text and images. More

Fashion multi-modal retrieval has been recently addressed using vision-and-language transformers. However, these models cannot scale in training time and memory requirements due to the quadratic attention mechanism. Moreover, they design the retrieval as a classification task, assigning a similarity score to pairs of text and images in input. Each query is thus resolved inefficiently by pairing it, at runtime, with every text or image in the entire dataset, precluding the scalability to large-scale datasets. We propose a novel approach for efficient multi-modal retrieval in the fashion domain that combines self-supervised pretraining with linear attention and deep metric learning to create a latent space where spatial proximity among instances translates into a semantic similarity score. Unlike existing contributions, our approach separately embeds text and images, decoupling them and allowing to collocate and search in the space, after training, even for new images with missing text and vice versa. Experiments show that with a single 12 GB GPU, our solution outperforms, both in efficacy and efficiency, existing state-of-the-art contributions on the FashionGen dataset. Our architecture also enables the adoption of multidimensional indices, with which retrieval scales in logarithmic time up to millions, and potentially billions, of text and images. Less
Self-supervised Information Retrieval Trained from Self-generated Sets of Queries and Relevant Documents

Gianluca Moro, Lorenzo Valgimigli, Alex Rossi, Cristiano Casadei, Andrea Montefiori

SISAP 2022 Proceedings of the 15th International Conference on Similarity Search and Applications

Cite

Read

Large corpora of textual data such as scientific papers, patents, legal documents, reviews, etc., represent precious unstructured knowledge that needs semantic information retrieval engines to be extracted. Current best information retrieval solutions use supervised deep learning approaches, requiring large labelled training sets of queries and corresponding relevant documents, often unavailable, or their preparation is economically infeasible for most organizations. In this work, we present a new self-supervised method to train a neural solution to model and efficiently search large corpora of documents against arbitrary queries without requiring labelled dataset of queries and associated relevant papers. The core points of our self-supervised approach are (i) a method to self-generate the training set of queries and their relevant documents from the corpus itself, without any kind of human supervision, (ii) a deep metric learning approach to model their semantic space of relationships, and (iii) the incorporation of a multi-dimensional index for this neural semantic space over which running queries efficiently. To better stress the performance of the approach, we applied it to a totally unsupervised corpus with complex contents of over half a million Italian legal documents. More

Large corpora of textual data such as scientific papers, patents, legal documents, reviews, etc., represent precious unstructured knowledge that needs semantic information retrieval engines to be extracted. Current best information retrieval solutions use supervised deep learning approaches, requiring large labelled training sets of queries and corresponding relevant documents, often unavailable, or their preparation is economically infeasible for most organizations. In this work, we present a new self-supervised method to train a neural solution to model and efficiently search large corpora of documents against arbitrary queries without requiring labelled dataset of queries and associated relevant papers. The core points of our self-supervised approach are (i) a method to self-generate the training set of queries and their relevant documents from the corpus itself, without any kind of human supervision, (ii) a deep metric learning approach to model their semantic space of relationships, and (iii) the incorporation of a multi-dimensional index for this neural semantic space over which running queries efficiently. To better stress the performance of the approach, we applied it to a totally unsupervised corpus with complex contents of over half a million Italian legal documents. Less
Text-to-Text Extraction and Verbalization of Biomedical Event Graphs

Giacomo Frisoni, Gianluca Moro, Lorenzo Balzani

COLING 2022 Proceedings of the 29th International Conference on Computational Linguistics (COLING 2022)

Code

Cite

Read

Biomedical events represent complex, graphical, and semantically rich interactions expressed in the scientific literature. Almost all contributions in the event realm orbit around semantic parsing, usually employing discriminative architectures and cumbersome multi-step pipelines limited to a small number of target interaction types. We present the first lightweight framework to solve both event extraction and event verbalization with a unified text-to-text approach, allowing us to fuse all the resources so far designed for different tasks. To this end, we present a new event graph linearization technique and release highly comprehensive event-text paired datasets, covering more than 150 event types from multiple biology subareas (English language). By streamlining parsing and generation to translations, we propose baseline transformer model results according to multiple biomedical text mining benchmarks and NLG metrics. Our extractive models achieve greater state-of-the-art performance than single-task competitors and show promising capabilities for the controlled generation of coherent natural language utterances from structured data. More

Biomedical events represent complex, graphical, and semantically rich interactions expressed in the scientific literature. Almost all contributions in the event realm orbit around semantic parsing, usually employing discriminative architectures and cumbersome multi-step pipelines limited to a small number of target interaction types. We present the first lightweight framework to solve both event extraction and event verbalization with a unified text-to-text approach, allowing us to fuse all the resources so far designed for different tasks. To this end, we present a new event graph linearization technique and release highly comprehensive event-text paired datasets, covering more than 150 event types from multiple biology subareas (English language). By streamlining parsing and generation to translations, we propose baseline transformer model results according to multiple biomedical text mining benchmarks and NLG metrics. Our extractive models achieve greater state-of-the-art performance than single-task competitors and show promising capabilities for the controlled generation of coherent natural language utterances from structured data. Less
Enhancing Biomedical Scientific Reviews Summarization with Graph-based Factual Evidence Extracted from Papers

Giacomo Frisoni, Paolo Italiani, Francesco Boschi, Gianluca Moro

DATA 2022 Proceedings of the 11th International Conference on Data Science, Technology and Applications

Cite

Read

Best Student Paper Award

Combining structured knowledge and neural language models to tackle natural language processing tasks is a recent research trend that catalyzes community attention. This integration holds a lot of potential in document summarization, especially in the biomedical domain, where the jargon and the complex facts make the overarching information truly hard to interpret. In this context, graph construction via semantic parsing plays a crucial role in unambiguously capturing the most relevant parts of a document. However, current works are limited to extracting open-domain triples, failing to model real-world n-ary and nested biomedical interactions accurately. To alleviate this issue, we present EASumm, the first framework for biomedical abstractive summarization enhanced by event graph extraction (i.e., graphical representations of medical evidence learned from scientific text), relying on dual text-graph encoders. Extensive evaluations on the CDSR dataset corroborate the importance of explicit event structures, with better or comparable performance than previous state-of-the-art systems. Finally, we offer some hints to guide future research in the field. More

Combining structured knowledge and neural language models to tackle natural language processing tasks is a recent research trend that catalyzes community attention. This integration holds a lot of potential in document summarization, especially in the biomedical domain, where the jargon and the complex facts make the overarching information truly hard to interpret. In this context, graph construction via semantic parsing plays a crucial role in unambiguously capturing the most relevant parts of a document. However, current works are limited to extracting open-domain triples, failing to model real-world n-ary and nested biomedical interactions accurately. To alleviate this issue, we present EASumm, the first framework for biomedical abstractive summarization enhanced by event graph extraction (i.e., graphical representations of medical evidence learned from scientific text), relying on dual text-graph encoders. Extensive evaluations on the CDSR dataset corroborate the importance of explicit event structures, with better or comparable performance than previous state-of-the-art systems. Finally, we offer some hints to guide future research in the field. Less
Discriminative Marginalized Probabilistic Neural Method for Multi-Document Summarization of Medical Literature

Gianluca Moro, Luca Ragazzi, Lorenzo Valgimigli, Davide Freddi

ACL 2022 Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Cite

Read

Although current state-of-the-art Transformer-based solutions succeeded in a wide range for single-document NLP tasks, they still struggle to address multi-input tasks such as multi-document summarization. Many solutions truncate the inputs, thus ignoring potential summary-relevant contents, which is unacceptable in the medical domain where each information can be vital. Others leverage linear model approximations to apply multi-input concatenation, worsening the results because all information is considered, even if it is conflicting or noisy with respect to a shared background. Despite the importance and social impact of medicine, there are no ad-hoc solutions for multi-document summarization. For this reason, we propose a novel discriminative marginalized probabilistic method (DAMEN) trained to discriminate critical information from a cluster of topic-related medical documents and generate a multi-document summary via token probability marginalization. Results prove we outperform the previous state-of-the-art on a biomedical dataset for multi-document summarization of systematic literature reviews. Moreover, we perform extensive ablation studies to motivate the design choices and prove the importance of each module of our method. More

Although current state-of-the-art Transformer-based solutions succeeded in a wide range for single-document NLP tasks, they still struggle to address multi-input tasks such as multi-document summarization. Many solutions truncate the inputs, thus ignoring potential summary-relevant contents, which is unacceptable in the medical domain where each information can be vital. Others leverage linear model approximations to apply multi-input concatenation, worsening the results because all information is considered, even if it is conflicting or noisy with respect to a shared background. Despite the importance and social impact of medicine, there are no ad-hoc solutions for multi-document summarization. For this reason, we propose a novel discriminative marginalized probabilistic method (DAMEN) trained to discriminate critical information from a cluster of topic-related medical documents and generate a multi-document summary via token probability marginalization. Results prove we outperform the previous state-of-the-art on a biomedical dataset for multi-document summarization of systematic literature reviews. Moreover, we perform extensive ablation studies to motivate the design choices and prove the importance of each module of our method. Less
Semantic Self-Segmentation for Abstractive Summarization of Long Documents in Low-Resource Regimes

Gianluca Moro, Luca Ragazzi

AAAI 2022 Proceedings of the 36th AAAI Conference on Artificial Intelligence (AAAI 2022)

Try!

Code

Cite

Read

The quadratic memory complexity of transformers prevents long document summarization in low computational resource scenarios. State-of-the-art models need to apply input truncation, thus discarding and ignoring potential summary-relevant contents, leading to a performance drop. Furthermore, this loss is generally destructive for semantic text analytics in high-impact domains such as the legal one. In this paper, we propose a novel semantic self-segmentation (Se3) approach for long document summarization to address the critical problems of low-resource regimes, namely to process inputs longer than the GPU memory capacity and produce accurate summaries despite the availability of only a few dozens of training instances. Se3 segments a long input into semantically coherent chunks, allowing transformers to summarize very long documents without truncation by summarizing each chunk and concatenating the results. Experimental outcomes show the approach significantly improves the performance of abstractive summarization transformers, even with just a dozen of labeled data, achieving new state-of-the-art results on two legal datasets of different domains and contents. Finally, we report ablation studies to evaluate each contribution of the components of our method to the performance gain. More

The quadratic memory complexity of transformers prevents long document summarization in low computational resource scenarios. State-of-the-art models need to apply input truncation, thus discarding and ignoring potential summary-relevant contents, leading to a performance drop. Furthermore, this loss is generally destructive for semantic text analytics in high-impact domains such as the legal one. In this paper, we propose a novel semantic self-segmentation (Se3) approach for long document summarization to address the critical problems of low-resource regimes, namely to process inputs longer than the GPU memory capacity and produce accurate summaries despite the availability of only a few dozens of training instances. Se3 segments a long input into semantically coherent chunks, allowing transformers to summarize very long documents without truncation by summarizing each chunk and concatenating the results. Experimental outcomes show the approach significantly improves the performance of abstractive summarization transformers, even with just a dozen of labeled data, achieving new state-of-the-art results on two legal datasets of different domains and contents. Finally, we report ablation studies to evaluate each contribution of the components of our method to the performance gain. Less
Phenomena Explanation from Text: Unsupervised Learning of Interpretable and Statistically Significant Knowledge

Giacomo Frisoni, Gianluca Moro

DATA (Revised Selected Papers) 2020 Proceedings of the 9th International Conference on Data Management Technologies and Applications

Code

Cite

Read

Learning knowledge from text is becoming increasingly important as the amount of unstructured content on the Web rapidly grows. Despite recent breakthroughs in natural language understanding, the explanation of phenomena from textual documents is still a difficult and poorly addressed problem. Additionally, current NLP solutions often require labeled data, are domain-dependent, and based on black box models. In this paper, we introduce POIROT, a new descriptive text mining methodology for phenomena explanation from documents corpora. POIROT is designed to provide accurate and interpretable results in unsupervised settings, quantifying them based on their statistical significance. We evaluated POIROT on a medical case study, with the aim of learning the “voice of patients” from short social posts. Taking Esophageal Achalasia as a reference, we automatically derived scientific correlations with 79% F1-measure score and built useful explanations of the patients’ viewpoint on topics such as symptoms, treatments, drugs, and foods. We make the source code and experiment details publicly available (https://github.com/unibodatascience/POIROT). More

Learning knowledge from text is becoming increasingly important as the amount of unstructured content on the Web rapidly grows. Despite recent breakthroughs in natural language understanding, the explanation of phenomena from textual documents is still a difficult and poorly addressed problem. Additionally, current NLP solutions often require labeled data, are domain-dependent, and based on black box models. In this paper, we introduce POIROT, a new descriptive text mining methodology for phenomena explanation from documents corpora. POIROT is designed to provide accurate and interpretable results in unsupervised settings, quantifying them based on their statistical significance. We evaluated POIROT on a medical case study, with the aim of learning the “voice of patients” from short social posts. Taking Esophageal Achalasia as a reference, we automatically derived scientific correlations with 79% F1-measure score and built useful explanations of the patients’ viewpoint on topics such as symptoms, treatments, drugs, and foods. We make the source code and experiment details publicly available (https://github.com/unibodatascience/POIROT). Less
Unsupervised Descriptive Text Mining for Knowledge Graph Learning

Giacomo Frisoni, Gianluca Moro, Antonella Carbonaro

KDIR 2020 Proceedings of the 12th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management

Cite

Read

The use of knowledge graphs (KGs) in advanced applications is constantly growing, as a consequence of their ability to model large collections of semantically interconnected data. The extraction of relational facts from plain text is currently one of the main approaches for the construction and expansion of KGs. In this paper, we introduce a novel unsupervised and automatic technique of KG learning from corpora of short unstructured and unlabeled texts. Our approach is unique in that it starts from raw textual data and comes to: i) identify a set of relevant domain-dependent terms; ii) extract aggregate and statistically significant semantic relationships between terms, documents, and classes; iii) represent the accurate probabilistic knowledge as a KG; iv) extend and integrate the KG according to the Linked Open Data vision. The proposed solution is easily transferable to many domains and languages as long as the data are available. As a case study, we demonstrate how it is possible to automatically learn a KG representing the knowledge contained within the conversational messages shared on social networks such as Facebook by patients with rare diseases, and the impact this can have on creating resources aimed to capture the 'voice of patients'. More

The use of knowledge graphs (KGs) in advanced applications is constantly growing, as a consequence of their ability to model large collections of semantically interconnected data. The extraction of relational facts from plain text is currently one of the main approaches for the construction and expansion of KGs. In this paper, we introduce a novel unsupervised and automatic technique of KG learning from corpora of short unstructured and unlabeled texts. Our approach is unique in that it starts from raw textual data and comes to: i) identify a set of relevant domain-dependent terms; ii) extract aggregate and statistically significant semantic relationships between terms, documents, and classes; iii) represent the accurate probabilistic knowledge as a KG; iv) extend and integrate the KG according to the Linked Open Data vision. The proposed solution is easily transferable to many domains and languages as long as the data are available. As a case study, we demonstrate how it is possible to automatically learn a KG representing the knowledge contained within the conversational messages shared on social networks such as Facebook by patients with rare diseases, and the impact this can have on creating resources aimed to capture the 'voice of patients'. Less
Learning Interpretable and Statistically Significant Knowledge from Unlabeled Corpora of Social Text Messages: A Novel Methodology of Descriptive Text Mining

Giacomo Frisoni, Gianluca Moro, Antonella Carbonaro

DATA 2020 Proceedings of the 9th International Conference on Data Science, Technology and Applications

Cite

Read

Best Paper Award

Though the strong evolution of knowledge learning models has characterized the last few years, the explanation of a phenomenon from text documents, called descriptive text mining, is still a difficult and poorly addressed problem. The need to work with unlabeled data, explainable approaches, unsupervised, and domain-independent solutions further increases the complexity of this task. Currently, existing techniques only partially solve the problem and have several limitations. In this paper, we propose a novel methodology of descriptive text mining, capable of offering accurate explanations in unsupervised settings and of quantifying the results based on their statistical significance. Considering the strong growth of patient communities on social platforms such as Facebook, we demonstrate the effectiveness of the contribution by taking the short social posts related to Esophageal Achalasia as a typical case study. Specifically, the methodology produces useful explanations about the experiences of patients and caregivers. Starting directly from the unlabeled patient's posts, we derive correct scientific correlations among symptoms, drugs, treatments, foods, and so on. More

Though the strong evolution of knowledge learning models has characterized the last few years, the explanation of a phenomenon from text documents, called descriptive text mining, is still a difficult and poorly addressed problem. The need to work with unlabeled data, explainable approaches, unsupervised, and domain-independent solutions further increases the complexity of this task. Currently, existing techniques only partially solve the problem and have several limitations. In this paper, we propose a novel methodology of descriptive text mining, capable of offering accurate explanations in unsupervised settings and of quantifying the results based on their statistical significance. Considering the strong growth of patient communities on social platforms such as Facebook, we demonstrate the effectiveness of the contribution by taking the short social posts related to Esophageal Achalasia as a typical case study. Specifically, the methodology produces useful explanations about the experiences of patients and caregivers. Starting directly from the unlabeled patient's posts, we derive correct scientific correlations among symptoms, drugs, treatments, foods, and so on. Less
Towards Rare Disease Knowledge Graph Learning from Social Posts of Patients

Giacomo Frisoni, Gianluca Moro, Antonella Carbonaro

RIIFORUM 2020 Research and Innovation Forum 2020 - Disruptive Technologies in Times of Change

Cite

Read

Rare diseases pose particular challenges to patients, families, caregivers, clinicians and researchers. Due to the scarce availability of information and their disintegration, in recent years we are witnessing a strong growth of patient communities on social platforms such as Facebook. Although the data generated in this context are of high value, the currently existing ontologies and resources tend to ignore them. The work presented in this paper studies how to extract knowledge from the large availability of unstructured text generated by the users over time, in order to represent it in an organized way and to make logical reasoning above. Starting from the awareness of the need to integrate different methodologies in complex domains, the research shows a combined use of Text Mining and Semantic Web techniques. In particular, we describe the basis of a novel approach for Knowledge Graph Learning with the aim of introducing a patient-centered vision into the world of Linked Open Data. By identifying and representing correlations between concepts of interest, we show how it is possible to answer patients’ questions and provide them with an additional tool for decision making. The outlined contribute minimizes costs through automatic data retrieval and increases the productivity of investigators. More

Rare diseases pose particular challenges to patients, families, caregivers, clinicians and researchers. Due to the scarce availability of information and their disintegration, in recent years we are witnessing a strong growth of patient communities on social platforms such as Facebook. Although the data generated in this context are of high value, the currently existing ontologies and resources tend to ignore them. The work presented in this paper studies how to extract knowledge from the large availability of unstructured text generated by the users over time, in order to represent it in an organized way and to make logical reasoning above. Starting from the awareness of the need to integrate different methodologies in complex domains, the research shows a combined use of Text Mining and Semantic Web techniques. In particular, we describe the basis of a novel approach for Knowledge Graph Learning with the aim of introducing a patient-centered vision into the world of Linked Open Data. By identifying and representing correlations between concepts of interest, we show how it is possible to answer patients’ questions and provide them with an additional tool for decision making. The outlined contribute minimizes costs through automatic data retrieval and increases the productivity of investigators. Less

Journal Articles

Java Academic Benchmark: Exam-Based Evaluation of LLMs on Object-Oriented Programming

Alessio Cocchieri, Luca Ragazzi, Gianluca Aguzzi, Giacomo Frisoni, Lorenzo Molfetta, Gianluca Moro, Mirko Viroli

JSS 2026 Journal of Systems and Software

Current code generation benchmarks largely overlook object-oriented programming (OOP) skills, leaving open whether large language models (LLMs) can effectively apply OOP principles to structured programming tasks. The dominance and permissiveness of Python in most benchmarks further obscure model weaknesses in core OOP principles. We introduce the Java Academic Benchmark (JAB), based on 103 authentic Java exams collected over a decade at a major university, together with 506 expert-written JUnit tests, to rigorously evaluate advanced Java programming competence with a strong focus on OOP. To complement execution-based scoring, we propose KODE, an LLM-as-a-Judge framework that assesses OOP adherence across four pedagogical dimensions. We evaluate 27 LLMs under two resolution strategies: single-attempt (one-shot) and agentic (iterative refinement with compiler and test feedback). Results reveal that, under our evaluation protocol, larger closed models match or surpass bachelor-level OOP students—especially under agentic resolution—while smaller open models lag behind. JAB's class-level design enables fine-grained error analysis, exposing recurring misconceptions. More

Current code generation benchmarks largely overlook object-oriented programming (OOP) skills, leaving open whether large language models (LLMs) can effectively apply OOP principles to structured programming tasks. The dominance and permissiveness of Python in most benchmarks further obscure model weaknesses in core OOP principles. We introduce the Java Academic Benchmark (JAB), based on 103 authentic Java exams collected over a decade at a major university, together with 506 expert-written JUnit tests, to rigorously evaluate advanced Java programming competence with a strong focus on OOP. To complement execution-based scoring, we propose KODE, an LLM-as-a-Judge framework that assesses OOP adherence across four pedagogical dimensions. We evaluate 27 LLMs under two resolution strategies: single-attempt (one-shot) and agentic (iterative refinement with compiler and test feedback). Results reveal that, under our evaluation protocol, larger closed models match or surpass bachelor-level OOP students—especially under agentic resolution—while smaller open models lag behind. JAB's class-level design enables fine-grained error analysis, exposing recurring misconceptions. Less
COMMA: A Multi-Task and Multi-Lingual Dataset of Constitutional Verdicts

Luca Ragazzi, Giacomo Frisoni, Gianluca Moro, Paolo Italiani, Lorenzo Molfetta, Veronika Folin

AI&Law 2026 Artificial Intelligence and Law

Read

Transformer-based language models have sparked a revolutionary change in Legal NLP, endowing lawyers with unparalleled tools to effectively navigate, understand, and draft large volumes of text. However, the dearth of large-scale datasets from authoritative sources hampers further progress. The available resources are primarily single-task, English-only, and written in layman's terms. To bridge this gap, we introduce COMMA, a multi-task and multi-lingual archive of 14K verdicts drawn from the Constitutional Court of the Italian Republic, grounded in a non-common law system. Documents in COMMA diverge from ordinary legal manuscripts as they address fundamental principles and rights, involve technical jargon, exhibit an articulated structure, are diachronic, have extended length, and demand more significant expertise and interpretation. By embracing 4 widespread languages, COMMA tackles a panoply of necessity-driven tasks: multi-granular abstractive summarization, decision generation, article retrieval, and ruling classification. We systematically benchmark a catalog of language models in both few-shot and full settings, uncovering substantial headroom for improvement. We contribute to the new era of Legal NLP systems by openly releasing COMMA and best-performing models. More

Transformer-based language models have sparked a revolutionary change in Legal NLP, endowing lawyers with unparalleled tools to effectively navigate, understand, and draft large volumes of text. However, the dearth of large-scale datasets from authoritative sources hampers further progress. The available resources are primarily single-task, English-only, and written in layman's terms. To bridge this gap, we introduce COMMA, a multi-task and multi-lingual archive of 14K verdicts drawn from the Constitutional Court of the Italian Republic, grounded in a non-common law system. Documents in COMMA diverge from ordinary legal manuscripts as they address fundamental principles and rights, involve technical jargon, exhibit an articulated structure, are diachronic, have extended length, and demand more significant expertise and interpretation. By embracing 4 widespread languages, COMMA tackles a panoply of necessity-driven tasks: multi-granular abstractive summarization, decision generation, article retrieval, and ruling classification. We systematically benchmark a catalog of language models in both few-shot and full settings, uncovering substantial headroom for improvement. We contribute to the new era of Legal NLP systems by openly releasing COMMA and best-performing models. Less
OpenBioNER-v2: A Suite of Lightweight Models for Zero-Shot Medical Named Entity Recognition via Type Descriptions

Alessio Cocchieri, Giacomo Frisoni, Francesco Zangrillo, Luca Ragazzi, Marcos Martínez Galindo, Giuseppe Tagliavini, Gianluca Moro

ESWA 2026 Expert Systems with Applications

Read

Named entity recognition (NER) in medicine is challenging due to specialized terminology, inconsistent annotation guidelines, and the continuous emergence of new entity types—requiring models that can adapt to unseen targets. Large language models (LLMs) exhibit strong generalization but are impractical for scalable deployment, whereas recent encoder-only approaches leverage entity names for zero-shot inference but struggle with disambiguation in complex domains. We introduce OpenBioNER-v2, a family of lightweight transformer encoders (15M-110M parameters) designed for zero-shot recognition of biomedical and clinical entities by conditioning on natural language descriptions of target types. Our cross-encoder architecture jointly models input text and entity-type descriptions, enabling semantic matches. Pretrained on LLM-generated silver annotations and multi-view descriptions covering thousands of medical types, OpenBioNER-v2 achieves state-of-the-art results across 11 benchmarks—including a new dataset for personal de-identification. Variants with <56M parameters outperform both large and small language models, such as UniversalNER and GliNER. Ablation studies reveal effective strategies for formulating descriptions. All data, code, and model checkpoints are publicly released under open-science principles. More

Named entity recognition (NER) in medicine is challenging due to specialized terminology, inconsistent annotation guidelines, and the continuous emergence of new entity types—requiring models that can adapt to unseen targets. Large language models (LLMs) exhibit strong generalization but are impractical for scalable deployment, whereas recent encoder-only approaches leverage entity names for zero-shot inference but struggle with disambiguation in complex domains. We introduce OpenBioNER-v2, a family of lightweight transformer encoders (15M-110M parameters) designed for zero-shot recognition of biomedical and clinical entities by conditioning on natural language descriptions of target types. Our cross-encoder architecture jointly models input text and entity-type descriptions, enabling semantic matches. Pretrained on LLM-generated silver annotations and multi-view descriptions covering thousands of medical types, OpenBioNER-v2 achieves state-of-the-art results across 11 benchmarks—including a new dataset for personal de-identification. Variants with <56M parameters outperform both large and small language models, such as UniversalNER and GliNER. Ablation studies reveal effective strategies for formulating descriptions. All data, code, and model checkpoints are publicly released under open-science principles. Less
Clash-of-Leges: A Bilingual Dataset for Conflict Detection and Explanation in Statutory Law

Paolo Italiani, Gianluca Moro, Luca Ragazzi

ESWA 2025 Expert Systems with Applications

Cite

Read

Legal conflicts between statutes or constitutional articles present a significant challenge in maintaining consistency and coherence within legal systems. Addressing these conflicts requires extensive human expertise, making the process labor-intensive and time-consuming. In this paper, we introduce Clash-of-Leges, a novel multilingual dataset derived from rulings by the Constitutional Court of the Italian Republic, designed to aid the automation of conflict detection and explanation between legal articles. We identify three key tasks: Conflict Classification, which determines whether two legal articles are in conflict; Conflict Explanation Generation, which provides detailed explanations for identified conflicts; and Reference Retrieval, which sources relevant legal bases and precedents to substantiate interpretations. These tasks are intended to facilitate the development of AI models that can automatically identify and explain contradictions between legal provisions. More

Legal conflicts between statutes or constitutional articles present a significant challenge in maintaining consistency and coherence within legal systems. Addressing these conflicts requires extensive human expertise, making the process labor-intensive and time-consuming. In this paper, we introduce Clash-of-Leges, a novel multilingual dataset derived from rulings by the Constitutional Court of the Italian Republic, designed to aid the automation of conflict detection and explanation between legal articles. We identify three key tasks: Conflict Classification, which determines whether two legal articles are in conflict; Conflict Explanation Generation, which provides detailed explanations for identified conflicts; and Reference Retrieval, which sources relevant legal bases and precedents to substantiate interpretations. These tasks are intended to facilitate the development of AI models that can automatically identify and explain contradictions between legal provisions. Less
Read Between the Tokens: Differentiable Text Pruning via Perturbed Top-k Selection

Paolo Italiani, Luca Ragazzi, Gianluca Moro

IEEE TASLP 2025 IEEE Transactions on Audio, Speech and Language Processing

Transformer-based pretrained language models (PLMs) face scalability issues due to their computational expense, which increases with the length of the input sequence, and often struggle to maintain focus on relevant information. To mitigate this, we introduce PrunePert, a novel model featuring a learnable mechanism that identifies and removes uninformative tokens from the context. By doing so, our method not only addresses performance concerns but also enhances interpretability, offering valuable insights into the tokens utilized in the model's decision-making process. Specifically, our approach employs a differentiable perturbed top-k token selection module within the transformer layers to prune a user-defined percentage of tokens. It can be integrated with any downstream PLMs, allowing them to be trained end-to-end using backpropagation. We demonstrate the application of PrunePert in text summarization and classification tasks, utilizing both encoder-decoder PLMs and contemporary decoder-only large language models. Notably, our findings reveal that models equipped with PrunePert achieve up to 2x higher throughput and exhibit comparable performance in text summarization, while demonstrating superior performance in text classification tasks. Code is available at https://anonymous.4open.science/r/llm_pruning-6B58/. More

Transformer-based pretrained language models (PLMs) face scalability issues due to their computational expense, which increases with the length of the input sequence, and often struggle to maintain focus on relevant information. To mitigate this, we introduce PrunePert, a novel model featuring a learnable mechanism that identifies and removes uninformative tokens from the context. By doing so, our method not only addresses performance concerns but also enhances interpretability, offering valuable insights into the tokens utilized in the model's decision-making process. Specifically, our approach employs a differentiable perturbed top-k token selection module within the transformer layers to prune a user-defined percentage of tokens. It can be integrated with any downstream PLMs, allowing them to be trained end-to-end using backpropagation. We demonstrate the application of PrunePert in text summarization and classification tasks, utilizing both encoder-decoder PLMs and contemporary decoder-only large language models. Notably, our findings reveal that models equipped with PrunePert achieve up to 2x higher throughput and exhibit comparable performance in text summarization, while demonstrating superior performance in text classification tasks. Code is available at https://anonymous.4open.science/r/llm_pruning-6B58/. Less
Abstractive Summarization through the Prism of Decoding Strategies

Giacomo Frisoni, Luca Ragazzi, David Cohen, Gianluca Moro, Antonella Carbonaro, Claudio Sartori

Neural Networks 2025 Neural Networks

Read

In natural language generation, abstractive summarization (AS) is advancing rapidly due to transformer-based language models (LMs). Although decoding strategies significantly influence generated summaries, their significance is often overlooked. Given the abundance of token selection heuristics and associated hyperparameters, the community needs guidance to make well-informed decisions based on the specific task and target metrics. To address this gap, we conduct a comparative assessment of the effectiveness and efficiency of decoding-time techniques for short, long, and multi-document AS. We explore over 3,500 combinations involving three widely used million-scale autoregressive encoder-decoder LMs, two billion-scale decoder-only LMs, six datasets, and nine decoding settings. Our findings highlight that optimized decoding choices can lead to substantial performance improvements. Alongside human evaluation, we quantitatively measure effects using ten automatic metrics, covering dimensions such as semantic similarity, factuality, compression, redundancy, and carbon footprint. To set the stage for differentiable selection and optimization of decoding options, we introduce PRISM, a first-of-its-kind dataset that pairs AS gold input-output examples with our LM predictions across a diverse range of decoding options. More

In natural language generation, abstractive summarization (AS) is advancing rapidly due to transformer-based language models (LMs). Although decoding strategies significantly influence generated summaries, their significance is often overlooked. Given the abundance of token selection heuristics and associated hyperparameters, the community needs guidance to make well-informed decisions based on the specific task and target metrics. To address this gap, we conduct a comparative assessment of the effectiveness and efficiency of decoding-time techniques for short, long, and multi-document AS. We explore over 3,500 combinations involving three widely used million-scale autoregressive encoder-decoder LMs, two billion-scale decoder-only LMs, six datasets, and nine decoding settings. Our findings highlight that optimized decoding choices can lead to substantial performance improvements. Alongside human evaluation, we quantitatively measure effects using ten automatic metrics, covering dimensions such as semantic similarity, factuality, compression, redundancy, and carbon footprint. To set the stage for differentiable selection and optimization of decoding options, we introduce PRISM, a first-of-its-kind dataset that pairs AS gold input-output examples with our LM predictions across a diverse range of decoding options. Less
Legal Lay Summarization: Exploring Methods and Data Generation with Large Language Models

Gianluca Moro, Leonardo David Matteo Magnani, Luca Ragazzi

AIR 2025 Artificial Intelligence Review

Cite

Read

This paper explores advancements in Natural Language Processing (NLP) for legal lay summarization by systematically analyzing existing methodologies, datasets, and research findings. We review current literature, highlighting key challenges such as data scarcity and the complexity of legal language. A primary contribution of this study is the development of LegalEase, a specialized dataset designed to improve model training for summarizing legal documents in layman’s terms. Our findings demonstrate that subdomain-specific datasets within the legal domain outperform general legal datasets in enhancing NLP model performance for generating accurate and comprehensible legal summaries. The insights and methodologies presented provide a foundation for future research in legal lay summarization. More

This paper explores advancements in Natural Language Processing (NLP) for legal lay summarization by systematically analyzing existing methodologies, datasets, and research findings. We review current literature, highlighting key challenges such as data scarcity and the complexity of legal language. A primary contribution of this study is the development of LegalEase, a specialized dataset designed to improve model training for summarizing legal documents in layman’s terms. Our findings demonstrate that subdomain-specific datasets within the legal domain outperform general legal datasets in enhancing NLP model performance for generating accurate and comprehensible legal summaries. The insights and methodologies presented provide a foundation for future research in legal lay summarization. Less
Enhancing Legal Question Answering with Data Generation and Knowledge Distillation from Large Language Models

Paolo Italiani, Gianluca Moro, Luca Ragazzi

AI&Law 2025 Artificial Intelligence and Law

Code

Read

Legal question answering (LQA) relies on supervised methods to automatically handle law-related queries. These solutions require a significant amount of carefully annotated data for training, which makes the process very costly. Although large language models (LLMs) show promise in zero-shot QA, their computational demands limit their practical use, making specialized small language models (SLMs) more favorable. Furthermore, the growing interest in synthetic data generation has recently surged, spurred by the impressive generation capabilities of LLMs. This paper presents Ace-Attorney, an LLM distillation approach devised to develop LQA data and supervised models without human annotation. Given a textual prompt, a frozen LLM generates artificial examples that are used as knowledge to train a student SLM with an order of magnitude fewer parameters. Taking into account a realistic retrieval-based scenario to fetch the correct document for answer generation, we propose Selective Generative Paradigm, a novel approach designed to improve retrieval efficacy. Extensive experiments demonstrate the effectiveness and efficiency of distilled models on Syn-LeQA, our human-free synthetic dataset, and a public expert-annotated corpus. Notably, by using only a few dozen training samples, our best SLM achieves LLM-comparable performance with ≈1200% less CO2 emissions. More

Legal question answering (LQA) relies on supervised methods to automatically handle law-related queries. These solutions require a significant amount of carefully annotated data for training, which makes the process very costly. Although large language models (LLMs) show promise in zero-shot QA, their computational demands limit their practical use, making specialized small language models (SLMs) more favorable. Furthermore, the growing interest in synthetic data generation has recently surged, spurred by the impressive generation capabilities of LLMs. This paper presents Ace-Attorney, an LLM distillation approach devised to develop LQA data and supervised models without human annotation. Given a textual prompt, a frozen LLM generates artificial examples that are used as knowledge to train a student SLM with an order of magnitude fewer parameters. Taking into account a realistic retrieval-based scenario to fetch the correct document for answer generation, we propose Selective Generative Paradigm, a novel approach designed to improve retrieval efficacy. Extensive experiments demonstrate the effectiveness and efficiency of distilled models on Syn-LeQA, our human-free synthetic dataset, and a public expert-annotated corpus. Notably, by using only a few dozen training samples, our best SLM achieves LLM-comparable performance with ≈1200% less CO2 emissions. Less
Cross-Document Distillation via Graph-based Summarization of Extracted Essential Knowledge

Luca Ragazzi, Gianluca Moro, Lorenzo Valgimigli, Riccardo Fiorani

IEEE TASLP 2025 IEEE Transactions on Audio, Speech and Language Processing

Read

Abstractive multi-document summarization aims to generate a comprehensive summary that encapsulates crucial content derived from multiple input documents. Despite the proficiency exhibited by language models in text summarization, challenges persist in capturing and aggregating salient information dispersed across a cluster of lengthy sources. To accommodate more input, existing solutions prioritize sparse attention mechanisms, relying on sequence truncation without incorporating graph-based modeling of multiple semantic units to locate essential facets. Furthermore, the limited availability of training examples adversely impacts performance, thereby compromising summarization quality in real-world few-shot scenarios. In this paper, we present G-Seek-2, a graph-enhanced approach designed to distill multiple topic-related documents by pinpointing and processing solely the pertinent information. We use a heterogeneous graph to model the input cluster, interconnecting various encoded entities via informative semantic edges. Then, a graph neural network locates the most salient sentences that are provided to a language model to generate the summary. We extensively evaluate G-Seek-2 across seven datasets spanning various domains—including news articles, lawsuits, government reports, and scientific texts—under few-shot settings with a limited training sample size of only 100 examples. The experimental findings demonstrate that our model consistently outperforms advanced summarization baselines, achieving improvements as measured by syntactic and semantic metrics. More

Abstractive multi-document summarization aims to generate a comprehensive summary that encapsulates crucial content derived from multiple input documents. Despite the proficiency exhibited by language models in text summarization, challenges persist in capturing and aggregating salient information dispersed across a cluster of lengthy sources. To accommodate more input, existing solutions prioritize sparse attention mechanisms, relying on sequence truncation without incorporating graph-based modeling of multiple semantic units to locate essential facets. Furthermore, the limited availability of training examples adversely impacts performance, thereby compromising summarization quality in real-world few-shot scenarios. In this paper, we present G-Seek-2, a graph-enhanced approach designed to distill multiple topic-related documents by pinpointing and processing solely the pertinent information. We use a heterogeneous graph to model the input cluster, interconnecting various encoded entities via informative semantic edges. Then, a graph neural network locates the most salient sentences that are provided to a language model to generate the summary. We extensively evaluate G-Seek-2 across seven datasets spanning various domains—including news articles, lawsuits, government reports, and scientific texts—under few-shot settings with a limited training sample size of only 100 examples. The experimental findings demonstrate that our model consistently outperforms advanced summarization baselines, achieving improvements as measured by syntactic and semantic metrics. Less
LAWSUIT: a LArge expert-Written SUmmarization dataset of ITalian constitutional court verdicts

Luca Ragazzi, Gianluca Moro, Stefano Guidi, Giacomo Frisoni

AI&Law 2024 Artificial Intelligence and Law

Try!

Cite

Read

Large-scale public datasets are vital for driving the progress of abstractive summarization, especially in law, where documents have highly specialized jargon. However, the available resources are English-centered, limiting research advancements in other languages. This paper introduces LAWSUIT, a collection of 14K Italian legal verdicts with expert-authored abstractive maxims drawn from the Constitutional Court of the Italian Republic. LAWSUIT presents an arduous task with lengthy source texts and evenly distributed salient content. We offer extensive experiments with sequence-to-sequence and segmentation-based approaches, revealing that the latter achieve better results in full and few-shot settings. We openly release LAWSUIT to foster the development and automation of real-world legal applications. More

Large-scale public datasets are vital for driving the progress of abstractive summarization, especially in law, where documents have highly specialized jargon. However, the available resources are English-centered, limiting research advancements in other languages. This paper introduces LAWSUIT, a collection of 14K Italian legal verdicts with expert-authored abstractive maxims drawn from the Constitutional Court of the Italian Republic. LAWSUIT presents an arduous task with lengthy source texts and evenly distributed salient content. We offer extensive experiments with sequence-to-sequence and segmentation-based approaches, revealing that the latter achieve better results in full and few-shot settings. We openly release LAWSUIT to foster the development and automation of real-world legal applications. Less
Multi-Language Transfer Learning for Low-Resource Legal Case Summarization

Gianluca Moro, Nicola Piscaglia, Luca Ragazzi, Paolo Italiani

AI&Law 2024 Artificial Intelligence and Law

Cite

Read

Analyzing and evaluating legal case reports are labor-intensive tasks for judges and lawyers, who usually base their decisions on report abstracts, legal principles, and commonsense reasoning. Thus, summarizing legal documents is time-consuming and requires excellent human expertise. Moreover, public legal corpora of specific languages are almost unavailable. This paper proposes a transfer learning approach with extractive and abstractive techniques to cope with the lack of labeled legal summarization datasets, namely a low-resource scenario. In particular, we conducted extensive multi- and cross-language experiments. The proposed work outperforms the state-of-the-art results of extractive summarization on the Australian Legal Case Reports dataset and sets a new baseline for abstractive summarization. Finally, syntactic and semantic metrics assessments have been carried out to evaluate the accuracy and the factual consistency of the machine-generated legal summaries. More

Analyzing and evaluating legal case reports are labor-intensive tasks for judges and lawyers, who usually base their decisions on report abstracts, legal principles, and commonsense reasoning. Thus, summarizing legal documents is time-consuming and requires excellent human expertise. Moreover, public legal corpora of specific languages are almost unavailable. This paper proposes a transfer learning approach with extractive and abstractive techniques to cope with the lack of labeled legal summarization datasets, namely a low-resource scenario. In particular, we conducted extensive multi- and cross-language experiments. The proposed work outperforms the state-of-the-art results of extractive summarization on the Australian Legal Case Reports dataset and sets a new baseline for abstractive summarization. Finally, syntactic and semantic metrics assessments have been carried out to evaluate the accuracy and the factual consistency of the machine-generated legal summaries. Less
Evidence, my Dear Watson: Abstractive Dialogue Summarization on Learnable Relevant Utterances

Paolo Italiani, Giacomo Frisoni, Gianluca Moro, Antonella Carbonaro, Claudio Sartori

Neurocomputing 2023 Neurocomputing

Cite

Read

Abstractive dialogue summarization requires distilling and rephrasing key information from noisy multi-speaker documents. Combining pre-trained language models with input augmentation techniques has recently led to significant research progress. However, existing solutions still struggle to select relevant chat segments, primarily relying on open-domain and unsupervised annotators not tailored to the actual needs of the summarization task. In this paper, we propose DearWatson, a task-aware utterance-level annotation framework for improving the effectiveness and interpretability of pre-trained dialogue summarization models. Precisely, we learn relevant utterances in the source document and mark them with special tags, that then act as supporting evidence for the generated summary. Quantitative experiments are conducted on two datasets made up of real-life messenger conversations. The results show that DearWatson allows model attention to focus on salient tokens, achieving new state-of-the-art results in three evaluation metrics, including semantic and factuality measures. Human evaluation proves the superiority of our solution in semantic consistency and recall. Finally, extensive ablation studies confirm each module’s importance, also exploring different annotation strategies and parameter-efficient fine-tuning of large generative language models. More

Abstractive dialogue summarization requires distilling and rephrasing key information from noisy multi-speaker documents. Combining pre-trained language models with input augmentation techniques has recently led to significant research progress. However, existing solutions still struggle to select relevant chat segments, primarily relying on open-domain and unsupervised annotators not tailored to the actual needs of the summarization task. In this paper, we propose DearWatson, a task-aware utterance-level annotation framework for improving the effectiveness and interpretability of pre-trained dialogue summarization models. Precisely, we learn relevant utterances in the source document and mark them with special tags, that then act as supporting evidence for the generated summary. Quantitative experiments are conducted on two datasets made up of real-life messenger conversations. The results show that DearWatson allows model attention to focus on salient tokens, achieving new state-of-the-art results in three evaluation metrics, including semantic and factuality measures. Human evaluation proves the superiority of our solution in semantic consistency and recall. Finally, extensive ablation studies confirm each module’s importance, also exploring different annotation strategies and parameter-efficient fine-tuning of large generative language models. Less
Graph-Enhanced Biomedical Abstractive Summarization via Factual Evidence Extraction

Giacomo Frisoni, Paolo Italiani, Gianluca Moro, Ilaria Bartolini, Marco Antonio Boschetti, Antonella Carbonaro

SN Computer Science 2023 SN Computer Science

Code

Cite

Read

Infusing structured semantic representations into language models is a rising research trend underpinning many natural language processing tasks that require understanding and reasoning capabilities. Decoupling factual non-ambiguous concept units from the lexical surface holds great potential in abstractive summarization, especially in the biomedical domain, where fact selection and rephrasing are made more difficult by specialized jargon and hard factuality constraints. Nevertheless, current graph-augmented contributions rely on extractive binary relations, failing to model real-world n-ary and nested biomedical interactions mentioned in the text. To alleviate this issue, we present EASumm, the first framework for biomedical abstractive summarization empowered by event extraction, namely graph-based representations of relevant medical evidence derived from the source scientific document. By relying on dual text-graph encoders, we prove the promising role of explicit event structures, achieving better or comparable performance than previous state-of-the-art models on the CDSR dataset. We conduct extensive ablation studies, including a wide experimentation of graph representation learning techniques. Finally, we offer some hints to guide future research in the field. More

Infusing structured semantic representations into language models is a rising research trend underpinning many natural language processing tasks that require understanding and reasoning capabilities. Decoupling factual non-ambiguous concept units from the lexical surface holds great potential in abstractive summarization, especially in the biomedical domain, where fact selection and rephrasing are made more difficult by specialized jargon and hard factuality constraints. Nevertheless, current graph-augmented contributions rely on extractive binary relations, failing to model real-world n-ary and nested biomedical interactions mentioned in the text. To alleviate this issue, we present EASumm, the first framework for biomedical abstractive summarization empowered by event extraction, namely graph-based representations of relevant medical evidence derived from the source scientific document. By relying on dual text-graph encoders, we prove the promising role of explicit event structures, achieving better or comparable performance than previous state-of-the-art models on the CDSR dataset. We conduct extensive ablation studies, including a wide experimentation of graph representation learning techniques. Finally, we offer some hints to guide future research in the field. Less
Align-Then-Abstract Representation Learning for Low-Resource Summarization

Gianluca Moro, Luca Ragazzi

Neurocomputing 2023 Neurocomputing

Code

Cite

Read

Generative transformer-based models have achieved state-of-the-art performance in text summarization. Nevertheless, they still struggle in real-world scenarios with long documents when trained in low-resource settings of a few dozen labeled training instances, namely in low-resource summarization (LRS). This paper bridges the gap by addressing two key research challenges when summarizing long documents, i.e., long-input processing and document representation, in one coherent model trained for LRS. Specifically, our novel align-then-abstract representation learning model (Athena) jointly trains a segmenter and a summarizer by maximizing the alignment between the chunk-target pairs in output from the text segmentation. Extensive experiments reveal that Athena outperforms the current state-of-the-art approaches in LRS on multiple long document summarization datasets from different domains. More

Generative transformer-based models have achieved state-of-the-art performance in text summarization. Nevertheless, they still struggle in real-world scenarios with long documents when trained in low-resource settings of a few dozen labeled training instances, namely in low-resource summarization (LRS). This paper bridges the gap by addressing two key research challenges when summarizing long documents, i.e., long-input processing and document representation, in one coherent model trained for LRS. Specifically, our novel align-then-abstract representation learning model (Athena) jointly trains a segmenter and a summarizer by maximizing the alignment between the chunk-target pairs in output from the text segmentation. Extensive experiments reveal that Athena outperforms the current state-of-the-art approaches in LRS on multiple long document summarization datasets from different domains. Less
Efficient Memory-Enhanced Transformer for Long-Document Summarization in Low-Resource Regimes

Gianluca Moro, Luca Ragazzi, Lorenzo Valgimigli, Giacomo Frisoni, Claudio Sartori, Gustavo Marfia

Sensors 2023 Sensors

Cite

Read

Long document summarization poses obstacles to current generative transformer-based models because of the broad context to process and understand. Indeed, detecting long-range dependencies is still challenging for today’s state-of-the-art solutions, usually requiring model expansion at the cost of an unsustainable demand for computing and memory capacities. This paper introduces Emma, a novel efficient memory-enhanced transformer-based architecture. By segmenting a lengthy input into multiple text fragments, our model stores and compares the current chunk with previous ones, gaining the capability to read and comprehend the entire context over the whole document with a fixed amount of GPU memory. This method enables the model to deal with theoretically infinitely long documents, using less than 18 and 13 GB of memory for training and inference, respectively. We conducted extensive performance analyses and demonstrate that Emma achieved competitive results on two datasets of different domains while consuming significantly less GPU memory than competitors do, even in low-resource settings. More

Long document summarization poses obstacles to current generative transformer-based models because of the broad context to process and understand. Indeed, detecting long-range dependencies is still challenging for today’s state-of-the-art solutions, usually requiring model expansion at the cost of an unsustainable demand for computing and memory capacities. This paper introduces Emma, a novel efficient memory-enhanced transformer-based architecture. By segmenting a lengthy input into multiple text fragments, our model stores and compares the current chunk with previous ones, gaining the capability to read and comprehend the entire context over the whole document with a fixed amount of GPU memory. This method enables the model to deal with theoretically infinitely long documents, using less than 18 and 13 GB of memory for training and inference, respectively. We conducted extensive performance analyses and demonstrate that Emma achieved competitive results on two datasets of different domains while consuming significantly less GPU memory than competitors do, even in low-resource settings. Less
Efficient Text-Image Semantic Search: a Multi-modal Vision-Language Approach for Fashion Retrieval

Gianluca Moro, Stefano Salvatori, Giacomo Frisoni

Neurocomputing 2023 Neurocomputing

Try!

Cite

Read

In this paper, we address the problem of multi-modal retrieval of fashion products. State-of-the-art (SOTA) works proposed in literature use vision-and-language transformers to assign similarity scores to joint text-image pairs, then used for sorting the results during a retrieval phase. However, this approach is inefficient since it requires coupling a query with every record in the dataset and computing a forward pass for each sample at runtime, precluding scalability to large-scale datasets. We thus propose a solution that overcomes the above limitation by combining transformers and deep metric learning to create a latent space where texts and images are separately embedded, and their spatial proximity translates into semantic similarity. Our architecture does not use convolutional neural networks to process images, allowing us to test different levels of image-processing details and metric learning losses. We vastly improve retrieval accuracy results on the FashionGen benchmark (+18.71% and +9.22% Rank@1 on Image-to-Text and Text-to-Image, respectively) while being up to 512x faster. Finally, we analyze the speed-up obtainable by different approximate nearest neighbor retrieval strategies—an optimization precluded to current SOTA contributions. We release our solution as a web application available at https://disi-unibo-nlp.github.io/projects/fashion_retrieval/. More

In this paper, we address the problem of multi-modal retrieval of fashion products. State-of-the-art (SOTA) works proposed in literature use vision-and-language transformers to assign similarity scores to joint text-image pairs, then used for sorting the results during a retrieval phase. However, this approach is inefficient since it requires coupling a query with every record in the dataset and computing a forward pass for each sample at runtime, precluding scalability to large-scale datasets. We thus propose a solution that overcomes the above limitation by combining transformers and deep metric learning to create a latent space where texts and images are separately embedded, and their spatial proximity translates into semantic similarity. Our architecture does not use convolutional neural networks to process images, allowing us to test different levels of image-processing details and metric learning losses. We vastly improve retrieval accuracy results on the FashionGen benchmark (+18.71% and +9.22% Rank@1 on Image-to-Text and Text-to-Image, respectively) while being up to 512x faster. Finally, we analyze the speed-up obtainable by different approximate nearest neighbor retrieval strategies—an optimization precluded to current SOTA contributions. We release our solution as a web application available at https://disi-unibo-nlp.github.io/projects/fashion_retrieval/. Less
Comprehensive Analysis of Knowledge Graph Embedding Techniques Benchmarked on Link Prediction

Ilaria Ferrari, Giacomo Frisoni, Paolo Italiani, Gianluca Moro, Claudio Sartori

Electronics 2022 Electronics

Code

Cite

Read

In knowledge graph representation learning, link prediction is among the most popular and influential tasks. Its surge in popularity has resulted in a panoply of orthogonal embedding-based methods projecting entities and relations into low-dimensional continuous vectors. To further enrich the research space, the community witnessed a prolific development of evaluation benchmarks with a variety of structures and domains. Therefore, researchers and practitioners face an unprecedented challenge in effectively identifying the best solution to their needs. To this end, we propose the most comprehensive and up-to-date study to systematically assess the effectiveness and efficiency of embedding models for knowledge graph completion. We compare 13 models on six datasets with different sizes, domains, and relational properties, covering translational, semantic matching, and neural network-based encoders. A fine-grained evaluation is conducted to compare each technique head-to-head in terms of standard metrics, training and evaluation times, memory consumption, carbon footprint, and space geometry. Our results demonstrate the high dependence between performance and graph types, identifying the best options for each scenario. Among all the encoding strategies, the new generation of translational models emerges as the most promising, bringing out the best and most consistent results across all the datasets and evaluation criteria. More

In knowledge graph representation learning, link prediction is among the most popular and influential tasks. Its surge in popularity has resulted in a panoply of orthogonal embedding-based methods projecting entities and relations into low-dimensional continuous vectors. To further enrich the research space, the community witnessed a prolific development of evaluation benchmarks with a variety of structures and domains. Therefore, researchers and practitioners face an unprecedented challenge in effectively identifying the best solution to their needs. To this end, we propose the most comprehensive and up-to-date study to systematically assess the effectiveness and efficiency of embedding models for knowledge graph completion. We compare 13 models on six datasets with different sizes, domains, and relational properties, covering translational, semantic matching, and neural network-based encoders. A fine-grained evaluation is conducted to compare each technique head-to-head in terms of standard metrics, training and evaluation times, memory consumption, carbon footprint, and space geometry. Our results demonstrate the high dependence between performance and graph types, identifying the best options for each scenario. Among all the encoding strategies, the new generation of translational models emerges as the most promising, bringing out the best and most consistent results across all the datasets and evaluation criteria. Less
Human Being Detection from UWB NLOS Signals: Accuracy and Generality of Advanced Machine Learning Models

Gianluca Moro, Federico Di Luca, Davide Dardari, Giacomo Frisoni

Sensors 2022 Sensors

Code

Cite

Read

This paper studies the problem of detecting human beings in non-line-of-sight (NLOS) conditions using an ultra-wideband radar. We perform an extensive measurement campaign in realistic environments, considering different body orientations, the obstacles’ materials, and radar–obstacle distances. We examine two main scenarios according to the radar position: (i) placed on top of a mobile cart; (ii) handheld at different heights. We empirically analyze and compare several input representations and machine learning (ML) methods—supervised and unsupervised, symbolic and non-symbolic—according to both their accuracy in detecting NLOS human beings and their adaptability to unseen cases. Our study proves the effectiveness and flexibility of modern ML techniques, avoiding environment-specific configurations and benefiting from knowledge transference. Unlike traditional TLC approaches, ML allows for generalization, overcoming limits due to unknown or only partially known observation models and insufficient labeled data, which usually occur in emergencies or in the presence of time/cost constraints. More

This paper studies the problem of detecting human beings in non-line-of-sight (NLOS) conditions using an ultra-wideband radar. We perform an extensive measurement campaign in realistic environments, considering different body orientations, the obstacles’ materials, and radar–obstacle distances. We examine two main scenarios according to the radar position: (i) placed on top of a mobile cart; (ii) handheld at different heights. We empirically analyze and compare several input representations and machine learning (ML) methods—supervised and unsupervised, symbolic and non-symbolic—according to both their accuracy in detecting NLOS human beings and their adaptability to unseen cases. Our study proves the effectiveness and flexibility of modern ML techniques, avoiding environment-specific configurations and benefiting from knowledge transference. Unlike traditional TLC approaches, ML allows for generalization, overcoming limits due to unknown or only partially known observation models and insufficient labeled data, which usually occur in emergencies or in the presence of time/cost constraints. Less
Unsupervised Event Graph Representation and Similarity Learning on Biomedical Literature

Giacomo Frisoni, Gianluca Moro, Giulio Carlassare, Antonella Carbonaro

Sensors 2021 Sensors

Code

Cite

Read

The automatic extraction of biomedical events from the scientific literature has drawn keen interest in the last several years, recognizing complex and semantically rich graphical interactions otherwise buried in texts. However, very few works revolve around learning embeddings or similarity metrics for event graphs. This gap leaves biological relations unlinked and prevents the application of machine learning techniques to promote discoveries. Taking advantage of recent deep graph kernel solutions and pre-trained language models, we propose Deep Divergence Event Graph Kernels (DDEGK), an unsupervised inductive method to map events into low-dimensional vectors, preserving their structural and semantic similarities. Unlike most other systems, DDEGK operates at a graph level and does not require task-specific labels, feature engineering, or known correspondences between nodes. To this end, our solution compares events against a small set of anchor ones, trains cross-graph attention networks for drawing pairwise alignments (bolstering interpretability), and employs transformer-based models to encode continuous attributes. Extensive experiments have been done on nine biomedical datasets. We show that our learned event representations can be effectively employed in tasks such as graph classification, clustering, and visualization, also facilitating downstream semantic textual similarity. Empirical results demonstrate that DDEGK significantly outperforms other state-of-the-art methods. More

The automatic extraction of biomedical events from the scientific literature has drawn keen interest in the last several years, recognizing complex and semantically rich graphical interactions otherwise buried in texts. However, very few works revolve around learning embeddings or similarity metrics for event graphs. This gap leaves biological relations unlinked and prevents the application of machine learning techniques to promote discoveries. Taking advantage of recent deep graph kernel solutions and pre-trained language models, we propose Deep Divergence Event Graph Kernels (DDEGK), an unsupervised inductive method to map events into low-dimensional vectors, preserving their structural and semantic similarities. Unlike most other systems, DDEGK operates at a graph level and does not require task-specific labels, feature engineering, or known correspondences between nodes. To this end, our solution compares events against a small set of anchor ones, trains cross-graph attention networks for drawing pairwise alignments (bolstering interpretability), and employs transformer-based models to encode continuous attributes. Extensive experiments have been done on nine biomedical datasets. We show that our learned event representations can be effectively employed in tasks such as graph classification, clustering, and visualization, also facilitating downstream semantic textual similarity. Empirical results demonstrate that DDEGK significantly outperforms other state-of-the-art methods. Less
A Survey on Event Extraction for Natural Language Understanding: Riding the Biomedical Literature Wave

Giacomo Frisoni, Gianluca Moro, Antonella Carbonaro

IEEE Access 2021 IEEE Access

Cite

Read

Motivation: The scientific literature embeds an enormous amount of relational knowledge, encompassing interactions between biomedical entities, like proteins, drugs, and symptoms. To cope with the ever-increasing number of publications, researchers are experiencing a surge of interest in extracting valuable, structured, concise, and unambiguous information from plain texts. With the development of deep learning, the granularity of information extraction is evolving from entities and pairwise relations to events. Events can model complex interactions involving multiple participants having a specific semantic role, also handling nested and overlapping definitions. After being studied for years, automatic event extraction is on the road to significantly impact biology in a wide range of applications, from knowledge base enrichment to the formulation of new research hypotheses. Results: This paper provides a comprehensive and up-to-date survey on the link between event extraction and natural language understanding, focusing on the biomedical domain. First, we establish a flexible event definition, summarizing the terminological efforts conducted in various areas. Second, we present the event extraction task, the related challenges, and the available annotated corpora. Third, we deeply explore the most representative methods and present an analysis of the current state-of-the-art, accompanied by performance discussion. To help researchers navigate the avalanche of event extraction works, we provide a detailed taxonomy for classifying the contributions proposed by the community. Fourth, we compare solutions applied in biomedicine with those evaluated in other domains, identifying research opportunities and providing insights for strategies not yet explored. Finally, we discuss applications and our envisions about future perspectives, moving the needle on explainability and knowledge injection. More

Motivation: The scientific literature embeds an enormous amount of relational knowledge, encompassing interactions between biomedical entities, like proteins, drugs, and symptoms. To cope with the ever-increasing number of publications, researchers are experiencing a surge of interest in extracting valuable, structured, concise, and unambiguous information from plain texts. With the development of deep learning, the granularity of information extraction is evolving from entities and pairwise relations to events. Events can model complex interactions involving multiple participants having a specific semantic role, also handling nested and overlapping definitions. After being studied for years, automatic event extraction is on the road to significantly impact biology in a wide range of applications, from knowledge base enrichment to the formulation of new research hypotheses. Results: This paper provides a comprehensive and up-to-date survey on the link between event extraction and natural language understanding, focusing on the biomedical domain. First, we establish a flexible event definition, summarizing the terminological efforts conducted in various areas. Second, we present the event extraction task, the related challenges, and the available annotated corpora. Third, we deeply explore the most representative methods and present an analysis of the current state-of-the-art, accompanied by performance discussion. To help researchers navigate the avalanche of event extraction works, we provide a detailed taxonomy for classifying the contributions proposed by the community. Fourth, we compare solutions applied in biomedicine with those evaluated in other domains, identifying research opportunities and providing insights for strategies not yet explored. Finally, we discuss applications and our envisions about future perspectives, moving the needle on explainability and knowledge injection. Less
Efficient Self-Supervised Metric Information Retrieval: A Bibliography Based Method Applied to COVID Literature

Gianluca Moro, Lorenzo Valgimigli

Sensors 2021 Sensors

Cite

Read

The literature on coronaviruses counts more than 300,000 publications. Finding relevant papers concerning arbitrary queries is essential to discovery helpful knowledge. Current best information retrieval (IR) use deep learning approaches and need supervised training sets with labeled data, namely to know a priori the queries and their corresponding relevant papers. Creating such labeled datasets is time-expensive and requires prominent experts’ efforts, resources insufficiently available under a pandemic time pressure. We present a new self-supervised solution, called SUBLIMER, that does not require labels to learn to search on corpora of scientific papers for most relevant against arbitrary queries. SUBLIMER is a novel efficient IR engine trained on the unsupervised COVID-19 Open Research Dataset (CORD19), using deep metric learning. The core point of our self-supervised approach is that it uses no labels, but exploits the bibliography citations from papers to create a latent space where their spatial proximity is a metric of semantic similarity; for this reason, it can also be applied to other domains of papers corpora. SUBLIMER, despite is self-supervised, outperforms the Precision@5 (P@5) and Bpref of the state-of-the-art competitors on CORD19, which, differently from our approach, require both labeled datasets and a number of trainable parameters that is an order of magnitude higher than our. More

The literature on coronaviruses counts more than 300,000 publications. Finding relevant papers concerning arbitrary queries is essential to discovery helpful knowledge. Current best information retrieval (IR) use deep learning approaches and need supervised training sets with labeled data, namely to know a priori the queries and their corresponding relevant papers. Creating such labeled datasets is time-expensive and requires prominent experts’ efforts, resources insufficiently available under a pandemic time pressure. We present a new self-supervised solution, called SUBLIMER, that does not require labels to learn to search on corpora of scientific papers for most relevant against arbitrary queries. SUBLIMER is a novel efficient IR engine trained on the unsupervised COVID-19 Open Research Dataset (CORD19), using deep metric learning. The core point of our self-supervised approach is that it uses no labels, but exploits the bibliography citations from papers to create a latent space where their spatial proximity is a metric of semantic similarity; for this reason, it can also be applied to other domains of papers corpora. SUBLIMER, despite is self-supervised, outperforms the Precision@5 (P@5) and Bpref of the state-of-the-art competitors on CORD19, which, differently from our approach, require both labeled datasets and a number of trainable parameters that is an order of magnitude higher than our. Less

Publications

Conference Proceedings

LLMs (Almost) Never Abstain Under Medical Uncertainty

Sycophants in the Courtroom: Are LLMs Fragile to Juridical Authority and Evolving Legal Standards?

ReMedQA: Are We Done With Medical Multiple-Choice Benchmarks?

MemeWeaver: Inter-Meme Graph Reasoning for Sexism and Misogyny Detection

Graph-of-Mark: Promote Spatial Reasoning in Multimodal Language Models with Graph-Based Visual Prompting

Can Large Language Models Win the International Mathematical Games?

PORTS: Preference-Optimized Retrievers for Tool Selection with Large Language Models

FEAST: Retrieval-Augmented Multi-Hierarchical Food Classification for the FoodEx2 System

Magic Mirror on the Wall, Which is the Fairest Prompt of All? A Survey on Automatic Prompt Learning

Predicting Protein Functions with Ensemble Deep Learning and Protein Language Models

"What do you call a dog that is incontrovertibly true? Dogma": Testing LLM Generalization through Humor

ZeroNER: Fueling Zero-Shot Named Entity Recognition via Entity Type Descriptions

Neuro-Symbolic Artificial Intelligence: A Task-Directed Survey in the Black-Box Era

OpenBioNER: Lightweight Open-Domain Biomedical Named Entity Recognition Through Entity Type Description

Unknown Claims: Generation of Fact-Checking Training Examples from Unstructured and Structured Data

Diagnosing the Optimal Prompt Trick: A Case Study on the Effectiveness of Prompt Engineering in Medical Question Answering

What Are You Token About? Differentiable Perturbed Top-k Token Selection for Scientific Document Summarization

To Generate or to Retrieve? On the Effectiveness of Artificial Contexts for Medical Open-Domain Question Answering

Revelio: Interpretable Long-Form Question Answering

Retrieve-and-Rank End-to-End Summarization of Biomedical Studies

Graph-based Summarization of Extracted Essential Knowledge for Low-Resource Scenarios

Cogito Ergo Summ: Abstractive Summarization of Biomedical Papers via Semantic Parsing Graphs and Consistency Rewards

Carburacy: Summarization Models Tuning and Comparison in Eco-Sustainable Regimes with a Novel Carbon-Aware Accuracy

BioReader: a Retrieval-Enhanced Text-to-Text Transformer for Biomedical Literature

Deep Vision-Language Model for Efficient Multi-modal Similarity Search in Fashion Retrieval

Self-supervised Information Retrieval Trained from Self-generated Sets of Queries and Relevant Documents

Text-to-Text Extraction and Verbalization of Biomedical Event Graphs

Enhancing Biomedical Scientific Reviews Summarization with Graph-based Factual Evidence Extracted from Papers

Discriminative Marginalized Probabilistic Neural Method for Multi-Document Summarization of Medical Literature

Semantic Self-Segmentation for Abstractive Summarization of Long Documents in Low-Resource Regimes

Phenomena Explanation from Text: Unsupervised Learning of Interpretable and Statistically Significant Knowledge

Unsupervised Descriptive Text Mining for Knowledge Graph Learning

Learning Interpretable and Statistically Significant Knowledge from Unlabeled Corpora of Social Text Messages: A Novel Methodology of Descriptive Text Mining

Towards Rare Disease Knowledge Graph Learning from Social Posts of Patients

Journal Articles

Java Academic Benchmark: Exam-Based Evaluation of LLMs on Object-Oriented Programming

COMMA: A Multi-Task and Multi-Lingual Dataset of Constitutional Verdicts

OpenBioNER-v2: A Suite of Lightweight Models for Zero-Shot Medical Named Entity Recognition via Type Descriptions

Clash-of-Leges: A Bilingual Dataset for Conflict Detection and Explanation in Statutory Law

Read Between the Tokens: Differentiable Text Pruning via Perturbed Top-k Selection

Abstractive Summarization through the Prism of Decoding Strategies

Legal Lay Summarization: Exploring Methods and Data Generation with Large Language Models

Enhancing Legal Question Answering with Data Generation and Knowledge Distillation from Large Language Models

Cross-Document Distillation via Graph-based Summarization of Extracted Essential Knowledge

LAWSUIT: a LArge expert-Written SUmmarization dataset of ITalian constitutional court verdicts

Multi-Language Transfer Learning for Low-Resource Legal Case Summarization

Evidence, my Dear Watson: Abstractive Dialogue Summarization on Learnable Relevant Utterances

Graph-Enhanced Biomedical Abstractive Summarization via Factual Evidence Extraction

Align-Then-Abstract Representation Learning for Low-Resource Summarization

Efficient Memory-Enhanced Transformer for Long-Document Summarization in Low-Resource Regimes

Efficient Text-Image Semantic Search: a Multi-modal Vision-Language Approach for Fashion Retrieval

Comprehensive Analysis of Knowledge Graph Embedding Techniques Benchmarked on Link Prediction

Human Being Detection from UWB NLOS Signals: Accuracy and Generality of Advanced Machine Learning Models

Unsupervised Event Graph Representation and Similarity Learning on Biomedical Literature

A Survey on Event Extraction for Natural Language Understanding: Riding the Biomedical Literature Wave

Efficient Self-Supervised Metric Information Retrieval: A Bibliography Based Method Applied to COVID Literature