Published: May 17, 2024
We are proud to announce that our group will be at ACL 2024 with two accepted papers in the Main Track and Findings!
Here’s the list of the papers:
- To Generate or to Retrieve? On the Effectiveness of Artificial Contexts for Medical Open-Domain Question Answering
- What Are You Token About? Differentiable Perturbed Top-k Token Selection for Scientific Document Summarization
To Generate or to Retrieve? On the Effectiveness of Artificial Contexts for Medical Open-Domain Question Answering
by G. Frisoni, A. Cocchieri, A. Presepi, G. Moro and Z. Meng
Medical open-domain question answering demands substantial access to specialized knowledge. Recent efforts have sought to decouple knowledge from model parameters, counteracting architectural scaling and allowing for training on common low-resource hardware. The retrieve-then-read paradigm has become ubiquitous, with model predictions grounded on relevant knowledge pieces from external repositories such as PubMed, textbooks, and UMLS. An alternative path, still under-explored but made possible by the advent of domain-specific large language models, entails constructing artificial contexts through prompting. As a result, 'to generate or to retrieve' is the modern equivalent of Hamlet’s dilemma. This paper presents MEDGENIE, the first generate-then-read framework for multiple-choice question answering in medicine. We conduct extensive experiments on MedQA-USMLE, MedMCQA, and MMLU, incorporating a practical perspective by assuming a maximum of 24GB VRAM. MEDGENIE sets a new state-of-the- art (SOTA) in the open-book setting of each testbed, even allowing a small-scale reader to outcompete zero-shot closed-book 175B baselines while using up to 706× fewer parameters. Overall, our findings reveal that generated passages are more effective than retrieved counterparts in attaining higher accuracy.
What Are You Token About? Differentiable Perturbed Top-k Token Selection for Scientific Document Summarization
by L. Ragazzi, P. Italiani, G. Moro and M. Panni
Scientific document summarization (SDS) aims to condense complex and long articles in both technical and plain-language terms to facilitate the accessibility and dissemination of scientific findings. Existing datasets lack source heterogeneity, hindering effective model training and generalizability. First, we introduce SciLay, a novel dataset that includes documents from multiple natural science journals with expert-authored technical and lay summaries. Second, we propose PrunePert, a new transformer-based model that incorporates a differentiable perturbed top-k encoder layer to prune irrelevant tokens in end-to-end learning. Experimental results show that our model achieves a nearly 2x speed-up compared to a state-of-the-art linear transformer, remaining comparable in effectiveness. Additional examinations underscore the importance of employing a training dataset that includes different sources to enhance the generalizability of the models. Code is available at https://anonymous.4open.science/r/sci-lay-2511/.