Introduction
Biomedical publications contain the latest research on prominent health-related topics, ranging from common illnesses to global pandemics. This can often result in their content being of interest to a wide variety of audiences including researchers, medical professionals, journalists, and even members of the public. However, the highly technical and specialist language used within such articles typically makes it difficult for non-expert audiences to understand their contents.
The BioLaySumm shared task surrounds the abstractive summarization of biomedical articles, with an emphasis on catering to non-expert audiences through the generation of summaries that are more readable, containing more background information and less technical terminology (i.e., a “lay summary”).
This is the 3nd iteration of BioLaySumm, following the success of the 2nd edition of the task at BioNLP 2024 [1] which attracted 200 plus submissions across 53 different teams and the 1st edition of the task at BioNLP 2023 [2] which attracted 56 submissions across 20 different teams. In this edition, which is to be hosted by the BioNLP workshop at ACL 2025, we aim to build on last year’s task by introducing a new task: radiology report generation with layman’s terms, extending the shared task to a new domain and multi-modality. Additionally, we update our evaluation protocol, and encourage participants to explore unified approaches to tackle both the text-only task and the multi-modal task, which help further advance the state-of-the-art for Lay Summarization.
Important Dates
- First call for participation: January 31st, 2025
- Releasing of task data (training, validation, and test): February 15th, 2025
- System submission deadline: May 6th, 2025
- System papers due date: May 15th, 2025
- Notification of acceptance: June 17th, 2025
- Camera-ready system papers due: July 1st, 2025
- BioNLP Workshop Date: July 31 - August 1, 2025
Note that all deadlines are 23:59:59 AoE (UTC-12).
Task Definition
Task 1: Lay Summarization
Subtask 1.1: Plain Lay Summarization
Given an article’s abstract and main text as input, the goal is for participants to train a model (or models) to generate the lay summary. Two separate datasets (derived from biomedical journals, PLOS and eLife) are provided for model training and will be used for evaluation. For the final evaluation, submissions will be ranked on the average performance across both datasets.
Note that submissions can be generated from either 2 separate summarization models (i.e., one trained on each dataset) or a single unified model (i.e., trained on both datasets). Participants will be required to indicate which approach was taken for each submission.
Subtask 1.2: Lay Summarisation with External Knowledge
Due to differences in the intended audience, the source article may not always contain all the information required by a lay audience (e.g., background information needed for understanding the topic, concept definitions, etc.). This knowledge gap can be filled through the introduction of relevant external information.
This substask is the constrained version of Task 1.1, and follows an identical setup, except that participants will be required to make use of external knowledge in some capacity. For example, participants could perform manual data augmentation or Retrieval-Augmented Generation (RAG). Participants will be required to indicate which approach was taken(i.e., online or offline) for each submission.
Task 2: Radiology Report Generation with Layman’s Terms
Subtask 2.1: Radiology Report Translation
The goal of this subtask is to build models to translate professional reports to layman’s terms. This has validated advantages that will bring real-world impacts, e.g., improving patient understanding. This will also allow participants that have no intention to participate in our multi-modal setting to contribute to the field of Radiology Report Generation(RRG).
Participants would be asked to submit summaries generated by the trained model. Also, we will be able to provide professional report-layman’s terms pairs from datasets(two settings: Open-i, PadChest, BIMCV-COVID19, with or without MIMIC-XCR).
Subtask 2.2: Multi-modal Radiology Report Translation
The participants are expected to train an end-to-end image-to-text model, e.g., an MLLM, to facilitate this task.
Two settings are provided, three datasets training and four datasets training. Submissions under these two settings will be evaluated differently, although we would encourage participants to gain access to MIMIC-XCR dataset to push the boundary of the performance.
Note: We’re excited to have you join us! You can choose to take on one subtask or multiple — it’s entirely up to you!
Datasets
For Task 1: Lay Summarisation
The data that will be used for this task is based on the PLOS and eLife, published in [3]. Each dataset consists of biomedical research articles (including their technical abstracts) and their expert-written lay summaries. The lay summaries of each dataset also exhibit numerous notable differences in their characteristics - for more details, please refer to [3].
PLOS is the larger of the two datasets, containing 24,773 instances for training and 1,376 for validation. eLife contains 4,346 instances for training and 241 for validation. The datasets can be downloaded in scientific_lay_summarisation.
For Task 2: Radiology Report Generation with Layman’s Terms
Four layman’s terms datasets (PadChest, BIMCV-COVID19, Open-i and MIMIC-CXR) will be provided for model training and used for evaluation. We will mainly base on non-MIMIC’s images because participants can easily access. We allow the use of MIMIC-XCR data to provide better results, and submissions that are trained on three datasets and four datasets will be evaluated separately.
Evaluation
An automatic evaluation will be conducted automatically upon submission of test set predictions using CodaLab. After the test phase is complete, we will perform a manual system ranking process that involves normalising and averaging metric scores.
Automatic Metrics
For Task 1, we will evaluate generated summaries across four aspects: Relevance, Readability, and Factuality[8][9], and LLM-based metrics. Each evaluation aspect will be composed of one or more automatic metrics:
- Relevance - ROUGE (1, 2, and L), BLEU, meteor, BERTScore, semantics scores (e.g., based on GritLM)
- Readability - Flesch-Kincaid Grade Level (FKGL) and Dale-Chall Readability Score (DCRS), Coleman-Liau Index (CLI), and LENS
- Factuality - AlignScore, SummaC
- LLMs-based metrics - DeepSeek
For Task 2, we will evaluate generated summaries across three aspects: Relevance, Clinical, and LLM-based metrics. Each evaluation aspect will be composed of one or more automatic metrics:
- Relevance - ROUGE (1, 2, and L), BLEU, meteor, BERTScore, semantics scores (e.g., based on GritLM)
- Clinical metrics - CheXbert-F1, RadGraph-F1, RadCliQ
- LLMs-based metrics - DeepSeek
System Ranking
We will rank submissions based on each of these evaluation aspects independently. This will be done after the test phase has ended by applying min-max normalization to the scores of each metric, before averaging across metrics within each evaluation aspect. A ranking for each evaluation aspect will be computed as well as an overall ranking, which will be based on the best average score across all related aspects.
Baseline
For text-only tasks(Task 1 and Subtask 2.1), llama3 8B/Qwen2.5 7B will be used as the primary baseline.
For multimodal task(Subtask 2.2), we will use finetuned LlaVA as the finetuned baseline, and a stronger model (e.g, Qwen-VL 72B) as the off-the-shelf baseline.
Promising Research Directions
-
Retrieval-Augmented Generation (RAG) - Recent work has shown that the incorperation of external knowledge derived from graphs can help to improve the quality of generated summaries [4]. The automatic identification and retrieval of relevant textual knowledge is likely to provide similar benefits, and we encourage participants to explore this direction.
-
Controllable Lay Summarisation - The lay summaries of PLOS and eLife exhibit numerous notable differences in their characteristics, including their length, readability, and abstractivness [3]. We encourage participants to explore the use of controllable generation techniques to produce summaries that are tailored to the characteristics of each dataset and enable models to cater to the needs of different audiences (e.g., [7]).
-
LLMs for Data Augmentation - In the first edition of BioLaySumm [2], we found that the use of large language models (LLMs) proved benefitial for summary generation [5], but also for data augmentation [6]. We encourage participants to explore this direction further, and to consider potential new ways to use use LLMs for both summary generation and data augmentation.
System Paper Submission
All participating teams are invited to submit system papers that, pending review, will be published as part of the BioNLP Workshop proceedings.
Submission - Participants can submit a system paper at following SoftConf link (track “ST_2”) anytime before the deadline on 20/05/2025 (23:59:59, Anywhere on Earth timezone): https://softconf.com/acl2025/BioNLP2025-ST/
Format - System papers should follow the ACL 2025 short paper format (i.e., 4 pages, with unlimited pages for appendices and references), as described on the call for papers. The Overleaf template for this can be found here.
Paper titles should adopt the format: “{TEAM_NAME} at BioLaySumm:” followed by a descriptive title of the proposed approach. Papers should be submitted in non-anonymised format (i.e., with author names included).
References - We ask participants to ensure the following citations are included in their system papers:
-
BioLaySumm 2025 Overview Paper - the following citation should be used when referring to the shared task in general (note, this is a temporary example that may be subject to change in the future):
@inproceedings{goldsack-etal-2025-biolaysumm, title = "Overview of the BioLaySumm 2025 Shared Task on the Lay Summarization of Biomedical Research Articles", author = "Goldsack, Tomas and Lin, Chenghua", booktitle = "The 24rd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks", month = aug, year = "2025", address = "Vienna, Austria", publisher = "Association for Computational Linguistics", }
Participants may also want to include last two years’ overview paper when referring to the shared task:
@inproceedings{goldsack-etal-2024-overview,
title = "Overview of the {B}io{L}ay{S}umm 2024 Shared Task on the Lay Summarization of Biomedical Research Articles",
author = "Goldsack, Tomas and
Scarton, Carolina and
Shardlow, Matthew and
Lin, Chenghua",
editor = "Demner-Fushman, Dina and
Ananiadou, Sophia and
Miwa, Makoto and
Roberts, Kirk and
Tsujii, Junichi",
booktitle = "Proceedings of the 23rd Workshop on Biomedical Natural Language Processing",
month = aug,
year = "2024",
address = "Bangkok, Thailand",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.bionlp-1.10/",
doi = "10.18653/v1/2024.bionlp-1.10",
pages = "122--131",
}
@inproceedings{goldsack-etal-2023-biolaysumm,
title = "Overview of the BioLaySumm 2023 Shared Task on Lay Summarization of Biomedical Research Articles",
author = "Goldsack, Tomas and
Luo, Zheheng and
Xie, Qianqian and
Scarton, Carolina and
Shardlow, Matthew and
Ananiadou, Sophia and
Lin, Chenghua",
booktitle = "The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks",
month = jul,
year = "2023",
address = "Toronto, Canada",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.bionlp-1.44",
doi = "10.18653/v1/2023.bionlp-1.44",
pages = "468--477",
}
- Task datasets - the following citation should be used when referring to the task datasets:
@article{zhao2024x,
title={X-ray Made Simple: Radiology Report Generation and Evaluation with Layman's Terms},
author={Zhao, Kun and Xiao, Chenghao and Tang, Chen and Yang, Bohao and Ye, Kai and Moubayed, Noura Al and Zhan, Liang and Lin, Chenghua},
journal={arXiv preprint arXiv:2406.17911},
year={2024}
}
@inproceedings{goldsack-etal-2022-making,
title = "Making Science Simple: Corpora for the Lay Summarisation of Scientific Literature",
author = "Goldsack, Tomas and
Zhang, Zhihao and
Lin, Chenghua and
Scarton, Carolina",
booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing",
month = dec,
year = "2022",
address = "Abu Dhabi, United Arab Emirates",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.emnlp-main.724",
pages = "10589--10604",
}
Reviewer Nomination - Similar to other shared task campaigns (e.g. SemEval), we are requiring that at least one author per paper also acts as a reviewer for our shared task papers. Please nominate the reviewer from your submission using this form. If you do not nominate a reviewer, the corresponding author(s) will be automatically selected.
Organizers
- Kun Zhao, University of Pittsburgh
- Prof. Liang Zhan, University of Pittsburgh
- Chenghao Xiao, University of Durham
- Prof. Noura Al Moubayed, University of Durham
- Kejing Yin, Hong Kong Baptist University
- Sixing Yan, Hong Kong Baptist University
- Zijian Lei, Hong Kong Baptist University
- Prof. William CHEUNG, Hong Kong Baptist University
- Dr. Qianqian Xie, Yale University
- Zheheng Luo, University of Manchester
- Prof. Sophia Ananiadou, University of Manchester
- Tomas Goldsack, University of Sheffield
- Siwei Wu, University of Manchester
- Xiao Wang, University of Manchester
- Prof. Chenghua Lin, University of Manchester
References
[1] Tomas Goldsack, Carolina Scarton, Matthew Shardlow, and Chenghua Lin. 2024. Overview of the BioLaySumm 2024 Shared Task on the Lay Summarization of Biomedical Research Articles. In Proceedings of the 23rd Workshop on Biomedical Natural Language Processing, pages 122–131, Bangkok, Thailand. Association for Computational Linguistics.
[2] Tomas Goldsack, Zheheng Luo, Qianqian Xie, Carolina Scarton, Sophia Ananiadou, Chenghua Lin. 2023. Overview of the BioLaySumm 2023 Shared Task on Lay Summarization of Biomedical Research Articles. In The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks, pages 468–477, Toronto, Canada. Association for Computational Linguistics.
[3] Tomas Goldsack, Zhihao Zhang, Chenghua Lin, Carolina Scarton. 2022. Making Science Simple: Corpora for the Lay Summarisation of Scientific Literature. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10589–10604, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
[4] Tomas Goldsack, Zhihao Zhang, Chen Tang, Carolina Scarton, and Chenghua Lin. 2023. Enhancing Biomedical Lay Summarisation with External Knowledge Graphs. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 8016–8032, Singapore. Association for Computational Linguistics.
[5] Oisín Turbitt, Robert Bevan, and Mouhamad Aboshokor. 2023. MDC at BioLaySumm Task 1: Evaluating GPT Models for Biomedical Lay Summarization. In The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks, pages 611–619, Toronto, Canada. Association for Computational Linguistics.
[6] Mong Yuan Sim, Xiang Dai, Maciej Rybinski, and Sarvnaz Karimi. 2023. CSIRO Data61 Team at BioLaySumm Task 1: Lay Summarisation of Biomedical Research Articles Using Generative Models. In The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks, pages 629–635, Toronto, Canada. Association for Computational Linguistics.
[7] Zheheng Luo, Qianqian Xie, and Sophia Ananiadou. 2022. Readability Controllable Biomedical Document Summarization. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 4667–4680, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
[8]Zheheng Luo, Qianqian Xie, Sophia Ananiadou, Factual consistency evaluation of summarization in the Era of large language models. Expert Systems with Applications,Volume 254, 2024, 124456,https://doi.org/10.1016/j.eswa.2024.124456.
[9]Jennifer A. Bishop, Sophia Ananiadou, and Qianqian Xie. 2024. LongDocFACTScore: Evaluating the Factuality of Long Document Abstractive Summarisation. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 10777–10789, Torino, Italia. ELRA and ICCL.