BioLaySumm 2024

Shared Task: Lay Summarization of Biomedical Research Articles @ BioNLP Workshop, ACL 2024


Biomedical publications contain the latest research on prominent health-related topics, ranging from common illnesses to global pandemics. This can often result in their content being of interest to a wide variety of audiences including researchers, medical professionals, journalists, and even members of the public. However, the highly technical and specialist language used within such articles typically makes it difficult for non-expert audiences to understand their contents.

The BioLaySumm shared task surrounds the abstractive summarization of biomedical articles, with an emphasis on catering to non-expert audiences through the generation of summaries that are more readable, containing more background information and less technical terminology (i.e., a “lay summary”).

This is the 2nd iteration of BioLaySumm, following the success of the 1st edition of the task at BioNLP 2023 [1] which attracted 56 submissions across 20 different teams. In this edition, which is to be hosted by the BioNLP workshop at ACL 2024, we aim to build on last year’s task by introducing a new test set, updating our evaluation protocol, and encouraging participants to explore novel approaches that will help to further advance the state-of-the-art for Lay Summarization. Accordingly, we will not only be offering a prize of £100 to the team with the top-ranking submission, but we will also offer a second prize of £50 to the team that propose the most innovative approach (as decided upon by the task organisers). For inspiration, we provide some ideas of promising research directions below.


Important Dates

Note that all deadlines are 23:59:59 AoE (UTC-12).

Join our Google Group for updates and discussion on the shared task! If you have any questions, please ask in the Google Group or email us.

Registration and Submission

CodaBench page:

Task Definition: Lay Summarization

Given an article’s abstract and main text as input, the goal is for participants to train a model (or models) to generate the lay summary. Two separate datasets (derived from biomedical journals, PLOS and eLife) are provided for model training and will be used for evaluation. For the final evaluation, submissions will be ranked on the average performance across both datasets.

Note that submissions can be generated from either 2 separate summarization models (i.e., one trained on each dataset) or a single unified model (i.e., trained on both datasets). Participants will be required to indicate which approach was taken for each submission.


The data that will be used for the task is based on the PLOS and eLife datasets, published in [2]. Each dataset consists of biomedical research articles ( including their technical abstracts) and their expert-written lay summaries. The lay summaries of each dataset also exhibit numerous notable differences in their characteristics - for more details, please refer to [2].

PLOS is the larger of the two datasets, containing 24,773 instances for training and 1,376 for validation. eLife contains 4,346 instances for training and 241 for validation.

The test data is composed of 142 PLOS article and 142 eLife articles. Note: These test splits are different to those published in [2] and used for the 1st edition of BioLaySumm [1].

All task data is provided via the competition CodaBench page under the “Files” tab.


For both subtasks, we will evaluate generated summaries across three aspects: Relevance, Readability, and Factuality. Each evaluation aspect will be composed of multiple automatic metrics:

The scores presented for each metric will be the average of those calculated independently for the generated lay summaries of PLOS and eLife. The aim is to maximize the scores for Relevance metrics, Factuality metrics, and the LENS (Readability) metric and minimize scores for all other Readability metrics.

We will rank submissions based on each of these evaluation aspects independently. This will be done after the test phase has ended by applying min-max normalization to the scores of each metric, before averaging across metrics within each evaluation aspect. A ranking for each evaluation aspect will be computed as well as an overall ranking, which will be based on the best average score across all three aspects.

The evaluation scripts (configured for use on the validation splot) can be found here.

A worked example of the described process is provided here.

Promising Research Directions



[1] Tomas Goldsack, Zheheng Luo, Qianqian Xie, Carolina Scarton, Sophia Ananiadou, Chenghua Lin. 2023. Overview of the BioLaySumm 2023 Shared Task on Lay Summarization of Biomedical Research Articles. In The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks, pages 468–477, Toronto, Canada. Association for Computational Linguistics.

[2] Tomas Goldsack, Zhihao Zhang, Chenghua Lin, Carolina Scarton. 2022. Making Science Simple: Corpora for the Lay Summarisation of Scientific Literature. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10589–10604, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.

[3] Tomas Goldsack, Zhihao Zhang, Chen Tang, Carolina Scarton, and Chenghua Lin. 2023. Enhancing Biomedical Lay Summarisation with External Knowledge Graphs. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 8016–8032, Singapore. Association for Computational Linguistics.

[4] Oisín Turbitt, Robert Bevan, and Mouhamad Aboshokor. 2023. MDC at BioLaySumm Task 1: Evaluating GPT Models for Biomedical Lay Summarization. In The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks, pages 611–619, Toronto, Canada. Association for Computational Linguistics.

[5] Mong Yuan Sim, Xiang Dai, Maciej Rybinski, and Sarvnaz Karimi. 2023. CSIRO Data61 Team at BioLaySumm Task 1: Lay Summarisation of Biomedical Research Articles Using Generative Models. In The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks, pages 629–635, Toronto, Canada. Association for Computational Linguistics.

[6] Zheheng Luo, Qianqian Xie, and Sophia Ananiadou. 2022. Readability Controllable Biomedical Document Summarization. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 4667–4680, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.