text2text generation

TL;DR

Scientific Abstract Simplification (SAS) is a tool designed to rewrite complex scientific abstracts into simpler, more comprehensible versions. Our objective is to make scientific knowledge universally accessible. If you have already experimented with our baseline model (sas_baseline), you will find that the current model surpasses its predecessor in terms of all evaluation metrics. Feel free to test it via the Hosted Inference API to your right. Simply select one of the provided examples or input your own scientific abstract. Just ensure to precede your text with the instruction, "summarize, simplify, and contextualize: ", followed by a space. For local usage, refer to the Usage section."

Project Description

Open science has significantly reduced barriers to accessing scientific papers. However, attainable research does not entail accessible knowledge. Consequently, many individuals might prefer to rely on succinct social media narratives rather than endeavour to comprehend a scientific paper. This preference is understandable as humans often favor narratives over dry, technical information. So, why not "translate" these intricate scientific abstracts into simpler, more accessible narratives? Several prestigious journals have already initiated steps towards enhancing accessibility. For instance, PNAS requires authors to submit Significance Statements understandable to an 'undergraduate-educated scientist', while Science includes an editor's abstract to provide a swift overview of the paper's salient points.

In this project, our objective is to employ AI to rewrite scientific abstracts into easily understandable scientific narratives. To facilitate this, we have curated two new datasets: one containing PNAS abstract-significance pairs and the other encapsulating editor abstracts from Science. We utilize a Transformer model (a variant known as Flan-T5) to fine-tune our model for the task of simplifying scientific abstracts. Initially, the model is fine-tuned utilizing multiple discrete instructions by amalgamating four pertinent tasks in a challenge-proportional manner (a strategy we refer to as Multi-Instruction Pretuning). Subsequently, we continue the fine-tuning process exclusively with the abstract-significance corpus. Our model can generate lay summaries that outperform models fine-tuned solely with the abstract-significance corpus and models fine-tuned with traditional task combinations. We hope our work can foster a more comprehensive understanding of scientific research, enabling a larger audience to benefit from open science.

Usage

Use the code below to get started with the model. Remember to prepend the INSTRUCTION for best performance.

import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
INSTRUCTION = "summarize, simplify, and contextualize: "
tokenizer = AutoTokenizer.from_pretrained("haining/scientific_abstract_simplification")
model = AutoModelForSeq2SeqLM.from_pretrained("haining/scientific_abstract_simplification")
input_text = "The COVID-19 pandemic presented enormous data challenges in the United States. Policy makers, epidemiological modelers, and health researchers all require up-to-date data on the pandemic and relevant public behavior, ideally at fine spatial and temporal resolution. The COVIDcast API is our attempt to fill this need: Operational since April 2020, it provides open access to both traditional public health surveillance signals (cases, deaths, and hospitalizations) and many auxiliary indicators of COVID-19 activity, such as signals extracted from deidentified medical claims data, massive online surveys, cell phone mobility data, and internet search trends. These are available at a fine geographic resolution (mostly at the county level) and are updated daily. The COVIDcast API also tracks all revisions to historical data, allowing modelers to account for the frequent revisions and backfill that are common for many public health data sources. All of the data are available in a common format through the API and accompanying R and Python software packages. This paper describes the data sources and signals, and provides examples demonstrating that the auxiliary signals in the COVIDcast API present information relevant to tracking COVID activity, augmenting traditional public health reporting and empowering research and decision-making."
encoding = tokenizer(INSTRUCTION + input_text, 
                     max_length=672, 
                     padding='max_length', 
                     truncation=True, 
                     return_tensors='pt')
decoded_ids = model.generate(input_ids=encoding['input_ids'],
                             attention_mask=encoding['attention_mask'], 
                             max_length=512, 
                             top_p=.9, 
                             do_sample=True)
print(tokenizer.decode(decoded_ids[0], skip_special_tokens=True))

Training

Data

Corpus # Training/Dev/Test Samples # Training Tokens (source, target) # Validation Tokens (source, target) # Test Tokens (source, target) Note
Scientific Abstract-Significance 3,030/200/200 707,071, 375,433 45,697, 24,901 46,985, 24,426 -
Editor Abstract 732/91/92 154,808, 194,721 19,675, 24,421 19,539, 24,332 -
Wiki Auto 28,364/1,000/1,000 18,239,990, 12,547,272 643,157, 444,034 642549, 444,883 We used the ACL version, adopted from Huggingface datasets. The validation and test samples are split from the corpus and kept frozen.
CNN/DailyMail 287,113/13,368/11,490 - - - We used the 2.0 version, adopted from Huggingface datasets.

Setup

We finetuned the base model (flan-t5-large) on multiple relevant tasks with standard language modeling loss. During training, the source text of each task is prepended with an task-specific instruction and mapped to the corresponding target text. For example, "simplify: " is added before a wiki text, and the whole text is fed into the model to line up with the corresponding simple wiki text. The tuning process has two steps.

Task Corpus Instruction Optimal samples
Scientific Abstract Simplification Scientific Abstract-Significance "summarize, simplify, and contextualize: " 39,200
Recontextualization Editor Abstract "contextualize: " 2,200
Simplification Wiki Auto "simplify: " 57,000
Summarization CNN/DailyMail "summarize: " 165,000
Total Challenge-proportional Mixing n/a 263,400

The multi-instruction tuning and the retuning took roughly 63 hours and 8 hours, respectively, on two NVIDIA RTX A5000 (24GB memory each) GPUs. We saved the checkpoint with the lowest validation loss for inference. We used the AdamW optimizer and a learning rate of 3e-5 with fully sharded data parallel strategy across training stages. The batch size equals to 1.

Evaluation

The model is evaluated on the SAS test set using SacreBLEU, METEOR, BERTScore, ROUGE, SARI, and ARI.

Metrics

<details> <summary> Click to expand </summary>

Implementations of SacreBLEU, BERT Score, ROUGE, METEOR, and SARI are from Huggingface evaluate v.0.3.0. ARI is from py-readability-metrics v.1.4.5.

Results

We tested our model on the SAS test set (200 samples). We generate 10 lay summaries based on each sample's abstract. During generation, we used top-p sampling with p=0.9. The mean performance is reported below.

Metrics SAS
SacreBLEU↑ 25.60
BERT Score F1↑ 90.14
ROUGE-1↑ 52.28
ROUGE-2↑ 29.61
ROUGE-L↑ 38.02
METEOR↑ 43.75
SARI↑ 51.96
ARI↓ 17.04
Note: 1. Some generated texts are too short (less than 100 words) to calcualte meaningful ARI. We therefore concatenated adjecent five texts and compute ARI for the 400 longer texts (instead of original 2,000 texts). 2. BERT score, ROUGE, and METEOR are multiplied by 100.

Contact

Please contact us for any questions or suggestions.

Disclaimer

This model is designed to make scientific abstracts more accessible. Its outputs should not be relied upon for any purpose outside of this scope. There is no guarantee that the generated text accurately reflects the research it is based on. When making important decisions, it is recommended to seek the advice of human experts or consult the original papers.

Acknowledgement

This research is supported by the Institute of Museum and Library Services (IMLS) RE-246450-OLS-20.