text2text generation

TL;DR

Our full model is out!🎉🎉🎉 It leverages the power of multi-instruction finetuning and beats the baseline by a margin. Use the full model unless the goal is comparison.

Scientific Abstract Simplification rewrites hard-to-read scientific abstracts😵 into simpler yet relevant scientific stories😇. We hope our model can make scientific knowledge accessible for everyone🤗.

Try it now with the Hosted inference API on the right. You can choose an existing example or paste in any (perhaps full-of-jargon) abstract. Remember to prepend the instruction to the abstract ("summarize, simplify, and contextualize: "; notice, there is a whitespace after the colon). Local use refers to Section Usage.

Model Details

Model Description

Open science has significantly lowered the barriers to scientific papers. However, reachable research does not mean accessible knowledge. Scientific papers are usually replete with jargon and hard to read. A lay audience would rather trust little stories on social media than read scientific papers. They are not to blame, we human like stories. So why do not we "translate" arcane scientific abstracts into simpler yet relevant scientific stories? Some renowned journals have already taken accessibility into consideration. For example, PNAS asks authors to submit Significance Statements targeting "an undergraduate-educated scientist." Science also includes an editor abstract for a quick dive.

We therefore propose to rewrite scientific abstracts into understandable scientific stories using AI. To this end, we introduce a new corpus comprising PNAS abstract-significance pairs. We finetune an encoder-decoder Transformer model (a variant of Flan-T5) with the corpus. Our baseline model (SAS-baseline) shows promising capacity in simplifying and summarizing scientific abstracts. We hope our work can pave the last mile of scientific understanding and let people better enjoy the fruits of open science.

As an ongoing effort, we are working on re-contextualizating abstracts for better storytelling and avoiding certain jargon tokens during inference time for better readability.

<!-- We hypothesize the last mile of scientific understanding is cognitive. -->

Usage

Use the code below to get started with the model. Remember to prepend the INSTRUCTION for best performance.

import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
INSTRUCTION = "summarize, simplify, and contextualize: "

tokenizer = AutoTokenizer.from_pretrained("haining/sas_baseline")

model = AutoModelForSeq2SeqLM.from_pretrained("haining/sas_baseline")

input_text = "The COVID-19 pandemic presented enormous data challenges in the United States. Policy makers, epidemiological modelers, and health researchers all require up-to-date data on the pandemic and relevant public behavior, ideally at fine spatial and temporal resolution. The COVIDcast API is our attempt to fill this need: Operational since April 2020, it provides open access to both traditional public health surveillance signals (cases, deaths, and hospitalizations) and many auxiliary indicators of COVID-19 activity, such as signals extracted from deidentified medical claims data, massive online surveys, cell phone mobility data, and internet search trends. These are available at a fine geographic resolution (mostly at the county level) and are updated daily. The COVIDcast API also tracks all revisions to historical data, allowing modelers to account for the frequent revisions and backfill that are common for many public health data sources. All of the data are available in a common format through the API and accompanying R and Python software packages. This paper describes the data sources and signals, and provides examples demonstrating that the auxiliary signals in the COVIDcast API present information relevant to tracking COVID activity, augmenting traditional public health reporting and empowering research and decision-making."

encoding = tokenizer(INSTRUCTION + input_text, 
                     max_length=672, 
                     padding='max_length', 
                     truncation=True, 
                     return_tensors='pt')
decoded_ids = model.generate(input_ids=encoding['input_ids'],
                             attention_mask=encoding['attention_mask'], 
                             max_length=512, 
                             top_p=.9, 
                             do_sample=True)

print(tokenizer.decode(decoded_ids[0], skip_special_tokens=True))

Training

Data

For SAS-baseline, we finetuned Flan-T5 model with the Scientific Abstract-Significance (SAS) corpus.

Scientific Abstract-Significance # Training/Dev/Test Samples # Training Tokens # Validation Tokens # Test Tokens Automated Readability Index (std.)
Abstract 3030/200/200 707,071 45,697 46,985 18.68 (2.85)
Significance 3030/200/200 375,433 24,901 24,426 17.89 (3.05)

Setup

We finetuned the base model with a standard language modeling objective: the abstracts are sources and the significance statements are targets. We inform the model with a task-spcific prefix ("summarize, simplify, and contextualize: ") during training. The training took roughly 9 hours on two NVIDIA RTX A5000 (24GB memory each) GPUs. We saved the checkpoint with the lowest validation loss for inference. We used the AdamW optimizer and a learning rate of 3e-5 with fully sharded data parallel strategy. The model (~780M parameter) was trained on Nov. 20, 2022. Notice, the readability of the significance statements is generally lower than the abstracts', but not by a large margin. Our incoming SAS-full model will leverage more corpora for scientific (re)contextualization, summarization, and simplification.

Evaluation

The model is evaluated on the SAS test set using the following metrics.

Metrics

Implementations of SacreBLEU, BERT Score, ROUGLE, METEOR, and SARI are from Huggingface evaluate v.0.3.0. ARI is from py-readability-metrics v.1.4.5.

Results

We tested our model on the SAS test set (200 samples). We generate 10 lay summaries based on each sample's abstract. During generation, we used top-p sampling with p=0.9. The mean performance is reported below.

Metrics SAS-baseline
SacreBLEU↑ 18.43
BERT Score F1↑ 89.31
ROUGLE-1↑ 48.14
ROUGLE-2↑ 22.96
ROUGLE-L↑ 32.29
METEOR↑ 39.04
SARI↑ 46.68
ARI↓ 17.27
Note: 1. Some generated texts are too short (less than 100 words) to calcualte meaningful ARI. We therefore concatenated adjecent five texts and compute ARI for the 400 longer texts (instead of original 2,000 texts). 2. BERT score, ROUGE, and METEOR are multiplied by 100.

Contact

Please contact us for any questions or suggestions.

Disclaimer

This model is created for making scientific abstracts more accessible. Its outputs should not be used or trusted outside of its scope. There is no guarantee that the generated text is perfectly aligned with the research. Resort to human experts or original papers when a decision is critical.

Acknowledgement

This research is supported by the Institute of Museum and Library Services (IMLS) RE-246450-OLS-20.