long-t5-tglobal-base-16384 + BookSum

Summarize long text and get a SparkNotes-esque summary of arbitrary topics!

generalizes reasonably well to academic & narrative text.
A simple example/use case on ASR is here.
Example notebook in Colab (click on the icon above).

Cheeky Proof-of-Concept

A summary of the infamous navy seals copypasta:

The narrator tells us that he's graduated from the Navy seals and has been involved in many secret raids. He's also one of the best snipers in the entire U.S. military. He promises to "wipe you out with precision" when they meet again.

Contents

Model description
How-To in Python
Intended uses & limitations
Training and evaluation data
FAQ
Training procedure
Citation info

Model description

A fine-tuned version of google/long-t5-tglobal-base on the kmfoda/booksum dataset:

30+ epochs of fine-tuning from the base model on V100/A100 GPUs
Training used 16384 token input / 1024 max output

Read the paper by Guo et al. here: LongT5: Efficient Text-To-Text Transformer for Long Sequences

How-To in Python

Install/update transformers pip install -U transformers

Summarize text with pipeline:

import torch
from transformers import pipeline

summarizer = pipeline(
    "summarization",
    "pszemraj/long-t5-tglobal-base-16384-book-summary",
    device=0 if torch.cuda.is_available() else -1,
)
long_text = "Here is a lot of text I don't want to read. Replace me"

result = summarizer(long_text)
print(result[0]["summary_text"])

Pass other parameters related to beam search textgen when calling summarizer to get even higher quality results.

Intended uses & limitations

The current checkpoint is fairly well converged but will be updated if further improvements can be made.
- Compare performance to LED-base trained on the same dataset (API gen parameters are the same).
while this model seems to improve upon factual consistency, do not take summaries to be foolproof and check things that seem odd.

Training and evaluation data

kmfoda/booksum dataset on HuggingFace - read the original paper here. Summaries longer than 1024 LongT5 tokens were filtered out to prevent the model from learning to generate "partial" summaries.

FAQ

How to run inference over a very long (30k+ tokens) document in batches?

See summarize.py in the code for my hf space Document Summarization :)

You can also use the same code to split a document into batches of 4096, etc., and run over those with the model. This is useful in situations where CUDA memory is limited.

How to fine-tune further?

See train with a script and the summarization scripts.

This model was originally tuned on Google Colab with a heavily modified variant of the longformer training notebook, key enabler being deepspeed. You can try this as an alternate route to fine-tuning the model without using the command line.

Are there simpler ways to run this?

For this reason, I created a Python package utility. It's called textsum, and you can use it to load models and summarize things in a few lines of code.

pip install textsum

Use textsum in python with this model:

from textsum.summarize import Summarizer

summarizer = Summarizer(
    model_name_or_path="pszemraj/long-t5-tglobal-base-16384-book-summary"
)

long_string = "This is a long string of text that will be summarized."
out_str = summarizer.summarize_string(long_string)
print(f"summary: {out_str}")

This package provides easy-to-use interfaces for applying summarization models to text documents of arbitrary length. Currently implemented interfaces include a Python API, a CLI, and a shareable demo application.

For details, explanations, and documentation, see the README (linked above) or the wiki.

Training procedure

Updates:

July 22, 2022: updated to a fairly converged checkpoint
July 3, 2022: Added a new version with several epochs of additional general training that is more performant.

Training hyperparameters

NOTE: early checkpoints of this model were trained on a "smaller" subsection of the dataset as it was filtered for summaries of 1024 characters. This was subsequently caught and adjusted to 1024 tokens and then trained further for 10+ epochs.

The following hyperparameters were used during the most recent training round*:

learning_rate: 0.0005
train_batch_size: 1
eval_batch_size: 1
seed: 42
distributed_type: multi-GPU
gradient_accumulation_steps: 128
total_train_batch_size: 128
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.01
num_epochs: 2

* Prior training sessions used roughly similar parameters; multiple sessions were required as this takes eons to train

Framework versions

Transformers 4.20.1
Pytorch 1.10.0+cu113
Datasets 2.3.2
Tokenizers 0.12.1

Citation info

If you find pszemraj/long-t5-tglobal-base-16384-book-summary useful in your work, please consider citing this model :)

@misc {peter_szemraj_2022,
	author       = { {Peter Szemraj} },
	title        = { long-t5-tglobal-base-16384-book-summary (Revision 4b12bce) },
	year         = 2022,
	url          = { https://huggingface.co/pszemraj/long-t5-tglobal-base-16384-book-summary },
	doi          = { 10.57967/hf/0100 },
	publisher    = { Hugging Face }
}