summarization summary booksum long-document long-form

long-t5-tglobal-base-16384 + BookSum

<a href="https://colab.research.google.com/gist/pszemraj/d9a0495861776168fd5cdcd7731bc4ee/example-long-t5-tglobal-base-16384-book-summary.ipynb"> <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/> </a>

Summarize long text and get a SparkNotes-esque summary of arbitrary topics!

Cheeky Proof-of-Concept

A summary of the infamous navy seals copypasta:

The narrator tells us that he's graduated from the Navy seals and has been involved in many secret raids. He's also one of the best snipers in the entire U.S. military. He promises to "wipe you out with precision" when they meet again.


Contents

<!-- TOC -->

<!-- /TOC -->


Model description

A fine-tuned version of google/long-t5-tglobal-base on the kmfoda/booksum dataset:

Read the paper by Guo et al. here: LongT5: Efficient Text-To-Text Transformer for Long Sequences

How-To in Python

Install/update transformers pip install -U transformers

Summarize text with pipeline:

import torch
from transformers import pipeline

summarizer = pipeline(
    "summarization",
    "pszemraj/long-t5-tglobal-base-16384-book-summary",
    device=0 if torch.cuda.is_available() else -1,
)
long_text = "Here is a lot of text I don't want to read. Replace me"

result = summarizer(long_text)
print(result[0]["summary_text"])

Pass other parameters related to beam search textgen when calling summarizer to get even higher quality results.

Intended uses & limitations

Training and evaluation data

kmfoda/booksum dataset on HuggingFace - read the original paper here. Summaries longer than 1024 LongT5 tokens were filtered out to prevent the model from learning to generate "partial" summaries.


FAQ

How to run inference over a very long (30k+ tokens) document in batches?

See summarize.py in the code for my hf space Document Summarization :)

You can also use the same code to split a document into batches of 4096, etc., and run over those with the model. This is useful in situations where CUDA memory is limited.

How to fine-tune further?

See train with a script and the summarization scripts.

This model was originally tuned on Google Colab with a heavily modified variant of the longformer training notebook, key enabler being deepspeed. You can try this as an alternate route to fine-tuning the model without using the command line.

Are there simpler ways to run this?

For this reason, I created a Python package utility. It's called textsum, and you can use it to load models and summarize things in a few lines of code.

pip install textsum

Use textsum in python with this model:

from textsum.summarize import Summarizer

summarizer = Summarizer(
    model_name_or_path="pszemraj/long-t5-tglobal-base-16384-book-summary"
)

long_string = "This is a long string of text that will be summarized."
out_str = summarizer.summarize_string(long_string)
print(f"summary: {out_str}")

This package provides easy-to-use interfaces for applying summarization models to text documents of arbitrary length. Currently implemented interfaces include a Python API, a CLI, and a shareable demo application.

For details, explanations, and documentation, see the README (linked above) or the wiki.


Training procedure

Updates:

Training hyperparameters

NOTE: early checkpoints of this model were trained on a "smaller" subsection of the dataset as it was filtered for summaries of 1024 characters. This was subsequently caught and adjusted to 1024 tokens and then trained further for 10+ epochs.

The following hyperparameters were used during the most recent training round*:

* Prior training sessions used roughly similar parameters; multiple sessions were required as this takes eons to train

Framework versions

Citation info

If you find pszemraj/long-t5-tglobal-base-16384-book-summary useful in your work, please consider citing this model :)

@misc {peter_szemraj_2022,
	author       = { {Peter Szemraj} },
	title        = { long-t5-tglobal-base-16384-book-summary (Revision 4b12bce) },
	year         = 2022,
	url          = { https://huggingface.co/pszemraj/long-t5-tglobal-base-16384-book-summary },
	doi          = { 10.57967/hf/0100 },
	publisher    = { Hugging Face }
}