led-large-book-summary

This model is a fine-tuned version of allenai/led-large-16384 on the BookSum dataset (kmfoda/booksum). It aims to generalize well and be useful in summarizing lengthy text for both academic and everyday purposes.

Handles up to 16,384 tokens input
See the Colab demo linked above or try the demo on Spaces

Note: Due to inference API timeout constraints, outputs may be truncated before the fully summary is returned (try python or the demo)

Basic Usage

To improve summary quality, use encoder_no_repeat_ngram_size=3 when calling the pipeline object. This setting encourages the model to utilize new vocabulary and construct an abstractive summary.

Load the model into a pipeline object:

import torch
from transformers import pipeline

hf_name = 'pszemraj/led-large-book-summary'

summarizer = pipeline(
    "summarization",
    hf_name,
    device=0 if torch.cuda.is_available() else -1,
)

Feed the text into the pipeline object:

wall_of_text = "your words here"

result = summarizer(
    wall_of_text,
    min_length=16,
    max_length=256,
    no_repeat_ngram_size=3,
    encoder_no_repeat_ngram_size=3,
    repetition_penalty=3.5,
    num_beams=4,
    early_stopping=True,
)

Important: For optimal summary quality, use the global attention mask when decoding, as demonstrated in this community notebook, see the definition of generate_answer(batch).

If you're facing computing constraints, consider using the base version pszemraj/led-base-book-summary.

Training Information

Data

The model was fine-tuned on the booksum dataset. During training, the chapterwas the input col, while the summary_text was the output.

Procedure

Fine-tuning was run on the BookSum dataset across 13+ epochs. Notably, the final four epochs combined the training and validation sets as 'train' to enhance generalization.

Hyperparameters

The training process involved different settings across stages:

Initial Three Epochs: Low learning rate (5e-05), batch size of 1, 4 gradient accumulation steps, and a linear learning rate scheduler.
In-between Epochs: Learning rate reduced to 4e-05, increased batch size to 2, 16 gradient accumulation steps, and switched to a cosine learning rate scheduler with a 0.05 warmup ratio.
Final Two Epochs: Further reduced learning rate (2e-05), batch size reverted to 1, maintained gradient accumulation steps at 16, and continued with a cosine learning rate scheduler, albeit with a lower warmup ratio (0.03).

Versions

Transformers 4.19.2
Pytorch 1.11.0+cu113
Datasets 2.2.2
Tokenizers 0.12.1

Simplified Usage with TextSum

To streamline the process of using this and other models, I've developed a Python package utility named textsum. This package offers simple interfaces for applying summarization models to text documents of arbitrary length.

Install TextSum:

pip install textsum

Then use it in Python with this model:

from textsum.summarize import Summarizer

model_name = "pszemraj/led-large-book-summary"
summarizer = Summarizer(
    model_name_or_path=model_name,  # you can use any Seq2Seq model on the Hub
    token_batch_length=4096,  # tokens to batch summarize at a time, up to 16384
)
long_string = "This is a long string of text that will be summarized."
out_str = summarizer.summarize_string(long_string)
print(f"summary: {out_str}")

Currently implemented interfaces include a Python API, a Command-Line Interface (CLI), and a demo/web UI.

For detailed explanations and documentation, check the README or the wiki

Related Models

Check out these other related models, also trained on the BookSum dataset:

LED-large continued - experiment with further fine-tuning
Long-T5-tglobal-base
BigBird-Pegasus-Large-K
Pegasus-X-Large
Long-T5-tglobal-XL

There are also other variants on other datasets etc on my hf profile, feel free to try them out :)