summarization led summary longformer booksum long-document long-form

led-large-book-summary

<a href="https://colab.research.google.com/gist/pszemraj/3eba944ddc9fc9a4a1bfb21e83b57620/summarization-token-batching.ipynb"> <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/> </a>

This model is a fine-tuned version of allenai/led-large-16384 on the BookSum dataset (kmfoda/booksum). It aims to generalize well and be useful in summarizing lengthy text for both academic and everyday purposes.

Note: Due to inference API timeout constraints, outputs may be truncated before the fully summary is returned (try python or the demo)


Basic Usage

To improve summary quality, use encoder_no_repeat_ngram_size=3 when calling the pipeline object. This setting encourages the model to utilize new vocabulary and construct an abstractive summary.

Load the model into a pipeline object:

import torch
from transformers import pipeline

hf_name = 'pszemraj/led-large-book-summary'

summarizer = pipeline(
    "summarization",
    hf_name,
    device=0 if torch.cuda.is_available() else -1,
)

Feed the text into the pipeline object:

wall_of_text = "your words here"

result = summarizer(
    wall_of_text,
    min_length=16,
    max_length=256,
    no_repeat_ngram_size=3,
    encoder_no_repeat_ngram_size=3,
    repetition_penalty=3.5,
    num_beams=4,
    early_stopping=True,
)

Important: For optimal summary quality, use the global attention mask when decoding, as demonstrated in this community notebook, see the definition of generate_answer(batch).

If you're facing computing constraints, consider using the base version pszemraj/led-base-book-summary.


Training Information

Data

The model was fine-tuned on the booksum dataset. During training, the chapterwas the input col, while the summary_text was the output.

Procedure

Fine-tuning was run on the BookSum dataset across 13+ epochs. Notably, the final four epochs combined the training and validation sets as 'train' to enhance generalization.

Hyperparameters

The training process involved different settings across stages:

Versions


Simplified Usage with TextSum

To streamline the process of using this and other models, I've developed a Python package utility named textsum. This package offers simple interfaces for applying summarization models to text documents of arbitrary length.

Install TextSum:

pip install textsum

Then use it in Python with this model:

from textsum.summarize import Summarizer

model_name = "pszemraj/led-large-book-summary"
summarizer = Summarizer(
    model_name_or_path=model_name,  # you can use any Seq2Seq model on the Hub
    token_batch_length=4096,  # tokens to batch summarize at a time, up to 16384
)
long_string = "This is a long string of text that will be summarized."
out_str = summarizer.summarize_string(long_string)
print(f"summary: {out_str}")

Currently implemented interfaces include a Python API, a Command-Line Interface (CLI), and a demo/web UI.

For detailed explanations and documentation, check the README or the wiki


Related Models

Check out these other related models, also trained on the BookSum dataset:

There are also other variants on other datasets etc on my hf profile, feel free to try them out :)