Longformer
longformer-base-4096 is a BERT-like model started from the RoBERTa checkpoint and pretrained for MLM on long documents. It supports sequences of length up to 4,096. It was introduced in this paper and first released in this repository. Longformer uses a combination of a sliding window (local) attention and global attention. Global attention is user-configured based on the task to allow the model to learn task-specific representations.
Model description
Transformer-based models are unable to process long sequences due to their self-attention operation, which scales quadratically with the sequence length. Longformer-Encoder-Decoder (LED), a Longformer variant for supporting long document generative sequence-to-sequence tasks, and demonstrate its effectiveness on the arXiv summarization dataset.
- Transformer-based models are unable to pro- cess long sequences due to their self-attention operation, which scales quadratically with the sequence length. To address this limitation, we introduce the Longformer with an attention mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or longer. Longformer’s attention mechanism is a drop-in replacement for the standard self-attention and combines a local windowed attention with a task moti- vated global attention. Following prior work on long-sequence transformers, we evaluate Longformer on character-level language mod- eling and achieve state-of-the-art results on text8 and enwik8. In contrast to most prior work, we also pretrain Longformer and finetune it on a variety of downstream tasks. Our pretrained Longformer consistently out- performs RoBERTa on long document tasks and sets new state-of-the-art results on Wiki- Hop and TriviaQA. We finally introduce the Longformer-Encoder-Decoder (LED), a Long- former variant for supporting long document generative sequence-to-sequence tasks, and demonstrate its effectiveness on the arXiv sum- marization dataset.
- The original Transformer model has a self-attention component with O(n^2) time and memory complexity where n is the input sequence length. To address this challenge, we sparsify the full self-attention matrix according to an “attention pattern” specifying pairs of input locations attending to one another. Unlike the full self-attention, our proposed attention pattern scales linearly with the input sequence, making it efficient for longer sequences. This section discusses the design and implementation of this attention pattern.
Dataset and Task
To compare to prior work we focus on character-level LM (text8 and enwik8; Mahoney, 2009) (This is for language modelling) For finetuned tasks: WikiHop, TriviaQA, HotpotQA, OntoNotes, IMDB, Hyperpartisan
We evaluate on text8 and enwik8, both contain 100M characters from Wikipedia split into 90M, 5M, 5M for train, dev, test.
Tokenizer with Vocabulary size
To prepare the data for input to Longformer and RoBERTa, we first tokenize the question, answer candidates, and support contexts using RoBERTa’s wordpiece tokenizer. The special tokens [q], [/q], [ent], [/ent] were added to the RoBERTa vocabulary and randomly initialized before task finetuning.
NOTE: Similar strategy was performed for all tasks. And vocabulary size is similar to RoBERTa's vocabulary"
Computational Resources
Character Level Language Modelling: We ran the small model experiments on 4 RTX8000 GPUs for 16 days. For the large model, we ran experiments on 8 RTX8000 GPUs for 13 days.
For wikihop: All models were trained on a single RTX8000 GPU, with Longformer-base taking about a day for 5 epochs.
For TriviaQA: We ran our experiments on 32GB V100 GPUs. Small model takes 1 day to train on 4 GPUs, while large model takes 1 day on 8 GPUs.
For Hotpot QA: Our experiments are done on RTX8000 GPUs and training each epoch takes approximately half a day on 4 GPUs.
Text Classification: Experiments were done on a single RTX8000 GPU."
Pretraining Objective
We pretrain Longformer with masked language modeling (MLM), where the goal is to recover randomly masked tokens in a sequence.
This bias will also affect all fine-tuned versions of this model.
Training Setup
-
We train two model sizes, a base model and a large model. Both models are trained for 65K gradient updates with sequences length 4,096, batch size 64 (2 18 tokens), maximum learning rate of 3e-5, linear warmup of 500 steps, followed by a power 3 polynomial decay. The rest of the hyperparameters are the same as RoBERTa.[For MLM Pretraining]
-
Hyperparameters for the best performing model for character-level language modeling
-
Hyperparameters of the QA models. All mod- els use a similar scheduler with linear warmup and de- cay.
-
[For coreference resolution] The maximum se- quence length was 384 for RoBERTa-base, chosen after three trials from [256, 384, 512] using the default hyperparameters in the original implemen- tation.16 For Longformer-base the sequence length was 4,096.....
-
[For coreference resolution] .... Hyperparameter searches were minimal and con- sisted of grid searches of RoBERTa LR in [1e-5, 2e-5, 3e-5] and task LR in [1e-4, 2e-4, 3e-4] for both RoBERTa and Longformer for a fair compari- son. The best configuration for Longformer-base was RoBERTa lr=1e-5, task lr=1e-4. All other hy- perparameters were the same as in the original im- plementation.
-
[For text classification]
We used Adam opti- mizer with batch sizes of 32 and linear warmup and decay with warmup steps equal to 0.1 of the total training steps. For both IMDB and Hyperpar- tisan news we did grid search of LRs [3e-5, 5e-5] and epochs [10, 15, 20] and found the model with [3e-5] and epochs 15 to work best.
Training procedure
Preprocessing
"For WikiHop: To prepare the data for input to Longformer and RoBERTa, we first tokenize the question, answer candidates, and support contexts using RoBERTa’s wordpiece tokenizer. Then we concatenate the question and answer candi- dates with special tokens as [q] question [/q] [ent] candidate1 [/ent] ... [ent] candidateN [/ent]. The contexts are also concatenated using RoBERTa’s doc- ument delimiter tokens as separators: </s> context1 </s> ... </s> contextM </s>. The special tokens [q], [/q], [ent], [/ent] were added to the RoBERTa vocabulary and randomly initialized before task finetuning.
For TriviaQA: Similar to WikiHop, we tokenize the question and the document using RoBERTa’s tokenizer, then form the input as [s] question [/s] document [/s]. We truncate the document at 4,096 wordpiece to avoid it being very slow.
For HotpotQA: Similar to Wikihop and TriviaQA, to prepare the data for input to Long- former, we concatenate question and then all the 10 paragraphs in one long context. We particu- larly use the following input format with special tokens: “[CLS] [q] question [/q] <t> title1 </t> sent1,1 [s] sent1,2 [s] ... <t> title2 </t> sent2,1 [s] sent2,2 [s] ...” where [q], [/q], <t>, </t>, [s], [p] are special tokens representing, question start and end, paragraph title start and end, and sentence, respectively. The special tokens were added to the Longformer vocabulary and randomly initialized before task finetuning."
Experiment
- Character level langyage modeling: a) To compare to prior work we focus on character- level LM (text8 and enwik8; Mahoney, 2009).
b) Tab. 2 and 3 summarize evaluation results on text8 and enwik8 datasets. We achieve a new state-of-the-art on both text8 and enwik8 using the small models with BPC of 1.10 and 1.00 on text8 and enwik8 respectively, demonstrating the effectiveness of our model.
- Pretraining: a) We pretrain Longformer with masked language modeling (MLM), where the goal is to recover randomly masked tokens in a sequence.
b) Table 5: MLM BPC for RoBERTa and various pre- trained Longformer configurations.
-
WikiHop: Instances in WikiHop consist of: a question, answer candidates (ranging from two candidates to 79 candidates), supporting contexts (ranging from three paragraphs to 63 paragraphs), and the correct answer. The dataset does not pro- vide any intermediate annotation for the multihop reasoning chains, requiring models to instead infer them from the indirect answer supervision.
-
TriviaQA: TriviaQA has more than 100K ques- tion, answer, document triplets for training. Doc- uments are Wikipedia articles, and answers are named entities mentioned in the article. The span that answers the question is not annotated, but it is found using simple text matching.
-
HotpotQA: HotpotQA dataset involves answer- ing questions from a set of 10 paragraphs from 10 different Wikipedia articles where 2 paragraphs are relevant to the question and the rest are dis- tractors. It includes 2 tasks of answer span ex- traction and evidence sentence identification. Our model for HotpotQA combines both answer span extraction and evidence extraction in one joint model.
-
Coreference model: The coreference model is a straightforward adaptation of the coarse-to-fine BERT based model from Joshi et al. (2019).
-
Text classification: For classification, following BERT, we used a simple binary cross entropy loss on top of a first [CLS] token with addition of global attention to [CLS].
-
Evaluation metric for finetuned tasks: Summary of finetuning results on QA, coreference resolution, and document classification. Results are on the development sets comparing our Longformer-base with RoBERTa-base. TriviaQA, Hyperpartisan metrics are F1, WikiHop and IMDB use accuracy, HotpotQA is joint F1, OntoNotes is average F1.
-
Summarization: a) We evaluate LED on the summarization task us- ing the arXiv summarization dataset (Cohan et al.) which focuses on long document summariza- tion in the scientific domain.
b) Table 11: Summarization results of Longformer- Encoder-Decoder (LED) on the arXiv dataset. Met- rics from left to right are ROUGE-1, ROUGE-2 and ROUGE-L."
Ablation
Ablation study for WikiHop on the development set. All results use Longformer- base, fine-tuned for five epochs with identical hy- perparameters except where noted. Longformer benefits from longer sequences, global attention, separate projection matrices for global attention, MLM pretraining, and longer training. In addition, when configured as in RoBERTa-base (seqlen: 512, and n2 attention) Longformer performs slightly worse then RoBERTa-base, confirming that per- formance gains are not due to additional pretrain- ing. Performance drops slightly when using the RoBERTa model pretrained when only unfreezing the additional position embeddings, showing that Longformer can learn to use long range context in task specific fine-tuning with large training datasets such as WikiHop.
BibTeX entry and citation info
@article{DBLP:journals/corr/abs-2004-05150,
author = {Iz Beltagy and
Matthew E. Peters and
Arman Cohan},
title = {Longformer: The Long-Document Transformer},
journal = {CoRR},
volume = {abs/2004.05150},
year = {2020},
url = {http://arxiv.org/abs/2004.05150},
archivePrefix = {arXiv},
eprint = {2004.05150},
timestamp = {Wed, 22 Apr 2020 14:29:36 +0200},
biburl = {https://dblp.org/rec/journals/corr/abs-2004-05150.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}