Brief Review — Longformer: The Long-Document Transformer

Longformer, Global Attention and Attention Within Sliding Window

Sik-Ho Tsang
5 min readOct 29, 2022
Time and memory are scaled linearly with sequence length, by using Longformer

Longformer: The Long-Document Transformer,
, by Allen Institute for Artificial Intelligence,
2020 arXiv v2, Over 1300 Citations (Sik-Ho Tsang @ Medium)
Natural Language Processing, NLP, Language Model, Transformer

  • The proposed Longformer introduces the attention mechanism, which combines a local windowed attention with a task motivated global attention. Thus, the time and memory are scaled linearly with sequence length, as shown above.
  • In Addition, Longformer-Encoder-Decoder (LED) is also proposed for summarization.


  1. Long-Document Transformer (Longformer)
  2. Longformer Results
  3. Longformer-Encoder-Decoder (LED) & Its Results

1. Long-Document Transformer (Longformer)

1.1. Attention Variants

Comparing the full self-attention pattern and the configuration of attention patterns in the Longformer
  • (a) The original Transformer model: has a self-attention component with O(n²) time and memory complexity where n is the input sequence length.
  • As we can see, there is a QK^T in the attention operation. This becomes a problem when the model becomes very large, which consumes large amount of memory.
  • (b) Sliding Window: This attention pattern employs a fixed-size window attention surrounding each token. Given a fixed window size w, each token attends to (1/2)×w tokens on each side.
  • The computation complexity of this pattern is O(n×w), which scales linearly with input sequence length n.
  • (c) Dilated Sliding Window: To further increase the receptive field without increasing computation, the sliding window can be “dilated”. Assuming a fixed d and w for all layers, the receptive field is l×d×w, which can reach tens of thousands of tokens even for small values of d.
  • Different dilation configurations per head improves performance by allowing some heads without dilation to focus on local context, while others with dilation focus on longer context.
  • (d) Global Attention: can be added on few pre-selected input locations. The figure shows an example of a sliding window attention with global attention at a few tokens at custom locations.
  • Since the number of such tokens is small, the complexity of the combined local and global attention is still O(n).

1.2. Attention Patterns

  • Following Sukhbaatar et al. (2019), differing window sizes are used across the layers. In particular, small window sizes are used for the lower layers and window sizes are increased as we move to higher layers.

This allows the top layers to learn higher-level representation of the entire sequence while having the lower layers capture local information. In addition, it provides balance between efficiency and performance.

  • Dilated sliding windows are used for lower layers. For the higher layers, a small amount of increasing dilation is used only on 2 heads.

This gives the model the ability to directly attend to distant tokens without sacrificing local context.

2. Results

Small model BPC on text8 & enwik8
  • A stage-based training strategy is used, where Longformer is trained from short sequence to long sequence stage-by-stage.

A new state-of-the-art on both text8 and enwik8 using the small models with BPC of 1.10 and 1.00 on text8 and enwik8 respectively, demonstrating the effectiveness of the proposed Longformer.

Performance of large models on enwik8

Longformer outperforms the comparable Transformer-XL, matches the performance of the comparable Sparse Transformer, and matches or slightly underperforms recent models that have more than twice the number of parameters.

Summary of finetuning results on QA, coreference resolution, and document classification

Longformer consistently outperforms the RoBERTa baseline.

Leaderboard results of Longformer-large at time of submission (May 2020)

Longformer-large achieves new state-of-the-art results on WikiHop and TriviaQA by large margins (3.6 and 4 points respectively), and for HotpotQA, it underperforms the current state-of-the-art (Fang et al., 2020) by a point.

3. Longformer-Encoder-Decoder (LED) & Its Results

  • A Longformer-Encoder-Decoder (LED) is proposed that has both the encoder and decoder Transformer stacks but instead of the full self-attention in the encoder, it uses the efficient local+global attention pattern of the Longformer.
  • The decoder uses the full self-attention to the entire encoded tokens and to previously decoded locations.
  • Since pre-training LED is expensive, LED parameters are initialized from the BART, and follow BART’s exact architecture in terms of number of layers and hidden sizes.
  • LED-base and LED-large, which respectively have 6 and 12 layers in both encoder and decoder stacks.
Summarization results of Longformer-Encoder-Decoder (LED) on the arXiv dataset.
  • LED-large 16K on the arXiv summarization task. This model is merely initialized from BART, with no additional pre-training.

LED achieves state-of-the-art results on arXiv, slightly outperforming BigBird (Zaheer et al., 2020).


[2020 arXiv v2] [Longformer]
Longformer: The Long-Document Transformer

4.1. Language Model / Sequence Model

(Some are not related to NLP, but I just group them here)

19912020 [ALBERT] [GPT-3] [T5] [Pre-LN Transformer] [MobileBERT] [TinyBERT] [BART] [Longformer]

My Other Previous Paper Readings



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.