Brief Review — Longformer: The Long-Document Transformer

Longformer, Global Attention and Attention Within Sliding Window

Time and memory are scaled linearly with sequence length, by using Longformer
  • The proposed Longformer introduces the attention mechanism, which combines a local windowed attention with a task motivated global attention. Thus, the time and memory are scaled linearly with sequence length, as shown above.
  • In Addition, Longformer-Encoder-Decoder (LED) is also proposed for summarization.

Outline

  1. Long-Document Transformer (Longformer)
  2. Longformer Results
  3. Longformer-Encoder-Decoder (LED) & Its Results

1. Long-Document Transformer (Longformer)

1.1. Attention Variants

Comparing the full self-attention pattern and the configuration of attention patterns in the Longformer
  • (a) The original Transformer model: has a self-attention component with O(n²) time and memory complexity where n is the input sequence length.
  • As we can see, there is a QK^T in the attention operation. This becomes a problem when the model becomes very large, which consumes large amount of memory.
  • (b) Sliding Window: This attention pattern employs a fixed-size window attention surrounding each token. Given a fixed window size w, each token attends to (1/2)×w tokens on each side.
  • The computation complexity of this pattern is O(n×w), which scales linearly with input sequence length n.
  • (c) Dilated Sliding Window: To further increase the receptive field without increasing computation, the sliding window can be “dilated”. Assuming a fixed d and w for all layers, the receptive field is l×d×w, which can reach tens of thousands of tokens even for small values of d.
  • Different dilation configurations per head improves performance by allowing some heads without dilation to focus on local context, while others with dilation focus on longer context.
  • (d) Global Attention: can be added on few pre-selected input locations. The figure shows an example of a sliding window attention with global attention at a few tokens at custom locations.
  • Since the number of such tokens is small, the complexity of the combined local and global attention is still O(n).

1.2. Attention Patterns

  • Following Sukhbaatar et al. (2019), differing window sizes are used across the layers. In particular, small window sizes are used for the lower layers and window sizes are increased as we move to higher layers.
  • Dilated sliding windows are used for lower layers. For the higher layers, a small amount of increasing dilation is used only on 2 heads.

2. Results

Small model BPC on text8 & enwik8
  • A stage-based training strategy is used, where Longformer is trained from short sequence to long sequence stage-by-stage.
Performance of large models on enwik8
Summary of finetuning results on QA, coreference resolution, and document classification
Leaderboard results of Longformer-large at time of submission (May 2020)

3. Longformer-Encoder-Decoder (LED) & Its Results

  • A Longformer-Encoder-Decoder (LED) is proposed that has both the encoder and decoder Transformer stacks but instead of the full self-attention in the encoder, it uses the efficient local+global attention pattern of the Longformer.
  • The decoder uses the full self-attention to the entire encoded tokens and to previously decoded locations.
  • Since pre-training LED is expensive, LED parameters are initialized from the BART, and follow BART’s exact architecture in terms of number of layers and hidden sizes.
  • LED-base and LED-large, which respectively have 6 and 12 layers in both encoder and decoder stacks.
Summarization results of Longformer-Encoder-Decoder (LED) on the arXiv dataset.
  • LED-large 16K on the arXiv summarization task. This model is merely initialized from BART, with no additional pre-training.

Reference

4.1. Language Model / Sequence Model

My Other Previous Paper Readings

--

--

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store