Brief Review — Linformer: Self-Attention with Linear Complexity

Self-attention mechanism can be approximated by a low-rank matrix

Sik-Ho Tsang
3 min readJul 31, 2024

Linformer: Self-Attention with Linear Complexity
Linformer
, by Facebook AI
2020 arXiv v3, Over 1500 Citations (Sik-Ho Tsang @ Medium)

Language Model (LM)
2007 … 2022 [GLM] [Switch Transformers] [WideNet] [MoEBERT] [X-MoE] [sMLP] [LinkBERT, BioLinkBERT] [AlphaCode] [Block-wise Dynamic Quantization] 2023 [ERNIE-Code] [Grouped-Query Attention (GQA)]
==== My Other Paper Readings Are Also Over Here ====

  • The standard self-attention mechanism of the Transformer uses O(n²) time and space with respect to sequence length.
  • In this paper, it is found that self-attention mechanism can be approximated by a low-rank matrix. The resulting linear Transformer, the Linformer, performs on par with standard Transformer models, while being much more memory- and time-efficient.

Outline

  1. Linformer
  2. Results

1. Linformer

1.1. Linear Self-Attention

Linear self-attention in Linformer
  • The main idea of the proposed linear self-attention (Figure 2) is to add two linear projection matrices Ei, Fi when computing key and value.
  • The original (n×d)-dimensional key and value layers and are first projected into (k×d)-dimensional projected key and value layers.
  • Then, an (n×k)-dimensional context mapping matrix ~P is computed using scaled dot-product attention:
  • The above operations only require O(nk) time and space complexity.

Thus, if k<<n, then the memory and space consumption can be significantly reduced.

  • In Figure 2 (top right), the inference speed of Linformer and standard Transformer is plotted versus sequence length, while holding the total number of tokens fixed.

1.2. Parameter Sharing Between Projections

  • Parameter sharing between projections: One can share parameters for the linear projection matrices Ei, Fi across layers and heads. In particular, there can be 3 levels of sharing:
  • Headwise sharing: Ei = E and Fi = F across all heads i.
  • Key-value sharing: Ei = Fi = E for each key-value projection matrix across all head i.
  • Layerwise sharing: A single projection matrix E is used across all layers, for all heads, and for both key and value.

2. Results

  • RoBERTa is used.
  • BookCorpus plus English Wikipedia are used as the pretraining set, with the masked-language-modeling (MLM) objective.
  • Then, the pretrained model is fine-tuned on downstream tasks.
  • Linformer model (n = 512; k = 128) has comparable downstream performance to the RoBERTa model, and in fact even slightly outperforms it at k = 256.
  • Moreover, it is noted that although the Linformer’s layerwise sharing strategy shares a single projection matrix across the entire model, it actually exhibits the best accuracy result of all three parameter sharing strategies.

Even with n = 512 and k = 128, Linformer has 1.5× faster inference time and allows for a 1.7× larger maximum batch size than the Transformer.

  • As sequence length increases, the inference-time speed-up and memory savings are even more dramatic.

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.