Brief Review — Linformer: Self-Attention with Linear Complexity
Self-attention mechanism can be approximated by a low-rank matrix
Linformer: Self-Attention with Linear Complexity
Linformer, by Facebook AI
2020 arXiv v3, Over 1500 Citations (Sik-Ho Tsang @ Medium)Language Model (LM)
2007 … 2022 [GLM] [Switch Transformers] [WideNet] [MoEBERT] [X-MoE] [sMLP] [LinkBERT, BioLinkBERT] [AlphaCode] [Block-wise Dynamic Quantization] 2023 [ERNIE-Code] [Grouped-Query Attention (GQA)]
==== My Other Paper Readings Are Also Over Here ====
- The standard self-attention mechanism of the Transformer uses O(n²) time and space with respect to sequence length.
- In this paper, it is found that self-attention mechanism can be approximated by a low-rank matrix. The resulting linear Transformer, the Linformer, performs on par with standard Transformer models, while being much more memory- and time-efficient.
Outline
- Linformer
- Results
1. Linformer
1.1. Linear Self-Attention
- The main idea of the proposed linear self-attention (Figure 2) is to add two linear projection matrices Ei, Fi when computing key and value.
- The original (n×d)-dimensional key and value layers and are first projected into (k×d)-dimensional projected key and value layers.
- Then, an (n×k)-dimensional context mapping matrix ~P is computed using scaled dot-product attention:
- The above operations only require O(nk) time and space complexity.
Thus, if k<<n, then the memory and space consumption can be significantly reduced.
- In Figure 2 (top right), the inference speed of Linformer and standard Transformer is plotted versus sequence length, while holding the total number of tokens fixed.
1.2. Parameter Sharing Between Projections
- Parameter sharing between projections: One can share parameters for the linear projection matrices Ei, Fi across layers and heads. In particular, there can be 3 levels of sharing:
- Headwise sharing: Ei = E and Fi = F across all heads i.
- Key-value sharing: Ei = Fi = E for each key-value projection matrix across all head i.
- Layerwise sharing: A single projection matrix E is used across all layers, for all heads, and for both key and value.
2. Results
- RoBERTa is used.
- BookCorpus plus English Wikipedia are used as the pretraining set, with the masked-language-modeling (MLM) objective.
- Then, the pretrained model is fine-tuned on downstream tasks.
- Linformer model (n = 512; k = 128) has comparable downstream performance to the RoBERTa model, and in fact even slightly outperforms it at k = 256.
- Moreover, it is noted that although the Linformer’s layerwise sharing strategy shares a single projection matrix across the entire model, it actually exhibits the best accuracy result of all three parameter sharing strategies.
Even with n = 512 and k = 128, Linformer has 1.5× faster inference time and allows for a 1.7× larger maximum batch size than the Transformer.
- As sequence length increases, the inference-time speed-up and memory savings are even more dramatic.