# Brief Review — Linformer: Self-Attention with Linear Complexity

## Self-attention mechanism can be approximated by a low-rank matrix

Linformer: Self-Attention with Linear Complexity
Linformer
2020 arXiv v3, Over 1500 Citations (Sik-Ho Tsang @ Medium)

Language Model (LM)
2007 … 2022 [GLM] [Switch Transformers] [WideNet] [MoEBERT] [X-MoE] [sMLP] [LinkBERT, BioLinkBERT] [AlphaCode] [Block-wise Dynamic Quantization] 2023 [ERNIE-Code] [Grouped-Query Attention (GQA)]
==== My Other Paper Readings Are Also Over Here ====

• The standard self-attention mechanism of the Transformer uses O(n²) time and space with respect to sequence length.
• In this paper, it is found that self-attention mechanism can be approximated by a low-rank matrix. The resulting linear Transformer, the Linformer, performs on par with standard Transformer models, while being much more memory- and time-efficient.

1. Linformer
2. Results

# 1. Linformer

## 1.1. Linear Self-Attention

• The main idea of the proposed linear self-attention (Figure 2) is to add two linear projection matrices Ei, Fi when computing key and value.
• The original (n×d)-dimensional key and value layers and are first projected into (k×d)-dimensional projected key and value layers.
• Then, an (n×k)-dimensional context mapping matrix ~P is computed using scaled dot-product attention:
• The above operations only require O(nk) time and space complexity.

Thus, if k<<n, then the memory and space consumption can be significantly reduced.

• In Figure 2 (top right), the inference speed of Linformer and standard Transformer is plotted versus sequence length, while holding the total number of tokens fixed.

## 1.2. Parameter Sharing Between Projections

• Parameter sharing between projections: One can share parameters for the linear projection matrices Ei, Fi across layers and heads. In particular, there can be 3 levels of sharing:
• Headwise sharing: Ei = E and Fi = F across all heads i.
• Key-value sharing: Ei = Fi = E for each key-value projection matrix across all head i.
• Layerwise sharing: A single projection matrix E is used across all layers, for all heads, and for both key and value.

# 2. Results

• RoBERTa is used.
• BookCorpus plus English Wikipedia are used as the pretraining set, with the masked-language-modeling (MLM) objective.
• Then, the pretrained model is fine-tuned on downstream tasks.
• Linformer model (n = 512; k = 128) has comparable downstream performance to the RoBERTa model, and in fact even slightly outperforms it at k = 256.
• Moreover, it is noted that although the Linformer’s layerwise sharing strategy shares a single projection matrix across the entire model, it actually exhibits the best accuracy result of all three parameter sharing strategies.

Even with n = 512 and k = 128, Linformer has 1.5× faster inference time and allows for a 1.7× larger maximum batch size than the Transformer.

• As sequence length increases, the inference-time speed-up and memory savings are even more dramatic.

--

--