Brief Review — Linformer: Self-Attention with Linear Complexity

Self-attention mechanism can be approximated by a low-rank matrix

3 min readJul 31, 2024

Linformer: Self-Attention with Linear Complexity
Linformer, by Facebook AI
2020 arXiv v3, Over 1500 Citations (Sik-Ho Tsang @ Medium)
Language Model (LM)
2007 … 2022 [GLM] [Switch Transformers] [WideNet] [MoEBERT] [X-MoE] [sMLP] [LinkBERT, BioLinkBERT] [AlphaCode] [Block-wise Dynamic Quantization] 2023 [ERNIE-Code] [Grouped-Query Attention (GQA)]
==== My Other Paper Readings Are Also Over Here ====

The standard self-attention mechanism of the Transformer uses O(n²) time and space with respect to sequence length.
In this paper, it is found that self-attention mechanism can be approximated by a low-rank matrix. The resulting linear Transformer, the Linformer, performs on par with standard Transformer models, while being much more memory- and time-efficient.

Outline

Linformer
Results

1. Linformer

1.1. Linear Self-Attention

The main idea of the proposed linear self-attention (Figure 2) is to add two linear projection matrices Ei, Fi when computing key and value.
The original (n×d)-dimensional key and value layers and are first projected into (k×d)-dimensional projected key and value layers.
Then, an (n×k)-dimensional context mapping matrix ~P is computed using scaled dot-product attention:

The above operations only require O(nk) time and space complexity.

Thus, if k<<n, then the memory and space consumption can be significantly reduced.

In Figure 2 (top right), the inference speed of Linformer and standard Transformer is plotted versus sequence length, while holding the total number of tokens fixed.

1.2. Parameter Sharing Between Projections

Parameter sharing between projections: One can share parameters for the linear projection matrices Ei, Fi across layers and heads. In particular, there can be 3 levels of sharing:
Headwise sharing: Ei = E and Fi = F across all heads i.
Key-value sharing: Ei = Fi = E for each key-value projection matrix across all head i.
Layerwise sharing: A single projection matrix E is used across all layers, for all heads, and for both key and value.

2. Results

RoBERTa is used.
BookCorpus plus English Wikipedia are used as the pretraining set, with the masked-language-modeling (MLM) objective.
Then, the pretrained model is fine-tuned on downstream tasks.

Linformer model (n = 512; k = 128) has comparable downstream performance to the RoBERTa model, and in fact even slightly outperforms it at k = 256.
Moreover, it is noted that although the Linformer’s layerwise sharing strategy shares a single projection matrix across the entire model, it actually exhibits the best accuracy result of all three parameter sharing strategies.

Even with n = 512 and k = 128, Linformer has 1.5× faster inference time and allows for a 1.7× larger maximum batch size than the Transformer.

As sequence length increases, the inference-time speed-up and memory savings are even more dramatic.

Brief Review — Linformer: Self-Attention with Linear Complexity

Self-attention mechanism can be approximated by a low-rank matrix

Outline

1. Linformer

1.1. Linear Self-Attention

1.2. Parameter Sharing Between Projections

2. Results

Written by Sik-Ho Tsang

No responses yet