# Brief Review — Linformer: Self-Attention with Linear Complexity

## Self-attention mechanism can be approximated by a low-rank matrix

Linformer: Self-Attention with Linear Complexity, by Facebook AI

Linformer2020 arXiv v3, Over 1500 Citations(Sik-Ho Tsang @ Medium)

Language Model (LM)2007 …2022[GLM] [Switch Transformers] [WideNet] [MoEBERT] [X-MoE] [sMLP] [LinkBERT, BioLinkBERT] [AlphaCode] [Block-wise Dynamic Quantization]2023[ERNIE-Code] [Grouped-Query Attention (GQA)]

==== My Other Paper Readings Are Also Over Here ====

- The
**standard self-attention**mechanism of the Transformer uses**O(**time and space with respect to sequence length.*n*²) - In this paper, it is found that
**self-attention mechanism can be approximated by a low-rank matrix**. The resulting linear Transformer, the**Linformer**, performs**on par with standard****Transformer**

# Outline

**Linformer****Results**

**1. Linformer**

## 1.1. Linear Self-Attention

- The main idea of the
**proposed linear self-attention**(Figure 2) is to**add two linear projection matrices**when computing key and value.*Ei*,*Fi* **The original (**and are*n*×*d*)-dimensional key and value layers**first projected into (***k*×*d*)-dimensional projected key and value layers.- Then,
**an (**:*n*×*k*)-dimensional context mapping matrix ~*P*is computed using scaled dot-product attention

- The above operations only require
**O(***nk*) time and space complexity.

Thus, if

, then thek<<nmemory and space consumption can be significantly reduced.

- In
**Figure 2 (top right)**,**the inference speed of Linformer and standard****Transformer****is plotted versus sequence length**, while holding the total number of tokens fixed.

## 1.2. Parameter Sharing Between Projections

- Parameter sharing between projections: One can share parameters for the linear projection matrices
*Ei*,*Fi*across layers and heads. In particular, there can be**3 levels of sharing**: **Headwise sharing**:*Ei*=*E*and*Fi*=*F*across all heads*i*.**Key-value sharing**:*Ei*=*Fi*=*E*for each key-value projection matrix across all head*i*.**Layerwise sharing**: A single projection matrix*E*is used across all layers, for all heads, and for both key and value.

# 2. Results

**RoBERTa****BookCorpus**plus**English Wikipedia**are used as the pretraining set, with the**masked-language-modeling (MLM) objective.**- Then, the pretrained model is fine-tuned on downstream tasks.

**Linformer model (***n*= 512;*k*= 128) has comparable downstream performance to the**RoBERTa****model**, and in fact**even slightly outperforms it at***k*= 256.- Moreover, it is noted that although the Linformer’s layerwise sharing strategy shares a single projection matrix across the entire model, it actually exhibits the best accuracy result of all three parameter sharing strategies.

Even withn= 512 andk= 128, Linformer has 1.5× faster inference time and allows for a 1.7× larger maximum batch size than theTransformer.

- As sequence length increases, the inference-time speed-up and memory savings are even more dramatic.