# Brief Review — Rethinking Attention with Performers

## Performers, Using FAVOR+, Approximate Full Softmax

Rethinking Attention with Performers,, by Google, University of Cambridge, DeepMind, and Alan Turing Institute,

Performers2021 ICLR, Over 500 Citations(Sik-Ho Tsang @ Medium)

Natural Language Processing, NLP, Transformer

**Conventional****Transformer****quadratic space and time complexity**due to the softmax attention within the self-attention module.**Performers**, a Transformer variant, is proposed to**approximate softmax attention kernels**, which leverages a novel**Fast Attention Via positive Orthogonal Random features approach (FAVOR+)**to efficiently**model kernelizable attention mechanisms**beyond softmax.

# Outline

**Preliminaries****Performer****Results**

**1. Preliminaries**

## 1.1. Conventional Transformer

- As mentioned above,
**conventional****Transformer****quadratic space and time complexity**due to the**softmax attention**:*A*

- Some research works are suggested to solve the above issue below.

## 1.2. Standard Sparsification Techniques

**Left**:**Local attention**instead of global attention, only**attending to nearby tokens.****Right**: Graph-based attention,**attending to neighbors**only.

**2. Performer**

- Conventional Transformer self-attention module has
*Q*,*K*,*V*. Within it,*Q*and*K*generates*A*then interacts with*V*. - Here, the
**matrix**a novel*A*is approximated by lower-rank randomized matrices*Q*′ and*K*′**Fast Attention Via positive Orthogonal Random features approach (FAVOR+)**. - FAVOR+ works for attention blocks using
**matrices**of the form:*A*

- with
*qi*/*kj*standing for the*i*th/*j*th query/key row-vector in*Q*/*K*and**kernel***K*defined for the (usually randomized) mapping*Φ*:

- For
with*Q*’,*K*’**rows given as**respectively.*Φ*(*qi*) and*Φ*(*ki*)

Here

stands for the^Att↔approximate attentionandbrackets in the figure belowindicatethe order of computations:

- By taking
of the following*Φ***form for functions**or*f*1, …,*fl*, function*h*and deterministic vectors*ω*i*ω*1, …,*ωm,*iid ~*D*for some distribution*D*∈*P*(*R^d*) (such as Gaussian):

**Efficient attention mechanism**is formed:

With the concept of

low-rank approximation/matrix factorization/matrix decomposition, thenthe space and time complexity becomes much more linear.

- (Please feel free to read the paper for mathematical proofs.)

# 3. Results

## 3.1. NLP Datasets

**“X” (OPT)**denotes the**maximum possible speedup**achievable, when attention simply returns the*V*-matrix.

The Performer reaches

nearly linear time and sub-quadratic memory consumption(since the explicit O(L2) attention matrix is not stored).In fact, by comparing “X”, the Performer achieves

nearly optimal speedup and memory efficiency possible.

## 3.2. Protein Sequence Dataset

- A
**36-layer model**is trained using**protein sequences**from the Jan. 2019 release of TrEMBL. **Reformer and Linformer significantly drop in accuracy**on the protein dataset.

Performer-ReLU(takingf=ReLU) achieves thehighest accuracyin both (U) and (B) cases. (U: Unidirectional, B: Bidirectional)

- A protein benchmark is tried for
**predicting interactions among groups of proteins**by**concatenating protein sequences**to**length**from TrEMBL.*L*=8192 - A regular
**Transformer****overloads memory**even at a batch size of 1 per chip.

The

smallerTransformer(nlayer = 3) isquickly bounded at 19%, while thePerformeris able to train continuously to24%.

## 3.3. ImageNet64 (Image Generation)

Performer/6-layers matches the Reformer/12-layers, while thePerformer/12-layers matches the Reformer/24-layers.

- Depending on
**hardware (TPU or GPU)**, it is also found that the**Performer can be 2× faster than the Reformer**via Jax optimizations for the (U) setting.

- Performer enables the Transformer to be
**applied to much longer sequences without constraints**on the structure of the attention matrix to**advance the biology and medicine (e.g.: very long protein sequence)**.

## References

[2021 ICLR] [Performer]

Rethinking Attention with Performers

[Google AI Blog]

Rethinking Attention with Performers

## 2.1. Language Model / Sequence Model

(Some are not related to NLP, but I just group them here)

**1991** … **2020 **[ALBERT] [GPT-3] [T5] [Pre-LN Transformer] [MobileBERT] [TinyBERT] [BART] [Longformer] [ELECTRA] [Megatron-LM] [SpanBERT] **2021 **[Performer]