Brief Review — Rethinking Attention with Performers

Performers, Using FAVOR+, Approximate Full Softmax

4 min readNov 26, 2022

--

Rethinking Attention with Performers,
Performers, by Google, University of Cambridge, DeepMind, and Alan Turing Institute, 2021 ICLR, Over 500 Citations (Sik-Ho Tsang @ Medium)
Natural Language Processing, NLP, Transformer

Conventional Transformer has a quadratic space and time complexity due to the softmax attention within the self-attention module.
Performers, a Transformer variant, is proposed to approximate softmax attention kernels, which leverages a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+) to efficiently model kernelizable attention mechanisms beyond softmax.

Outline

Preliminaries
Performer
Results

1. Preliminaries

1.1. Conventional Transformer

As mentioned above, conventional Transformer has a quadratic space and time complexity due to the softmax attention A:

Some research works are suggested to solve the above issue below.

1.2. Standard Sparsification Techniques

**Standard sparsification techniques**. **Left:** Example of a sparsity pattern, where tokens attend only to other nearby tokens. **Right**: In Graph Attention Networks, tokens attend only to their neighbors in the graph, which should have higher relevance than other nodes.

Left: Local attention instead of global attention, only attending to nearby tokens.
Right: Graph-based attention, attending to neighbors only.

2. Performer

***LHS:*** ***The standard attention matrix****, which contains all similarity scores for every pair of entries, formed by a softmax operation on the query and keys, denoted by* q and k. ***RHS:*** *The standard attention matrix can be* ***approximated via lower-rank randomized matrices Q′ and K′*** with rows encoding potentially randomized nonlinear functions of the original queries/keys. For the regular softmax-attention, the transformation is very compact and involves an exponential function as well as random Gaussian projections.

Conventional Transformer self-attention module has Q, K, V. Within it, Q and K generates A then interacts with V.
Here, the matrix A is approximated by lower-rank randomized matrices Q′ and K′ a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+).
FAVOR+ works for attention blocks using matrices A of the form:

with qi/kj standing for the ith/jth query/key row-vector in Q/K and kernel K defined for the (usually randomized) mapping Φ:

For Q’, K’ with rows given as Φ(qi) and Φ(ki) respectively.

Here ^Att↔ stands for the approximate attention and brackets in the figure below indicate the order of computations:

By taking Φ of the following form for functions f1, …, fl, function h and deterministic vectors ωi or ω1, …, ωm, iid ~ D for some distribution D∈P(R^d) (such as Gaussian):

Efficient attention mechanism is formed:

***Left: Standard unidirectional attention*** *requires* ***masking the attention matrix to obtain its lower-triangular part***. ***Right:*** ***Unbiased approximation on the LHS*** can be obtained via a prefix-sum mechanism, where the prefix-sum of the outer-products of random feature maps for keys and value vectors is built on the fly and left-multiplied by query random feature vector to obtain the new row in the resulting matrix. (Figure from *Google AI*)

**Approximation of the regular attention mechanism AV** (before D^(-1)-renormalization) via (random) feature maps. Dashed-blocks indicate order of computation with corresponding time complexities attached.

With the concept of low-rank approximation/matrix factorization/matrix decomposition, then the space and time complexity becomes much more linear.

(Please feel free to read the paper for mathematical proofs.)

3. Results

3.1. NLP Datasets

**Comparison of** **Transformer** **and Performer in terms of forward and backward pass speed and maximum L allowed**

“X” (OPT) denotes the maximum possible speedup achievable, when attention simply returns the V-matrix.

The Performer reaches nearly linear time and sub-quadratic memory consumption (since the explicit O(L2) attention matrix is not stored).
In fact, by comparing “X”, the Performer achieves nearly optimal speedup and memory efficiency possible.

3.2. Protein Sequence Dataset

**Train = Dashed, Validation = Solid, on Protein Dataset**

A 36-layer model is trained using protein sequences from the Jan. 2019 release of TrEMBL.
Reformer and Linformer significantly drop in accuracy on the protein dataset.

Performer-ReLU (taking f=ReLU) achieves the highest accuracy in both (U) and (B) cases. (U: Unidirectional, B: Bidirectional)

**Train = Dashed, Validation = Solid, on TrEMBL-Concat**

A protein benchmark is tried for predicting interactions among groups of proteins by concatenating protein sequences to length L=8192 from TrEMBL.
A regular Transformer overloads memory even at a batch size of 1 per chip.

The smaller Transformer (nlayer = 3) is quickly bounded at 19%, while the Performer is able to train continuously to 24%.

3.3. ImageNet64 (Image Generation)

**Train = Dashed, Validation = Solid, on ImageNet64**

Performer/6-layers matches the Reformer/12-layers, while the Performer/12-layers matches the Reformer/24-layers.

Depending on hardware (TPU or GPU), it is also found that the Performer can be 2× faster than the Reformer via Jax optimizations for the (U) setting.

Performer enables the Transformer to be applied to much longer sequences without constraints on the structure of the attention matrix to advance the biology and medicine (e.g.: very long protein sequence).

References

[2021 ICLR] [Performer]
Rethinking Attention with Performers

[Google AI Blog]
Rethinking Attention with Performers

2.1. Language Model / Sequence Model

(Some are not related to NLP, but I just group them here)

1991 … 2020 [ALBERT] [GPT-3] [T5] [Pre-LN Transformer] [MobileBERT] [TinyBERT] [BART] [Longformer] [ELECTRA] [Megatron-LM] [SpanBERT] 2021 [Performer]

Brief Review — Rethinking Attention with Performers

Performers, Using FAVOR+, Approximate Full Softmax

Outline

1. Preliminaries

1.1. Conventional Transformer

1.2. Standard Sparsification Techniques

2. Performer

3. Results

3.1. NLP Datasets

3.2. Protein Sequence Dataset

3.3. ImageNet64 (Image Generation)

References

2.1. Language Model / Sequence Model

My Other Previous Paper Readings

Written by Sik-Ho Tsang

Responses (1)