# Brief Review — Rethinking Attention with Performers

## Performers, Using FAVOR+, Approximate Full Softmax

--

Rethinking Attention with Performers,
Performers
, by Google, University of Cambridge, DeepMind, and Alan Turing Institute, 2021 ICLR, Over 500 Citations (Sik-Ho Tsang @ Medium)
Natural Language Processing, NLP, Transformer

• Conventional Transformer has a quadratic space and time complexity due to the softmax attention within the self-attention module.
• Performers, a Transformer variant, is proposed to approximate softmax attention kernels, which leverages a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+) to efficiently model kernelizable attention mechanisms beyond softmax.

# Outline

1. Preliminaries
2. Performer
3. Results

# 1. Preliminaries

## 1.1. Conventional Transformer

• As mentioned above, conventional Transformer has a quadratic space and time complexity due to the softmax attention A:
• Some research works are suggested to solve the above issue below.

## 1.2. Standard Sparsification Techniques

• Left: Local attention instead of global attention, only attending to nearby tokens.
• Right: Graph-based attention, attending to neighbors only.

# 2. Performer

• Conventional Transformer self-attention module has Q, K, V. Within it, Q and K generates A then interacts with V.
• Here, the matrix A is approximated by lower-rank randomized matrices Q′ and Ka novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+).
• FAVOR+ works for attention blocks using matrices A of the form:
• with qi/kj standing for the ith/jth query/key row-vector in Q/K and kernel K defined for the (usually randomized) mapping Φ:
• For Q’, K with rows given as Φ(qi) and Φ(ki) respectively.

Here ^Att↔ stands for the approximate attention and brackets in the figure below indicate the order of computations:

• By taking Φ of the following form for functions f1, …, fl, function h and deterministic vectors ωi or ω1, …, ωm, iid ~ D for some distribution DP(R^d) (such as Gaussian):
• Efficient attention mechanism is formed:

With the concept of low-rank approximation/matrix factorization/matrix decomposition, then the space and time complexity becomes much more linear.

# 3. Results

## 3.1. NLP Datasets

• “X” (OPT) denotes the maximum possible speedup achievable, when attention simply returns the V-matrix.

The Performer reaches nearly linear time and sub-quadratic memory consumption (since the explicit O(L2) attention matrix is not stored).

In fact, by comparing “X”, the Performer achieves nearly optimal speedup and memory efficiency possible.

## 3.2. Protein Sequence Dataset

• A 36-layer model is trained using protein sequences from the Jan. 2019 release of TrEMBL.
• Reformer and Linformer significantly drop in accuracy on the protein dataset.

Performer-ReLU (taking f=ReLU) achieves the highest accuracy in both (U) and (B) cases. (U: Unidirectional, B: Bidirectional)

• A protein benchmark is tried for predicting interactions among groups of proteins by concatenating protein sequences to length L=8192 from TrEMBL.
• A regular Transformer overloads memory even at a batch size of 1 per chip.

The smaller Transformer (nlayer = 3) is quickly bounded at  19%, while the Performer is able to train continuously to 24%.

## 3.3. ImageNet64 (Image Generation)

Performer/6-layers matches the Reformer/12-layers, while the Performer/12-layers matches the Reformer/24-layers.

• Depending on hardware (TPU or GPU), it is also found that the Performer can be 2× faster than the Reformer via Jax optimizations for the (U) setting.
• Performer enables the Transformer to be applied to much longer sequences without constraints on the structure of the attention matrix to advance the biology and medicine (e.g.: very long protein sequence).

## References

[2021 ICLR] [Performer]
Rethinking Attention with Performers