Brief Review — Rethinking Attention with Performers

Performers, Using FAVOR+, Approximate Full Softmax

Sik-Ho Tsang
4 min readNov 26, 2022

Rethinking Attention with Performers,
, by Google, University of Cambridge, DeepMind, and Alan Turing Institute, 2021 ICLR, Over 500 Citations (

@ Medium)
Natural Language Processing, NLP, Transformer

  • Conventional Transformer has a quadratic space and time complexity due to the softmax attention within the self-attention module.
  • Performers, a Transformer variant, is proposed to approximate softmax attention kernels, which leverages a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+) to efficiently model kernelizable attention mechanisms beyond softmax.


  1. Preliminaries
  2. Performer
  3. Results

1. Preliminaries

1.1. Conventional Transformer

  • As mentioned above, conventional Transformer has a quadratic space and time complexity due to the softmax attention A:
  • Some research works are suggested to solve the above issue below.

1.2. Standard Sparsification Techniques

Standard sparsification techniques. Left: Example of a sparsity pattern, where tokens attend only to other nearby tokens. Right: In Graph Attention Networks, tokens attend only to their neighbors in the graph, which should have higher relevance than other nodes.
  • Left: Local attention instead of global attention, only attending to nearby tokens.
  • Right: Graph-based attention, attending to neighbors only.

2. Performer

LHS: The standard attention matrix, which contains all similarity scores for every pair of entries, formed by a softmax operation on the query and keys, denoted by q and k. RHS: The standard attention matrix can be approximated via lower-rank randomized matrices Q′ and K′ with rows encoding potentially randomized nonlinear functions of the original queries/keys. For the regular softmax-attention, the transformation is very compact and involves an exponential function as well as random Gaussian projections.
  • Conventional Transformer self-attention module has Q, K, V. Within it, Q and K generates A then interacts with V.
  • Here, the matrix A is approximated by lower-rank randomized matrices Q′ and Ka novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+).
  • FAVOR+ works for attention blocks using matrices A of the form:
  • with qi/kj standing for the ith/jth query/key row-vector in Q/K and kernel K defined for the (usually randomized) mapping Φ:
  • For Q’, K with rows given as Φ(qi) and Φ(ki) respectively.

Here ^Att↔ stands for the approximate attention and brackets in the figure below indicate the order of computations:

  • By taking Φ of the following form for functions f1, …, fl, function h and deterministic vectors ωi or ω1, …, ωm, iid ~ D for some distribution DP(R^d) (such as Gaussian):
  • Efficient attention mechanism is formed:
Left: Standard unidirectional attention requires masking the attention matrix to obtain its lower-triangular part. Right: Unbiased approximation on the LHS can be obtained via a prefix-sum mechanism, where the prefix-sum of the outer-products of random feature maps for keys and value vectors is built on the fly and left-multiplied by query random feature vector to obtain the new row in the resulting matrix. (Figure from Google AI)
Approximation of the regular attention mechanism AV (before D^(-1)-renormalization) via (random) feature maps. Dashed-blocks indicate order of computation with corresponding time complexities attached.

With the concept of low-rank approximation/matrix factorization/matrix decomposition, then the space and time complexity becomes much more linear.

  • (Please feel free to read the paper for mathematical proofs.)

3. Results

3.1. NLP Datasets

Comparison of Transformer and Performer in terms of forward and backward pass speed and maximum L allowed
  • “X” (OPT) denotes the maximum possible speedup achievable, when attention simply returns the V-matrix.

The Performer reaches nearly linear time and sub-quadratic memory consumption (since the explicit O(L2) attention matrix is not stored).

In fact, by comparing “X”, the Performer achieves nearly optimal speedup and memory efficiency possible.

3.2. Protein Sequence Dataset

Train = Dashed, Validation = Solid, on Protein Dataset
  • A 36-layer model is trained using protein sequences from the Jan. 2019 release of TrEMBL.
  • Reformer and Linformer significantly drop in accuracy on the protein dataset.

Performer-ReLU (taking f=ReLU) achieves the highest accuracy in both (U) and (B) cases. (U: Unidirectional, B: Bidirectional)

Train = Dashed, Validation = Solid, on TrEMBL-Concat
  • A protein benchmark is tried for predicting interactions among groups of proteins by concatenating protein sequences to length L=8192 from TrEMBL.
  • A regular Transformer overloads memory even at a batch size of 1 per chip.

The smaller Transformer (nlayer = 3) is quickly bounded at  19%, while the Performer is able to train continuously to 24%.

3.3. ImageNet64 (Image Generation)

Train = Dashed, Validation = Solid, on ImageNet64

Performer/6-layers matches the Reformer/12-layers, while the Performer/12-layers matches the Reformer/24-layers.

  • Depending on hardware (TPU or GPU), it is also found that the Performer can be 2× faster than the Reformer via Jax optimizations for the (U) setting.
  • Performer enables the Transformer to be applied to much longer sequences without constraints on the structure of the attention matrix to advance the biology and medicine (e.g.: very long protein sequence).


[2021 ICLR] [Performer]
Rethinking Attention with Performers

[Google AI Blog]
Rethinking Attention with Performers

2.1. Language Model / Sequence Model

(Some are not related to NLP, but I just group them here)

19912020 [ALBERT] [GPT-3] [T5] [Pre-LN Transformer] [MobileBERT] [TinyBERT] [BART] [Longformer] [ELECTRA] [Megatron-LM] [SpanBERT] 2021 [Performer]

My Other Previous Paper Readings



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.