Brief Review — Rethinking Attention with Performers

Performers, Using FAVOR+, Approximate Full Softmax

  • Conventional Transformer has a quadratic space and time complexity due to the softmax attention within the self-attention module.
  • Performers, a Transformer variant, is proposed to approximate softmax attention kernels, which leverages a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+) to efficiently model kernelizable attention mechanisms beyond softmax.

Outline

  1. Preliminaries
  2. Performer
  3. Results

1. Preliminaries

1.1. Conventional Transformer

  • As mentioned above, conventional Transformer has a quadratic space and time complexity due to the softmax attention A:
  • Some research works are suggested to solve the above issue below.

1.2. Standard Sparsification Techniques

Standard sparsification techniques. Left: Example of a sparsity pattern, where tokens attend only to other nearby tokens. Right: In Graph Attention Networks, tokens attend only to their neighbors in the graph, which should have higher relevance than other nodes.
  • Left: Local attention instead of global attention, only attending to nearby tokens.
  • Right: Graph-based attention, attending to neighbors only.

2. Performer

LHS: The standard attention matrix, which contains all similarity scores for every pair of entries, formed by a softmax operation on the query and keys, denoted by q and k. RHS: The standard attention matrix can be approximated via lower-rank randomized matrices Q′ and K′ with rows encoding potentially randomized nonlinear functions of the original queries/keys. For the regular softmax-attention, the transformation is very compact and involves an exponential function as well as random Gaussian projections.
  • Conventional Transformer self-attention module has Q, K, V. Within it, Q and K generates A then interacts with V.
  • Here, the matrix A is approximated by lower-rank randomized matrices Q′ and Ka novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+).
  • FAVOR+ works for attention blocks using matrices A of the form:
  • with qi/kj standing for the ith/jth query/key row-vector in Q/K and kernel K defined for the (usually randomized) mapping Φ:
  • For Q’, K with rows given as Φ(qi) and Φ(ki) respectively.
  • By taking Φ of the following form for functions f1, …, fl, function h and deterministic vectors ωi or ω1, …, ωm, iid ~ D for some distribution DP(R^d) (such as Gaussian):
  • Efficient attention mechanism is formed:
Left: Standard unidirectional attention requires masking the attention matrix to obtain its lower-triangular part. Right: Unbiased approximation on the LHS can be obtained via a prefix-sum mechanism, where the prefix-sum of the outer-products of random feature maps for keys and value vectors is built on the fly and left-multiplied by query random feature vector to obtain the new row in the resulting matrix. (Figure from Google AI)
Approximation of the regular attention mechanism AV (before D^(-1)-renormalization) via (random) feature maps. Dashed-blocks indicate order of computation with corresponding time complexities attached.
  • (Please feel free to read the paper for mathematical proofs.)

3. Results

3.1. NLP Datasets

Comparison of Transformer and Performer in terms of forward and backward pass speed and maximum L allowed
  • “X” (OPT) denotes the maximum possible speedup achievable, when attention simply returns the V-matrix.

3.2. Protein Sequence Dataset

Train = Dashed, Validation = Solid, on Protein Dataset
  • A 36-layer model is trained using protein sequences from the Jan. 2019 release of TrEMBL.
  • Reformer and Linformer significantly drop in accuracy on the protein dataset.
Train = Dashed, Validation = Solid, on TrEMBL-Concat
  • A protein benchmark is tried for predicting interactions among groups of proteins by concatenating protein sequences to length L=8192 from TrEMBL.
  • A regular Transformer overloads memory even at a batch size of 1 per chip.

3.3. ImageNet64 (Image Generation)

Train = Dashed, Validation = Solid, on ImageNet64
  • Depending on hardware (TPU or GPU), it is also found that the Performer can be 2× faster than the Reformer via Jax optimizations for the (U) setting.
  • Performer enables the Transformer to be applied to much longer sequences without constraints on the structure of the attention matrix to advance the biology and medicine (e.g.: very long protein sequence).

References

[2021 ICLR] [Performer]
Rethinking Attention with Performers

2.1. Language Model / Sequence Model

(Some are not related to NLP, but I just group them here)

My Other Previous Paper Readings

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store