Brief Review — Rethinking Attention with Performers
Rethinking Attention with Performers,
Performers, by Google, University of Cambridge, DeepMind, and Alan Turing Institute, 2021 ICLR, Over 500 Citations (Sik-Ho Tsang @ Medium)
Natural Language Processing, NLP, Transformer
- Conventional Transformer has a quadratic space and time complexity due to the softmax attention within the self-attention module.
- Performers, a Transformer variant, is proposed to approximate softmax attention kernels, which leverages a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+) to efficiently model kernelizable attention mechanisms beyond softmax.
1.1. Conventional Transformer
- As mentioned above, conventional Transformer has a quadratic space and time complexity due to the softmax attention A:
- Some research works are suggested to solve the above issue below.
1.2. Standard Sparsification Techniques
- Left: Local attention instead of global attention, only attending to nearby tokens.
- Right: Graph-based attention, attending to neighbors only.
- Conventional Transformer self-attention module has Q, K, V. Within it, Q and K generates A then interacts with V.
- Here, the matrix A is approximated by lower-rank randomized matrices Q′ and K′ a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+).
- FAVOR+ works for attention blocks using matrices A of the form:
- with qi/kj standing for the ith/jth query/key row-vector in Q/K and kernel K defined for the (usually randomized) mapping Φ:
- For Q’, K’ with rows given as Φ(qi) and Φ(ki) respectively.
Here ^Att↔ stands for the approximate attention and brackets in the figure below indicate the order of computations:
- By taking Φ of the following form for functions f1, …, fl, function h and deterministic vectors ωi or ω1, …, ωm, iid ~ D for some distribution D∈P(R^d) (such as Gaussian):
- Efficient attention mechanism is formed:
With the concept of low-rank approximation/matrix factorization/matrix decomposition, then the space and time complexity becomes much more linear.
- (Please feel free to read the paper for mathematical proofs.)
3.1. NLP Datasets
- “X” (OPT) denotes the maximum possible speedup achievable, when attention simply returns the V-matrix.
The Performer reaches nearly linear time and sub-quadratic memory consumption (since the explicit O(L2) attention matrix is not stored).
In fact, by comparing “X”, the Performer achieves nearly optimal speedup and memory efficiency possible.
3.2. Protein Sequence Dataset
- A 36-layer model is trained using protein sequences from the Jan. 2019 release of TrEMBL.
- Reformer and Linformer significantly drop in accuracy on the protein dataset.
- A protein benchmark is tried for predicting interactions among groups of proteins by concatenating protein sequences to length L=8192 from TrEMBL.
- A regular Transformer overloads memory even at a batch size of 1 per chip.
The smaller Transformer (nlayer = 3) is quickly bounded at 19%, while the Performer is able to train continuously to 24%.
3.3. ImageNet64 (Image Generation)
Performer/6-layers matches the Reformer/12-layers, while the Performer/12-layers matches the Reformer/24-layers.
- Depending on hardware (TPU or GPU), it is also found that the Performer can be 2× faster than the Reformer via Jax optimizations for the (U) setting.
- Performer enables the Transformer to be applied to much longer sequences without constraints on the structure of the attention matrix to advance the biology and medicine (e.g.: very long protein sequence).
[2021 ICLR] [Performer]
Rethinking Attention with Performers
[Google AI Blog]
Rethinking Attention with Performers
2.1. Language Model / Sequence Model
(Some are not related to NLP, but I just group them here)