Review — Long-Short Transformer: Efficient Transformers for Language and Vision
Transformer-LS, Long-Range & Short-Range Attentions for Language Modelling & Image Classification
Long-Short Transformer: Efficient Transformers for Language and Vision,
Transformer-LS, by University of Maryland, NVIDIA, Arizona State University, California Institute of Technology,
2021 NeurIPS, Over 40 Citations (Sik-Ho Tsang @ Medium)
NLP, LM, Image Classification, Transformer, Vision Transformer, ViT
- Long-Short Transformer (Transformer-LS) is proposed, where an efficient self-attention mechanism for modeling long sequences with linear complexity for both language and vision tasks.
- It aggregates a novel long-range attention with dynamic projection to model distant correlations and a short-term attention to capture fine-grained local correlations.
- A dual normalization strategy is used to account for the scale mismatch between the two attention mechanisms.
- Transformer Preliminaries
- Long-Short Transformer (Transformer-LS)
- Language Task Results
- Vision Task Results
1. Transformer Preliminaries
- Multi-head attention is a core component of the Transformer:
- where Q, K, V, are the query, key and value embeddings respectively.
- For each head, the self attention is performed, e.g.: through matrix multiplication and softmax:
- (Please feel free to read Transformer if you are interested.)
2. Long-Short Transformer (Transformer-LS)
2.1. Short-term Attention via Segment-wise Sliding Window (Left)
- The input sequence is divided into disjoint segments with length w for efficiency reason. All tokens within a segment attend to all tokens within its home segment, as well as w=2 consecutive tokens on the left and right side of its home segment (zero-padding when necessary), resulting in an attention span over a total of 2w key-value pairs.
- This segment-wise sliding window attention is faster than the per-token sliding window attention where each token attends to itself and w tokens to its left and right.
This simple yet effective sliding window attention captures fine-grained local correlations.
2.2. Long-range Attention via Dynamic Projections (Middle)
- The dynamic low-rank projection is used.
- The full (n×n) attention matrix can be decomposed into the product of two matrices with r columns or rows.
- The dynamic projection matrix Pi and the key-value embeddings ‾Ki, ‾Vi are defined:
- where WPi are learnable parameters, and the softmax normalizes the projection weights on the first dimension over all n tokens, which stabilizes training.
- The long-range attention for each head becomes:
- The computational complexity is O(rn).
- The input sequence is first divided into equal-length segments with length l, and dynamic projection is applied to extract ‾Ki, ‾Vi from each segment.
- For autoregressive model, future tokens are not used. The long-range attention of Qt by attending to Ki,t, Vi,t becomes:
- The dynamic low-rank projection is applied to each segment only once in parallel, preserving the linear complexity and the high training speed.
2.3. Aggregating Long-range and Short-term Attentions (Right)
- The global low-rank projected keys and values are ‾Ki, ‾Vi, and the local keys and values are ~Kt, ~Vt within the local window of position t for the query Qt.
- where [.;.] denotes concatenating the matrices along the first dimension.
- However, there is a scale mismatch between the initial norms of ~KtWKi and ‾Ki.
2.4. DualLN: Normalization Strategy
- Therefore, a normalization strategy (DualLN) is proposed to align the norms and improve the effectiveness of the aggregation.
- Two sets of Layer Normalizations (LN) after the key and value projections for the local window and global low-rank attentions, so that their scales are aligned at initialization, but the network can still learn to re-weight the norms after training:
- The Transformer-LS models trained with DualLN has consistently lower validation loss than the models without DualLN.
3. Language Task Results
3.1. Bidirectional Modeling on Long Range Arena and IMDb
Transformer-LS (best) with the best configurations of w=8, r=32 for each task.
- The models are trained on the original, clean training sets and only their test sets are perturbed.
Dynamic projection is more robust against location changes.
3.2. Autoregressive Language Modeling
- The proposed smaller 12-layer and larger 30-layer models are Pre-LN Transformers with the same width and depth as Longformer.
- The proposed method has achieved state-of-the-art results. On text8, Transformer-LS achieves a test BPC of 1.09 with the smaller model. On enwik8, the proposed smaller model achieves a test BPC of 0.99, and outperforms the state-of-the-art models with comparable number of parameters.
- The proposed larger model obtains a test BPC of 0.97, on par with the Compressive Transformer  with 2× parameters.
The results are consistently better than Longformer which is trained on longer sequences with 5 stages and 48 GPU memory.
Transformer-LS model is much more memory and computational efficient than full attention.
4. Vision Task Results
- CvT and ViL, state-of-the art Vision Transformer architectures, are use as the backbones and their attention mechanisms are replaced with the long-short term attention, denoted as CvT-LS and ViL-size-LS.
- With long-short term attention, the training can be easily scaled to higher resolution, and the performance of CvT-LS and ViL-LS also improves.
[2021 NeurIPS] [Transformer-LS]
Long-Short Transformer: Efficient Transformers for Language and Vision
1.1. Image Classification
1989 … 2021 … [Transformer-LS] 2022 [ConvNeXt] [PVTv2] [ViT-G] [AS-MLP] [ResTv2] [CSWin Transformer] [Pale Transformer] [Sparse MLP] [MViTv2] [S²-MLP] [CycleMLP] [MobileOne] [GC ViT] 2023 [Vision Permutator (ViP)]