Review — Long-Short Transformer: Efficient Transformers for Language and Vision

Transformer-LS, Long-Range & Short-Range Attentions for Language Modelling & Image Classification

Sik-Ho Tsang
6 min readFeb 6, 2023

Long-Short Transformer: Efficient Transformers for Language and Vision,
Transformer-LS, by University of Maryland, NVIDIA, Arizona State University, California Institute of Technology,
2021 NeurIPS, Over 40 Citations (Sik-Ho Tsang @ Medium)
NLP, LM, Image Classification, Transformer, Vision Transformer, ViT

  • Long-Short Transformer (Transformer-LS) is proposed, where an efficient self-attention mechanism for modeling long sequences with linear complexity for both language and vision tasks.
  • It aggregates a novel long-range attention with dynamic projection to model distant correlations and a short-term attention to capture fine-grained local correlations.
  • A dual normalization strategy is used to account for the scale mismatch between the two attention mechanisms.


  1. Transformer Preliminaries
  2. Long-Short Transformer (Transformer-LS)
  3. Language Task Results
  4. Vision Task Results

1. Transformer Preliminaries

  • Multi-head attention is a core component of the Transformer:
  • where Q, K, V, are the query, key and value embeddings respectively.
  • For each head, the self attention is performed, e.g.: through matrix multiplication and softmax:
  • (Please feel free to read Transformer if you are interested.)

2. Long-Short Transformer (Transformer-LS)

Long-short term attention of a single attention head. Left: Short-term Attention

2.1. Short-term Attention via Segment-wise Sliding Window (Left)

  • The input sequence is divided into disjoint segments with length w for efficiency reason. All tokens within a segment attend to all tokens within its home segment, as well as w=2 consecutive tokens on the left and right side of its home segment (zero-padding when necessary), resulting in an attention span over a total of 2w key-value pairs.
  • This segment-wise sliding window attention is faster than the per-token sliding window attention where each token attends to itself and w tokens to its left and right.

This simple yet effective sliding window attention captures fine-grained local correlations.

2.2. Long-range Attention via Dynamic Projections (Middle)

  • The dynamic low-rank projection is used.
  • The full (n×n) attention matrix can be decomposed into the product of two matrices with r columns or rows.
  • The dynamic projection matrix Pi and the key-value embeddings ‾Ki, ‾Vi are defined:
  • where WPi are learnable parameters, and the softmax normalizes the projection weights on the first dimension over all n tokens, which stabilizes training.
  • The long-range attention for each head becomes:
  • The computational complexity is O(rn).
  • The input sequence is first divided into equal-length segments with length l, and dynamic projection is applied to extract ‾Ki, ‾Vi from each segment.
  • For autoregressive model, future tokens are not used. The long-range attention of Qt by attending to Ki,t, Vi,t becomes:
  • The dynamic low-rank projection is applied to each segment only once in parallel, preserving the linear complexity and the high training speed.

2.3. Aggregating Long-range and Short-term Attentions (Right)

  • The global low-rank projected keys and values are Ki, ‾Vi, and the local keys and values are ~Kt, ~Vt within the local window of position t for the query Qt.
  • where [.;.] denotes concatenating the matrices along the first dimension.
  • However, there is a scale mismatch between the initial norms of ~KtWKi and ‾Ki.

2.4. DualLN: Normalization Strategy

  • Therefore, a normalization strategy (DualLN) is proposed to align the norms and improve the effectiveness of the aggregation.
  • Two sets of Layer Normalizations (LN) after the key and value projections for the local window and global low-rank attentions, so that their scales are aligned at initialization, but the network can still learn to re-weight the norms after training:
Left: Ratios of the average `2 norms of the local window to global low-rank key/value embeddings at initialization. Right: The validation loss of Transformer-LS with and without DualLN on enwik8 and text8.
  • The Transformer-LS models trained with DualLN has consistently lower validation loss than the models without DualLN.

3. Language Task Results

3.1. Bidirectional Modeling on Long Range Arena and IMDb

Accuracy (%) and FLOPs (G) on Long Range Arena (LRA), with the model configs annotated.

Transformer-LS (best) with the best configurations of w=8, r=32 for each task.

Comparing the robustness of the models under test-time insertions and deletions.
  • The models are trained on the original, clean training sets and only their test sets are perturbed.

Dynamic projection is more robust against location changes.

Comparing the results of pretrained language models fine-tuned on IMDb.

The base model outperforms Longformer-base, and the large model achieves improvements over RoBERTa-large.

3.2. Autoregressive Language Modeling

BPC (↓) of smaller models on enwik8 and text8 (left), and larger models on enwik8 (right).
  • The proposed smaller 12-layer and larger 30-layer models are Pre-LN Transformers with the same width and depth as Longformer.
  • The proposed method has achieved state-of-the-art results. On text8, Transformer-LS achieves a test BPC of 1.09 with the smaller model. On enwik8, the proposed smaller model achieves a test BPC of 0.99, and outperforms the state-of-the-art models with comparable number of parameters.
  • The proposed larger model obtains a test BPC of 0.97, on par with the Compressive Transformer [10] with 2× parameters.

The results are consistently better than Longformer which is trained on longer sequences with 5 stages and 48 GPU memory.

Running time and memory consumption of Transformer-XL (full attention) and Transformer-LS on Char-LM.

Transformer-LS model is much more memory and computational efficient than full attention.

4. Vision Task Results

4.1. ImageNet

Test accuracies on ImageNet, ImageNet-ReaL, and ImageNet-V2 of models trained on ImageNet-1K.
  • CvT and ViL, state-of-the art Vision Transformer architectures, are use as the backbones and their attention mechanisms are replaced with the long-short term attention, denoted as CvT-LS and ViL-size-LS.
  • With long-short term attention, the training can be easily scaled to higher resolution, and the performance of CvT-LS and ViL-LS also improves.

CvT-LS-17 achieves better result than CvT-21 at resolution 224 using fewer parameters and FLOPs, and CvT-LS-21S model further improves CvT-LS-21 model.

The best model with CvT (CvT-LS-21 at 448²) achieves 0.3% higher accuracy than the best reported result of CvT while using the same amount of parameters and 76% of its FLOPs.

4.2. Robustness

Robustness evaluation on various ImageNet datasets.

CvT using Transformer-LS significantly outperforms the CNN-based method (ResNet-50). Compared to DeiT, the proposed models also achieve favorable improvements.


[2021 NeurIPS] [Transformer-LS]
Long-Short Transformer: Efficient Transformers for Language and Vision

1.1. Image Classification

19892021 … [Transformer-LS] 2022 [ConvNeXt] [PVTv2] [ViT-G] [AS-MLP] [ResTv2] [CSWin Transformer] [Pale Transformer] [Sparse MLP] [MViTv2] [S²-MLP] [CycleMLP] [MobileOne] [GC ViT] 2023 [Vision Permutator (ViP)]

2.1. Language Model / Sequence Model

1991 … 2021 [Performer] [gMLP] [Roformer] [PPBERT] [DeBERTa] [DeLighT] [Transformer-LS] 2022 [GPT-NeoX-20B] [InstructGPT]

==== My Other Previous Paper Readings ====



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.