Review — Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Transformer-XL: Transformer With The Use Of Memory

Transformer-XL (Image from Language Modelling for Source Code with Transformer-XL)
  • Transformer-XL that enables learning dependency beyond a fixed length without disrupting temporal coherence.
  • A novel positional encoding scheme is proposed.


  1. Vanilla Character Transformer Model T64
  2. Proposed Transformer-XL
  3. Experimental Results

1. Vanilla Character Transformer Model T64

Illustration of the vanilla model with a segment length 4
  • A simple solution would be to process the entire context sequence is usually infeasible with the limited resource in practice.
  • In T64, a crude approximation is proposed to split the entire corpus into shorter segments of manageable sizes, and only train the model within each segment, ignoring all contextual information from previous segments.
  • The above figure shows T64, a vanilla model, with a segment length 4.
  • (Please feel free to read T64 if interested.)

2. Proposed Transformer-XL

Illustration of the Transformer-XL model with a segment length 4

2.1. Segment-Level Recurrence with State Reuse

  • As shown above, during training, the hidden state sequence computed for the previous segment is fixed and cached to be reused as an extended context (green lines) when the model processes the next new segment.
  • Formally, let the two consecutive segments and +1 of length L be:
  • Denoting the n-th layer hidden state sequence produced for the τ-th segment by hnτ. Then, the n-th layer hidden state for segment +1 is produced (schematically) as follows:
  • where the function SG(.) stands for stop-gradient, [] is concatenation of 2 hidden sequences at the (n-1)-th hidden layers.
  • q, k, v are the query, key, and value respectively.
  • Transformer-Layer is one layer of Transformer to get the n-th hidden layer.
  • Thus, a predefined length-M old hidden states can be cached which spans (possibly) multiple segments, and is referred as the memorymn.
  • In the experiment, M is set as the segment length during training, and it is increased by multiple times during evaluation.

2.2. Relative Positional Encodings

  • Positional encoding is used in Transformer. But when hidden states are reused, there is positional encoding problem.
  • where Esτ is the word embedding sequence of , and f represents a transformation function.
  • Notice that, both Esτ and Esτ+1 are associated with the same positional encoding U1:L. As a result, the model has no information to distinguish the positional difference.
  • In the standard Transformer, the attention score between query qi and key vector kj within the same segment can be decomposed as:
  • Following the idea of only relying on relative positional information, the above equation is modified and is re-parameterized as the four terms:
  1. The first change is to replace all appearances of the absolute positional embedding Uj for computing key vectors in term (b) and (d) with its relative counterpart Ri-j.
  2. Secondly, a trainable parameter u is introduced to replace the query in term (c). Similar in term (d), a trainable parameter v is introduced. In this case, since the query vector is the same for all query positions, it suggests that the attentive bias towards different words should remain the same regardless of the query position.
  3. Finally, the two weight matrices Wk,E and Wk,R are deliberately separated for producing the content-based key vectors and location-based key vectors respectively.
  • The relative positional embedding R adapts the sinusoid formulation.
  • The computational procedure for a N-layer Transformer-XL with a single attention head is:

3. Experimental Results

3.1. SOTA Comparison on WikiText-103

Comparison with state-of-the-art results on WikiText-103
  • Attention length is set to 384 during training and 1600 during evaluation.

3.2. SOTA Comparison on enwik8

Comparison with state-of-the-art results on enwik8
  • By increasing model sizes, 18-layer and 24-layer Transformer-XLs are trained with attention length is set to 784 during training and 3800 during evaluation.
  • Different from T64, Transformer-XL does not need any auxiliary losses.

3.3. SOTA Comparison on text8

Comparison with state-of-the-art results on text8
  • Simply adapt the best model and the same hyper-parameters on enwik8 to text8 without further tuning.

3.4. SOTA Comparison on One Billion Word

Comparison with state-of-the-art results on One Billion Word
  • Transformer-XL dramatically improves the single-model SoTA from 23.7 to 21.8.

3.5. SOTA Comparison on Penn Treebank

Comparison with state-of-the-art results on Penn Treebank
  • Variational dropout and weight average are applied to Transformer-XL similar to AWD-LSTM.
  • (There are ablation studies in the paper. Please feel free to read the paper.)


Natural Language Processing (NLP)

My Other Previous Paper Readings



PhD, Researcher. I share what I learn. :) Reads:, LinkedIn:, Twitter:

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store