Review — Attention Is All You Need (Transformer)

Using Transformer, Attention is Drawn, Long-Range Dependencies are Considered, Outperforms ByteNet, Deep-Att, GNMT, and ConvS2S

Attention Is All You Need (Figure from


  1. Transformer: Model Architecture
  2. Multi-Head Attention
  3. Applications of Attention in Transformer
  4. Position-wise Feed-Forward Networks
  5. Other Details
  6. Experimental Results

1. Transformer: Model Architecture

Transformer: Model Architecture

1.1. Framework

  • Left: The encoder maps an input sequence of symbol representations (x1, …, xn) to a sequence of continuous representations z= (z1, …, zn).
  • Right: Given z, the decoder then generates an output sequence (y1, …, ym) of symbols one element at a time. At each step the model is auto-regressive, consuming the previously generated symbols as additional input when generating the next.

1.2. Encoder

  • That is, the output of each sub-layer is LayerNorm(x + Sublayer(x)).
  • All sub-layers in the model, as well as the embedding layers, produce outputs of dimension dmodel=512.

1.3. Decoder

  • The self-attention sub-layer in the decoder stack is modified to prevent positions from attending to subsequent (future) positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position i can depend only on the known outputs at positions less than i.

2. Multi-Head Attention

Multi-Head Attention

2.1. Scaled Dot-Product Attention

Scaled Dot-Product Attention (1-Head, Mask layer is optional, it is only used at decoder)

2.1.1. Procedures

2.1.2. Reasons of Using Dot Product Attention over Additive Attention

  • The two most commonly used attention functions are additive attention, and dot-product (multiplicative) attention.
  • Additive attention computes the compatibility function using a feed-forward network with a single hidden layer.
  • For large values of dk, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients. To counteract this effect, we scale the dot products by 1/√(dk).

2.2. Multi-Head Attention

Multi-Head Attention
  • In this model, h=8 parallel attention layers, or heads.
  • For each of these, dk=dv=dmodel/h=64. Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality.

3. Applications of Attention in Transformer

Left: Encoder-Decoder Attention, Middle: Self-Attention at Encoder, Right: Masked Self-Attention at Decoder
  1. The encoder contains self-attention layers. In a self-attention layer all of the keys, values and queries come from the same place, in this case, the output of the previous layer in the encoder. Each position in the encoder can attend to all positions in the previous layer of the encoder.
  2. Similarly, self-attention layers in the decoder allow each position in the decoder to attend to all positions in the decoder up to and including that position. To prevent leftward information flow in the decoder to preserve the auto-regressive property, the scaled dot-product attention is modified by masking out (setting to -∞) all values in the input of the softmax which correspond to illegal connections.

4. Position-wise Feed-Forward Networks

Position-wise Feed-Forward Networks
  • The dimensionality of input and output is dmodel = 512, and the inner-layer has dimensionality dff = 2048, which is a bottleneck structure.

5. Other Details

5.1. Embeddings and Softmax

  • In Transformer, the same weight matrix is shared between the two embedding layers and the pre-softmax linear transformation.
  • In the embedding layers, those weights are multiplied by √(dmodel).

5.2. Positional Encoding

Positional Encoding at Encoder (Left) and Decoder (Right)
  • “Positional encodings” are added to the input embeddings at the bottoms of the encoder and decoder stacks. The positional encodings have the same dimension dmodel as the embeddings, so that the two can be summed.

5.3. Why Attention

Maximum path lengths, per-layer complexity and minimum number of sequential operations for different layer types
  1. The amount of computation that can be parallelized.
  2. The path length between long-range dependencies in the network.
  • In terms of computational complexity, self-attention layers are faster than recurrent layers when the sequence length n is smaller than the representation dimensionality d.
  • A single convolutional layer with kernel width k<n does not connect all pairs of input and output positions. Doing so requires a stack of O(n=k) convolutional layers in the case of contiguous kernels, or O(logk(n)) in the case of dilated convolutions.

6. Experimental Results

6.1. Datasets

  • WMT 2014 English-German dataset consists of about 4.5 million sentence pairs. Sentences were encoded using byte-pair encoding, which has a shared source-target vocabulary of about 37000 tokens.
  • WMT 2014 English-French dataset consisting of 36M sentences and split tokens into a 32000 word-piece vocabulary. Each training batch contained a set of sentence pairs containing approximately 25000 source tokens and 25000 target tokens.

6.2. SOTA Comparison

English-to-German and English-to-French newstest2014 tests

6.3. Model Variations

Variations on the Transformer architecture on the English-to-German translation development set, newstest2013 (Unlisted values are identical to those of the base model)
  • (B): Reducing the attention key size dk hurts model quality.
  • (C) and (D): Bigger models are better, and dropout is very helpful in avoiding over-fitting.
  • (E): The sinusoidal positional encoding is replaced with learned positional embeddings in [9], and nearly identical results to the base model is observed.

6.4. English Constituency Parsing

English Constituency Parsing on Wall Street Journal (WSJ)
  • A 4-layer transformer with dmodel=1024 is trained on the Wall Street Journal (WSJ) portion of the Penn Treebank [25], about 40K training sentences.

6.5. Attention Visualization

An example of the attention mechanism following long-distance dependencies in the encoder self-attention in layer 5 of 6
Two attention heads, also in layer 5 of 6, apparently involved in anaphora resolution
  • Bottom: Isolated attentions from just the word ‘its’ for attention heads 5 and 6.


[2017 NeurIPS] [Transformer]
Attention Is All You Need

Natural Language Processing (NLP)

Language Model: 2007 [Bengio TNN’07] 2013 [Word2Vec] [NCE] [Negative Sampling] 2014 [GRU] [Doc2Vec] 2015 [Skip-Thought] 2016 [GCNN/GLU]
Machine Translation: 2014 [Seq2Seq] [RNN Encoder-Decoder] 2015 [Attention Decoder/RNNSearch] 2016 [GNMT] [ByteNet] [Deep-ED & Deep-Att] 2017 [ConvS2S] [Transformer]
Image Captioning: 2015 [m-RNN] [R-CNN+BRNN] [Show and Tell/NIC] [Show, Attend and Tell]

My Other Previous Paper Readings



PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store