Review — Attention Is All You Need (Transformer)

Using Transformer, Attention is Drawn, Long-Range Dependencies are Considered, Outperforms ByteNet, Deep-Att, GNMT, and ConvS2S

Attention Is All You Need (Figure from


1. Transformer: Model Architecture

Transformer: Model Architecture

1.1. Framework

1.2. Encoder

1.3. Decoder

2. Multi-Head Attention

Multi-Head Attention

2.1. Scaled Dot-Product Attention

Scaled Dot-Product Attention (1-Head, Mask layer is optional, it is only used at decoder)

2.1.1. Procedures

2.1.2. Reasons of Using Dot Product Attention over Additive Attention

2.2. Multi-Head Attention

Multi-Head Attention

3. Applications of Attention in Transformer

Left: Encoder-Decoder Attention, Middle: Self-Attention at Encoder, Right: Masked Self-Attention at Decoder

4. Position-wise Feed-Forward Networks

Position-wise Feed-Forward Networks

5. Other Details

5.1. Embeddings and Softmax

5.2. Positional Encoding

Positional Encoding at Encoder (Left) and Decoder (Right)

5.3. Why Attention

Maximum path lengths, per-layer complexity and minimum number of sequential operations for different layer types

6. Experimental Results

6.1. Datasets

6.2. SOTA Comparison

English-to-German and English-to-French newstest2014 tests

6.3. Model Variations

Variations on the Transformer architecture on the English-to-German translation development set, newstest2013 (Unlisted values are identical to those of the base model)

6.4. English Constituency Parsing

English Constituency Parsing on Wall Street Journal (WSJ)

6.5. Attention Visualization

An example of the attention mechanism following long-distance dependencies in the encoder self-attention in layer 5 of 6
Two attention heads, also in layer 5 of 6, apparently involved in anaphora resolution


Natural Language Processing (NLP)

My Other Previous Paper Readings

PhD, Researcher. I share what I've learnt and done. :) My LinkedIn:, My Paper Reading List: