Review — Convolutional Sequence to Sequence Learning (ConvS2S)

ConvS2S as Convolutional Network, Outperforms GNMT

  • An architecture is proposed which is entirely based on convolutional neural networks (CNN).
  • Computations can be fully parallelized during training.
  • Gated linear units (GLU) eases gradient propagation and
  • Each decoder layer equipped with a separate attention module.

Outline

  1. ConvS2S: Network Architecture
  2. Experimental Results

1. ConvS2S: Network Architecture

Convolutional Sequence to Sequence Learning (ConvS2S) Network Architecture
  • At the top part, it is the encoder. At the bottom part, it is the decoder.
  • The encoder RNN processes an input sequence x=(x1, …, xm) of m elements and returns state representations z=(z1, ..., zm).
  • The decoder RNN takes z and generates the output sequence y=(y1, …, yn) left to right, one element at a time.
  • The above architecture will be mentioned below part by part.

1.1. Position Embeddings

Position Embeddings e at Encoder
  • Input elements x=(x1,…,xm) are embedded in distributional space as w=(w1,…,wm).
  • The absolute positions of input elements p=(p1,…,pm) are embedded.
  • Both w and p are combined to obtain input element representations e=(w1+p1,…,wm+pm). Thus, position-dependent word embedding is used.
Position Embeddings g at Decoder
  • Similarly at decoder, position Embeddings g are used, as shown above.

1.2. Convolutions and Residual Connections

1D Convolution at Encoder
1D Convolution at Decoder (Output
  • Both encoder and decoder networks share a simple block structure.
  • Each block/layer contains a one dimensional convolution followed by a non-linearity.
  • At decoder, asymmetric triangle shape means that future words are not used for convolution.
  • Stacking several blocks on top of each other increases the number of input elements represented in a state. For instance, stacking 6 blocks with kernel width k=5 results in an input field of 25 elements, i.e. each output depends on 25 inputs.

1.3. Gated Linear Unit (GLU)

Gated Linear Unit (GLU) at Encoder, (Output is z)
Gated Linear Unit (GLU) at Decoder, (Output is h)
  • The output of the convolution is divided into 2 parts A and B and goes through the Gated Linear Unit (GLU):
  • where ⊗ is the point-wise multiplication and σ is the sigmoid function.
  • (For GLU, please feel free to read GCNN if interested.)
  • The output of GLU at the encoder is z.
  • The output of GLU at the decoder is h.
  • In addition, residual Connections are added from the input of each convolution to the output of the block (Recall that v is GLU):

1.4. Multi-Step Attention

Multi-Step Attention
  • Before computing the attention, the current decoder state hli is combined with an embedding of the previous target element gi to obtain the decoder summary dli (No corresponding blocks in the figure):
  • Dot product of the decoder summary d and the encoder output z is performed (Center array in blue and yellow colors).
  • The attention weight alij is obtained by using softmax on the dot product elements (Output of the center array).
  • Finally, conditional input ci is calculated which is the sum of attention weighted of (z+e):
  • Recall that zuj is the output of the convolution at encoder and ej is the embedding at the encoder. Encoder outputs zuj represent potentially large input contexts and ej provides point information about a specific input element that is useful when making a prediction.

1.5. Output

Predicted output y
  • Once cli has been computed, it is simply added to the output of the corresponding decoder layer hli, to get the predicted output.

1.6. Others

  • Normalization is performed to scale the output of residual blocks as well as the attention to preserve the variance of activations.
  • Careful weight initialization is done due to the normalization.

2. Experimental Results

2.1. Single Model

Accuracy on WMT tasks compared to previous work.
  • On WMT’16 English-Romanian, ConvS2S has 20 layers in the encoder and 20 layers in the decoder, both using kernels of width 3 and hidden size 512 throughout.
  • On WMT’14 English to German translation, the proposed ConvS2S encoder has 15 layers and the decoder has 15 layers, both with 512 hidden units in the first ten layers and 768 units in the subsequent three layers, all using kernel width 3. The final two layers have 2048 units which are just linear mappings with a single input.
  • On WMT’14 English-French translation, ConvS2S has a bit different settings with different numbers of hidden units.

2.2. Ensemble Model

Accuracy of ensembles with eight models

Reference

Natural Language Processing (NLP)

My Other Previous Paper Readings

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store