Sik-Ho Tsang

Nov 26, 2021

5 min read

Review: Language Modeling with Gated Convolutional Networks (GCNN/GLU)

Gated Convolutional Networks (GCNN) Using Gated Linear Unit (GLU)

  • A finite context approach through stacked convolutions is proposed, which can be more efficient since they allow parallelization over sequential tokens.
  • A novel simplified gating mechanism, Gated Linear Unit (GLU), is proposed.


  1. Gated Convolutional Networks (GCNN): Network Architecture
  2. Experimental Results

1. Gated Convolutional Networks (GCNN): Network Architecture

Gated Convolutional Networks (GCNN): Network Architecture
  • The above architecture will be mentioned part by part from input to output as below.

1.1. Motivations of Using CNN over RNN

  • Recurrent neural network (RNN) always needs to wait for previous state, which is difficult for parallelization.
  • The proposed approach use CNN, which convolves the inputs with a function f to obtain H=f*w and therefore has no temporal dependencies, so it is easier to parallelize over the individual words of a sentence.

1.2. Word Embedding as Input

Word Embedding as Input
  • Words are represented by a vector embedding stored in a lookup table D^(|Ve) where |V| is the number of words in the vocabulary and e is the embedding size. The input to the model is a sequence of words w0, .., wN which are represented by word embeddings E=[Dw0, …, DwN].

1.3. Gated Linear Unit (GLU)

Gated Linear Unit (GLU)
  • The hidden layers h0, …, hL are computed as:
  • where σ is the sigmoid function and is the element-wise product between matrices.
  • When convolving inputs, care is needed that hi does not contain information from future words. Zero-padding is used to pad the input to handle this problem.

1.4. Stacking GLU

Stacking GLU
  • Stacking multiple layers on top of the input E gives a representation of the context for each word H=hLh0(E).
  • The convolution and the gated linear unit in a pre-activation residual block (Pre-Activation ResNet).
  • The blocks have a bottleneck structure for computational efficiency and each block has up to 5 layers.
  • (Please feel free to read Pre-Activation ResNet if interested.)

1.5. Softmax

  • The simplest choice to obtain model predictions is to use a softmax layer, but it is computationally inefficient for large vocabularies.
  • Adaptive softmax which assigns higher capacity to very frequent words and lower capacity to rare words (Grave et al., 2016a), is used.

1.6. GCNN Variants

GCNN Variants, The residual building blocks are shown in brackets with the format [k, n]. “B” denotes bottleneck architectures.
  • Gradient clipping is used where large gradient is clipped.
  • Weight normalization is used where weights are normalized in some layers.
  • Both techniques are used to speed up the convergence.

2. Experimental Results

2.1. Google Billion Word Dataset

Results on the Google Billion Word test set
  • GCNN outperforms the comparable LSTM results on Google billion words.

2.2. WikiText-103 Dataset

Results for single models on the WikiText-103 dataset
  • An input sequence is an entire Wikipedia article instead of an individual sentence — increasing the average length to 4000 words.

2.3. Other Studies

Comparison of full softmax and the adaptive softmax approximation
Learning curves on WikiText-103 (left) and Google Billion Word (right)
Processing speed in tokens/s at test time
  • Throughput can be maximized by processing many sentences in parallel to amortize sequential operations.
  • In contrast, responsiveness is the speed of processing the input sequentially, one token at a time.
Test perplexity as a function of context for Google Billion Word (left) and Wiki-103 (right)
  • Models with bigger context achieve better results but the results start diminishing quickly after a context of 20.
Learning curves on Google Billion Word for models with varying degrees of non-linearity
Effect of weight normalization and gradient clipping on Google Billion Word


Natural Language Processing (NLP)

My Other Previous Paper Readings