Review — ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

From BERT to ALBERT

Machine performance on the RACE challenge (SAT-like reading comprehension). A random-guess baseline score is 25.0. The maximum possible score is 95.0.
  • Increasing model size become harder due to GPU/TPU memory limitations and longer training times.
  • ALBERT, A Lite BERT, proposes two parameter reduction techniques to lower memory consumption and increase the training speed of BERT.
  • A self-supervised loss is also used that focuses on modeling inter-sentence coherence, and show it consistently helps downstream tasks with multi-sentence inputs.

Outline

  1. Factorized Embedding Parameterization
  2. Cross-Layer Parameter Sharing
  3. Inter-Sentence Coherence Loss
  4. ALBERT Model Variants
  5. Experimental Results

1. Factorized Embedding Parameterization

1.1. Basic Model Architecture

  • The backbone of the ALBERT architecture is similar to BERT in that it uses a Transformer encoder with GELU nonlinearities.
  • The BERT notation conventions are followed and the vocabulary embedding size denoted as E, the number of encoder layers is as L, and the hidden size is as H.
  • Following BERT, the feed-forward/filter size is set to be 4H and the number of attention heads to be H=64.

1.2. Embedding Factorization

  • The WordPiece embedding size E is tied with the hidden layer size H, i.e., E=H, which is sub-optimal. If E=H, then increasing H increases the size of the embedding matrix, which has size V×E.
  • A more efficient usage is to have HE.
  • A factorization of the embedding parameters is proposed in ALBERT, decomposing them into two smaller matrices.
  • By using this decomposition, the embedding parameters are reduced from O(V×H) to O(V×E+E×H). This parameter reduction is significant when HE.
  • (Similar factorization concept is also used in CNN such as the factorizing convolution in Inception-v3 and depthwise separable convolution in MobileNetV1.)
  • (A little bit similar concept is also used in matrix decomposition for recommendation system.)

2. Cross-Layer Parameter Sharing

Attention Network (Equation from Transformer)
Feed-Forward Network (Equation from Transformer)
  • In a Transformer layer, there is a attention network and feed-forward network (FFN).
  • There are multiple ways to share parameters, e.g., only sharing FFN parameters across layers, or only sharing attention parameters.

3. Inter-Sentence Coherence Loss

  • In addition to the masked language modeling (MLM) loss, BERT uses an additional loss called next-sentence prediction (NSP) where NSP is a binary classification loss for predicting whether two segments appear consecutively in the original text.
  • However, subsequent studies found that NSP’s impact unreliable and decided to eliminate it. It is conjectured that the main reason behind NSP’s ineffectiveness is its lack of difficulty as a task, as compared to MLM.
  • ALBERT maintains that inter-sentence modeling is an important aspect of language understanding, but a loss based primarily on coherence is proposed.
  • That is, for ALBERT, a sentence-order prediction (SOP) loss is used, which avoids topic prediction and instead focuses on modeling inter-sentence coherence.

4. ALBERT Model Variants

The configurations of the main BERT and ALBERT models analyzed in this paper
  • ALBERT models have much smaller parameter size compared to corresponding BERT models.
  • For example, ALBERT-large has about 18× fewer parameters compared to BERT-large, 18M versus 334M.
  • An ALBERT-xlarge configuration with H=2048 has only 60M parameters and an ALBERT-xxlarge configuration with H=4096 has 233M parameters, i.e., around 70% of BERT-large’s parameters.
  • Note that for ALBERT-xxlarge, a 12-layer network is used because a 24-layer network (with the same configuration) obtains similar results but is computationally more expensive.

5. Experimental Results

5.1. Overall Comparison between BERT and ALBERT

Dev set results for models pretrained over BOOKCORPUS and Wikipedia for 125k steps

5.2. Factorized Embedding Parameterization

The effect of vocabulary embedding size on the performance of ALBERT-base
  • Based on these results, we use an embedding size E=128 in all future settings

5.3. Cross-Layer Parameter Sharing

The effect of cross-layer parameter-sharing strategies, ALBERT-base configuration
  • All-shared strategy is chosen as the default choice.

5.4. Sentence Order Prediction (SOP)

The effect of sentence-prediction loss, NSP vs. SOP, on intrinsic and downstream tasks
  • None (XLNet- and RoBERTa-style), NSP (BERT-style), and SOP (ALBERT-style), using an ALBERT-base configuration, are tried.

5.5. Same Amount of Training Time

The effect of controlling for training time, BERT-large vs ALBERT-xxlarge configurations
  • The performance of a BERT-large model after 400k training steps (after 34h of training), roughly equivalent with the amount of time needed to train an ALBERT-xxlarge model with 125k training steps (32h of training).

5.6. Additional Data & Dropout

The effects of adding data and removing Dropout during training
The effect of additional training data using the ALBERT-base configuration
  • The performance on the downstream tasks is also improved.
The effect of removing Dropout, measured for an ALBERT-xxlarge configuration
  • Removing Dropout also helps the downstream tasks.

5.7. Current SOTA on NLU Tasks

State-of-the-art results on the GLUE benchmark
State-of-the-art results on the SQuAD and RACE benchmarks
  • The single-model ALBERT configuration incorporates the best-performing settings discussed: an ALBERT-xxlarge configuration using combined MLM and SOP losses, and no Dropout.
  • The latter appears to be a particularly strong improvement, a jump of +17.4% absolute points over BERT, +7.6% over XLNet, +6.2% over RoBERTa, and 5.3% over DCMN+, an ensemble of multiple models specifically designed for reading comprehension tasks.

--

--

PhD, Researcher. I share what I learn. :) Reads: https://bit.ly/33TDhxG, LinkedIn: https://www.linkedin.com/in/sh-tsang/, Twitter: https://twitter.com/SHTsang3

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store