Review — DeLighT: Deep and Light-weight Transformer

DeLighT, Parameter Reduction for Transformers

  • Within each Transformer block, a deep and lightweight transformation is used using DeLighT block.
  • Across blocks, block-wise scaling is used for shallower and narrower DeLighT blocks near the input, and wider and deeper DeLighT blocks near the output.

Outline

  1. DeLighT Block
  2. Block-wise Scaling
  3. Experimental Results

1. DeLighT Block

(a, b) Block-wise comparison between the standard Transformer block and the DeLighT block. (c, d) compares the DeFINE transformation with the DeLighT transformation.

1.1. Conceptual Idea

  • Similar to DeFINE, DeLighT transformation uses group linear transformations (GLTs) because they learn local representations by deriving the output from a specific part of the input and are more efficient than linear transformations.
  • To learn global representations, the DeLighT transformation shares information between different groups in the group linear transformation using feature shuffling.

1.2. DeLighT Transformation

Example illustrating the expansion phase in the DeLighT transformation that uses GLTs, feature shuffling, and an input mixer connection
  • In the expansion phase, the DeLighT transformation projects the dm-dimensional input to a high-dimensional space, dmax=wmdm, linearly using ⌈N/2⌉ layers.
  • In the reduction phase, the DeLighT transformation projects the dmax-dimensional vector to a do-dimensional space using the remaining N- ⌈N/2⌉ GLT layers.
  • Mathematically, the output Y at each GLT layer l as:
  • The function F then linearly transforms each Xi with weights Wli and bias bli to produce output Yli.
  • The outputs of each group Yli are then concatenated to produce the output Yl.
  • The function H first shuffles the output of each group in Yl-1 and then combines it with the input X using the input mixer connection in DeFINE.
  • The number of groups at the l-th GLT are computed as:
  • In this paper, gmax=⌈dm/32⌉ so that each group has at least 32 input elements.

1.3. DeLighT Block

  • DeLighT transformation is integrated into the Transformer block.
  • DeLighT attention is:
  • where do<dm, and dm is the standard Transformer module. In the experiment, do=dm/2, 2× fewer multiplication-addition operations are required as compared to the Transformer architecture.
  • For the light-weight FFN, the first layer reduces the dimensionality of the input from dm to dm/r while the second layer expands the dimensionality from dm/r to dm, where r is the reduction factor.
  • In the experiment, r=4. Thus, the light-weight FFN reduces the number of parameters in the FFN by 16.

2. Block-wise Scaling

Block-wise scaling efficiently allocates parameters and operations across blocks, leading to shallower and narrower DeLighT blocks near the input and deeper and wider DeLighT blocks near the output.
  • Simply scaling model width and depth allocates parameters uniformly across blocks, which may lead to learning redundant parameters.
  • For the b-th DeLighT block, the number of GLTs Nb and the width multiplier wbm are computed:

3. Experimental Results

3.1. Machine Translation

  • wm=2, Nmin=4, and Nmax=8 for WMT’16 En-Ro, WMT’14 En-De, and WMT’14 En-Fr; resulting in 222 layer deep DeLighT networks.
  • wm=1, Nmin=3, and Nmax=9 for IWSLT’14 De-En; resulting in 289 layer deep network.
  • For simplicity, B=Nmax.
Comparison with baseline Transformers on machine translation corpora.
DeLighT networks are deep, lightweight and efficient as compared to Transformers on the WMT’14 En-Fr dataset.
Comparison of DeLighT with Transformers and Evolved Transformers at two different settings, on the WMT’14 En-De corpus.
Comparison with state-of-the-art methods on machine translation corpora.
DeLighT requires less regularization as compared to baseline Transformers (Dataset: WMT’14 En-De).
  • DeLighT delivers similar performance to baseline Transformers, but with fewer parameters and less regularization.
Scaling up DeLighT models.

3.2. Language Modeling

  • wm=2, Nmin=4, and Nmax=12, B=Nmax.
Results on the WikiText-103 dataset.

3.3. Computational Complexity

Comparison with baseline Transformers in terms of training speed and memory consumption.
  • Dedicated CUDA kernels for grouping and ungrouping functions in GLTs are implemented. With these changes, training time and GPU memory consumption of DeLighT reduced by about 4 hours and 3 GB, respectively.

Reference

[2021 ICLR] [DeLightT]
DeLighT: Deep and Light-weight Transformer

2.1. Language Model / Sequence Model

1991 … 2021 [Performer] [gMLP] [Roformer] [PPBERT] [DeBERTa] [DeLighT] 2022 [GPT-NeoX-20B] [InstructGPT]

2.2. Machine Translation

2013 … 2021 [ResMLP] [GPKD] [Roformer] [DeLighT]

==== My Other Previous Paper Readings ====

--

--

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store