# Review — DeLighT: Deep and Light-weight Transformer

## DeLighT, Parameter Reduction for Transformers

• Within each Transformer block, a deep and lightweight transformation is used using DeLighT block.
• Across blocks, block-wise scaling is used for shallower and narrower DeLighT blocks near the input, and wider and deeper DeLighT blocks near the output.

# Outline

1. DeLighT Block
2. Block-wise Scaling
3. Experimental Results

# 1. DeLighT Block

## 1.1. Conceptual Idea

• Similar to DeFINE, DeLighT transformation uses group linear transformations (GLTs) because they learn local representations by deriving the output from a specific part of the input and are more efficient than linear transformations.
• To learn global representations, the DeLighT transformation shares information between different groups in the group linear transformation using feature shuffling.

## 1.2. DeLighT Transformation

• In the expansion phase, the DeLighT transformation projects the dm-dimensional input to a high-dimensional space, dmax=wmdm, linearly using ⌈N/2⌉ layers.
• In the reduction phase, the DeLighT transformation projects the dmax-dimensional vector to a do-dimensional space using the remaining N- ⌈N/2⌉ GLT layers.
• Mathematically, the output Y at each GLT layer l as:
• The function F then linearly transforms each Xi with weights Wli and bias bli to produce output Yli.
• The outputs of each group Yli are then concatenated to produce the output Yl.
• The function H first shuffles the output of each group in Yl-1 and then combines it with the input X using the input mixer connection in DeFINE.
• The number of groups at the l-th GLT are computed as:
• In this paper, gmax=⌈dm/32⌉ so that each group has at least 32 input elements.

## 1.3. DeLighT Block

• DeLighT transformation is integrated into the Transformer block.
• DeLighT attention is:
• where do<dm, and dm is the standard Transformer module. In the experiment, do=dm/2, 2× fewer multiplication-addition operations are required as compared to the Transformer architecture.
• For the light-weight FFN, the first layer reduces the dimensionality of the input from dm to dm/r while the second layer expands the dimensionality from dm/r to dm, where r is the reduction factor.
• In the experiment, r=4. Thus, the light-weight FFN reduces the number of parameters in the FFN by 16.

# 2. Block-wise Scaling

• Simply scaling model width and depth allocates parameters uniformly across blocks, which may lead to learning redundant parameters.
• For the b-th DeLighT block, the number of GLTs Nb and the width multiplier wbm are computed:

# 3. Experimental Results

## 3.1. Machine Translation

• wm=2, Nmin=4, and Nmax=8 for WMT’16 En-Ro, WMT’14 En-De, and WMT’14 En-Fr; resulting in 222 layer deep DeLighT networks.
• wm=1, Nmin=3, and Nmax=9 for IWSLT’14 De-En; resulting in 289 layer deep network.
• For simplicity, B=Nmax.
• DeLighT delivers similar performance to baseline Transformers, but with fewer parameters and less regularization.

## 3.2. Language Modeling

• wm=2, Nmin=4, and Nmax=12, B=Nmax.

## 3.3. Computational Complexity

• Dedicated CUDA kernels for grouping and ungrouping functions in GLTs are implemented. With these changes, training time and GPU memory consumption of DeLighT reduced by about 4 hours and 3 GB, respectively.

## Reference

[2021 ICLR] [DeLightT]
DeLighT: Deep and Light-weight Transformer

## 2.1. Language Model / Sequence Model

1991 … 2021 [Performer] [gMLP] [Roformer] [PPBERT] [DeBERTa] [DeLighT] 2022 [GPT-NeoX-20B] [InstructGPT]

## 2.2. Machine Translation

2013 … 2021 [ResMLP] [GPKD] [Roformer] [DeLighT]

--

--