# Review — DeLighT: Deep and Light-weight Transformer

## DeLighT, **Parameter Reduction for **Transformers

--

DeLighT: Deep and Light-weight Transformer,DeLighT, by University of Washington, Facebook AI Research, and Allen Institute for AI,2021 ICLR, Over 60 Citations(Sik-Ho Tsang @ Medium)

NLP, LM, NMT, Transformer

- Within each Transformer block, a
**deep and lightweight transformation**is used using**DeLighT block**. - Across blocks,
**block-wise scaling**is used for**shallower and narrower DeLighT blocks near the input**, and**wider and deeper DeLighT blocks near the output.**

# Outline

**DeLighT Block****Block-wise Scaling****Experimental Results**

# 1. DeLighT Block

## 1.1. Conceptual Idea

DeLighTtransformation maps aand thendmdimensional input vector into a high dimensional space (expansion)reduces it down to ausingdodimensional output vector (reduction)Nlayers of the group transformations.

**Similar to****DeFINE**, DeLighT transformation uses**group linear transformations (GLTs)**because they learn**local representations by deriving the output from a specific part of the input**and are**more efficient**than linear transformations.- To learn
**global representations**, the DeLighT transformation shares**information between different groups**in the**group linear transformation using feature shuffling.**

## 1.2. DeLighT Transformation

Formally, the DeLighT transformation is controlled by

five configuration parameters: (1) number of GLT layers, (2) width multiplierN, (3) input dimensionwm, (4) output dimensiondm, and (5) maximum groupsdoin a GLT.gmax

- In the expansion phase, the DeLighT transformation projects the
*dm*-dimensional input**high-dimensional space,**, linearly using ⌈*dmax*=*wmdm**N*/2⌉ layers. - In the reduction phase, the DeLighT transformation projects the
to a*dmax*-dimensional vectorusing the remaining*do*-dimensional space*N*- ⌈*N*/2⌉ GLT layers. - Mathematically, the output
*Y*at each GLT layer*l*as:

- The function
then linearly*F***transforms each**to*Xi*with weights*Wli*and bias*bli***produce output**.*Yli* **The outputs of each group**are then*Yli***concatenated**to produce the**output**.*Yl*- The function
first*H***shuffles**the output of each group in*Yl*-1 and then**combines**it with the input X using the**input mixer**connection in DeFINE. - The
**number of groups at the**are computed as:*l*-th GLT

- In this paper,
*gmax*=⌈*dm*/32⌉

## 1.3. DeLighT Block

- DeLighT transformation is
**integrated**into the Transformer block. - DeLighT attention is:

- where
, and*do*<*dm**dm*is the standard Transformer module. In the experiment,are required as compared to the Transformer architecture.*do*=*dm*/2, 2× fewer multiplication-addition operations - For the
**light-weight FFN**, the**first layer**reduces the dimensionality of the input**from**while the*dm*to*dm*/*r***second layer**expands the dimensionality**from**, where*dm*/*r*to*dm*is the*r***reduction factor**. - In the experiment,
. Thus, the light-weight FFN*r*=4**reduces the number of parameters in the FFN by 16**.

The DeLighT block stacks (1) a DeLighT transformation with

, (2)NGLTsthree parallel linear layersfor key, query, and value, (3)a projection layer, and (4)two linear layers.The depth of DeLighT block iswhereN+4Nis the depth of standard Transformer block.

**2. Block-wise Scaling**

**Simply scaling model width and depth**allocates parameters**uniformly across blocks**, which may lead to learning**redundant parameters**.

Block-wise scalingis introduced that creates a network with variably-sized DeLighT blocks, allocatingshallower and narrower DeLighT blocks near the inputanddeeper and wider DeLighT blocks near the output.

- For the
*b*-th DeLighT block,**the number of GLTs***Nb***the width multiplier***wbm*

With this scaling,

each DeLighT blockhas adifferent depth and width.

# 3. Experimental Results

## 3.1. Machine Translation

or WMT’16 En-Ro, WMT’14 En-De, and WMT’14 En-Fr; resulting in*wm*=2,*Nmin*=4, and*Nmax*=8 f**222 layer**deep DeLighT networks.for IWSLT’14 De-En; resulting in*wm*=1,*Nmin*=3, and*Nmax*=9**289 layer**deep network.- For simplicity,
*B*=*Nmax*.

DeLighT delivers

better performancewithfewer parametersthan Transformers, across different corpora.

For example, on WMT’14 En-Fr dataset, DeLighT is

3.7 deeperthan Transformers andimproves its BLEU score by 1.3 pointsyet with13 million fewer parametersand3 billion fewer operations.

For small models (< 10M parameters),

DeLighT models delivers better performance and for attaining the same performanceas these models, DeLighT models requiresfewer parameters.

DeLighT delivers

similar or better performancethan existing methods.

- DeLighT delivers
**similar performance**to baseline Transformers, but with fewer parameters and**less regularization**.

DeLighT models

improves with increase in network parameters;suggesting their ability to learn representations across different corpora, including low-resource.

## 3.2. Language Modeling

*wm*=2,*Nmin*=4, and*Nmax*=12,*B*=*Nmax*.

DeLighT delivers

better performancethan state-of-the-art methods (including Transformer-XL).

## 3.3. Computational Complexity

The Transformer and

DeLighTmodels took about 37 and23 hoursfor training and consumed about 12.5 GB and14.5 GB of GPU memory, respectively (R1 vs. R2).

**Dedicated CUDA kernels**for grouping and ungrouping functions in GLTs are implemented. With these changes,**training time**and**GPU**memory consumption of DeLighT**reduced by about 4 hours and 3 GB**, respectively.

## Reference

[2021 ICLR] [DeLightT]

DeLighT: Deep and Light-weight Transformer

## 2.1. Language Model / Sequence Model

**1991 … 2021 **[Performer] [gMLP] [Roformer] [PPBERT] [DeBERTa] [DeLighT] **2022 **[GPT-NeoX-20B] [InstructGPT]

## 2.2. Machine Translation

**2013 …** **2021 **[ResMLP] [GPKD] [Roformer] [DeLighT]