# Review — DeFINE: Deep Factorized Input Token Embeddings for Neural Sequence Modeling

## DeFINE, Parameter Reduction for Input Token Embedding

--

DeFINE: Deep Factorized Input Token Embeddings for Neural Sequence Modeling,DeFINE, by University of Washington, and Allen Institute for AI2020 ICLR, Over 10 Citations(Sik-Ho Tsang @ Medium)

Natural Language Processing, NLP, Language Model, LM, Neural Machine Translation, NMT, Transformer, Transformer-XL

- A
**hierarchical structure**with novel**skip-connections**are proposed, which allows**for the use of low dimensional input and output layers**,**reducing total parameters and training time**while delivering similar or better performance versus existing methods. - DeFINE can be
**incorporated easily in new or existing sequence models.**

# Outline

**Hierarchical Group Transformation (HGT)****DeFINE Unit****Results**

# 1. Hierarchical Group Transformation (HGT)

## 1.1. Motivations & Overall Idea

**Most NLP researches**uses**shallow network**to learn a good approximation for**token embedding**.**DeFINE**, an effective way of**learning deep token representations**in high-dimensional space with a**minimum of additional parameters.**- The proposed method is based on a
**Map-Expand-Reduce (MER) principle, first maps**an input token to a low dimensional embedding vector, then**transforms it to a high-dimensional space**using a computationally efficient**hierarchical group transformation (HGT).** - The resultant vector is then
**transformed to a low-dimensional space.** - By making use of a new connectivity pattern that establishes a
**direct link between the input and output layers, promoting feature reuse**, and**improving gradient flow.**

## 1.2. **Map-Expand-Reduce (MER)**

- The
**first**step in MER,**Map**, is similar to standard sequence models. Every input token in the**vocabulary**However, in this paper, the value of*V*is mapped to a fixed dimensional vector*ei*of size*n*×1., compared to typical dimensions of 400 or more).*n*is small (say 64 or 128 - The
**next**step,**Expand**, takesas an*ei***input**and**applies a hierarchical group transformation (HGT)**to produce a very**high-dimensional vector****^**, where*ei*of size*k*×1.*k*>>*n* - The
**last**step,**Reduce**,**projects the vector ^**to produce the*ei*to a lower dimensional space**final embedding vector**for a given input token.*eo*of size*m*×1 - The dimensions of
*eo*can be matched to contextual representation models, such as LSTMs or Transformers, allowing DeFINE to serve as an input layer for these models.

## 1.3. **Hierarchical Group Transformation (HGT)**

- HGT comprises of
**a stack of***N*layers. - HGT
**starts with**at the first layer and*gmax*groups**then subsequently decreases the number of groups by a factor of 2 at each level.** **(Group linear transformations (GLT)**, originally introduced to improve the efficiency of the LSTM,**also sparsify the connections**in fully connected layer, as shown above. However, the outputs of a certain group are only derived from a small fraction of the input, thus learning**weak representations**.)- Formally, in HGT, the transformation
**from**is:*ei*to ^*ei*at*l*-th layer

- where:

- And
are the*Wl***weights**learned at*l*-th layer, andis a*FG***group transformation function**. **Group transformation splits the input into**, each of which is processed independently using a linear transformation. The output of these groups are then*g*groups**concatenated to produce final output**.

# 2. **DeFINE Unit**

- The DeFINE unit is composed of HGT transformations.
- A
**simple new skip-connection**is used that establishes**a direct link between any layer in HGT with the input**, as above.*ei* - The input and the output are
**chunked into**using a*gl*groups**split layer**. The**chunked input and output vectors**are then**mixed**.

This mechanism

promotes input feature reuseefficiently. Additionally, it establishes a direct link with the inputei,allowing gradients to flow back to the input via multiple pathsand resulting in improved performance.

- The
**mapping**between the input token and the output of the DeFINE unit (*eo*), can be**cached using look-up table**, resulting in a mechanism that allows to**skip the computations of the DeFINE unit at inference time.**

- This figure summarizes
**different architectures**with**different settings**.

# 3. Results

## 3.1. LSTM Models

- (a): The proposed method further
**improves performance by about 3 points while learning only 1.25% (or 0.4 million) more parameters**.

(b): The depth of DeFINE is scaled from 3 to 11 layers. The performanceimproves by a further 6 points, deliveringcompetitive performance to existing RNN-based methods with fewer parameters(e.g., 1/3 as many parameters as Merity et al. (2018a)).

- (c): The proposed method
**improves the performance of AWD-LSTM by 4 points**while simultaneously**reducing the number of parameters by 4 million.**

## 3.2. Transformer Models

- The proposed method is able to attain
**similar performance to Dai et al. (2019)**while learning**10M fewer parameters**.

Transformer-XLwith DeFINEis able to achievecomparable perplexity to a standardTransformer-XLwith projective embeddingswhile usingsignificantly fewer parameters.

## 3.3. Machine Translation

- OpenNMT is used for Transformer model training.

DeFINE improves the performance of theTransformermodel without checkpoint averaging by 2%while simultaneouslyreducing the total number of parameters by 26%, suggesting that DeFINE is effective.

## 3.4. Further Analyses & Ablations

DeFINE is able to

approximate the standard embedding matrix efficiently.

DeFINE embeddings can be compressedsimilarly to standard embeddingswithout loss of performance.

Left: HGT improves perplexity by about 5 pointswhile learning a similar number of parameters as GLT.

Right: Furthermore, when adirect connectionis used, the performancefurther improves by 2.9 points.

For the

same value of k, the performance of the language modelimproves with the increase in the depth N. However, when wescale the width k for a fixed value of depth N, the performancedoes not improve.

Left: The proposedskip-connectionsare moreeffective.

Right: The performancewith and without this reduction stepissimilar, however, a modelwithout the reduction step, learns more parameters.

## Reference

[2020 ICLR] [DeFINE]

DeFINE: Deep Factorized Input Token Embeddings for Neural Sequence Modeling

## 2.1. Language Model / Sequence Model

**1991 … 2020 **… [DeFINE]** 2021 **[Performer] [gMLP] [Roformer] [PPBERT] [DeBERTa] **2022 **[GPT-NeoX-20B] [InstructGPT]

## 2.2. Machine Translation

**2013 …** **2020 **[Batch Augment, BA] [GPT-3] [T5] [Pre-LN Transformer] [OpenNMT] [DeFINE] **2021 **[ResMLP] [GPKD] [Roformer]