Brief Review — DeBERTa: Decoding-enhanced BERT with Disentangled Attention

DeBERTa, First Single Model to Surpass Human Performance on SuperGLUE

6 min readJan 21, 2023

**Performance on** **SuperGLUE** **leaderboard**

DeBERTa: Decoding-enhanced BERT with Disentangled Attention,
DeBERTa, by Microsoft Dynamics 365 AI, Microsoft Research
2021 ICLR, Over 700 Citations (Sik-Ho Tsang @ Medium)
Natural Language Processing, NLP, Language Model, LM, Transformer, BERT

DeBERTa (Decoding-enhanced BERT with disentangled attention) that improves the BERT and RoBERTa models using two novel techniques.
Disentangled attention mechanism: Each word is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices on their contents and relative positions, respectively.
Incorporate absolute positions in the decoding layer to predict the masked tokens in model pre-training.
In addition, a new virtual adversarial training method is used for fine-tuning to improve models’ generalization.

Outline

DeBERTa Disentangled Attention Mechanism
DeBERTa Absolute Position Incorporation in Decoding Layer
DeBERTa Scale-invariant Fine-Tuning (SiFT)
Experimental Results

1. DeBERTa Disentangled Attention Mechanism

1.1. Idea

For a token at position i in a sequence, it is represented using two vectors, {Hi} and {Pi|j}, which represent its content and relative position with the token at position j, respectively.
The calculation of the cross attention score between tokens i and j can be decomposed into four components as:

Relative position encoding in Shaw NAACL’18 uses a separate embedding matrix, which is equivalent to computing the attention weights using only the content-to-content and content-to-position terms.

It is argued that the position-to-content term is also important since the attention weight of a word pair depends not only on their contents but on their relative positions.

1.2. Disentangled Attention Mechanism

**The model architecture of DeBERTa** (Figure from authors’ slides)

Taking single-head attention as an example, the standard self-attention operation:

where Qc, Kc and Vc are the projected content vectors generated using projection matrices Wq,c, Wk,c, Wv,c.
P represents the relative position embedding vectors shared across all layers (i.e., staying fixed during forward propagation), and Qr and Kr are projected relative position vectors generated using projection matrices Wq,r, Wk,r, respectively.
Denote k as the maximum relative distance, δ(i, j) ∈ [0, 2k) as the relative distance from token i to token j, which is defined as:

The disentangled self-attention with relative position bias:

The factor of 1/√(3d) is important for stabilizing model training , especially for large-scale Pre-trained Language Models (PLMs).
There is an efficient implementation where memory is not required to allocate to store a relative position embedding for each query and thus reduce the space complexity to O(kd) (for storing Kr and Qr). (Please read the paper directly for more details.)

2. DeBERTa Absolute Position Incorporation in Decoding Layer

**Enhanced Mask Decoder.** (Figure from authors’ slides)

Given a sentence “a new store opened beside the new mall” with the words “store” and “mall” masked for prediction. Using only the local context (e.g., relative positions and surrounding words) is insufficient for the model to distinguish store and mall in this sentence.
The model needs to take into account absolute positions, as complement information.
The BERT model incorporates absolute positions in the input layer.

In DeBERTa, absolute positions are incorporated right after all the Transformer layers but before the softmax layer for masked token prediction, as above.

3. DeBERTa Scale-invariant Fine-Tuning (SiFT)

**Performance on** **GLUE** **Dev** (Figure from authors’ slides)

Virtual adversarial training is a regularization method for improving models’ generalization. It does so by improving a model’s robustness to adversarial examples, which are created by making small perturbations to the input. The model is regularized so that when given a task-specific example, the model produces the same output distribution.
However, the value ranges (norms) of the embedding vectors vary among different words and models. The variance gets larger for bigger models.
SiFT algorithm is proposed, that improves the training stability by applying the perturbations to the normalized word embeddings.
The normalization substantially improves the performance of the fine-tuned models. The improvement is more prominent for larger DeBERTa models.

4. Results

4.1. Large Models

**Comparison results on the** **GLUE** **development set.**

Compared to BERT and RoBERTa, DeBERTa performs consistently better across all the tasks. Meanwhile, DeBERTa outperforms XLNet in six out of eight tasks.

DeBERTa also outperforms other SOTA PLMs, i.e., ELECTRAlarge and XLNetlarge, in terms of average GLUE score.

**Results on MNLI in/out-domain, SQuAD v1.1, SQuAD v2.0, RACE, ReCoRD, SWAG, CoNLL 2003 NER development set.**

Compared to the previous SOTA, PLMs with a similar model size (i.e., BERT, RoBERTa, XLNet, ALBERTlarge, and Megatron336M), DeBERTa shows superior performance in all seven tasks.

4.2. Base Models

**Results on MNLI in/out-domain (m/mm), SQuAD v1.1 and v2.0 development set.**

Across all three tasks, DeBERTa consistently outperforms RoBERTa and XLNet by a larger margin than that in large models.

4.3. Ablation Study

**Ablation study of the DeBERTa base model.**

Removing any one component in DeBERTa results in a sheer performance drop. Similarly, removing either content-to-position or position-to-content leads to inferior performance in all the benchmarks. As expected, removing two components results in even more substantial loss in performance.

4.4. Scale Up to 1.5 Billion Parameters

**SuperGLUE** **test set results scored using the** **SuperGLUE** **evaluation server.**

A larger version of DeBERTa with 1.5 billion parameters, 48 layers, 1536 hidden size, 24 heads, denoted as DeBERTa1.5B, is built, with some optimizations (Details in paper).

The single DeBERTa1.5B surpass the human performance on SuperGLUE for the first time in terms of macro-average score (89.9 versus 89.8) as of December 29, 2020, and the ensemble DeBERTa model (DeBERTaEnsemble) sits atop the SuperGLUE benchmark rankings as of January 6, 2021, outperforming the human baseline by a decent margin (90.3 versus 89.8).

Compared to T5, which consists of 11 billion parameters, the 1.5-billion-parameter DeBERTa is much more energy efficient to train and maintain, and it is easier to compress and deploy to apps of various settings.

References

[2021 ICLR] [DeBERTa]
DeBERTa: Decoding-enhanced BERT with Disentangled Attention

[Microsoft Research Blog] https://www.microsoft.com/en-us/research/blog/microsoft-deberta-surpasses-human-performance-on-the-superglue-benchmark/

[Slides] https://iclr.cc/media/iclr-2021/Slides/2562.pdf

2.1. Language Model / Sequence Model

1991 … 2021 [Performer] [gMLP] [Roformer] [PPBERT] [DeBERTa] 2022 [GPT-NeoX-20B] [InstructGPT]

Brief Review — DeBERTa: Decoding-enhanced BERT with Disentangled Attention

DeBERTa, First Single Model to Surpass Human Performance on SuperGLUE

Outline

1. DeBERTa Disentangled Attention Mechanism

1.1. Idea

1.2. Disentangled Attention Mechanism

2. DeBERTa Absolute Position Incorporation in Decoding Layer

3. DeBERTa Scale-invariant Fine-Tuning (SiFT)

4. Results

4.1. Large Models

4.2. Base Models

4.3. Ablation Study

4.4. Scale Up to 1.5 Billion Parameters

References

2.1. Language Model / Sequence Model

==== My Other Previous Paper Readings ====

Written by Sik-Ho Tsang