Review: MT-DNN (NLP)

Multi-Task Learning for Multi-Task Deep Neural Network

Sik-Ho Tsang
4 min readMar 18, 2022

Multi-Task Deep Neural Networks for Natural Language Understanding
MT-DNN, by Microsoft Research, and Microsoft Dynamics 365 AI
2019 ACL, Over 700 Citations (Sik-Ho Tsang @ Medium)
Language Model, Natural Language Processing, NLP, BERT

  • It is easier for a person who knows how to ski to learn skating than the one who does not.
  • Multi-Task Learning (MTL) is inspired by human learning activities, applying onto BERT to further improve the performance.


  1. MT-DNN: Multi-Task Learning
  2. Experimental Results

1. MT-DNN: Multi-Task Learning

Architecture of the MT-DNN model for representation learning
  • The lower layers are shared across all tasks, while the top layers represent task-specific outputs.
  • l1: The input X, which is a word sequence (either a sentence or a pair of sentences packed together) is first represented as a sequence of embedding vectors, one for each word, in l1.
  • l2: Then, the transformer encoder captures the contextual information for each word via self-attention, and generates a sequence of contextual embeddings in l2. This is the shared semantic representation that is trained by the proposed multi-task objectives.

1.1. Lexicon Encoder (l1)

  • The input X={x1, …, xm} is a sequence of tokens of length m.
  • Following BERT, the first token x1 is always the [CLS] token. If X is packed by a sentence pair (X1, X2), the two sentences are separated with a special token [SEP].

The lexicon encoder maps X into a sequence of input embedding vectors, one for each token.

  • (Please feel free to read BERT.)

1.2. Transformer Encoder (l2)

  • A multilayer bidirectional Transformer encoder is used to map the input representation vectors (l1) into a sequence of contextual embedding vectors C.

But unlike BERT, MT-DNN learns the representation using multi-task objectives, in addition to pre-training.

1.3. Multi-Task Learning (MTL)

1.3.1. Classification

  • For the classification tasks (i.e., single-sentence or pairwise text classification), cross-entropy loss is used as the objective:
  • where 1(X, c) is the binary indicator (0 or 1) if class label c is the correct classification for X, and W_SST is the task-specific parameter matrix.

1.3.2. Text Similarity

  • For the text similarity tasks, such as STS-B, where each sentence pair is annotated with a real-valued score y, mean squared error is used:
  • where Sim() is the similarity score.

1.3.3. Relevance Ranking

  • For the relevance ranking tasks, given a query Q, a list of candidate answers A is obtained which contains a positive example A+ that includes the correct answer, and |A|-1 negative examples.
  • The negative log likelihood of the positive example given queries across the training data is minimized.
  • where Rel(.) is relevance score. γ=1.

Given a mini batch, it can be batch of samples for different tasks.

For different tasks, different costs are minimized.

2. Experimental Results

2.1. GLUE

GLUE test set results scored using the GLUE evaluation server
  • MT-DNN: The pre-trained BERTLARGE is used to initialize its shared layers, the model is refined via MTL on all GLUE tasks, and the model is fine-tuned for each GLUE task using task-specific data.

MT-DNN outperforms all existing systems on all tasks, except WNLI, creating new state-of-the-art results on eight GLUE tasks and pushing the benchmark to 82.7%, which amounts to 2.2% absolution improvement over BERTLARGE.

  • MTDNNno-fine-tune still outperforms BERTLARGE consistently among all tasks but CoLA.
GLUE dev set results
  • ST-DNN stands for Single-Task DNN. It uses the same model architecture as MT-DNN. But its shared layers are the pre-trained BERT model without being refined via MTL. ST-DNN is then fine-tuned for each GLUE task using task-specific data.

On all four tasks (MNLI, QQP, RTE and MRPC), ST-DNN outperforms BERT. ST-DNN significantly outperforms BERT demonstrates clearly the importance of problem formulation.

2.2. SNLI and SciTail

Results on the SNLI and SciTail dataset

MT-DNNLARGE generates new state-of-the-art results on both datasets, pushing the benchmarks to 91.6% on SNLI (1.5% absolute improvement) and 95.0% on SciTail (6.7% absolute improvement), respectively.


[2019 ACL] [MT-DNN]
Multi-Task Deep Neural Networks for Natural Language Understanding

Natural Language Processing (NLP)

Language/Sequence Model: 2007 [Bengio TNN’07] 2013 [Word2Vec] [NCE] [Negative Sampling] 2014 [GloVe] [GRU] [Doc2Vec] 2015 [Skip-Thought] 2016 [GCNN/GLU] [context2vec] [Jozefowicz arXiv’16] [LSTM-Char-CNN] 2017 [TagLM] [CoVe] [MoE] 2018 [GLUE] [T-DMCA] [GPT] [ELMo] 2019 [T64] [Transformer-XL] [BERT] [RoBERTa] [GPT-2] [DistilBERT] [MT-DNN]

My Other Previous Paper Readings



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.