Review — T5: Text-to-Text Transfer Transformer

Language Model where Input: Text, Output: Text

Sik-Ho Tsang
10 min readApr 23, 2022

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, T5, by Google
2020 JMLR, Over 3000 Citations (Sik-Ho Tsang @ Medium)
Language Model, Natural Language Processing, NLP, Transformer

  • A unified framework that converts all text-based language problems into a text-to-text format.
  • Combining with insights from scaling and new C4 “Colossal Clean Crawled Corpus” dataset, state-of-the-art results are achieved.
  • (This paper got 67 pages in arXiv, I only mentioned some key points here. Please feel free to read the paper directly for more details. For quick read, please read 1, 2 and 6.)


  1. T5: Text-to-Text Framework
  2. C4: Colossal Clean Crawled Corpus
  3. Ablation Experiments for Model
  4. Ablation Experiments for Dataset
  5. Ablation Experiments for Training Strategies
  6. SOTA Comparisons

1. T5: Text-to-Text Framework

T5: Text-to-Text Framework

1.1. Unified Input & Output Format

  • T5 means “Text-to-Text Transfer Transformer”: Every task considered — including translation, question answering, and classification — is cast as feeding the T5 model text as input and training it to generate some target text.
  • Translation: Ask the model to translate the sentence “That is good.” from English to German, the model would be fed the sequence “translate English to German: That is good.” and would be trained to output “Das ist gut.”
  • Text classification: The model simply predicts a single word corresponding to the target label. For example, on the MNLI benchmark, the goal is to predict whether a premise implies (“entailment”), contradicts (“contradiction”), or neither (“neutral”) a hypothesis. The input sequence becomes “mnli premise: I hate pigeons. hypothesis: My feelings towards pigeons are filled with animosity.”. Only possible labels are “entailment”, “neutral”, or “contradiction”, other outcomes are treated as wrong prediction.
  • Regression: In STS-B, the goal is to predict a similarity score between 1 and 5. Increments of 0.2 are used as text prediction.
  • (It is in details for each task how they unify the format, please feel free to read the paper directly if interested.)

1.2. Encoder-Decoder Transformer Model

  • T5 uses encoder-decoder Transformer implementation which closely follows the original Transformer, with the exception of below differences:
  • (Please feel free to Transformer if interested.)
  • But a simplified layer normalization is used where the activations are only rescaled and no additive bias is applied. After layer normalization, a residual skip connection, originated from ResNet, adds each subcomponent’s input to its output.
  • Also, instead of using a fixed embedding for each position, relative position embeddings (Shaw NAACL’18) produce a different learned embedding according to the offset (distance) between the “key” and “query” being compared in the self-attention mechanism.

1.3. Training

  • A combination of model and data parallelism are used to train models on “slices” of Cloud TPU Pods. 5 TPU pods are are multi-rack ML supercomputers that contain 1,024 TPU v3 chips connected via a high-speed 2D mesh interconnect with supporting CPU host machines.

2. C4: Colossal Clean Crawled Corpus

  • Common Crawl is a publicly-available web archive that provides “web extracted text” by removing markup and other non-text content from the scraped HTML files. This process produces around 20TB of scraped text data each month. But they are not all helpful and clean.
  • Rules to clean up the dataset:
  1. Only retained lines that ended in a terminal punctuation mark (i.e. a period, exclamation mark, question mark, or end quotation mark).
  2. Discarded any page with fewer than 5 sentences and only retained lines that contained at least 3 words.
  3. Removed any page that contained any word on the “List of Dirty, Naughty, Obscene or Otherwise Bad Words”.
  4. Removed any line with the word Javascript.
  5. Removed any page where the phrase “lorem ipsum” appeared.
  6. Removed any pages that contained a curly bracket.
  7. Deduplicate the data set, discarded all but one of any three-sentence span occurring more than once in the data set.
  8. Additionally, since most of the downstream tasks are focused on English-language text, we langdetect7 is used to filter out any pages that were not classified as English with a probability of at least 0.99.
  • The web extracted text is from April 2019. The filtered dataset is not only orders of magnitude larger than most data sets used for pre-training (about 750 GB), but also comprises reasonably clean and natural English text. This data set is called the “Colossal Clean Crawled Corpus” (or C4 for short) and released as part of TensorFlow Datasets.

3. Ablation Experiments for Model

3.1. Baseline

  • The baseline model is designed so that the encoder and decoder are each similar in size and configuration to a “BERTBASE”. Specifically, both the encoder and decoder consist of 12 blocks, with about 220 million parameters.
Schematic for Objective Function
  • The words “for”, “inviting” and “last” (marked with an ×) are randomly chosen for corruption. Each consecutive span of corrupted tokens is replaced by a sentinel token (shown as <X> and <Y>) that is unique over the example.
  • The aim is to mask consecutive spans of tokens and only predict dropped-out tokens during pretraining.
Baseline Performance

With pretraining, the baseline outperforms the one without pretraining by a large margin.

3.2. Ablation Study Procedures

Overall Ablation Study Procedures
  • Different experiments are performed to determine the best strategies for different components in a greedy manner, as shown above.

3.3. Architectures

Matrices representing different attention mask patterns
Schematics of the Transformer architecture variants (x is input, y is output)
  • Three model variants are considered:
  1. Encoder-decoder Transformer: The encoder uses a “fully-visible” attention mask. The self-attention operations in the Transformer’s decoder use a “causal” masking pattern. The decoder in an encoder-decoder Transformer is used to autoregressively produce an output sequence.
  2. LM: A Transformer decoder (without an encoder) can be used as a language model (LM), i.e. a model trained solely for next-step prediction, similar to GPT.
  3. Prefix LM: Instead of pure LM, a fully-visible masking can be used during the prefix portion of the sequence (x). This masking pattern and a schematic of the resulting “prefix LM”.
Performance of the different architectural variants

For all tasks, the encoder-decoder architecture with the denoising objective performed best.

3.4. Unsupervised Objectives

Examples of inputs and targets produced by some of the unsupervised objectives
  • Different objectives are shown above. But only 3 below are compared.
  • (Please read the examples above to understand their objectives.)
Performance of the three disparate pre-training objectives

The BERT-style objective performs best, though the prefix language modeling objective attains similar performance on the translation tasks.

Comparison of variants of the BERT-style pre-training objective
  • In the first two variants, the model is trained to reconstruct the original uncorrupted text segment. In the latter two, the model only predicts the sequence of corrupted tokens.

All of these variants perform similarly.

  • The only exception was that dropping corrupted tokens completely produced a small improvement in the GLUE score thanks to a significantly higher score on CoLA (60.04, compared to the baseline average of 53.84)

3.5. Corruption Rates

Performance of the i.i.d. corruption objective with different corruption rates
  • Using a larger corruption rate also results in longer targets, which can potentially slow down training.

Based on these results and the historical precedent set by BERT, a corruption rate of 15% is used going forward.

3.5. Corrupted Span Length

Performance of the span-corruption objective for different average span lengths
  • When multiple consecutive tokens have been corrupted, they are treated as a “span” and a single unique mask token is used to replace the entire span.

A limited difference is found between these objectives, though the version with an average span length of 10 slightly underperforms the other values in some cases.

4. Ablation Experiments for Dataset

4.1. Unlabeled Datasets

Performance resulting from pre-training on different data sets
  • Different pretraining datasets are tried.
  1. C4: The one mentioned in Section 2 in this story article.
  2. C4, unfiltered: C4 but without filtering, to know the effect of the heuristic filtering.
  3. RealNews-like: C4 but only include content from one of the domains used in the “RealNews” data set.
  4. WebText-like: Similarly, the WebText data set only uses content from webpages.
  5. Wikipedia: English Wikipedia text data from TensorFlow Datasets.
  6. Wikipedia+TBC: Toronto Books Corpus (TBC) contains text extracted from eBooks, combining with Wikipedia following BERT.

Pre-training on in-domain unlabeled data can improve performance on downstream tasks. (e.g.: unlabled news data helps downstream news dataset.) But this is unsatisfying if the goal is to pre-train a model that can rapidly adapt to language tasks from arbitrary domains.

4.2. Unlabeled Dataset Sizes

Measuring the effect of repeating data during pre-training
  • C4 has 2³⁵= 34B tokens.
  • Truncated variants of C4 consisting of 2²⁹, 2²⁷, 2²⁵ and 2²³ tokens. These sizes correspond to repeating the data set 64, 256, 1,024, and 4,096 times respectively over the course of pre-training.
  • As expected, performance degrades as the data set size shrinks.
Pre-training loss for the original C4 data set as well as 4 artificially truncated versions
  • Using a smaller data set size results in smaller training loss values, which may suggest some memorization of the unlabeled data set.

Authors suggested using large pre-training data sets whenever possible.

5. Ablation Studies for Training Strategies

5.1. Fine-Tuning Methods

  • Different fine-tuning methods are tried:
  1. Fine-tuning all parameters.
  2. Adapter Layers: keeping most of the original model fixed while fine-tuning. Adapter layers are additional dense-ReLU-dense blocks that are added after each of the preexisting feed-forward networks in each block of the Transformer.
  3. Gradual Freezing: More and more of the model’s parameters are finetuned over time. At the start of fine-tuning, only the parameters of the final layer are updated, then after training for a certain number of updates the parameters of the second-to-last layer are also included, and so on until the entire network’s parameters are being fine-tuned.

It is found that found that gradual unfreezing caused a minor degradation in performance across all tasks, though it did provide some speedup during fine-tuning.

5.2. Multi-Task Learning

Comparison of multi-task training using different mixing strategies
  • Multi-task learning is to train the model on multiple tasks at a time. Different multi-task learnings are tried. (Please read paper if interested.)

However, multi-task learning underperforms pre-training followed by fine-tuning on most tasks.

5.3. Combining Multi-Task Learning with Fine-Tuning

Comparison of unsupervised pre-training, multi-task learning, and various forms of multi-task pre-training
  • Authors tried to improve multi-task learning, by combining with fine-tuning. (Please read paper if interested.)

5.4. Scaling

Comparison of different methods of scaling up the baseline model
  • There are a variety of possible ways to scale, including using a bigger model, training the model for more steps, and ensembling.
  • There is no large difference between training a 2× bigger model for 2× as long and training a 4× bigger model on any of the tasks.

This suggests that increasing the training time and increasing the model size can be complementary means of improving performance. The results also suggest that ensembling provides an orthogonal and effective means of improving performance through scale.

6. SOTA Comparisons

Performance of T5 variants on every task
  • Small, Base, Large, 3B, and 11B refer to model configurations with 60 million, 220 million, 770 million, 3 billion, and 11 billion parameters, respectively. (by tuning different hyperparameters.)

Overall, state-of-the-art performance is achieved on 18 out of the 24 tasks.

  • T5–3B model variant did beat the previous state of the art in a few tasks, but scaling the model size to 11 billion parameters was the most important ingredient for achieving the best performance.
  • e.g.: For SQuAD, T5 outperformed the previous state-of-the-art ALBERT by over one point on the Exact Match score.
  • For SuperGLUE, T5 improved upon the state-of-the-art RoBERTa by a large margin from an average score of 84.6 to 88.9.
Performance comparison of T5-Base on the validation set
  • Further experiment is performed on three configurations as above:
  1. The standard baseline model, which was pre-trained on 2²⁵=34B tokens.
  2. Baseline-1T: The baseline trained instead for about 1 trillion tokens (i.e. the same amount of pre-training used for T5).
  3. T5-Base.

T5-Base performs substantially better than Baseline-1T, suggesting that scale is not the only factor that contributes to T5’s success.

This story is a bit long though I already tried to shorten it.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.