Review — SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

SuperGLUE: A More Difficult Benchmark Than GLUE

6 min readApr 16, 2022

SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems
SuperGLUE, by New York University, Facebook AI Research, University of Washington, and DeepMind
2019 NeurIPS, Over 700 Citations (Sik-Ho Tsang @ Medium)
Language Model, Natural Language Processing, NLP, BERT, GLUE

The SOTA performance on the GLUE benchmark, introduced a little over one year before, has recently surpassed the level of non-expert humans, suggesting limited headroom for further research.
SuperGLUE, a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, a software toolkit, and a public leaderboard.

Outline

Previous GLUE
Proposed SuperGLUE
Benchmarking Results

1. Previous GLUE

The NLP research development progress has eroded headroom on the GLUE benchmark dramatically.
While some tasks and some linguistic phenomena measured in GLUE remain difficult, the current state of the art GLUE Score as of early July 2019 (88.4 from Yang et al., 2019) surpasses human performance (87.1 from Nangia and Bowman, 2019) by 1.3 points, and in fact exceeds this human performance estimate on four tasks.
GPT and BERT achieved scores of 72.8 and 80.2 respectively.

A more difficult SuperGLUE benchmark is developed.

2. SuperGLUE

SuperGLUE has the same high-level motivation as GLUE: to provide a simple, hard-to-game measure of progress toward general-purpose language understanding technologies for English.
It is anticipated that significant progress on SuperGLUE should require substantive innovations in a number of core areas of machine learning, including sample-efficient, transfer, multitask, and unsupervised or self-supervised learning.

2.1. Tasks

WSD stands for word sense disambiguation. NLI is natural language inference. coref. is coreference resolution. QA is question answering.

**Development set examples from the tasks in SuperGLUE. Bold text represents part of the example format for each task.**

BoolQ (Boolean Questions, Clark et al., 2019a) is a QA task where each example consists of a short passage and a yes/no question about the passage.
CB (CommitmentBank, de Marneffe et al., 2019) is a corpus of short texts in which at least one sentence contains an embedded clause. Each of these embedded clauses is annotated with the degree to which it appears the person who wrote the text is committed to the truth of the clause.
COPA (Choice of Plausible Alternatives, Roemmele et al., 2011) is a causal reasoning task in which a system is given a premise sentence and must determine either the cause or effect of the premise from two possible choices.
MultiRC (Multi-Sentence Reading Comprehension, Khashabi et al., 2018) is a QA task where each example consists of a context paragraph, a question about that paragraph, and a list of possible answers. The system must predict which answers are true and which are false.
ReCoRD (Reading Comprehension with Commonsense Reasoning Dataset, Zhang et al., 2018) is a multiple-choice QA task. Each example consists of a news article and a Cloze-style question about the article in which one entity is masked out. The system must predict the masked out entity from a list of possible entities in the provided passage.
RTE (Recognizing Textual Entailment) datasets come from a series of annual competitions on textual entailment. All datasets (RTE1, RTE2, RTE3, and RTE5) are combined and converted to two-class classification: entailment and not_entailment.
WiC (Word-in-Context, Pilehvar and Camacho-Collados, 2019) is a word sense disambiguation task cast as binary classification of sentence pairs. Given two text snippets and a polysemous word that appears in both sentences, the task is to determine whether the word is used with the same sense in both sentences.
WSC (Winograd Schema Challenge, Levesque et al., 2012) is a coreference resolution task in which examples consist of a sentence with a pronoun and a list of noun phrases from the sentence. The system must determine the correct referrent of the pronoun from among the provided choices.

2.2. Scoring

A simple approach is to weight each task equally. For tasks with multiple metrics, first averaging those metrics to get a task score.

3. Benchmarking Results

3.1. BERT

BERT is used. Particularly, the BERT Large model variant.
For each task, the simplest possible architecture is used on top of BERT. Then, a copy of the pretrained BERT model is fine-tuned separately for each task.
For classification tasks with sentence-pair inputs (BoolQ, CB, RTE, WiC), the sentences are concatenated with a [SEP] token, then the fused input is fed to BERT, and a logistic regression classifier that sees the representation corresponding to [CLS] is used.
For WiC, the representation of the marked word is concatenated.
For COPA, MultiRC, and ReCoRD, for each answer choice, Similarly, the context is concatenated with that answer choice and is fed the resulting sequence into BERT to produce an answer representation.
For COPA, these representations are projected into a scalar, and the answer the choice with the highest associated scalar is taken.
For MultiRC, because each question can have more than one correct answer, each answer representation is fed into a logistic regression classifier.
For ReCoRD, the probability of each candidate independent of other candidates is evaluated, and the most likely candidate is taken as the model’s prediction.
For WSC, which is a span-based task, a model inspired by Tenney et al. (2019) is used. Given the BERT representation for each word in the original sentence, span representations of the pronoun and noun phrase are obtained via a self-attention span-pooling operator (Lee et al., 2017), before feeding it into a logistic regression classifier.

Some input manipulation is needed before inputting into BERT.

3.2. BERT++

This is BERT with additional training on related datasets before fine-tuning on the benchmark tasks.

3.3. Results

**Baseline performance on the SuperGLUE test sets and diagnostics.**

The most frequent class and CBOW (Word2Vec) baselines do not perform well overall, achieving near chance performance for several of the tasks.
Using BERT increases the average SuperGLUE score by 25 points, attaining significant gains on all of the benchmark tasks, particularly MultiRC, ReCoRD, and RTE.
On WSC, BERT actually performs worse than the simple baselines, likely due to the small size of the dataset and the lack of data augmentation.
For BERT++, using MultiNLI as an additional source of supervision for BoolQ, CB, and RTE leads to a 2–5 point improvement on all tasks.
For BERT++, using SWAG as a transfer task for COPA sees an 8 point improvement.

But the best baselines still lag substantially behind human performance. On average, there is a nearly 20 point gap between BERT++ and human performance.

The largest gap is on WSC, with a 35 point difference between the best model and human performance. The smallest margins are on BoolQ, CB, RTE, and WiC, with gaps of around 10 points on each of these.
On WSC and COPA, human performance is perfect. On three other tasks, it is in the mid-to-high 90s. On the diagnostics, all models continue to lag significantly behind humans.

Reference

[2019 NeurIPS] [SuperGLUE]
SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

Natural Language Processing (NLP)

Language/Sequence Model: 2007 [Bengio TNN’07] 2013 [Word2Vec] [NCE] [Negative Sampling] 2014 [GloVe] [GRU] [Doc2Vec] 2015 [Skip-Thought] 2016 [GCNN/GLU] [context2vec] [Jozefowicz arXiv’16] [LSTM-Char-CNN] 2017 [TagLM] [CoVe] [MoE] 2018 [GLUE] [T-DMCA] [GPT, GPT-1] [ELMo] 2019 [T64] [Transformer-XL] [BERT] [RoBERTa] [GPT-2] [DistilBERT] [MT-DNN] [SuperGLUE] 2020 [ALBERT] [GPT-3]

Review — SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

SuperGLUE: A More Difficult Benchmark Than GLUE

Outline

1. Previous GLUE

2. SuperGLUE

2.1. Tasks

2.2. Scoring

3. Benchmarking Results

3.1. BERT

3.2. BERT++

3.3. Results

Reference

Natural Language Processing (NLP)

My Other Previous Paper Readings

Written by Sik-Ho Tsang