# Review — Big Bird: Transformers for Longer Sequences

## BIGBIRD, Turing Complete, Sparse Attention Approximates Full Attention

Big Bird: Transformers for Longer Sequences,
2020 NeurIPS, Over 900 Citations (Sik-Ho Tsang @ Medium)
NLP, Transformer, Longformer
==== My Other Paper Readings Also Over Here ====

• BIGBIRD, a sparse attention mechanism that reduces the Transformer quadratic dependency to linear.
• The proposed sparse attention can handle sequences of length up to 8× of what was previously possible using similar hardware.

# Outline

1. Generalized Attention Mechanism & Full Attention in Transformer
2. BIGBIRD
3. NLP Results
4. Genomics Results

# 1. Generalized Attention Mechanism & Full Attention in Transformer

## 1.1. Generalized Attention Mechanism

• On an input sequence X=(x1, …, xn) of size n×d, the generalized attention mechanism is described by a directed graph D whose vertex set is [n] = {1, …, n}.
• The set of arcs (directed edges) represent the set of inner products that the attention mechanism will consider.
• Let N(i) denote the out-neighbors set of node i in D, then the i-th output vector of the generalized attention mechanism is defined as:
• where Qh, Kh, and Vh are query, key, and value respectively. σ is scoring function, i.e. softmax. H is the number of heads.
• XN(i) corresponds to the matrix formed by only stacking {xj: jN(i)} and NOT all the inputs.

## 1.2. Full Attention in Transformer

If D is the complete digraph, we recover the full quadratic attention mechanism in the conventional Transformer.

• To elaborate, A∈[0, 1] of size n×n with A(i, j)=1 if query i attends to key j and is zero otherwise.

However, full attention induces quadratic complexity.

## 1.3. Graph Sparsification Problem

The problem of reducing the quadratic complexity of self-attention can now be seen as a graph sparsification problem.

• It is well-known that random graphs are expanders and can approximate complete graphs in a number of different contexts including in their spectral properties [80, 38].

# 2. BIGBIRD

## 2.1. (a) Random Attention

• Let us consider the simplest random graph construction, known as Erdős–Rényi model, where each edge is independently chosen with a fixed probability.

A sparse attention where each query attends over r random number of keys, i.e. A(i, ·)=1 for r randomly chosen keys.

## 2.2. (b) Sliding Window Attention

• The second viewpoint which inspired the creation of BIGBIRD is that most contexts within NLP and computational biology have data which displays a great deal of locality of reference.
• A particular model introduced by Watts and Strogatz [95] is of high relevance to BIGBIRD as it achieves a good balance between average shortest path and the notion of locality.

The generative process of their model is as follows: Construct a regular ring lattice, a graph with n nodes each connected to w neighbors, w/2 on each side, to capture the local structures in the context.

• Then, we can have a random subset (k%) of all connections is replaced with a random connection. The other (100 — k)%, local connections are retained.
• When random (R) and sliding window (W) are used together for attention, the results are closed to BERT one.

## 2.3. (c) Global Attention

• The final piece is the importance of “global tokens” (tokens that attend to all tokens in the sequence and to whom all tokens attend to.)

BIGBIRD-ITC: Internal Transformer Construction (ITC) make some (g) existing tokens “global”.

BIGBIRD-ETC: Extended Transformer Construction (ETC) includes g additional “global” tokens such as CLS.

## 2.4. (d) BIGBIRD Attention

The final attention mechanism for BIGBIRD has all three of these properties: queries attend to r random keys, each query attends to w/2 tokens to the left of its location and w/2 to the right of its location and they contain g global tokens (The global tokens can be from existing tokens or extra added tokens).

• Authors also show the full attention is somehow Turing Complete.
• BIGBIRD is a universal approximator of sequence functions and is Turing Complete, thereby preserving these properties of the quadratic, full attention model.

# 3. NLP Results

• BIGBIRD-ETC, with expanded global tokens consistently outperforms all other models. Thus, this configuration is chosen to train a large sized model to be used for evaluation on the hidden test set.
• BIGBIRD-ETC model is compared to top-3 entries from the leaderboard excluding BIGBIRD.
• It is worth noting that BIGBIRD submission is a single model, whereas the other top-3 entries for Natural Questions are ensembles.

One can clearly see the importance of using longer context as both Longformer and BIGBIRD outperform models with smaller contexts.

## 3.2. Summarization

• Following BERT, one layer with cross entropy loss is used on top of the first [CLS] token.

Gains of using BIGBIRD are more significant when longer documents and fewer training examples are used.

• For instance, using base sized model, BIGBIRD improves state-of-the-art for Arxiv dataset by about 5% points.
• On Patents dataset, there is improvement over using simple BERT/RoBERTa, but given the large size of training data the improvement over SoTA (which is not BERT based) is not significant.

# 4. Genomics Results

• There has been a recent upsurge in using deep learning for genomics data [87, 107, 13], which has resulted in improved performance on several biologically-significant tasks such as promoter site prediction [71], methylation analysis [55], predicting functional effects of non-coding variant [110], etc.

## 4.1. Bits Per Character (BPC)

• DNA is first segmented into tokens so as to further increase the context length. In particular, a byte-pair encoding table is built for the DNA sequence of size 32K, with each token representing 8.78 base pairs on average. The the bits per character (BPC) on a held-out set is shown above.

Attention based contextual representation of DNA does improve BPC, which is further improved by using longer context.

## 4.2. Promoter Region Prediction

• Promoter is a DNA region typically located upstream of the gene, which is the site of transcription initiation.

BIGBIRD achieve nearly perfect accuracy with a 5% jump from the previous best reported accuracy.

## 4.3. Chromatin-Profile Prediction

• Non-coding regions of DNA do not code for proteins. Majority of diseases and other trait associated single-nucleotide polymorphism are correlated to non-coding genomic variations [110, 46].
• The corresponding ML task is to predict, for a given non-coding region of DNA, these 919 chromatin-profile including 690 transcription factors (TF) binding profiles for 160 different TFs, 125 DNase I sensitivity (DHS) profiles and 104 histone-mark (HM) profiles. 919 binary classifiers are jointly learned to predict these functional effects from sequence of DNA fragments.
• On held-out chromosomes, AUC is measured and compared with the baselines.

BIGBIRD significantly improves on performance on the harder task HM, which is known to have longer-range correlations [27] than others.

## Reference

[2020 NeurIPS] [BIGBIRD]
Big Bird: Transformers for Longer Sequences

## 2.1. Language Model / Sequence Model

19912020 … [BIGBIRD] 2021 [Performer] [gMLP] [Roformer] [PPBERT] [DeBERTa] [DeLighT] [Transformer-LS] 2022 [GPT-NeoX-20B] [InstructGPT]

--

--