Review — Big Bird: Transformers for Longer Sequences

BIGBIRD, Turing Complete, Sparse Attention Approximates Full Attention

Sik-Ho Tsang
6 min readFeb 18


Big Bird: Transformers for Longer Sequences,
BIGBIRD, by Google Research,
2020 NeurIPS, Over 900 Citations (Sik-Ho Tsang @ Medium)
NLP, Transformer, Longformer
==== My Other Paper Readings Also Over Here ====

  • BIGBIRD, a sparse attention mechanism that reduces the Transformer quadratic dependency to linear.
  • The proposed sparse attention can handle sequences of length up to 8× of what was previously possible using similar hardware.


  1. Generalized Attention Mechanism & Full Attention in Transformer
  3. NLP Results
  4. Genomics Results

1. Generalized Attention Mechanism & Full Attention in Transformer

1.1. Generalized Attention Mechanism

  • On an input sequence X=(x1, …, xn) of size n×d, the generalized attention mechanism is described by a directed graph D whose vertex set is [n] = {1, …, n}.
  • The set of arcs (directed edges) represent the set of inner products that the attention mechanism will consider.
  • Let N(i) denote the out-neighbors set of node i in D, then the i-th output vector of the generalized attention mechanism is defined as:
  • where Qh, Kh, and Vh are query, key, and value respectively. σ is scoring function, i.e. softmax. H is the number of heads.
  • XN(i) corresponds to the matrix formed by only stacking {xj: jN(i)} and NOT all the inputs.

1.2. Full Attention in Transformer

If D is the complete digraph, we recover the full quadratic attention mechanism in the conventional Transformer.

  • To elaborate, A∈[0, 1] of size n×n with A(i, j)=1 if query i attends to key j and is zero otherwise.

However, full attention induces quadratic complexity.

1.3. Graph Sparsification Problem

The problem of reducing the quadratic complexity of self-attention can now be seen as a graph sparsification problem.

  • It is well-known that random graphs are expanders and can approximate complete graphs in a number of different contexts including in their spectral properties [80, 38].


Building blocks of the attention mechanism used in BIGBIRD. White color indicates absence of attention. (a) random attention with r = 2, (b) sliding window attention with w = 3 (c) global attention with g = 2. (d) the combined BIGBIRD model.

2.1. (a) Random Attention

  • Let us consider the simplest random graph construction, known as Erdős–Rényi model, where each edge is independently chosen with a fixed probability.

A sparse attention where each query attends over r random number of keys, i.e. A(i, ·)=1 for r randomly chosen keys.

2.2. (b) Sliding Window Attention

  • The second viewpoint which inspired the creation of BIGBIRD is that most contexts within NLP and computational biology have data which displays a great deal of locality of reference.
  • A particular model introduced by Watts and Strogatz [95] is of high relevance to BIGBIRD as it achieves a good balance between average shortest path and the notion of locality.

The generative process of their model is as follows: Construct a regular ring lattice, a graph with n nodes each connected to w neighbors, w/2 on each side, to capture the local structures in the context.

  • Then, we can have a random subset (k%) of all connections is replaced with a random connection. The other (100 — k)%, local connections are retained.
Building block comparison @ 512.
  • When random (R) and sliding window (W) are used together for attention, the results are closed to BERT one.

2.3. (c) Global Attention

  • The final piece is the importance of “global tokens” (tokens that attend to all tokens in the sequence and to whom all tokens attend to.)

BIGBIRD-ITC: Internal Transformer Construction (ITC) make some (g) existing tokens “global”.

BIGBIRD-ETC: Extended Transformer Construction (ETC) includes g additional “global” tokens such as CLS.

2.4. (d) BIGBIRD Attention

The final attention mechanism for BIGBIRD has all three of these properties: queries attend to r random keys, each query attends to w/2 tokens to the left of its location and w/2 to the right of its location and they contain g global tokens (The global tokens can be from existing tokens or extra added tokens).

  • Authors also show the full attention is somehow Turing Complete.
  • BIGBIRD is a universal approximator of sequence functions and is Turing Complete, thereby preserving these properties of the quadratic, full attention model.
  • (Please feel free to read the paper for more details.)

3. NLP Results

3.1. QA Tasks

QA Dev results using Base size models.
  • BIGBIRD-ETC, with expanded global tokens consistently outperforms all other models. Thus, this configuration is chosen to train a large sized model to be used for evaluation on the hidden test set.
Fine-tuning results on Test set for QA tasks.
  • BIGBIRD-ETC model is compared to top-3 entries from the leaderboard excluding BIGBIRD.
  • It is worth noting that BIGBIRD submission is a single model, whereas the other top-3 entries for Natural Questions are ensembles.

One can clearly see the importance of using longer context as both Longformer and BIGBIRD outperform models with smaller contexts.

3.2. Summarization

Summarization ROUGE score for long documents.
  • Following BERT, one layer with cross entropy loss is used on top of the first [CLS] token.

Gains of using BIGBIRD are more significant when longer documents and fewer training examples are used.

  • For instance, using base sized model, BIGBIRD improves state-of-the-art for Arxiv dataset by about 5% points.
  • On Patents dataset, there is improvement over using simple BERT/RoBERTa, but given the large size of training data the improvement over SoTA (which is not BERT based) is not significant.

4. Genomics Results

  • There has been a recent upsurge in using deep learning for genomics data [87, 107, 13], which has resulted in improved performance on several biologically-significant tasks such as promoter site prediction [71], methylation analysis [55], predicting functional effects of non-coding variant [110], etc.

4.1. Bits Per Character (BPC)

MLM Bits Per Character (BPC).
  • DNA is first segmented into tokens so as to further increase the context length. In particular, a byte-pair encoding table is built for the DNA sequence of size 32K, with each token representing 8.78 base pairs on average. The the bits per character (BPC) on a held-out set is shown above.

Attention based contextual representation of DNA does improve BPC, which is further improved by using longer context.

4.2. Promoter Region Prediction

Promoter Region Prediction.
  • Promoter is a DNA region typically located upstream of the gene, which is the site of transcription initiation.

BIGBIRD achieve nearly perfect accuracy with a 5% jump from the previous best reported accuracy.

4.3. Chromatin-Profile Prediction

Chromatin-Profile Prediction.
  • Non-coding regions of DNA do not code for proteins. Majority of diseases and other trait associated single-nucleotide polymorphism are correlated to non-coding genomic variations [110, 46].
  • The corresponding ML task is to predict, for a given non-coding region of DNA, these 919 chromatin-profile including 690 transcription factors (TF) binding profiles for 160 different TFs, 125 DNase I sensitivity (DHS) profiles and 104 histone-mark (HM) profiles. 919 binary classifiers are jointly learned to predict these functional effects from sequence of DNA fragments.
  • On held-out chromosomes, AUC is measured and compared with the baselines.

BIGBIRD significantly improves on performance on the harder task HM, which is known to have longer-range correlations [27] than others.


[2020 NeurIPS] [BIGBIRD]
Big Bird: Transformers for Longer Sequences

2.1. Language Model / Sequence Model

19912020 … [BIGBIRD] 2021 [Performer] [gMLP] [Roformer] [PPBERT] [DeBERTa] [DeLighT] [Transformer-LS] 2022 [GPT-NeoX-20B] [InstructGPT]



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.