Brief Review — data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language

data2vec, SSL for Speech, Vision and Language

Sik-Ho Tsang
5 min readJun 28


data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language
data2vec, by Meta AI, and SambaNova
2022 ICML, Over 340 Citations (Sik-Ho Tsang @ Medium)

Self-Supervised Learning
19932022 [BEiT] [BEiT V2] [Masked Autoencoders (MAE)] [DiT] [SimMIM] [LDBM]
==== My Other Paper Readings Are Also Over Here ====

  • data2vec is proposed, which is a framework that uses the SAME learning method for either speech, NLP or computer vision.
  • The core idea is to predict latent representations of the full input data based on a masked view of the input in a self-distillation setup using a standard Transformer architecture.
  • data2vec predicts contextualized latent representations instead of predicting modality-specific targets such as words, visual tokens or units of human speech which are local in nature.


  1. data2vec
  2. Results

1. data2vec

data2vec Framework

1.1. Model

Standard Transformer architecture is used.

  • For computer vision, ViT is used to divide the blocks into non-overlapping 16×16 for learning the image embedding.
  • Speech data is encoded using a multi-layer 1-D convolutional neural network that maps 16 kHz waveform to 50 Hz representations (wav2vec 2.0).
  • Text is pre-processed to obtain sub-word units (Byte Pair Encoding (BPE), BERT) first. RoBERTa is used.

1.2. Masking

Part of tokens is masked, by replacing them with a learned MASK embedding token.

  • For computer vision, the block-wise masking strategy in BEiT is followed, but it masks 60% of the patches instead of 40%.
  • For speech, spans of latent speech representations are masked (wav2vec 2.0). p = 0.065 of all time-steps are sampled to be starting indices and mask the subsequent ten time-steps. This results in approximately 49% of all time-steps to be masked for a typical training sequence.
  • For language, tokens are masked following BERT. 15% of uniformly selected tokens: 80% are replaced by a learned mask token, 10% are left unchanged and 10% are replaced by randomly selected vocabulary token.

1.3. Training Targets

The representations to predict are contextualized representations.

  • This is an important difference to BERT, wav2vec 2.0 or BEiT, MAE, SimMIM, and MaskFeat, which predict targets lacking contextual information.

1.4. Teacher Parameterization

The encoding of the unmasked training sample is parameterized by an exponentially moving average (EMA) of the model parameters:

  • While the parameters of the feature encoder and the positional encoder are shared between the teacher and student networks.

1.5. Targets

Training targets are constructed based on the output of the top-K blocks of the teacher network for time-steps which are masked in student-mode.

  • The output of block l at time-step t is denoted as alt. A normalization is applied to each block to obtain ˆalt before averaging the top K blocks yt:
  • for a network with L blocks in total to obtain the training target yt for time-step t.
  • Normalizing the targets helps prevent the model from collapsing into a constant representation.
  • For speech representations, instance normalization is used. K = 8.
  • For NLP and vision, parameter-less layer normalization is used.
  • For vision, K = 6.
  • For NLP, K = 10.

1.6. Objectives

Given contextualized training targets yt, Smooth L1 loss is used to regress these targets:

  • The advantage of this loss is that it is less sensitive to outliers.
  • For speech, a simple L2 loss works well.

1.7. Downstream

  • For image classification, the output of the last Transformer block is mean-pooled and fed to a softmax-normalized classifier.
  • For speech, fine-tuning regime of wav2vec 2.0 is used.

2. Results

2.1. Computer Vision

Computer Vision
  • ImageNet is used for pretraining.

data2vec outperforms prior work with ViT-B and ViT-L in the single model setting and all prior work for ViT-L.

2.2. Speech

  • data2vec is pre-trained on the 960 hours of speech audio data from Librispeech (LS-960).

The above table shows improvements for most labeled data setups with the largest gains for 10 minutes of labeled data (20% relative WER improvement) for the Base models.

For Large models, there are strong improvements for the smallest labeled data setups, and comparable performance for the resource-rich settings of 100 hours and 960 hours of labeled data where performance is generally saturating for many models.


data2vec can outperform a comparable setup that uses the same pre-training and fine-tuning data.

2.3. NLP

  • The model is pre-trained on the Books Corpus and English Wikipedia data over 1M updates.

data2vec outperforms the RoBERTa baseline. This is the first successful pre-trained NLP model which does not use discrete units (words, subwords, characters or bytes) as the training target. Instead, the model predicts a contextualized latent representation.

2.4. Ablation Studies

Layer-averaged Targets
  • One of the main differences of our method compared to BYOL is the use of targets which are based on averaging multiple layers from the teacher network.

Targets based on multiple layers improves over using only the top layer (K = 1) for all modalities.

Target Contextualization
  • Teacher representations are based on self-attention over the entire input which results in contextualized targets. This distinguishes data2vec from other self-supervised approaches which construct a learning task by predicting or reconstructing local parts of the input.

Larger context sizes lead to better downstream performance.

Target Feature Type
  • Transformer blocks contain several layers which can each serve as targets.

The output of the feed-forward network (FFN) block works best while the output of the self-attention block does not yield a usable model.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.