Brief Review — FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ

For the Purposes of Automatic Speech Recognition (ASR) and Speech Translation (ST)

Sik-Ho Tsang
3 min readSep 23, 2023

FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ
, by Meta — Fundamental AI Research (FAIR), Johns Hopkins University
2020 AACL Demo, Over 170 Citations (Sik-Ho Tsang @ Medium)

Speech-to-Text Modeling
==== My Other Paper Readings Are Also Over Here ====

  • This paper does not provide any new model or new dataset.
  • Instead, FAIRSEQ S2T provides end-to-end workflows from data pre-processing, model training to offline (online) inference.
  • It also provides state-of-the-art RNN-based, Transformer-based as well as Conformer-based models and open-source detailed training recipes.


  2. Results



1.1. FAIRSEQ S2T Extension

  • FAIRSEQ (for NLP) provides a collection of Machine Translation (MT) models and Language Models (LMs).

FAIRSEQ S2T is an extension for Speech-to-Text (S2T) Processing, adding attention-based RNN models, Transformer models as well as the latest Conformer models for Automatic Speech Recognition (ASR) and Speech Translation (ST).

1.2. Data Pre-Processing

FAIRSEQ S2T provides online speech data transforms, including CMVN, speed perturbation and SpecAugment. It also has an open interface for user-defined transforms.

1.3. Computation

FAIRSEQ is implemented in PyTorch and it provides efficient batching, mixed precision training, multi-GPU as well as multi-machine training.

  • It also provides metrics and visualization.

2. Results

2.1. Model Architecture for Evaluation

  • FAIRSEQ S2T models as tabulated above are evaluated on English ASR benchmarkLibriSpeech, as well as multilingual ST benchmarks — MuSTC and CoVoST 2.

2.2. Speech Recognition (ASR)

Speech Recognition (ASR)
  • LibriSpeech is a de-facto standard ASR benchmark that contains 1,000 hours of English speech from audiobooks. Table 4 shows the dev and test WER of the proposed models on LibriSpeech clean and noisy sets.
  • Three architectures, RNN-based model (“B-Big”), Transformer-based models (“T-Sm”, “T-Md” and “T-Lg”) and Conformer-based wav2vec 2.0 model (“CW-Lg”), are evaluated.

The first two architectures are able to achieve competitive performance (WER) to the state-of-the-art ones, the implementation of the third architecture matches the state of the art.

2.3. Speech Translation (ST)

Speech Translation (ST) on MuST-C
  • MuST-C contains up to around 500 hours of English speech from TED talks with translations in 8 European languages.

The results represent the best systems in high (AL > 6), medium (6  AL > 3) and low (AL  3) latency regimes, on which we can clearly see the trade-offs between model performance and prediction latency.

Speech Translation (ST) on CoVoST 2
  • CoVoST 2 contains total 2,880 hours of read speech in 22 languages from the open-source community, with 21 X-En directions and 15 En-X directions.

The Transformer-based models (“T-Sm” and “T-Md”) outperforms RNN-based ones (“B-Base” and “B-Big”) on all En-X and X-En directions. The performance gap tends to be larger when the training data is higher resource (En-X directions, Fr-En, De-En and Es-En).

The multilingual models perform reasonably well with a universal model for over 15 X-En or En-X directions.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.