Brief Review — FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ

For the Purposes of Automatic Speech Recognition (ASR) and Speech Translation (ST)

3 min readSep 23, 2023

FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ
FAIRSEQ S2T, by Meta — Fundamental AI Research (FAIR), Johns Hopkins University
2020 AACL Demo, Over 170 Citations (Sik-Ho Tsang @ Medium)
Speech-to-Text Modeling
==== My Other Paper Readings Are Also Over Here ====

This paper does not provide any new model or new dataset.
Instead, FAIRSEQ S2T provides end-to-end workflows from data pre-processing, model training to offline (online) inference.
It also provides state-of-the-art RNN-based, Transformer-based as well as Conformer-based models and open-source detailed training recipes.

Outline

FAIRSEQ S2T
Results

1. FAIRSEQ S2T

1.1. FAIRSEQ S2T Extension

FAIRSEQ (for NLP) provides a collection of Machine Translation (MT) models and Language Models (LMs).

FAIRSEQ S2T is an extension for Speech-to-Text (S2T) Processing, adding attention-based RNN models, Transformer models as well as the latest Conformer models for Automatic Speech Recognition (ASR) and Speech Translation (ST).

1.2. Data Pre-Processing

FAIRSEQ S2T provides online speech data transforms, including CMVN, speed perturbation and SpecAugment. It also has an open interface for user-defined transforms.

1.3. Computation

FAIRSEQ is implemented in PyTorch and it provides efficient batching, mixed precision training, multi-GPU as well as multi-machine training.

It also provides metrics and visualization.

2. Results

2.1. Model Architecture for Evaluation

FAIRSEQ S2T models as tabulated above are evaluated on English ASR benchmark — LibriSpeech, as well as multilingual ST benchmarks — MuSTC and CoVoST 2.

2.2. Speech Recognition (ASR)

LibriSpeech is a de-facto standard ASR benchmark that contains 1,000 hours of English speech from audiobooks. Table 4 shows the dev and test WER of the proposed models on LibriSpeech clean and noisy sets.
Three architectures, RNN-based model (“B-Big”), Transformer-based models (“T-Sm”, “T-Md” and “T-Lg”) and Conformer-based wav2vec 2.0 model (“CW-Lg”), are evaluated.

The first two architectures are able to achieve competitive performance (WER) to the state-of-the-art ones, the implementation of the third architecture matches the state of the art.

2.3. Speech Translation (ST)

MuST-C contains up to around 500 hours of English speech from TED talks with translations in 8 European languages.

The results represent the best systems in high (AL > 6), medium (6 AL > 3) and low (AL 3) latency regimes, on which we can clearly see the trade-offs between model performance and prediction latency.

CoVoST 2 contains total 2,880 hours of read speech in 22 languages from the open-source community, with 21 X-En directions and 15 En-X directions.

The Transformer-based models (“T-Sm” and “T-Md”) outperforms RNN-based ones (“B-Base” and “B-Big”) on all En-X and X-En directions. The performance gap tends to be larger when the training data is higher resource (En-X directions, Fr-En, De-En and Es-En).
The multilingual models perform reasonably well with a universal model for over 15 X-En or En-X directions.

Brief Review — FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ

For the Purposes of Automatic Speech Recognition (ASR) and Speech Translation (ST)

Outline

1. FAIRSEQ S2T

1.1. FAIRSEQ S2T Extension

1.2. Data Pre-Processing

1.3. Computation

2. Results

2.1. Model Architecture for Evaluation

2.2. Speech Recognition (ASR)

2.3. Speech Translation (ST)

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Sik-Ho Tsang

No responses yet