Brief Review — FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ
For the Purposes of Automatic Speech Recognition (ASR) and Speech Translation (ST)
FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ
FAIRSEQ S2T, by Meta — Fundamental AI Research (FAIR), Johns Hopkins University
2020 AACL Demo, Over 170 Citations (Sik-Ho Tsang @ Medium)
==== My Other Paper Readings Are Also Over Here ====
- This paper does not provide any new model or new dataset.
- Instead, FAIRSEQ S2T provides end-to-end workflows from data pre-processing, model training to offline (online) inference.
- It also provides state-of-the-art RNN-based, Transformer-based as well as Conformer-based models and open-source detailed training recipes.
- FAIRSEQ S2T
1. FAIRSEQ S2T
1.1. FAIRSEQ S2T Extension
- FAIRSEQ (for NLP) provides a collection of Machine Translation (MT) models and Language Models (LMs).
FAIRSEQ S2T is an extension for Speech-to-Text (S2T) Processing, adding attention-based RNN models, Transformer models as well as the latest Conformer models for Automatic Speech Recognition (ASR) and Speech Translation (ST).
1.2. Data Pre-Processing
FAIRSEQ S2T provides online speech data transforms, including CMVN, speed perturbation and SpecAugment. It also has an open interface for user-defined transforms.
FAIRSEQ is implemented in PyTorch and it provides efficient batching, mixed precision training, multi-GPU as well as multi-machine training.
- It also provides metrics and visualization.
2.1. Model Architecture for Evaluation
- FAIRSEQ S2T models as tabulated above are evaluated on English ASR benchmark — LibriSpeech, as well as multilingual ST benchmarks — MuSTC and CoVoST 2.
2.2. Speech Recognition (ASR)
- LibriSpeech is a de-facto standard ASR benchmark that contains 1,000 hours of English speech from audiobooks. Table 4 shows the dev and test WER of the proposed models on LibriSpeech clean and noisy sets.
- Three architectures, RNN-based model (“B-Big”), Transformer-based models (“T-Sm”, “T-Md” and “T-Lg”) and Conformer-based wav2vec 2.0 model (“CW-Lg”), are evaluated.
The first two architectures are able to achieve competitive performance (WER) to the state-of-the-art ones, the implementation of the third architecture matches the state of the art.
2.3. Speech Translation (ST)
- MuST-C contains up to around 500 hours of English speech from TED talks with translations in 8 European languages.
The results represent the best systems in high (AL > 6), medium (6 AL > 3) and low (AL 3) latency regimes, on which we can clearly see the trade-offs between model performance and prediction latency.
- CoVoST 2 contains total 2,880 hours of read speech in 22 languages from the open-source community, with 21 X-En directions and 15 En-X directions.
The Transformer-based models (“T-Sm” and “T-Md”) outperforms RNN-based ones (“B-Base” and “B-Big”) on all En-X and X-En directions. The performance gap tends to be larger when the training data is higher resource (En-X directions, Fr-En, De-En and Es-En).
The multilingual models perform reasonably well with a universal model for over 15 X-En or En-X directions.