https://aclanthology.org/2021.acl-long.80
A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation.
VoxPopuli provides
- 400K hours of unlabelled speech data for 23 languages
- 1.8K hours of transcribed speech data for 16 languages
- 17.3K hours of speech-to-speech interpretation data for 15x15 directions
The raw data is collected from 2009-2020 European Parliament event recordings. We acknowledge the European Parliament for creating and sharing these materials.
Unlabelled and transcribed data
Language | Code | Unlabelled Hours (v1/v2) | Transcribed Hours | Transcribed Speakers | Transcribed Tokens | LM Tokens |
---|---|---|---|---|---|---|
English | En | 4.5K/24.1K | 543 | 1313 | 4.8M | 60.1M |
German | De | 4.5K/23.2K | 282 | 531 | 2.3M | 50.0M |
French | Fr | 4.5K/22.8K | 211 | 534 | 2.1M | 58.6M |
Spanish | Es | 4.4K/21.4K | 166 | 305 | 1.6M | 57.4M |
Polish | Pl | 4.5K/21.2K | 111 | 282 | 802K | 13.6M |
Italian | It | 4.6K/21.9K | 91 | 306 | 757K | 52.1M |
Romanian | Ro | 4.5K/17.9K | 89 | 164 | 739K | 10.3M |
Hungarian | Hu | 4.4K/17.7K | 63 | 143 | 431K | 13.0M |
Czech | Cs | 4.5K/18.7K | 62 | 138 | 461K | 13.5M |
Dutch | Nl | 4.5K/19.0K | 53 | 221 | 488K | 54.6M |
Finnish | Fi | 4.4K/14.2K | 27 | 84 | 160K | 34.5M |
Croatian | Hr | 2.7K/8.1K | 43 | 83 | 337K | 285K |
Slovak | Sk | 4.4K/12.1K | 35 | 96 | 270K | 13.3M |
Slovene | Sl | 4.4K/11.3K | 10 | 45 | 76K | 12.6M |
Estonian | Et | 4.3K/10.6K | 3 | 29 | 18K | 11.3M |
Lithuanian | Lt | 4.3K/14.4K | 2 | 21 | 10K | 11.5M |
Portuguese | Pt | 4.4K/17.5K | - | - | - | - |
Bulgarian | Bg | 4.3K/17.6K | - | - | - | - |
Greek | El | 4.4K/17.7K | - | - | - | - |
Latvian | Lv | 4.4K/13.1K | - | - | - | - |
Maltese | Mt | 4.4K/9.1K | - | - | - | - |
Swedish | Sv | 4.5K/16.3K | - | - | - | - |
Danish | Da | 4.3K/13.6K | - | - | - | - |
Total | 100K/384K | 1791 | 4295 | 15M | 467M |
Speech-to-speech interpretation data
Source/Target | En | De | Fr | Es | Pl | It | Ro | Hu | Cs | Nl | Fi | Sk | Sl | Lt | Da | Total |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
En | - | 463 | 427 | 441 | 432 | 461 | 457 | 382 | 427 | 400 | 442 | 433 | 434 | 398 | 370 | 6.0K |
De | 187 | - | 196 | 204 | 214 | 217 | 198 | 205 | 214 | 196 | 217 | 208 | 218 | 164 | 179 | 2.8K |
Fr | 169 | 187 | - | 187 | 172 | 197 | 195 | 144 | 170 | 158 | 168 | 168 | 156 | 139 | 134 | 2.3K |
Es | 130 | 138 | 135 | - | 118 | 148 | 128 | 93 | 118 | 115 | 124 | 114 | 108 | 83 | 86 | 1.6K |
Pl | 68 | 66 | 54 | 55 | - | 67 | 55 | 43 | 67 | 42 | 55 | 62 | 57 | 50 | 34 | 775 |
It | 69 | 77 | 76 | 79 | 72 | - | 75 | 61 | 68 | 64 | 71 | 66 | 70 | 53 | 60 | 961 |
Ro | 60 | 59 | 59 | 58 | 49 | 61 | - | 38 | 50 | 43 | 48 | 50 | 46 | 38 | 29 | 688 |
Hu | 30 | 38 | 25 | 27 | 29 | 30 | 27 | - | 27 | 20 | 31 | 29 | 26 | 21 | 18 | 378 |
Cs | 39 | 35 | 29 | 30 | 36 | 32 | 31 | 23 | - | 23 | 29 | 55 | 29 | 25 | 18 | 434 |
Nl | 31 | 43 | 35 | 29 | 27 | 38 | 24 | 25 | 25 | - | 32 | 25 | 23 | 19 | 25 | 401 |
Fi | 15 | 18 | 15 | 13 | 13 | 13 | 13 | 12 | 13 | 11 | - | 14 | 12 | 11 | 9 | 182 |
Hr | 31 | 27 | 27 | 24 | 27 | 28 | 24 | 22 | 24 | 22 | 24 | 26 | 37 | 21 | 20 | 384 |
Sk | 21 | 22 | 14 | 16 | 19 | 16 | 16 | 14 | 32 | 13 | 16 | - | 17 | 13 | 10 | 239 |
Sl | 6 | 6 | 4 | 5 | 5 | 6 | 5 | 4 | 5 | 4 | 5 | 6 | - | 4 | 3 | 68 |
Lt | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | - | 0 | 13 |
Total | 857 | 1.2K | 1.1K | 1.2K | 1.2K | 1.3K | 1.2K | 1.1K | 1.2K | 1.1K | 1.3K | 1.3K | 1.2K | 1.0K | 995 | 17.3K |
We provide raw audios as well as scripts to segment and align them with transcription/interpretation. The output format
is Ogg Vorbis (16000Hz, 16-bit, mono-channel),
which is supported by common libraries such as libsndfile
and libsox
(they have Python frontends
by soundfile, torchaudio, etc.).
As the first step, clone this repo for the processing scripts
git clone https://github.com/facebookresearch/voxpopuli.git
and install required PyPI packages:
pip install -r requirements.txt
First, download raw audios via
python -m voxpopuli.download_audios --root [ROOT] --subset [SUBSET]
which saves audios to ${ROOT}/raw_audios/[language]/[year]/[recording_id].ogg
.
SUBSET
specifies the data subset to download:
--subset | # Languages | Hours | Years | Size |
---|---|---|---|---|
en, de, fr, es, pl, it, ro, hu, cs, nl, fi, hr, sk, sl, et, lt, pt, bg, el, lv, mt, sv or da | 1 | 2.7K-4.6K | 2009-2020 | 44G-75G |
en_v2, de_v2, fr_v2, es_v2, pl_v2, it_v2, ro_v2, hu_v2, cs_v2, nl_v2, fi_v2, hr_v2, sk_v2, sl_v2, et_v2, lt_v2, pt_v2, bg_v2, el_v2, lv_v2, mt_v2, sv_v2 or da_v2 | 1 | 8.1K-24.1K | 2009-2020 | 130G-385G |
10k | 23 | 10K | 2019-2020 | 170G |
100k | 23 | 100K | 2009-2020 | 1.7T |
400k | 23 | 400K | 2009-2020 | 6.4T |
Then, segment these audios via
python -m voxpopuli.get_unlabelled_data --root [ROOT] --subset [SUBSET]
which outputs to ${ROOT}/unlabelled_data/[language]/[year]/[segment_id].ogg
First, download raw audios via
python -m voxpopuli.download_audios --root [ROOT] --subset asr
which saves audios to ${ROOT}/raw_audios/original/[year]/[recording_id].ogg
.
Then, segment these audios and align them with transcripts via
python -m voxpopuli.get_asr_data --root [ROOT] --lang [LANGUAGE]
which outputs
- audios
${ROOT}/transcribed_data/[language]/[year]/[segment_id].ogg
- per-split manifest (ID, transcript, speaker ID)
${ROOT}/transcribed_data/[language]/asr_[split].tsv
First, follow the instructions above to set up ASR data (source audios and transcripts).
Then, download target audios via
python -m voxpopuli.download_audios --root [ROOT] --subset [TARGET_LANGUAGE]
which saves audios to ${ROOT}/raw_audios/[target_language]/[year]/[recording_id].ogg
.
Finally, segment these audios and match them with source ones via
python -m voxpopuli.get_s2s_data --root [ROOT] --source-lang [SOURCE_LANGUAGE] --target-lang [TARGET_LANGUAGE]
which outputs
- target audios
${ROOT}/transcribed_data/[language]/[target_language]/[year]/[segment_id].ogg
- manifest (source ID, transcript, speaker ID, target ID)
${ROOT}/transcribed_data/[language]/[target_language]/s2s.tsv
We also human-transcribe part of the target audios (for English, French and Spanish only) to allow more accurate alignments.
To use them instead of machine transcriptions in the alignments, add --use-annotated-target
to the command line.
We combine VoxPopuli transcripts and text data from Europarl for LM training.
Download VoxPopuli and Europarl text data, process the raw text and generate the vocabulary via
python -m voxpopuli.get_lm_data --root [ROOT] --lang [LANGUAGE]
which outputs
- sentences
${ROOT}/lm_data/[language]/sentences.txt
- vocabulary
${ROOT}/lm_data/[language]/vocabulary.txt
To train an n-gram LM with KenLM, run
${KENLM_PATH}/lmplz -o ${n} --limit_vocab_file [OUT_VOCAB_FILE] < [OUT_TEXT_FILE] > ${n}gram_lm.arpa
${KENLM_PATH}/build_binary ${n}gram_lm.arpa ${n}gram_lm.bin
We provide pre-trained wav2vec 2.0 models (implemented in fairseq and wav2letter/flashlight) for downstream speech tasks:
Language(s) | Pre-training Hours | Base Model (95M) | Large Model (317M) |
---|---|---|---|
Es | 4.4K | fairseq | fairseq |
Fr | 4.5K | fairseq | fairseq |
It | 4.6K | fairseq | fairseq |
Nl | 4.5K | fairseq | fairseq |
Sv | 4.5K | fairseq | fairseq |
All 23 languages | 10K | fairseq | fairseq |
All 23 languages | 100K | fairseq / wav2letter | fairseq |
In our paper (Section 4.3.1), we evaluated these models on the Common Voice corpus in the normal setting and the few-shot phoneme recognition setting.
A wav2letter implementation as well as a checkpoint pretrained on VoxPopuli 100k (base model) is also available in the Wav2letter respository.
The complete fine-tuned ASR baselines for this codebase shoulda come soon. The wav2letter implementation follows this paper.
For the VoxPopuli ASR task, we provide Transformer baselines, fine-tuned wav2vec2 models (Base 10K) as well as n-gram LMs (trained with KenLM) and their lexicons:
Language | ASR (fairseq) | LM (kenLM) | Lexicon |
---|---|---|---|
Cs | baseline, fine-tuned wav2vec2 | 3-gram, 5-gram | lexicon |
De | baseline, fine-tuned wav2vec2 | 3-gram, 5-gram | lexicon |
En | baseline, fine-tuned wav2vec2 | 3-gram, 5-gram | lexicon |
Es | baseline, fine-tuned wav2vec2 | 3-gram, 5-gram | lexicon |
Et | baseline, fine-tuned wav2vec2 | 3-gram, 5-gram | lexicon |
Fi | baseline, fine-tuned wav2vec2 | 3-gram, 5-gram | lexicon |
Fr | baseline, fine-tuned wav2vec2 | 3-gram, 5-gram | lexicon |
Hr | baseline, fine-tuned wav2vec2 | 3-gram, 5-gram | lexicon |
Hu | baseline, fine-tuned wav2vec2 | 3-gram, 5-gram | lexicon |
It | baseline, fine-tuned wav2vec2 | 3-gram, 5-gram | lexicon |
Lt | baseline, fine-tuned wav2vec2 | 3-gram, 5-gram | lexicon |
Nl | baseline, fine-tuned wav2vec2 | 3-gram, 5-gram | lexicon |
Pl | baseline, fine-tuned wav2vec2 | 3-gram, 5-gram | lexicon |
Ro | baseline, fine-tuned wav2vec2 | 3-gram, 5-gram | lexicon |
Sk | baseline, fine-tuned wav2vec2 | 3-gram, 5-gram | lexicon |
Sl | baseline, fine-tuned wav2vec2 | 3-gram, 5-gram | lexicon |
We also provide CoVoST 2 + EuroParl-ST ASR Transformer models that are self-trained on 3000h VoxPopuli unlabelled data.
Language | CoVoST 2 Test (WER) | EuroParl-ST Test (WER) | Model (fairseq) |
---|---|---|---|
De | 17.3 | 21.4 | s2t_transformer_l |
Es | 13.2 | 15.3 | s2t_transformer_l |
Fr | 17.0 | 19.0 | s2t_transformer_l |
Please refer to the S2T examples for the use of Transformer model checkpoints.
We provide CoVoST 2 + EuroParl-ST ST Transformer models that are jointly trained with 400h VoxPopuli weakly labelled data.
Direction | CoVoST 2 Test (BLEU) | EuroParl-ST Test (BLEU) | Model (fairseq) |
---|---|---|---|
De-En | 23.4 | 24.4 | s2t_transformer_l |
Es-En | 29.7 | 28.4 | s2t_transformer_l |
Fr-En | 30.3 | 31.1 | s2t_transformer_l |
Please refer to the S2T examples for the use of these checkpoints.
- 2021-07-26: New unlabelled data (additional 300K hours) released.
- 2021-03-03: VoxPopuli released.
License | |
---|---|
VoxPopuli Data | CC0 (see also European Parliament's legal notice for the raw data) |
LM Data | (Please check out the Europarl website for the Europarl portion) |
Pre-trained Models | CC BY-NC 4.0 |
Code | CC BY-NC 4.0 |
Changhan Wang ([email protected]), Morgane Rivière ([email protected]), Ann Lee ([email protected])
@inproceedings{wang-etal-2021-voxpopuli,
title = "{V}ox{P}opuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation",
author = "Wang, Changhan and
Riviere, Morgane and
Lee, Ann and
Wu, Anne and
Talnikar, Chaitanya and
Haziza, Daniel and
Williamson, Mary and
Pino, Juan and
Dupoux, Emmanuel",
booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
month = aug,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.acl-long.80",
pages = "993--1003",
}