Skip to content

A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation

License

Notifications You must be signed in to change notification settings

tekinek/voxpopuli

 
 

Repository files navigation

VoxPopuli

https://aclanthology.org/2021.acl-long.80

A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation.

Overview

VoxPopuli provides

  • 400K hours of unlabelled speech data for 23 languages
  • 1.8K hours of transcribed speech data for 16 languages
  • 17.3K hours of speech-to-speech interpretation data for 15x15 directions

The raw data is collected from 2009-2020 European Parliament event recordings. We acknowledge the European Parliament for creating and sharing these materials.

Detailed statistics

Unlabelled and transcribed data

Language Code Unlabelled Hours (v1/v2) Transcribed Hours Transcribed Speakers Transcribed Tokens LM Tokens
English En 4.5K/24.1K 543 1313 4.8M 60.1M
German De 4.5K/23.2K 282 531 2.3M 50.0M
French Fr 4.5K/22.8K 211 534 2.1M 58.6M
Spanish Es 4.4K/21.4K 166 305 1.6M 57.4M
Polish Pl 4.5K/21.2K 111 282 802K 13.6M
Italian It 4.6K/21.9K 91 306 757K 52.1M
Romanian Ro 4.5K/17.9K 89 164 739K 10.3M
Hungarian Hu 4.4K/17.7K 63 143 431K 13.0M
Czech Cs 4.5K/18.7K 62 138 461K 13.5M
Dutch Nl 4.5K/19.0K 53 221 488K 54.6M
Finnish Fi 4.4K/14.2K 27 84 160K 34.5M
Croatian Hr 2.7K/8.1K 43 83 337K 285K
Slovak Sk 4.4K/12.1K 35 96 270K 13.3M
Slovene Sl 4.4K/11.3K 10 45 76K 12.6M
Estonian Et 4.3K/10.6K 3 29 18K 11.3M
Lithuanian Lt 4.3K/14.4K 2 21 10K 11.5M
Portuguese Pt 4.4K/17.5K - - - -
Bulgarian Bg 4.3K/17.6K - - - -
Greek El 4.4K/17.7K - - - -
Latvian Lv 4.4K/13.1K - - - -
Maltese Mt 4.4K/9.1K - - - -
Swedish Sv 4.5K/16.3K - - - -
Danish Da 4.3K/13.6K - - - -
Total 100K/384K 1791 4295 15M 467M

Speech-to-speech interpretation data

Source/Target En De Fr Es Pl It Ro Hu Cs Nl Fi Sk Sl Lt Da Total
En - 463 427 441 432 461 457 382 427 400 442 433 434 398 370 6.0K
De 187 - 196 204 214 217 198 205 214 196 217 208 218 164 179 2.8K
Fr 169 187 - 187 172 197 195 144 170 158 168 168 156 139 134 2.3K
Es 130 138 135 - 118 148 128 93 118 115 124 114 108 83 86 1.6K
Pl 68 66 54 55 - 67 55 43 67 42 55 62 57 50 34 775
It 69 77 76 79 72 - 75 61 68 64 71 66 70 53 60 961
Ro 60 59 59 58 49 61 - 38 50 43 48 50 46 38 29 688
Hu 30 38 25 27 29 30 27 - 27 20 31 29 26 21 18 378
Cs 39 35 29 30 36 32 31 23 - 23 29 55 29 25 18 434
Nl 31 43 35 29 27 38 24 25 25 - 32 25 23 19 25 401
Fi 15 18 15 13 13 13 13 12 13 11 - 14 12 11 9 182
Hr 31 27 27 24 27 28 24 22 24 22 24 26 37 21 20 384
Sk 21 22 14 16 19 16 16 14 32 13 16 - 17 13 10 239
Sl 6 6 4 5 5 6 5 4 5 4 5 6 - 4 3 68
Lt 1 1 1 1 1 1 1 1 1 1 1 1 1 - 0 13
Total 857 1.2K 1.1K 1.2K 1.2K 1.3K 1.2K 1.1K 1.2K 1.1K 1.3K 1.3K 1.2K 1.0K 995 17.3K

Getting Data

We provide raw audios as well as scripts to segment and align them with transcription/interpretation. The output format is Ogg Vorbis (16000Hz, 16-bit, mono-channel), which is supported by common libraries such as libsndfile and libsox (they have Python frontends by soundfile, torchaudio, etc.).

As the first step, clone this repo for the processing scripts

git clone https://github.com/facebookresearch/voxpopuli.git

and install required PyPI packages:

pip install -r requirements.txt

Unlabelled Data

First, download raw audios via

python -m voxpopuli.download_audios --root [ROOT] --subset [SUBSET]

which saves audios to ${ROOT}/raw_audios/[language]/[year]/[recording_id].ogg.

SUBSET specifies the data subset to download:

--subset # Languages Hours Years Size
en, de, fr, es, pl, it, ro, hu, cs, nl, fi, hr, sk, sl, et, lt, pt, bg, el, lv, mt, sv or da 1 2.7K-4.6K 2009-2020 44G-75G
en_v2, de_v2, fr_v2, es_v2, pl_v2, it_v2, ro_v2, hu_v2, cs_v2, nl_v2, fi_v2, hr_v2, sk_v2, sl_v2, et_v2, lt_v2, pt_v2, bg_v2, el_v2, lv_v2, mt_v2, sv_v2 or da_v2 1 8.1K-24.1K 2009-2020 130G-385G
10k 23 10K 2019-2020 170G
100k 23 100K 2009-2020 1.7T
400k 23 400K 2009-2020 6.4T

Then, segment these audios via

python -m voxpopuli.get_unlabelled_data --root [ROOT] --subset [SUBSET]

which outputs to ${ROOT}/unlabelled_data/[language]/[year]/[segment_id].ogg

Transcribed (ASR) Data

First, download raw audios via

python -m voxpopuli.download_audios --root [ROOT] --subset asr

which saves audios to ${ROOT}/raw_audios/original/[year]/[recording_id].ogg.

Then, segment these audios and align them with transcripts via

python -m voxpopuli.get_asr_data --root [ROOT] --lang [LANGUAGE]

which outputs

  • audios ${ROOT}/transcribed_data/[language]/[year]/[segment_id].ogg
  • per-split manifest (ID, transcript, speaker ID) ${ROOT}/transcribed_data/[language]/asr_[split].tsv

Speech-to-Speech Interpretation Data

First, follow the instructions above to set up ASR data (source audios and transcripts).

Then, download target audios via

python -m voxpopuli.download_audios --root [ROOT] --subset [TARGET_LANGUAGE]

which saves audios to ${ROOT}/raw_audios/[target_language]/[year]/[recording_id].ogg.

Finally, segment these audios and match them with source ones via

python -m voxpopuli.get_s2s_data --root [ROOT] --source-lang [SOURCE_LANGUAGE] --target-lang [TARGET_LANGUAGE]

which outputs

  • target audios ${ROOT}/transcribed_data/[language]/[target_language]/[year]/[segment_id].ogg
  • manifest (source ID, transcript, speaker ID, target ID) ${ROOT}/transcribed_data/[language]/[target_language]/s2s.tsv

We also human-transcribe part of the target audios (for English, French and Spanish only) to allow more accurate alignments. To use them instead of machine transcriptions in the alignments, add --use-annotated-target to the command line.

Language Modeling (LM) Data

We combine VoxPopuli transcripts and text data from Europarl for LM training.

Download VoxPopuli and Europarl text data, process the raw text and generate the vocabulary via

python -m voxpopuli.get_lm_data --root [ROOT] --lang [LANGUAGE]

which outputs

  • sentences ${ROOT}/lm_data/[language]/sentences.txt
  • vocabulary ${ROOT}/lm_data/[language]/vocabulary.txt

To train an n-gram LM with KenLM, run

${KENLM_PATH}/lmplz -o ${n} --limit_vocab_file [OUT_VOCAB_FILE] < [OUT_TEXT_FILE] > ${n}gram_lm.arpa
${KENLM_PATH}/build_binary ${n}gram_lm.arpa ${n}gram_lm.bin

Pre-trained Models

wav2vec 2.0

We provide pre-trained wav2vec 2.0 models (implemented in fairseq and wav2letter/flashlight) for downstream speech tasks:

Language(s) Pre-training Hours Base Model (95M) Large Model (317M)
Es 4.4K fairseq fairseq
Fr 4.5K fairseq fairseq
It 4.6K fairseq fairseq
Nl 4.5K fairseq fairseq
Sv 4.5K fairseq fairseq
All 23 languages 10K fairseq fairseq
All 23 languages 100K fairseq / wav2letter fairseq

In our paper (Section 4.3.1), we evaluated these models on the Common Voice corpus in the normal setting and the few-shot phoneme recognition setting.

Wav2letter C++ implementation

A wav2letter implementation as well as a checkpoint pretrained on VoxPopuli 100k (base model) is also available in the Wav2letter respository.

The complete fine-tuned ASR baselines for this codebase shoulda come soon. The wav2letter implementation follows this paper.

ASR and LM

For the VoxPopuli ASR task, we provide Transformer baselines, fine-tuned wav2vec2 models (Base 10K) as well as n-gram LMs (trained with KenLM) and their lexicons:

Language ASR (fairseq) LM (kenLM) Lexicon
Cs baseline, fine-tuned wav2vec2 3-gram, 5-gram lexicon
De baseline, fine-tuned wav2vec2 3-gram, 5-gram lexicon
En baseline, fine-tuned wav2vec2 3-gram, 5-gram lexicon
Es baseline, fine-tuned wav2vec2 3-gram, 5-gram lexicon
Et baseline, fine-tuned wav2vec2 3-gram, 5-gram lexicon
Fi baseline, fine-tuned wav2vec2 3-gram, 5-gram lexicon
Fr baseline, fine-tuned wav2vec2 3-gram, 5-gram lexicon
Hr baseline, fine-tuned wav2vec2 3-gram, 5-gram lexicon
Hu baseline, fine-tuned wav2vec2 3-gram, 5-gram lexicon
It baseline, fine-tuned wav2vec2 3-gram, 5-gram lexicon
Lt baseline, fine-tuned wav2vec2 3-gram, 5-gram lexicon
Nl baseline, fine-tuned wav2vec2 3-gram, 5-gram lexicon
Pl baseline, fine-tuned wav2vec2 3-gram, 5-gram lexicon
Ro baseline, fine-tuned wav2vec2 3-gram, 5-gram lexicon
Sk baseline, fine-tuned wav2vec2 3-gram, 5-gram lexicon
Sl baseline, fine-tuned wav2vec2 3-gram, 5-gram lexicon

We also provide CoVoST 2 + EuroParl-ST ASR Transformer models that are self-trained on 3000h VoxPopuli unlabelled data.

Language CoVoST 2 Test (WER) EuroParl-ST Test (WER) Model (fairseq)
De 17.3 21.4 s2t_transformer_l
Es 13.2 15.3 s2t_transformer_l
Fr 17.0 19.0 s2t_transformer_l

Please refer to the S2T examples for the use of Transformer model checkpoints.

Speech-to-Text Translation (ST)

We provide CoVoST 2 + EuroParl-ST ST Transformer models that are jointly trained with 400h VoxPopuli weakly labelled data.

Direction CoVoST 2 Test (BLEU) EuroParl-ST Test (BLEU) Model (fairseq)
De-En 23.4 24.4 s2t_transformer_l
Es-En 29.7 28.4 s2t_transformer_l
Fr-En 30.3 31.1 s2t_transformer_l

Please refer to the S2T examples for the use of these checkpoints.

What's New

  • 2021-07-26: New unlabelled data (additional 300K hours) released.
  • 2021-03-03: VoxPopuli released.

License

License
VoxPopuli Data CC0 (see also European Parliament's legal notice for the raw data)
LM Data (Please check out the Europarl website for the Europarl portion)
Pre-trained Models CC BY-NC 4.0
Code CC BY-NC 4.0

Contact

Changhan Wang ([email protected]), Morgane Rivière ([email protected]), Ann Lee ([email protected])

Citation

@inproceedings{wang-etal-2021-voxpopuli,
    title = "{V}ox{P}opuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation",
    author = "Wang, Changhan  and
      Riviere, Morgane  and
      Lee, Ann  and
      Wu, Anne  and
      Talnikar, Chaitanya  and
      Haziza, Daniel  and
      Williamson, Mary  and
      Pino, Juan  and
      Dupoux, Emmanuel",
    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.acl-long.80",
    pages = "993--1003",
}

About

A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%