Pre-trained word vectors

We are publishing pre-trained word vectors for 90 languages, trained on Wikipedia using fastText. These vectors in dimension 300 were obtained using the skip-gram model described in Bojanowski et al. (2016) with default parameters.

Format

The word vectors come in both the binary and text default formats of fastText. In the text format, each line contain a word followed by its embedding. Each value is space separated. Words are ordered by their frequency in a descending order.

License

The pre-trained word vectors are distributed under the Creative Commons Attribution-Share-Alike License 3.0.

References

If you use these word embeddings, please cite the following paper:

P. Bojanowski*, E. Grave*, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information

@article{bojanowski2016enriching,
  title={Enriching Word Vectors with Subword Information},
  author={Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1607.04606},
  year={2016}
}

Models

The models can be downloaded from:

Afrikaans: bin+text, text
Albanian: bin+text, text
Arabic: bin+text, text
Armenian: bin+text, text
Asturian: bin+text, text
Azerbaijani: bin+text, text
Bashkir: bin+text, text
Basque: bin+text, text
Belarusian: bin+text, text
Bengali: bin+text, text
Bosnian: bin+text, text
Breton: bin+text, text
Bulgarian: bin+text, text
Burmese: bin+text, text
Catalan: bin+text, text
Cebuano: bin+text, text
Chechen: bin+text, text
Chinese: bin+text, text
Chuvash: bin+text, text
Croatian: bin+text, text
Czech: bin+text, text
Danish: bin+text, text
Dutch: bin+text, text
English: bin+text, text
Esperanto: bin+text, text
Estonian: bin+text, text
Farsi: bin+text, text
Finnish: bin+text, text
French: bin+text, text
Galician: bin+text, text
Georgian: bin+text, text
German: bin+text, text
Greek: bin+text, text
Gujarati: bin+text, text
Hebrew: bin+text, text
Hindi: bin+text, text
Hungarian: bin+text, text
Icelandic: bin+text, text
Indonesian: bin+text, text
Italian: bin+text, text
Japanese: bin+text, text
Kannada: bin+text, text
Kazakh: bin+text, text
Khmer: bin+text, text
Korean: bin+text, text
Kyrgyz: bin+text, text
Latin: bin+text, text
Latvian: bin+text, text
Lithuanian: bin+text, text
Luxembourgish: bin+text, text
Macedonian: bin+text, text
Malagasy: bin+text, text
Malayalam: bin+text, text
Malay: bin+text, text
Marathi: bin+text, text
Minangkabau: bin+text, text
Mongolian: bin+text, text
Nepali: bin+text, text
Newar: bin+text, text
Norwegian: bin+text, text
Occitan: bin+text, text
Polish: bin+text, text
Portuguese: bin+text, text
Punjabi: bin+text, text
Romanian: bin+text, text
Russian: bin+text, text
Sanskrit: bin+text, text
Scots: bin+text, text
Serbian: bin+text, text
Serbo-Croatian: bin+text, text
Sinhalese: bin+text, text
Slovak: bin+text, text
Slovene: bin+text, text
Spanish: bin+text, text
Swedish: bin+text, text
Tagalog: bin+text, text
Tajik: bin+text, text
Tamil: bin+text, text
Tatar: bin+text, text
Telugu: bin+text, text
Thai: bin+text, text
Turkish: bin+text, text
Ukrainian: bin+text, text
Urdu: bin+text, text
Uzbek: bin+text, text
Vietnamese: bin+text, text
Volapük: bin+text, text
Waray: bin+text, text
Welsh: bin+text, text
Western Frisian: bin+text, text

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pretrained-vectors.md

pretrained-vectors.md

Pre-trained word vectors

Format

License

References

Models

Files

pretrained-vectors.md

Latest commit

History

pretrained-vectors.md

File metadata and controls

Pre-trained word vectors

Format

License

References

Models