Skip to content

Releases: Turkish-Word-Embeddings/Word-Embeddings-Repository-for-Turkish

Turkish Word Vectors, Corpus and Evaluation Dataset

12 Apr 22:44
Compare
Choose a tag to compare

Turkish static word embeddings trained using Word2Vec Skip-gram, Word2Vec CBOW, FastText, GloVe and ELMo architectures can be found here.
We used the following parameters for training:

  • Embedding dimension: 300
  • Windows size: 5
  • Number of negative samples for negative sampling: 5
  • Minimum frequency of a word: 10
  • Maximum iterations: 100 (for GloVe)
  • x_max (maximum number of co-occurrences to use in the weighting function) (for GloVe): 10

Our evaluation dataset consists of analogy and similarity tasks collected from various open-source datasets [1] [2], in addition to our own examples. The analogy dataset is split into two parts: semantic and syntactic, where each category is further divided into subcategories (noun declension suffixes, verb conjugation suffixes) for better evaluation. The similarity dataset is also split into syntactic and semantic groups.

We have combined two open-source corpora: Boun Web Corpus [3] and Huawei Corpus [1]. Overall, we have 1,384,961,747 tokens and 1,573,013 unique words (excluding words occurring less than the minimum frequency).


You can import Word2Vec and FastText word embeddings with the following code snippet:

from gensim.models import KeyedVectors
word_vectors = KeyedVectors.load_word2vec_format("path/to/the/wordvectors", binary=True, no_header=False)

To import Glove word vectors you should change the parameters:

from gensim.models import KeyedVectors
word_vectors = KeyedVectors.load_word2vec_format("path/to/the/wordvectors", binary=False, no_header=True)

You can download the X2Static Bert model from here: https://huggingface.co/CahidArda/bert-turkish-x2static/tree/main

Please don't forget to cite this repository if you use the following word vectors in your work/research.

  1. Onur Gungor, Eray Yildiz, "Linguistic Features in Turkish Word Representations - Türkçe Sözcük Temsillerinde Dilbilimsel Özellikler", 2017 25th Signal Processing and Communications Applications Conference (SIU), Antalya, 2017.
  2. https://github.com/bunyamink/word-embedding-models/tree/master/datasets
  3. Sak, H., Güngör, T. & Saraçlar, M. Resources for Turkish morphological processing. Lang Resources & Evaluation 45, 249–261 (2011). https://doi.org/10.1007/s10579-010-9128-6