Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GloVe pretrained models from CommonCrawl corpus #40

Open
havingfun opened this issue Mar 2, 2020 · 2 comments
Open

Add GloVe pretrained models from CommonCrawl corpus #40

havingfun opened this issue Mar 2, 2020 · 2 comments

Comments

@havingfun
Copy link

Hi Team,

I see that we don't have two of the models from the pretrained models by Stanford from here - https://nlp.stanford.edu/projects/glove/
The ones that can be added are -

  • Common Crawl (42B tokens, 1.9M vocab, uncased, 300d vectors, 1.75 GB download): glove.42B.300d.zip
  • Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors, 2.03 GB download): glove.840B.300d.zip

Thanks,
Rajesh

@kevinmneal
Copy link

Resurrecting this. These models have enormous vocabs that could prove useful for more esoteric problems, would love to be able to use them easily.

@piskvorky
Copy link
Owner

Sure, why not. I'm +1 on including those.

Please check https://github.com/RaRe-Technologies/gensim-data#want-to-add-a-new-corpus-or-model; we'll need:

a) Text that motivates adding each model (should be easy), including any links to its original research and preprocessing options, its license etc. Basically a quick summary of "What is this?' and "Who is it for?"
b) Code that loads these models (to include in __init__.py; see e.g. fasttext-wiki-news-subwords-300). Again, should be easy, IIRC we already support the gloVe data format.

Cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants