Add GloVe pretrained models from CommonCrawl corpus #40

havingfun · 2020-03-02T06:29:29Z

Hi Team,

I see that we don't have two of the models from the pretrained models by Stanford from here - https://nlp.stanford.edu/projects/glove/
The ones that can be added are -

Common Crawl (42B tokens, 1.9M vocab, uncased, 300d vectors, 1.75 GB download): glove.42B.300d.zip
Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors, 2.03 GB download): glove.840B.300d.zip

Thanks,
Rajesh

kevinmneal · 2021-07-29T23:02:59Z

Resurrecting this. These models have enormous vocabs that could prove useful for more esoteric problems, would love to be able to use them easily.

piskvorky · 2021-07-30T07:18:54Z

Sure, why not. I'm +1 on including those.

Please check https://github.com/RaRe-Technologies/gensim-data#want-to-add-a-new-corpus-or-model; we'll need:

a) Text that motivates adding each model (should be easy), including any links to its original research and preprocessing options, its license etc. Basically a quick summary of "What is this?' and "Who is it for?"
b) Code that loads these models (to include in __init__.py; see e.g. fasttext-wiki-news-subwords-300). Again, should be easy, IIRC we already support the gloVe data format.

Cheers!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GloVe pretrained models from CommonCrawl corpus #40

Add GloVe pretrained models from CommonCrawl corpus #40

havingfun commented Mar 2, 2020

kevinmneal commented Jul 29, 2021

piskvorky commented Jul 30, 2021

Add GloVe pretrained models from CommonCrawl corpus #40

Add GloVe pretrained models from CommonCrawl corpus #40

Comments

havingfun commented Mar 2, 2020

kevinmneal commented Jul 29, 2021

piskvorky commented Jul 30, 2021