The top words are very similar after 5-6 epochs #37

yg37 · 2016-06-16T19:19:23Z

I was rerunning the script for 20_newsgroup and this is the topic term distribution after 1 epoch. From the picture, we can see that the top words for each topic are actually very similar. Is it normal or were I implementing something wrong?I encountered the same issue when I ran the script on other corpus. After 10 epochs, the top words were almost identical with top words being "the","a", etc.

cprevosteau · 2016-06-17T10:30:22Z

I have the same problem.
I runned the script of the 20_newgroup on either the original corpus or on one of my own, and after just one epoch the topics' top words are identical.
I tried to change any hyperparameter but the results were the same.

`Top words in topic 0 invoke out_of_vocabulary out_of_vocabulary the . to , a i

Top words in topic 1 invoke out_of_vocabulary out_of_vocabulary the . to , a i

Top words in topic 2 invoke out_of_vocabulary out_of_vocabulary the . to , a i

Top words in topic 3 invoke out_of_vocabulary out_of_vocabulary the . to , a i

...

Top words in topic 19 invoke out_of_vocabulary out_of_vocabulary the . , to a i`

yg37 · 2016-06-17T17:36:39Z

I had it running on the server last night and the top words diverged after around 20 epochs. Not sure why the initial topic term distribution behaves that way, maybe it has something to do with the prior?

agtsai-i · 2016-07-12T21:00:09Z

I consistently have out_of_vocabulary as the top word across all topics, any suggestions on what I should look for? This happens even when I set the min and max vocab count thresholds to None.

radekrepo · 2016-07-13T06:19:50Z

Hi all,

From my experience you can set a more aggressive down-sampling rule to remove the out_of_vocabulary or equally redundant tokens from at least some of the topics if not all. I have lowered the down-sampling threshold in my dataset and the stop words largely disappeared from top topic word lists. An alternative to that, which I haven't tried, is to carry out data cleaning before you feed the tokens to the model. That way you can remove the out_of_vocabulary token as well as other meaningless tokens from modelling entirely. Data cleaning could possibly lead to improved results (it does in case of a pure LDA at least) although I don't know the maths behind the lda2vec model well enough to make a strong case for that.

I personally eventually gave up on using lda2vec because each time you use it, the model requires a lot time to fine tune the topic results. Standard word2vec or text2vec with some form of unsupervised semantic clustering are probably a less time-consuming alternative to lda2vec because they can work regardless of the dataset or a type of computer system you use, apart from mere fact that model optimisation may itself work more quickly. Moreover, lda2vec was a real pain to install on my Windows a couple of months ago. lda2vec may be useful but you should have very specific reasons for using it.

agtsai-i · 2016-07-13T17:43:56Z

Thanks @nanader! I'll play with the down-sampling threshold. I believe I had removed the out_of_vocabulary tokens entirely by setting the vocab count thresholds to None (at least, that's what my reading of the code tells me should happen so far), and so I was surprised to still see them pop up.

So far I've tried doc2vec and word2vec + earth mover's distance, but have not had stellar results so far. I like the approach used here for documents (in principle) more than the other two, and of course the given examples look amazing. I'd really like lda2vec to work out with the data I have.

I installed lda2vec on an AWS GPU instance, and that wasn't too horrible.

yg37 · 2016-07-13T17:53:32Z

I recently tried topic2vec as an alternative:
http://arxiv.org/abs/1506.08422
https://github.com/scavallari/Topic2Vec/blob/master/Topic2Vec_20newsgroups.ipynb
I tried it on simple wiki data and it performed very well

agtsai-i · 2016-07-13T17:57:57Z

Oh interesting, thank you!

radekrepo · 2016-07-14T10:29:41Z

Ah, by the way, agtsai-i. You can also use the vector space to label topics with the nearest cosine distance token vectors instead of relying on the most common topic-word assignments. lda2vev model results allow for it. That way you could ignore the tuning of the topic model results entirely and get as many or as few topics as you want. It depends on what you want to achieve, really. I hope that helps

agtsai-i · 2016-07-14T17:59:54Z

True, but if I did that I would be discarding a lot of the novelty of lda2vec and essentially just using word2vec right?

*Never mind, I see what you're saying. Much appreciated

gracegcy · 2017-03-22T17:19:27Z

Hi @agtsai-i & @yg37 did you resolve this issue in the end? Could you kindly share the solution if any? Thanks a lot.

yg37 · 2017-03-23T16:26:45Z

@gracegcy Keep running the epochs and the top words will diverge

ghost · 2019-02-17T21:58:33Z

@radekrepo @agtsai-i @yg37 Have you noticed that the result of the down-sampling step was never used? No wonder I kept getting a lot of stop words (OoV, punctuations, etc.) however I lowered its threshold. I put my efforts for solving this here: #92

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The top words are very similar after 5-6 epochs #37

The top words are very similar after 5-6 epochs #37

yg37 commented Jun 16, 2016

cprevosteau commented Jun 17, 2016 •

edited

Loading

yg37 commented Jun 17, 2016

agtsai-i commented Jul 12, 2016

radekrepo commented Jul 13, 2016 •

edited

Loading

agtsai-i commented Jul 13, 2016

yg37 commented Jul 13, 2016

agtsai-i commented Jul 13, 2016

radekrepo commented Jul 14, 2016 •

edited

Loading

agtsai-i commented Jul 14, 2016 •

edited

Loading

gracegcy commented Mar 22, 2017

yg37 commented Mar 23, 2017

ghost commented Feb 17, 2019 •

edited by ghost

Loading

The top words are very similar after 5-6 epochs #37

The top words are very similar after 5-6 epochs #37

Comments

yg37 commented Jun 16, 2016

cprevosteau commented Jun 17, 2016 • edited Loading

yg37 commented Jun 17, 2016

agtsai-i commented Jul 12, 2016

radekrepo commented Jul 13, 2016 • edited Loading

agtsai-i commented Jul 13, 2016

yg37 commented Jul 13, 2016

agtsai-i commented Jul 13, 2016

radekrepo commented Jul 14, 2016 • edited Loading

agtsai-i commented Jul 14, 2016 • edited Loading

gracegcy commented Mar 22, 2017

yg37 commented Mar 23, 2017

ghost commented Feb 17, 2019 • edited by ghost Loading

cprevosteau commented Jun 17, 2016 •

edited

Loading

radekrepo commented Jul 13, 2016 •

edited

Loading

radekrepo commented Jul 14, 2016 •

edited

Loading

agtsai-i commented Jul 14, 2016 •

edited

Loading

ghost commented Feb 17, 2019 •

edited by ghost

Loading