-
Notifications
You must be signed in to change notification settings - Fork 628
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The top words are very similar after 5-6 epochs #37
Comments
I have the same problem. `Top words in topic 0 invoke out_of_vocabulary out_of_vocabulary the . to , a i Top words in topic 1 invoke out_of_vocabulary out_of_vocabulary the . to , a i Top words in topic 2 invoke out_of_vocabulary out_of_vocabulary the . to , a i Top words in topic 3 invoke out_of_vocabulary out_of_vocabulary the . to , a i ... Top words in topic 19 invoke out_of_vocabulary out_of_vocabulary the . , to a i` |
I had it running on the server last night and the top words diverged after around 20 epochs. Not sure why the initial topic term distribution behaves that way, maybe it has something to do with the prior? |
I consistently have |
Hi all, From my experience you can set a more aggressive down-sampling rule to remove the I personally eventually gave up on using lda2vec because each time you use it, the model requires a lot time to fine tune the topic results. Standard word2vec or text2vec with some form of unsupervised semantic clustering are probably a less time-consuming alternative to lda2vec because they can work regardless of the dataset or a type of computer system you use, apart from mere fact that model optimisation may itself work more quickly. Moreover, lda2vec was a real pain to install on my Windows a couple of months ago. lda2vec may be useful but you should have very specific reasons for using it. |
Thanks @nanader! I'll play with the down-sampling threshold. I believe I had removed the So far I've tried doc2vec and word2vec + earth mover's distance, but have not had stellar results so far. I like the approach used here for documents (in principle) more than the other two, and of course the given examples look amazing. I'd really like lda2vec to work out with the data I have. I installed lda2vec on an AWS GPU instance, and that wasn't too horrible. |
I recently tried topic2vec as an alternative: |
Oh interesting, thank you! |
Ah, by the way, agtsai-i. You can also use the vector space to label topics with the nearest cosine distance token vectors instead of relying on the most common topic-word assignments. lda2vev model results allow for it. That way you could ignore the tuning of the topic model results entirely and get as many or as few topics as you want. It depends on what you want to achieve, really. I hope that helps |
True, but if I did that I would be discarding a lot of the novelty of lda2vec and essentially just using word2vec right? *Never mind, I see what you're saying. Much appreciated |
@gracegcy Keep running the epochs and the top words will diverge |
@radekrepo @agtsai-i @yg37 Have you noticed that the result of the down-sampling step was never used? No wonder I kept getting a lot of stop words (OoV, punctuations, etc.) however I lowered its threshold. I put my efforts for solving this here: #92 |
I was rerunning the script for 20_newsgroup and this is the topic term distribution after 1 epoch. From the picture, we can see that the top words for each topic are actually very similar. Is it normal or were I implementing something wrong?I encountered the same issue when I ran the script on other corpus. After 10 epochs, the top words were almost identical with top words being "the","a", etc.
The text was updated successfully, but these errors were encountered: