Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The top words are very similar after 5-6 epochs #37

Open
yg37 opened this issue Jun 16, 2016 · 12 comments
Open

The top words are very similar after 5-6 epochs #37

yg37 opened this issue Jun 16, 2016 · 12 comments

Comments

@yg37
Copy link

yg37 commented Jun 16, 2016

screen shot 2016-06-16 at 12 16 27 pm

I was rerunning the script for 20_newsgroup and this is the topic term distribution after 1 epoch. From the picture, we can see that the top words for each topic are actually very similar. Is it normal or were I implementing something wrong?I encountered the same issue when I ran the script on other corpus. After 10 epochs, the top words were almost identical with top words being "the","a", etc.

@cprevosteau
Copy link

cprevosteau commented Jun 17, 2016

I have the same problem.
I runned the script of the 20_newgroup on either the original corpus or on one of my own, and after just one epoch the topics' top words are identical.
I tried to change any hyperparameter but the results were the same.

`Top words in topic 0 invoke out_of_vocabulary out_of_vocabulary the . to , a i

Top words in topic 1 invoke out_of_vocabulary out_of_vocabulary the . to , a i

Top words in topic 2 invoke out_of_vocabulary out_of_vocabulary the . to , a i

Top words in topic 3 invoke out_of_vocabulary out_of_vocabulary the . to , a i

...

Top words in topic 19 invoke out_of_vocabulary out_of_vocabulary the . , to a i`

@yg37
Copy link
Author

yg37 commented Jun 17, 2016

I had it running on the server last night and the top words diverged after around 20 epochs. Not sure why the initial topic term distribution behaves that way, maybe it has something to do with the prior?

@agtsai-i
Copy link

I consistently have out_of_vocabulary as the top word across all topics, any suggestions on what I should look for? This happens even when I set the min and max vocab count thresholds to None.

@radekrepo
Copy link

radekrepo commented Jul 13, 2016

Hi all,

From my experience you can set a more aggressive down-sampling rule to remove the out_of_vocabulary or equally redundant tokens from at least some of the topics if not all. I have lowered the down-sampling threshold in my dataset and the stop words largely disappeared from top topic word lists. An alternative to that, which I haven't tried, is to carry out data cleaning before you feed the tokens to the model. That way you can remove the out_of_vocabulary token as well as other meaningless tokens from modelling entirely. Data cleaning could possibly lead to improved results (it does in case of a pure LDA at least) although I don't know the maths behind the lda2vec model well enough to make a strong case for that.

I personally eventually gave up on using lda2vec because each time you use it, the model requires a lot time to fine tune the topic results. Standard word2vec or text2vec with some form of unsupervised semantic clustering are probably a less time-consuming alternative to lda2vec because they can work regardless of the dataset or a type of computer system you use, apart from mere fact that model optimisation may itself work more quickly. Moreover, lda2vec was a real pain to install on my Windows a couple of months ago. lda2vec may be useful but you should have very specific reasons for using it.

@agtsai-i
Copy link

Thanks @nanader! I'll play with the down-sampling threshold. I believe I had removed the out_of_vocabulary tokens entirely by setting the vocab count thresholds to None (at least, that's what my reading of the code tells me should happen so far), and so I was surprised to still see them pop up.

So far I've tried doc2vec and word2vec + earth mover's distance, but have not had stellar results so far. I like the approach used here for documents (in principle) more than the other two, and of course the given examples look amazing. I'd really like lda2vec to work out with the data I have.

I installed lda2vec on an AWS GPU instance, and that wasn't too horrible.

@yg37
Copy link
Author

yg37 commented Jul 13, 2016

I recently tried topic2vec as an alternative:
http://arxiv.org/abs/1506.08422
https://github.com/scavallari/Topic2Vec/blob/master/Topic2Vec_20newsgroups.ipynb
I tried it on simple wiki data and it performed very well

@agtsai-i
Copy link

Oh interesting, thank you!

@radekrepo
Copy link

radekrepo commented Jul 14, 2016

Ah, by the way, agtsai-i. You can also use the vector space to label topics with the nearest cosine distance token vectors instead of relying on the most common topic-word assignments. lda2vev model results allow for it. That way you could ignore the tuning of the topic model results entirely and get as many or as few topics as you want. It depends on what you want to achieve, really. I hope that helps

@agtsai-i
Copy link

agtsai-i commented Jul 14, 2016

True, but if I did that I would be discarding a lot of the novelty of lda2vec and essentially just using word2vec right?

*Never mind, I see what you're saying. Much appreciated

@gracegcy
Copy link

Hi @agtsai-i & @yg37 did you resolve this issue in the end? Could you kindly share the solution if any? Thanks a lot.

@yg37
Copy link
Author

yg37 commented Mar 23, 2017

@gracegcy Keep running the epochs and the top words will diverge

@ghost
Copy link

ghost commented Feb 17, 2019

@radekrepo @agtsai-i @yg37 Have you noticed that the result of the down-sampling step was never used? No wonder I kept getting a lot of stop words (OoV, punctuations, etc.) however I lowered its threshold. I put my efforts for solving this here: #92

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants