Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

streaming mode broken #1

Open
gattis opened this issue Jul 20, 2011 · 3 comments
Open

streaming mode broken #1

gattis opened this issue Jul 20, 2011 · 3 comments

Comments

@gattis
Copy link

gattis commented Jul 20, 2011

I followed the setup instructions for single machine, and when I try the streaming mode example, if I input the same string multiple times, I get a different topic categorization every time:

java Tokenizer | ../learntopics -teststream -dumpprefix=../ut_out/lda --topics=100 --dictionary=../ut_out/lda.dict.dump
W0720 12:28:20.250313 2803 Controller.cpp:115] ----------------------------------------------------------------------
W0720 12:28:20.250712 2803 Controller.cpp:117] Log files are being stored at /lda/ut_out/learnTopics.*
W0720 12:28:20.250731 2803 Controller.cpp:119] ----------------------------------------------------------------------
W0720 12:28:20.251055 2803 Controller.cpp:140] You have chosen single machine testing mode
W0720 12:28:20.251379 2803 Unigram_Model_Streaming_Builder.cpp:56] Initializing global dictionary from ../ut_out/lda.dict.dump
W0720 12:28:20.308131 2803 Unigram_Model_Streaming_Builder.cpp:59] Dictionary initialized and has 17208
W0720 12:28:20.308279 2803 Unigram_Model_Streaming_Builder.cpp:86] Estimating the words that will fit in 2048 MB
W0720 12:28:20.408761 2803 Unigram_Model_Streaming_Builder.cpp:91] 17208 will fit in 1.06012 MB of memory
W0720 12:28:20.408906 2803 Unigram_Model_Streaming_Builder.cpp:93] Initializing Local Dictionary from ../ut_out/lda.dict.dump with 17208 words.
W0720 12:28:20.491570 2803 Unigram_Model_Streaming_Builder.cpp:122] Local Dictionary Initialized. Size: 34416
W0720 12:28:20.494669 2803 Unigram_Model_Streamer.cpp:64] Initializing Word-Topic counts table from dump ../ut_out/lda.ttc.dump using 17208 words & 100 topics.
W0720 12:28:20.549022 2803 Unigram_Model_Streamer.cpp:88] Initialized Word-Topic counts table
W0720 12:28:20.549149 2803 Unigram_Model_Streamer.cpp:91] Initializing Alpha vector from dumpfile ../ut_out/lda.par.dump
W0720 12:28:20.549247 2803 Unigram_Model_Streamer.cpp:94] Alpha vector initialized
W0720 12:28:20.549309 2803 Unigram_Model_Streamer.cpp:97] Initializing Beta Parameter from specified Beta = 0.01
W0720 12:28:20.549383 2803 Unigram_Model_Streamer.cpp:101] Beta param initialized
W0720 12:28:20.557430 2803 Testing_Execution_Strategy.cpp:64] Starting Parallel testing Pipeline
www.sauritchsurfboards.com/ recreation/sports/aquatic_sports watch out jeremy sherwin is here over the past six months you may have noticed this guy in every surf magazine published jeremy is finally getting his run more.. copyright surfboards 2004 all rights reserved june 6 2004 new launches it s new and improved site you can now order custom surfboards online more improvements to come.. top selling models middot rocket fish middot speed egg middot classic middot squash
www.sauritchsurfboards.com/ recreation/sports/aquatic_sports (watch,83) (past,86) (months,77) (noticed,15) (guy,93) (surf,35) (magazine,86) (published,92) (finally,49) (run,21) (copyright,62) (surfboards,27) (rights,90) (reserved,59) (june,63) (launches,26) (improved,40) (site,26) (order,72) (custom,36) (surfboards,11) (online,68) (improvements,67) (top,29) (selling,82) (models,30) (middot,62) (rocket,23) (fish,67) (middot,35) (speed,29) (egg,2) (middot,22) (classic,58) (middot,69) (squash,67)
www.sauritchsurfboards.com/ recreation/sports/aquatic_sports watch out jeremy sherwin is here over the past six months you may have noticed this guy in every surf magazine published jeremy is finally getting his run more.. copyright surfboards 2004 all rights reserved june 6 2004 new launches it s new and improved site you can now order custom surfboards online more improvements to come.. top selling models middot rocket fish middot speed egg middot classic middot squash
www.sauritchsurfboards.com/ recreation/sports/aquatic_sports (watch,93) (past,56) (months,11) (noticed,42) (guy,29) (surf,73) (magazine,21) (published,19) (finally,84) (run,37) (copyright,98) (surfboards,24) (rights,15) (reserved,70) (june,13) (launches,26) (improved,91) (site,80) (order,56) (custom,73) (surfboards,62) (online,70) (improvements,96) (top,81) (selling,5) (models,25) (middot,84) (rocket,27) (fish,36) (middot,5) (speed,46) (egg,29) (middot,13) (classic,57) (middot,24) (squash,95)
www.sauritchsurfboards.com/ recreation/sports/aquatic_sports watch out jeremy sherwin is here over the past six months you may have noticed this guy in every surf magazine published jeremy is finally getting his run more.. copyright surfboards 2004 all rights reserved june 6 2004 new launches it s new and improved site you can now order custom surfboards online more improvements to come.. top selling models middot rocket fish middot speed egg middot classic middot squash
www.sauritchsurfboards.com/ recreation/sports/aquatic_sports (watch,82) (past,45) (months,14) (noticed,67) (guy,34) (surf,64) (magazine,43) (published,50) (finally,87) (run,8) (copyright,76) (surfboards,78) (rights,88) (reserved,84) (june,3) (launches,51) (improved,54) (site,99) (order,32) (custom,60) (surfboards,76) (online,68) (improvements,39) (top,12) (selling,26) (models,86) (middot,94) (rocket,39) (fish,95) (middot,70) (speed,34) (egg,78) (middot,67) (classic,1) (middot,97) (squash,2)
www.sauritchsurfboards.com/ recreation/sports/aquatic_sports watch out jeremy sherwin is here over the past six months you may have noticed this guy in every surf magazine published jeremy is finally getting his run more.. copyright surfboards 2004 all rights reserved june 6 2004 new launches it s new and improved site you can now order custom surfboards online more improvements to come.. top selling models middot rocket fish middot speed egg middot classic middot squash
www.sauritchsurfboards.com/ recreation/sports/aquatic_sports (watch,17) (past,92) (months,52) (noticed,56) (guy,1) (surf,80) (magazine,86) (published,41) (finally,65) (run,89) (copyright,44) (surfboards,19) (rights,40) (reserved,29) (june,31) (launches,17) (improved,97) (site,71) (order,81) (custom,75) (surfboards,9) (online,27) (improvements,67) (top,56) (selling,97) (models,53) (middot,86) (rocket,65) (fish,6) (middot,83) (speed,19) (egg,24) (middot,28) (classic,71) (middot,32) (squash,29)

@gattis
Copy link
Author

gattis commented Jul 26, 2011

After digging a little further, I discovered this in Unigram_Model_Streamer::read(google::protobuf::Message& doc):

for (int i = 0; i < wdoc.body_size(); i++) {
    top = rand() % _num_topics;
    wdoc.add_topic_assignment(top);
}

Looks like it just assigns random topics to words. Is streaming mode just not implemented yet?

@shravanmn
Copy link
Collaborator

How many iterations did you run to learn the model? Did you check if the model looks fine? Only if you have a good model trained, streaming will work.

The random assignments are just initial assignments. They will go through the variational inference and the final assignments won't be random. The streaming mode works pretty fine for us.

@jduprey
Copy link

jduprey commented May 1, 2012

@shravanmn , the apparent random results with streaming happens with the example test set the project provides (Yahoo_LDA/docs/html/single__machine__usage.html) - 500 iterations. Does the example not produce a good model with 500 iterations, 100 topics? Even learning topics with a 1000 iterations doesn't appear to help the sample set. Can you confirm? When running it batch mode it consistently provides the same classifications. I will try to debug and understand how the two executions paths are different, but it would be great if anyone can provide some insight and save me some time.

Thank you!

jduprey added a commit to jduprey/Yahoo_LDA that referenced this issue May 16, 2012
… Per @shravan's advice, I have disabled the initial random topic assignments and now seem to get consistent results between calls to classify the same text.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants