Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Yahoo LDA and Hadoop #15

Open
eraserx99 opened this issue Jan 26, 2013 · 0 comments
Open

Yahoo LDA and Hadoop #15

eraserx99 opened this issue Jan 26, 2013 · 0 comments

Comments

@eraserx99
Copy link

After a couple of manual twists, I got the Yahoo! LDA distributed setup running on top of Hadoop 1.0.4. It is a simple 2-nodes Hadoop configuration. When I executed the runLDA.sh, I used "2" for the number of machines. Everything ran OK and I also checked the logs from Hadoop to make sure that everything looked normal. After it completed the run, I got two output directories and all the files as described in Yahoo! LDA documentation. So far so good...

Then, I continued to run a single machine setup against the same corpus / documents with the same numbers of topics and iterations. After it completed the run, I got all the files as described in Yahoo! LDA documentation too. So far so good...

However, when I started to compare the results between the distributed and single machine setup, they look pretty different to me. Though the number of topics are the same, the topics look very different. The document to topic outputs look pretty different too.

Here is an example.

Below is the topics that have the word "portion" from the distributed setup.

Topic 0: (portion,0.150299) (hous,0.129272) (top,0.113109) (member,0.0838274) (bottom,0.0588456) (featur,0.0582807) (configur,0.0468569) (materi,0.0431222) (seal,0.0416157) (cover,0.0299408) (perimet,0.0279322) (mechan,0.0273359) (port,0.0266141) (interior,0.0252332) (caviti,0.0248566) (membran,0.0238209) (factor,0.0233815) (electron,0.0233501) (latch,0.0211846) (coupl,0.0211219)
Topic 7: (magnet,0.173736) (portion,0.115939) (surfac,0.106755) (direct,0.0771537) (guid,0.0469494) (field,0.0426735) (coil,0.0386558) (side,0.0377519) (face,0.0375079) (shield,0.0358291) (medium,0.0347386) (end,0.0340499) (layer,0.0312805) (form,0.030477) (part,0.029817) (main,0.0278081) (pole,0.0272916) (section,0.0268181) (shape,0.0247949) (head,0.0199737)
Topic 12: (member,0.151508) (roller,0.0885818) (form,0.0709548) (imag,0.0678395) (sheet,0.0652739) (develop,0.0566723) (fix,0.0477156) (rotat,0.0468222) (toner,0.0461121) (portion,0.0448522) (direct,0.0416338) (belt,0.0406029) (side,0.0360559) (transfer,0.0329291) (surfac,0.0320128) (posit,0.0302947) (drum,0.0264349) (unit,0.0246825) (press,0.0245451) (apparatu,0.0244763)
Topic 18: (line,0.231612) (displai,0.110021) (electrod,0.0977109) (pixel,0.0831908) (crystal,0.0775907) (liquid,0.0751507) (panel,0.0452805) (portion,0.0372104) (plural,0.0269503) (direct,0.0253603) (substrat,0.0251203) (align,0.0219703) (form,0.0201103) (connect,0.0197803) (common,0.0188203) (view,0.0173602) (arrang,0.0172402) (gate,0.0171002) (polar,0.0165002) (respect,0.0159202)
Topic 47: (region,0.433173) (portion,0.0659125) (semiconductor,0.0466138) (structur,0.0431049) (gate,0.0370144) (implant,0.0332915) (sourc,0.0312804) (diffus,0.0297256) (dope,0.0259315) (present,0.0252183) (illustr,0.0249045) (charg,0.0243625) (zone,0.0241771) (form,0.0241057) (trench,0.0236636) (channel,0.0228648) (impur,0.0218521) (ion,0.0212958) (concentr,0.0208964) (drain,0.0206111)
Topic 54: (transfer,0.293593) (properti,0.0975047) (pre,0.0648832) (medium,0.0623078) (identifi,0.0547478) (instruct,0.0422863) (class,0.0393509) (diagram,0.0378002) (process,0.0360002) (sensit,0.0304894) (portion,0.0298802) (set,0.0274987) (match,0.026391) (inform,0.0247294) (oper,0.0234002) (classifi,0.0228187) (number,0.0221264) (determin,0.0216556) (label,0.021351) (step,0.0211848)
Topic 57: (chip,0.0976907) (wire,0.0929568) (electr,0.0724923) (connect,0.0669996) (pad,0.0622883) (conduct,0.0585963) (substrat,0.0582906) (circuit,0.0577923) (surfac,0.0551535) (packag,0.0491625) (semiconductor,0.047577) (board,0.0425033) (plural,0.0378487) (bond,0.0329676) (form,0.0302948) (contact,0.0297399) (portion,0.0287886) (mount,0.0269879) (compon,0.0266368) (side,0.0252325)
Topic 92: (portion,0.1064) (bodi,0.0994396) (side,0.068734) (end,0.0669684) (support,0.0604946) (posit,0.0601663) (assembl,0.0473205) (mount,0.0467772) (wall,0.0426122) (member,0.0421821) (surfac,0.0403939) (engag,0.0394771) (view,0.0380171) (plate,0.0377681) (connect,0.0358893) (front,0.035878) (open,0.0349726) (cover,0.032709) (attach,0.0325619) (rotat,0.0312377)

Below is the topics that have the word "portion" from the single machine setup.

Topic 6: (electr,0.158046) (structur,0.127875) (conduct,0.106806) (circuit,0.0866087) (connect,0.0616967) (interconnect,0.0502754) (connector,0.0435378) (integr,0.0398189) (mechan,0.0348396) (conductor,0.0320855) (illustr,0.0320388) (coupl,0.0315253) (ground,0.0308562) (contact,0.0285066) (portion,0.0271684) (fuse,0.0240252) (carrier,0.0237296) (isol,0.0217379) (substrat,0.0213022) (shown,0.017521)
Topic 7: (portion,0.270355) (side,0.108089) (end,0.0707633) (bodi,0.059409) (surfac,0.0554205) (direct,0.0473438) (form,0.0400563) (section,0.037248) (guid,0.0355345) (view,0.0340486) (upper,0.0292069) (shown,0.0286381) (shape,0.0268464) (main,0.0264341) (front,0.026107) (open,0.0252183) (extend,0.0208885) (lower,0.0202913) (hole,0.019694) (case,0.0184072)
Topic 24: (region,0.621829) (characterist,0.0302271) (portion,0.0278799) (laser,0.0274319) (direct,0.0259627) (overlap,0.0255685) (vertic,0.025246) (differ,0.0227913) (edg,0.020677) (posit,0.0195661) (width,0.018509) (adjac,0.0180611) (plural,0.0169681) (standard,0.0163947) (layout,0.0156422) (beam,0.0144955) (extend,0.0142984) (background,0.0139042) (specif,0.0122916) (abov,0.0122558)
Topic 82: (film,0.161729) (semiconductor,0.126738) (form,0.105283) (gate,0.094141) (insul,0.0559734) (silicon,0.0484004) (oxid,0.0421204) (electrod,0.0409827) (sourc,0.0344884) (transistor,0.0342963) (substrat,0.0337643) (drain,0.0326857) (conduct,0.0289546) (surfac,0.0261471) (region,0.0252235) (portion,0.0246842) (thin,0.0241079) (impur,0.0209088) (etch,0.0197341) (diffus,0.019638)
Topic 89: (electrod,0.309474) (wire,0.171766) (electr,0.0730991) (resist,0.0727857) (connect,0.0579722) (insul,0.0379882) (form,0.037917) (protect,0.0311655) (present,0.0265505) (discharg,0.0264081) (view,0.0192577) (section,0.0184458) (appli,0.0175627) (illustr,0.0160386) (contact,0.0155543) (addit,0.015127) (portion,0.0137311) (dispos,0.013546) (shown,0.0130902) (side,0.0125204)

Should both distributed and single machine setup have similar results? Or, is there any good way to compare the results between the distributed and single machine setup systematically?

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant