You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After a couple of manual twists, I got the Yahoo! LDA distributed setup running on top of Hadoop 1.0.4. It is a simple 2-nodes Hadoop configuration. When I executed the runLDA.sh, I used "2" for the number of machines. Everything ran OK and I also checked the logs from Hadoop to make sure that everything looked normal. After it completed the run, I got two output directories and all the files as described in Yahoo! LDA documentation. So far so good...
Then, I continued to run a single machine setup against the same corpus / documents with the same numbers of topics and iterations. After it completed the run, I got all the files as described in Yahoo! LDA documentation too. So far so good...
However, when I started to compare the results between the distributed and single machine setup, they look pretty different to me. Though the number of topics are the same, the topics look very different. The document to topic outputs look pretty different too.
Here is an example.
Below is the topics that have the word "portion" from the distributed setup.
Should both distributed and single machine setup have similar results? Or, is there any good way to compare the results between the distributed and single machine setup systematically?
Thanks!
The text was updated successfully, but these errors were encountered:
After a couple of manual twists, I got the Yahoo! LDA distributed setup running on top of Hadoop 1.0.4. It is a simple 2-nodes Hadoop configuration. When I executed the runLDA.sh, I used "2" for the number of machines. Everything ran OK and I also checked the logs from Hadoop to make sure that everything looked normal. After it completed the run, I got two output directories and all the files as described in Yahoo! LDA documentation. So far so good...
Then, I continued to run a single machine setup against the same corpus / documents with the same numbers of topics and iterations. After it completed the run, I got all the files as described in Yahoo! LDA documentation too. So far so good...
However, when I started to compare the results between the distributed and single machine setup, they look pretty different to me. Though the number of topics are the same, the topics look very different. The document to topic outputs look pretty different too.
Here is an example.
Below is the topics that have the word "portion" from the distributed setup.
Topic 0: (portion,0.150299) (hous,0.129272) (top,0.113109) (member,0.0838274) (bottom,0.0588456) (featur,0.0582807) (configur,0.0468569) (materi,0.0431222) (seal,0.0416157) (cover,0.0299408) (perimet,0.0279322) (mechan,0.0273359) (port,0.0266141) (interior,0.0252332) (caviti,0.0248566) (membran,0.0238209) (factor,0.0233815) (electron,0.0233501) (latch,0.0211846) (coupl,0.0211219)
Topic 7: (magnet,0.173736) (portion,0.115939) (surfac,0.106755) (direct,0.0771537) (guid,0.0469494) (field,0.0426735) (coil,0.0386558) (side,0.0377519) (face,0.0375079) (shield,0.0358291) (medium,0.0347386) (end,0.0340499) (layer,0.0312805) (form,0.030477) (part,0.029817) (main,0.0278081) (pole,0.0272916) (section,0.0268181) (shape,0.0247949) (head,0.0199737)
Topic 12: (member,0.151508) (roller,0.0885818) (form,0.0709548) (imag,0.0678395) (sheet,0.0652739) (develop,0.0566723) (fix,0.0477156) (rotat,0.0468222) (toner,0.0461121) (portion,0.0448522) (direct,0.0416338) (belt,0.0406029) (side,0.0360559) (transfer,0.0329291) (surfac,0.0320128) (posit,0.0302947) (drum,0.0264349) (unit,0.0246825) (press,0.0245451) (apparatu,0.0244763)
Topic 18: (line,0.231612) (displai,0.110021) (electrod,0.0977109) (pixel,0.0831908) (crystal,0.0775907) (liquid,0.0751507) (panel,0.0452805) (portion,0.0372104) (plural,0.0269503) (direct,0.0253603) (substrat,0.0251203) (align,0.0219703) (form,0.0201103) (connect,0.0197803) (common,0.0188203) (view,0.0173602) (arrang,0.0172402) (gate,0.0171002) (polar,0.0165002) (respect,0.0159202)
Topic 47: (region,0.433173) (portion,0.0659125) (semiconductor,0.0466138) (structur,0.0431049) (gate,0.0370144) (implant,0.0332915) (sourc,0.0312804) (diffus,0.0297256) (dope,0.0259315) (present,0.0252183) (illustr,0.0249045) (charg,0.0243625) (zone,0.0241771) (form,0.0241057) (trench,0.0236636) (channel,0.0228648) (impur,0.0218521) (ion,0.0212958) (concentr,0.0208964) (drain,0.0206111)
Topic 54: (transfer,0.293593) (properti,0.0975047) (pre,0.0648832) (medium,0.0623078) (identifi,0.0547478) (instruct,0.0422863) (class,0.0393509) (diagram,0.0378002) (process,0.0360002) (sensit,0.0304894) (portion,0.0298802) (set,0.0274987) (match,0.026391) (inform,0.0247294) (oper,0.0234002) (classifi,0.0228187) (number,0.0221264) (determin,0.0216556) (label,0.021351) (step,0.0211848)
Topic 57: (chip,0.0976907) (wire,0.0929568) (electr,0.0724923) (connect,0.0669996) (pad,0.0622883) (conduct,0.0585963) (substrat,0.0582906) (circuit,0.0577923) (surfac,0.0551535) (packag,0.0491625) (semiconductor,0.047577) (board,0.0425033) (plural,0.0378487) (bond,0.0329676) (form,0.0302948) (contact,0.0297399) (portion,0.0287886) (mount,0.0269879) (compon,0.0266368) (side,0.0252325)
Topic 92: (portion,0.1064) (bodi,0.0994396) (side,0.068734) (end,0.0669684) (support,0.0604946) (posit,0.0601663) (assembl,0.0473205) (mount,0.0467772) (wall,0.0426122) (member,0.0421821) (surfac,0.0403939) (engag,0.0394771) (view,0.0380171) (plate,0.0377681) (connect,0.0358893) (front,0.035878) (open,0.0349726) (cover,0.032709) (attach,0.0325619) (rotat,0.0312377)
Below is the topics that have the word "portion" from the single machine setup.
Topic 6: (electr,0.158046) (structur,0.127875) (conduct,0.106806) (circuit,0.0866087) (connect,0.0616967) (interconnect,0.0502754) (connector,0.0435378) (integr,0.0398189) (mechan,0.0348396) (conductor,0.0320855) (illustr,0.0320388) (coupl,0.0315253) (ground,0.0308562) (contact,0.0285066) (portion,0.0271684) (fuse,0.0240252) (carrier,0.0237296) (isol,0.0217379) (substrat,0.0213022) (shown,0.017521)
Topic 7: (portion,0.270355) (side,0.108089) (end,0.0707633) (bodi,0.059409) (surfac,0.0554205) (direct,0.0473438) (form,0.0400563) (section,0.037248) (guid,0.0355345) (view,0.0340486) (upper,0.0292069) (shown,0.0286381) (shape,0.0268464) (main,0.0264341) (front,0.026107) (open,0.0252183) (extend,0.0208885) (lower,0.0202913) (hole,0.019694) (case,0.0184072)
Topic 24: (region,0.621829) (characterist,0.0302271) (portion,0.0278799) (laser,0.0274319) (direct,0.0259627) (overlap,0.0255685) (vertic,0.025246) (differ,0.0227913) (edg,0.020677) (posit,0.0195661) (width,0.018509) (adjac,0.0180611) (plural,0.0169681) (standard,0.0163947) (layout,0.0156422) (beam,0.0144955) (extend,0.0142984) (background,0.0139042) (specif,0.0122916) (abov,0.0122558)
Topic 82: (film,0.161729) (semiconductor,0.126738) (form,0.105283) (gate,0.094141) (insul,0.0559734) (silicon,0.0484004) (oxid,0.0421204) (electrod,0.0409827) (sourc,0.0344884) (transistor,0.0342963) (substrat,0.0337643) (drain,0.0326857) (conduct,0.0289546) (surfac,0.0261471) (region,0.0252235) (portion,0.0246842) (thin,0.0241079) (impur,0.0209088) (etch,0.0197341) (diffus,0.019638)
Topic 89: (electrod,0.309474) (wire,0.171766) (electr,0.0730991) (resist,0.0727857) (connect,0.0579722) (insul,0.0379882) (form,0.037917) (protect,0.0311655) (present,0.0265505) (discharg,0.0264081) (view,0.0192577) (section,0.0184458) (appli,0.0175627) (illustr,0.0160386) (contact,0.0155543) (addit,0.015127) (portion,0.0137311) (dispos,0.013546) (shown,0.0130902) (side,0.0125204)
Should both distributed and single machine setup have similar results? Or, is there any good way to compare the results between the distributed and single machine setup systematically?
Thanks!
The text was updated successfully, but these errors were encountered: