Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A synchronization problem and ICE networking error #4

Open
metacret opened this issue Oct 20, 2011 · 1 comment
Open

A synchronization problem and ICE networking error #4

metacret opened this issue Oct 20, 2011 · 1 comment

Comments

@metacret
Copy link

Hi

When I am running Yahoo LDA on my hadoop cluster, I found the following problems:

  1. permission denied for executable contained in jar package
  2. To resolve this issue, I added chmod 755 $LDALibs/* at Formatter.sh and LDA.sh
  3. synchronization problem of global/lda.dict.dump
    I've found that before the process 0 finished writing global/lda.dict.dump if other processes tried to run the following script:

${HADOOP_CMD} dfs -get ${mapred_output_dir}/global/lda.dict.dump lda.dict.dump.global

it cannot download the file and whole process is going crashed. So, I put the synchronization code such as wait_for 60 ${mapred_output_dir}/global/lda.dict.dump.

  1. The critical problem of multi-machine of Yahoo LDA

Finally, I got the following problem, this is not related with running script, so how can I recover this situation?

1020 03:57:06.626588 20423 Merge_Topic_Counts.cpp:103] Initializing global dictionary from lda.dict.dump.global
W1020 03:57:11.659412 20423 Merge_Topic_Counts.cpp:105] global dictionary Initialized
terminate called after throwing an instance of 'Ice::ConnectionLostException'
what(): TcpTransceiver.cpp:248: Ice::ConnectionLostException:
connection lost: Connection reset by peer

Should I modify LDA.sh script to check the error code of each module execution and repeat unless the error code is success?

Thank you!

@shravanmn
Copy link
Collaborator

On Friday 21 October 2011 03:23 AM, metacret wrote:

Hi

When I am running Yahoo LDA on my hadoop cluster, I found the following problems:

  1. permission denied for executable contained in jar package
  2. To resolve this issue, I added chmod 755 $LDALibs/* at Formatter.sh and LDA.sh

cool!
2. synchronization problem of global/lda.dict.dump
I've found that before the process 0 finished writing global/lda.dict.dump if other processes tried to run the following script:

${HADOOP_CMD} dfs -get ${mapred_output_dir}/global/lda.dict.dump lda.dict.dump.global

I dunno why you say that other processes try to get the global
dictionary before its written. There is already a
wait_for_all 60 ${synch_dir}"/global_dict";

that takes care of it.

it cannot download the file and whole process is going crashed. So, I put the synchronization code such as wait_for 60 ${mapred_output_dir}/global/lda.dict.dump.

  1. The critical problem of multi-machine of Yahoo LDA

Finally, I got the following problem, this is not related with running script, so how can I recover this situation?

1020 03:57:06.626588 20423 Merge_Topic_Counts.cpp:103] Initializing global dictionary from lda.dict.dump.global
W1020 03:57:11.659412 20423 Merge_Topic_Counts.cpp:105] global dictionary Initialized
terminate called after throwing an instance of 'Ice::ConnectionLostException'
what(): TcpTransceiver.cpp:248: Ice::ConnectionLostException:
connection lost: Connection reset by peer

This is more of a hadoop problem. Connection can be lost due to many
reasons beyond the control of LDA. So its only the checkpointing &
restart mechanism that will take care of these. You need to worry about
these. LDA is confiured to automatically restart from the last
checkpointed iteartion.

Thanks,
--Shravan
PS: Sorry for the late response. Was really busy with some other stuff

Should I modify LDA.sh script to check the error code of each module execution and repeat unless the error code is success?

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants