Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add timeout for Gloo adjustable by env vars #43

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

HRezaei
Copy link

@HRezaei HRezaei commented Sep 1, 2024

This PR adds more flexibility by allowing the developer to configure timeout for Gloo using a new env variable "GLOO_TIMEOUT".

The default value is 1800 seconds and thus we have to wait 30 minutes to get error if other required processes are not running. In experimental deployments we'd rather to set this to a smaller value to avoid 30 minutes waiting and see the timeout error sooner.

On the other hand, we need to set this to a larger value for debugging, otherwise the execution is halted and the timeout error occurs if the line by line debugging takes more than 30 minutes.

@ClementRomac
Copy link
Collaborator

Hi, thanks for your PR!
You are right, this 30 minutes timeout can be painful.
Would you mind adding the timeout variable as a parameter of lamorel's config instead of an environment variable? It would allow users to set this as any other parameters in lamorel (e.g. number of LLMs).

@HRezaei
Copy link
Author

HRezaei commented Sep 20, 2024

You're welcome!

Would you mind adding the timeout variable as a parameter of lamorel's config instead of an environment variable?

Yes, sure, is it OK to be directly under the lamorel_args? I mean something like this:

lamorel_args:
  log_level: info
  gloo_timeout: 1800
...

The other options are to nest it under either lamorel_args.accelerate_args or lamorel_args.distributed_setup_args even though I'm not sure if it fits there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants