Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash when launching the same cluster in parallel #4166

Open
cg505 opened this issue Oct 24, 2024 · 0 comments
Open

Crash when launching the same cluster in parallel #4166

cg505 opened this issue Oct 24, 2024 · 0 comments

Comments

@cg505
Copy link
Collaborator

cg505 commented Oct 24, 2024

Repro steps:

  • run sky jobs launch 100x in parallel
  • some invocations crash

This seems to be a race condition related to the ray cluster yaml created when provisioning the controller cluster (and I assume this can apply to any cluster that you are launching many times in parallel).
The crash is here:

restored_yaml_content = _replace_yaml_dicts(

Since tmp_yaml_path is unique per cluster but not per invocation, it seems that multiple processes are writing to/reading from this file at the same time.
Note that this shouldn't affect the actual yaml_path since that is handled by a simple rename
os.rename(tmp_yaml_path, yaml_path)

One possible solution would be to add some random value to the tmp_yaml_path.

Version & Commit info:

  • sky -v: dev
  • sky -c: forgot to check, I think f2991b144d4b15eac55dd7f759f361b6146033b3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant