You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This seems to be a race condition related to the ray cluster yaml created when provisioning the controller cluster (and I assume this can apply to any cluster that you are launching many times in parallel).
The crash is here:
Since tmp_yaml_path is unique per cluster but not per invocation, it seems that multiple processes are writing to/reading from this file at the same time.
Note that this shouldn't affect the actual yaml_path since that is handled by a simple rename
Repro steps:
sky jobs launch
100x in parallelThis seems to be a race condition related to the ray cluster yaml created when provisioning the controller cluster (and I assume this can apply to any cluster that you are launching many times in parallel).
The crash is here:
skypilot/sky/backends/backend_utils.py
Line 995 in 6dc386b
Since
tmp_yaml_path
is unique per cluster but not per invocation, it seems that multiple processes are writing to/reading from this file at the same time.Note that this shouldn't affect the actual
yaml_path
since that is handled by a simple renameskypilot/sky/backends/backend_utils.py
Line 1019 in 6dc386b
One possible solution would be to add some random value to the
tmp_yaml_path
.Version & Commit info:
sky -v
: devsky -c
: forgot to check, I thinkf2991b144d4b15eac55dd7f759f361b6146033b3
The text was updated successfully, but these errors were encountered: