-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] ray.get
on a mixed list of dead and alive (but hanging) tasks does not immediately raise RayActorError
#47204
Comments
ray.get
on a task scheduled on a dead actor hangs instead of raising RayActorError
import ray
from typing import Callable
@ray.remote(num_cpus=1)
class TestClass:
def execute(self, fn: Callable[..., None]) -> None:
return fn()
def exit(self):
ray.actor.exit_actor()
def dummy_func():
import time
print (100)
time.sleep(100)
actors = [TestClass.remote() for _ in range(10)]
ray.get([actor.__ray_ready__.remote() for actor in actors])
refs = []
for actor in actors[:2]:
refs.append(actor.exit.remote())
# ray.kill(actor)
try:
ray.get(refs)
except Exception:
pass
tasks = [actor.execute.remote(dummy_func) for actor in actors]
ray.get(tasks) the main issue is that if the actor is crashed, and if you do ray.get on crashed/uncrashed actors together, ray.get doesn't raise an exception until uncrashed actors are finished. We can easily get around this in train layer (by using ray.wait), and |
unassign myself now as it is mitigated |
ray.get
on a task scheduled on a dead actor hangs instead of raising RayActorError
ray.get
on a mixed list of dead and alive (but hanging) tasks does not immediately raise RayActorError
See here for a more consistent reproduction: https://github.com/anyscale/runtime/pull/929/files#diff-1913713e052df41064554b30df8e0f47abef67dee769c9de607e7728ef2e4d40R397 |
it is not a blocker, but let's fix this soon. the semantic is very bad for fault tolerant cases |
Expected behavior: the crashed actor's any pending tasks raise ActorDiedError on |
Closes: ray-project#47204 Signed-off-by: Chi-Sheng Liu <[email protected]>
…t immediately raise ActorDiedError Closes: ray-project#47204 Signed-off-by: Chi-Sheng Liu <[email protected]>
…t immediately raise ActorDiedError Closes: ray-project#47204 Signed-off-by: Chi-Sheng Liu <[email protected]>
…t immediately raise ActorDiedError Closes: ray-project#47204 Signed-off-by: Chi-Sheng Liu <[email protected]>
What happened + What you expected to happen
We recently met a problem of running functions on a died actor (because the underlying node is killed). The expectation is an exception could be raised immediately. However, it seems the code just hangs. The behavior is a bit flaky, but is hangs in most cases. This problem might be a regression as the same code raise as expected in early July.
Mini repro appended below.
Versions / Dependencies
2.34
Reproduction script
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: