Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] ray.get on a mixed list of dead and alive (but hanging) tasks does not immediately raise RayActorError #47204

Open
hongpeng-guo opened this issue Aug 19, 2024 · 5 comments · May be fixed by #48264
Assignees
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P0 Issues that should be fixed in short order

Comments

@hongpeng-guo
Copy link
Contributor

hongpeng-guo commented Aug 19, 2024

What happened + What you expected to happen

We recently met a problem of running functions on a died actor (because the underlying node is killed). The expectation is an exception could be raised immediately. However, it seems the code just hangs. The behavior is a bit flaky, but is hangs in most cases. This problem might be a regression as the same code raise as expected in early July.

Mini repro appended below.

Versions / Dependencies

2.34

Reproduction script

import ray
from typing import Callable

@ray.remote(num_cpus=1)
class TestClass:
    def execute(self, fn: Callable[..., None]) -> None:
        return fn()

    def exit(self):
        ray.actor.exit_actor()

def dummy_func():
    print (100)

actors = [TestClass.remote() for _ in range(10)]
for actor in actors:
    actor.exit.remote()
tasks = [actor.execute.remote(dummy_func) for actor in actors]
ray.get(tasks)

Issue Severity

High: It blocks me from completing my task.

@hongpeng-guo hongpeng-guo added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Aug 19, 2024
@hongpeng-guo hongpeng-guo added core Issues that should be addressed in Ray Core P0 Issues that should be fixed in short order and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Aug 19, 2024
@justinvyu justinvyu changed the title [core] Died work that exits itself doesn't raise exception when called to run a function. [core] ray.get on a task scheduled on a dead actor hangs instead of raising RayActorError Aug 19, 2024
@aslonnie aslonnie added release-blocker P0 Issue that blocks the release weekly-release-blocker Issues that will be blocking Ray weekly releases labels Aug 19, 2024
@rkooo567
Copy link
Contributor

import ray
from typing import Callable

@ray.remote(num_cpus=1)
class TestClass:
    def execute(self, fn: Callable[..., None]) -> None:
        return fn()

    def exit(self):
        ray.actor.exit_actor()

def dummy_func():
    import time
    print (100)
    time.sleep(100)

actors = [TestClass.remote() for _ in range(10)]
ray.get([actor.__ray_ready__.remote() for actor in actors])
refs = []
for actor in actors[:2]:
    refs.append(actor.exit.remote())
    # ray.kill(actor)
try:
    ray.get(refs)
except Exception:
    pass


tasks = [actor.execute.remote(dummy_func) for actor in actors]
ray.get(tasks)

the main issue is that if the actor is crashed, and if you do ray.get on crashed/uncrashed actors together, ray.get doesn't raise an exception until uncrashed actors are finished. We can easily get around this in train layer (by using ray.wait), and

@rkooo567 rkooo567 removed the release-blocker P0 Issue that blocks the release label Aug 20, 2024
@rkooo567 rkooo567 removed their assignment Aug 20, 2024
@rkooo567
Copy link
Contributor

unassign myself now as it is mitigated

@justinvyu justinvyu changed the title [core] ray.get on a task scheduled on a dead actor hangs instead of raising RayActorError [core] ray.get on a mixed list of dead and alive (but hanging) tasks does not immediately raise RayActorError Aug 20, 2024
@justinvyu
Copy link
Contributor

@rkooo567
Copy link
Contributor

it is not a blocker, but let's fix this soon. the semantic is very bad for fault tolerant cases

@can-anyscale can-anyscale removed the weekly-release-blocker Issues that will be blocking Ray weekly releases label Aug 22, 2024
@rynewang
Copy link
Contributor

rynewang commented Oct 9, 2024

Expected behavior: the crashed actor's any pending tasks raise ActorDiedError on ray.get(obj), or ray.get([obj, other_objs]). should not hang

MortalHappiness added a commit to MortalHappiness/ray that referenced this issue Oct 24, 2024
Closes: ray-project#47204
Signed-off-by: Chi-Sheng Liu <[email protected]>
MortalHappiness added a commit to MortalHappiness/ray that referenced this issue Oct 24, 2024
…t immediately raise ActorDiedError

Closes: ray-project#47204
Signed-off-by: Chi-Sheng Liu <[email protected]>
MortalHappiness added a commit to MortalHappiness/ray that referenced this issue Oct 24, 2024
…t immediately raise ActorDiedError

Closes: ray-project#47204
Signed-off-by: Chi-Sheng Liu <[email protected]>
MortalHappiness added a commit to MortalHappiness/ray that referenced this issue Oct 24, 2024
…t immediately raise ActorDiedError

Closes: ray-project#47204
Signed-off-by: Chi-Sheng Liu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P0 Issues that should be fixed in short order
Projects
None yet
7 participants