-
Notifications
You must be signed in to change notification settings - Fork 496
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement parallel execution for DAG tasks #4128
base: advanced-dag
Are you sure you want to change the base?
Conversation
…#4067) provide an example, edited from pipeline.yml more focus on dependencies for user dag lib more powerful user interface load and dump new yaml format fix fix: reversed logic in add_edge rename refactor due to reviewer's comments generate task.name if not given add comments for add_edge add `print_exception_no_traceback` when raise make `Dag.tasks` a property print dependencies for `__repr__` move `get_unique_task_name` to common_utils rename methods to use downstream/edge terminology
@cblmemo I'm currently working on implementing a cancellation mechanism for tasks that have already started or are queued for execution (similar to your setup with replicas preparing to launch). I'm currently using That said, I noticed you used |
This is mainly due to logging. Threading will share a same I cannot find a way to do this kind of logging redirection back then. If you figured out a way, pls let me know ;) skypilot/sky/utils/ux_utils.py Lines 80 to 121 in 7971aa2
|
Co-authored-by: Tian Xia <[email protected]>
Co-authored-by: Tian Xia <[email protected]>
…amed" Otherwise, users can not refer to the task by name in the DAG. This reverts commit 8486352.
Left some design considerations for review @cblmemo:
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this feature @andylizf ! It is awesome. Left some comments to discuss ;)
sky/jobs/controller.py
Outdated
self._completed_tasks: Set[int] = set() | ||
self._failed_tasks: Set[int] = set() | ||
self._block_tasks: Set[int] = set() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use a map from id to its status? maintain 3 lists feels a little bit redundant for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, is it possible to use its name as the identifier? And related to previous PR, maybe instead of the related less meaningful timestamps, we could use its index in a topo order as its default name (if user does not specified one). e.g. task_1
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name block_tasks
feels a little bit strange to me as it might indicate it will be executed at some point, which is not true (it is permatently cancelled). should we rename it to CANCELLED?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, is it possible to use its name as the identifier? And related to previous PR, maybe instead of the related less meaningful timestamps, we could use its index in a topo order as its default name (if user does not specified one). e.g.
task_1
.
@cblmemo Could you clarify the second point? Not sure how using names relates to tasks here. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently we generate task name for tasks if user does not specify a name. The name is sth like f'task_{current_timestamp}'
IIRC. Maybe we could change the naming convention to f'task_{id_in_topo_order}'
TODOs:
|
…ueue each node once
…uture cancellation policy discussion.
Finished implementing the initial TODOs:
However, there are two remaining issues to be addressed:
The current behavior is demonstrated in the attached logs, showing: Output
Controller Logs
But for now things like Cluster's Run Log
From here we can see the necessity of the second TODO, after task 0 completed, the log was not reserved anymore. @cblmemo PTAL at these remaining issues. |
For logs like andyl@DESKTOP-7FP6SMO ~/skypilot (dag-execute)> sky jobs logs 4 --controller (skypilot)
├── Waiting for task resources on 1 node.
└── Job started. Streaming logs... (Ctrl-C to exit log streaming; job will not be killed)
(pipeline, pid=2474641) I 10-28 04:14:45 controller.py:63] DAG(pipeline: data-processing(infer_B,infer_A) infer_A(eval_A_B) infer_B(eval_A_B) eval_A_B(-))
(pipeline, pid=2474641) I 10-28 04:14:45 controller.py:442] Task 0 is submitted to run. To see logs: sky jobs logs 4 --task-id 0 <<<<<<<<< NOTICE HERE
(pipeline, pid=2474641) I 10-28 04:17:33 controller.py:457] Task 0 completed.
(pipeline, pid=2474641) I 10-28 04:17:33 controller.py:400] Task 0 completed with result: True |
We can download logs before terminating the cluster. Reference: skypilot/sky/serve/replica_managers.py Lines 759 to 768 in d3be8ed
|
Closes #4055
This PR implements parallel execution for DAG tasks in the jobs controller, addressing issue #4055. The changes allow for efficient execution of complex DAGs with independent tasks running concurrently, significantly improving performance for workflows with parallel components.
Changes
JobsController
to identify and execute parallel task groupsTested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
conda deactivate; bash -i tests/backward_compatibility_tests.sh