Implement parallel execution for DAG tasks #4128

andylizf · 2024-10-19T01:28:15Z

This PR implements parallel execution for DAG tasks in the jobs controller, addressing issue #4055. The changes allow for efficient execution of complex DAGs with independent tasks running concurrently, significantly improving performance for workflows with parallel components.

Changes

Modified JobsController to identify and execute parallel task groups
Implemented thread-safe task execution and monitoring
Added concurrent resource management and cleanup

Tested (run the relevant ones):

Code formatting: bash format.sh
Any manual or new tests for this PR (please specify below)
All smoke tests: pytest tests/test_smoke.py
Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

…#4067) provide an example, edited from pipeline.yml more focus on dependencies for user dag lib more powerful user interface load and dump new yaml format fix fix: reversed logic in add_edge rename refactor due to reviewer's comments generate task.name if not given add comments for add_edge add `print_exception_no_traceback` when raise make `Dag.tasks` a property print dependencies for `__repr__` move `get_unique_task_name` to common_utils rename methods to use downstream/edge terminology

andylizf · 2024-10-19T03:41:46Z

@cblmemo I'm currently working on implementing a cancellation mechanism for tasks that have already started or are queued for execution (similar to your setup with replicas preparing to launch). I'm currently using future.cancel, but it doesn't seem to fully address cancellation of tasks that are already in progress. I haven't switched to using thread.event yet, which might improve this.

That said, I noticed you used Process for managing the launch and termination of replicas. I don't see any clear advantages to using Process over Thread, especially since Thread should handle task cancellation just as well without the overhead of creating separate processes. Could you clarify your reasoning for choosing Process here? Is there a specific limitation you're addressing with this approach?

cblmemo · 2024-10-19T23:44:32Z

@cblmemo I'm currently working on implementing a cancellation mechanism for tasks that have already started or are queued for execution (similar to your setup with replicas preparing to launch). I'm currently using future.cancel, but it doesn't seem to fully address cancellation of tasks that are already in progress. I haven't switched to using thread.event yet, which might improve this.

That said, I noticed you used Process for managing the launch and termination of replicas. I don't see any clear advantages to using Process over Thread, especially since Thread should handle task cancellation just as well without the overhead of creating separate processes. Could you clarify your reasoning for choosing Process here? Is there a specific limitation you're addressing with this approach?

This is mainly due to logging. Threading will share a same sys.stdout which makes the following code not feasible.

I cannot find a way to do this kind of logging redirection back then. If you figured out a way, pls let me know ;)

skypilot/sky/utils/ux_utils.py

Lines 80 to 121 in 7971aa2

    
           class RedirectOutputForProcess: 
        
               """Redirects stdout and stderr to a file. 
        
               This class enabled output redirect for multiprocessing.Process. 
        
               Example usage: 
        
               p = multiprocessing.Process( 
        
                   target=RedirectOutputForProcess(func, file_name).run, args=...) 
        
               This is equal to: 
        
               p = multiprocessing.Process(target=func, args=...) 
        
               Plus redirect all stdout/stderr to file_name. 
        
               """ 
        
               def __init__(self, func: Callable, file: str, mode: str = 'w') -> None: 
        
                   self.func = func 
        
                   self.file = file 
        
                   self.mode = mode 
        
               def run(self, *args, **kwargs): 
        
                   with open(self.file, self.mode, encoding='utf-8') as f: 
        
                       sys.stdout = f 
        
                       sys.stderr = f 
        
                       # reconfigure logger since the logger is initialized before 
        
                       # with previous stdout/stderr 
        
                       sky_logging.reload_logger() 
        
                       logger = sky_logging.init_logger(__name__) 
        
                       # The subprocess_util.run('sky status') inside 
        
                       # sky.execution::_execute cannot be redirect, since we cannot 
        
                       # directly operate on the stdout/stderr of the subprocess. This 
        
                       # is because some code in skypilot will specify the stdout/stderr 
        
                       # of the subprocess. 
        
                       try: 
        
                           self.func(*args, **kwargs) 
        
                       except Exception as e:  # pylint: disable=broad-except 
        
                           logger.error(f'Failed to run {self.func.__name__}. ' 
        
                                        f'Details: {common_utils.format_exception(e)}') 
        
                           with ux_utils.enable_traceback(): 
        
                               logger.error(f'  Traceback:\n{traceback.format_exc()}') 
        
                           raise

Co-authored-by: Tian Xia <[email protected]>

…amed" Otherwise, users can not refer to the task by name in the DAG. This reverts commit 8486352.

…interleaving output

andylizf · 2024-10-24T07:23:25Z

Left some design considerations for review @cblmemo:

User feedback: Should we show "submitting to launch pool" or similar messages in console/controller logs to improve visibility?
Log accessibility: After separating thread logs, how should we expose them to users? Consider adding "View logs at: ..." messages for each subtask.
Task cancellation propagation: Need to define behavior of downstream tasks when their upstream tasks are cancelled (which happens when the upstream's upstream tasks fail). Current pending state might keep job running indefinitely.

cblmemo · 2024-10-24T20:48:06Z

Left some design considerations for review @cblmemo:

User feedback: Should we show "submitting to launch pool" or similar messages in console/controller logs to improve visibility?

Log accessibility: After separating thread logs, how should we expose them to users? Consider adding "View logs at: ..." messages for each subtask.

Task cancellation propagation: Need to define behavior of downstream tasks when their upstream tasks are cancelled (which happens when the upstream's upstream tasks fail). Current pending state might keep job running indefinitely.

I think running workload in parallel is expected and the technical terms would confuse users. Maybe lets leave them out for now.
Yes, that would be great. For previous pipeline, I believe we are just tailing the log one-by-one, as they are sequential. I think we can start with only printing out the file name (and command to tail the logs) in the main controller log. We should also think of supporting commands to tail a specific task's log. Maybe add an argument --task-name in sky jobs logs (and print this in the controller log).
Could you elaborate on what case would we encounter this problem? I think we will only start a job if all of its upstream is finished (thus cannot be cancelled)? IIUC if we want to cancel a job, we only need to stop all currently running jobs.

cblmemo

Thanks for adding this feature @andylizf ! It is awesome. Left some comments to discuss ;)

cblmemo · 2024-10-25T21:35:15Z

sky/jobs/controller.py

+        self._completed_tasks: Set[int] = set()
+        self._failed_tasks: Set[int] = set()
+        self._block_tasks: Set[int] = set()


Can we use a map from id to its status? maintain 3 lists feels a little bit redundant for me.

Also, is it possible to use its name as the identifier? And related to previous PR, maybe instead of the related less meaningful timestamps, we could use its index in a topo order as its default name (if user does not specified one). e.g. task_1.

The name block_tasks feels a little bit strange to me as it might indicate it will be executed at some point, which is not true (it is permatently cancelled). should we rename it to CANCELLED?

Also, is it possible to use its name as the identifier? And related to previous PR, maybe instead of the related less meaningful timestamps, we could use its index in a topo order as its default name (if user does not specified one). e.g. task_1.

@cblmemo Could you clarify the second point? Not sure how using names relates to tasks here. Thanks!

Currently we generate task name for tasks if user does not specify a name. The name is sth like f'task_{current_timestamp}' IIRC. Maybe we could change the naming convention to f'task_{id_in_topo_order}'

sky/jobs/controller.py

sky/jobs/recovery_strategy.py

sky/jobs/state.py

sky/serve/serve_utils.py

sky/utils/ux_utils.py

…a whole

andylizf · 2024-10-27T19:45:18Z

TODOs:

Update sky.launch to direct thread output to separate files, preventing interleaved logs in the controller's run.log.
Adjust the logic in stream_logs_by_id or sky jobs logs 1 --controller to avoid relying on get_latest_task_id_status, ensuring that each task’s run.log outputs fully and independently.

…ueue each node once

…uture cancellation policy discussion.

andylizf · 2024-10-28T05:04:00Z

Finished implementing the initial TODOs:

TODOs:

Update sky.launch to direct thread output to separate files, preventing interleaved logs in the controller's run.log.

Adjust the logic in stream_logs_by_id or sky jobs logs 1 --controller to avoid relying on get_latest_task_id_status, ensuring that each task’s run.log outputs fully and independently.

However, there are two remaining issues to be addressed:

Controller-side cluster's launch logs accessibility: Currently, logs like ~/sky_logs/sky-2024-10-28-04-14-45-846061/task_3_launch.log are only available on the controller machine and can only be accessed remotely.
Cluster-side task run log persistence: As demonstrated in the example where task 0's logs are no longer available after completion, we need to implement proper log retention for completed tasks.

The current behavior is demonstrated in the attached logs, showing:

Output

andyl@DESKTOP-7FP6SMO ~/skypilot (dag-execute)> sky jobs launch ./examples/dag/diamond.yml -y --cloud gcp (skypilot) Task from YAML spec: ./examples/dag/diamond.yml WARNING: override params {'cloud': GCP} are ignored, since the yaml file contains multiple tasks. Managed job 'pipeline' will be launched on (estimated): Task 'data-processing' requires AWS which is not enabled. To enable access, change the task cloud requirement or run: sky check aws Task 'infer_A' requires AWS which is not enabled. To enable access, change the task cloud requirement or run: sky check aws Best plan: ---------------------------------------------------------------------------------------------------------------------------- TASK #NODES CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE ---------------------------------------------------------------------------------------------------------------------------- data-processing 1 GCP n2-standard-2[Spot] 2 8 - northamerica-northeast2-c infer_A 1 GCP n2-standard-2 2 8 - us-central1-a infer_B 1 Kubernetes 2CPU--2GB 2 2 - kind-skypilot eval_A_B 1 Kubernetes 2CPU--2GB 2 2 - kind-skypilot ---------------------------------------------------------------------------------------------------------------------------- Considered resources for task 'data-processing' (1 node): ---------------------------------------------------------------------------------------------------------------- CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN ---------------------------------------------------------------------------------------------------------------- GCP n2-standard-2[Spot] 2 8 - northamerica-northeast2-c 0.01 ✔ ---------------------------------------------------------------------------------------------------------------- Considered resources for task 'infer_A' (1 node): ---------------------------------------------------------------------------------------------- CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN ---------------------------------------------------------------------------------------------- GCP n2-standard-2 2 8 - us-central1-a 0.10 ✔ ---------------------------------------------------------------------------------------------- Considered resources for task 'infer_B' (1 node): --------------------------------------------------------------------------------------------------- CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN --------------------------------------------------------------------------------------------------- Kubernetes 2CPU--2GB 2 2 - kind-skypilot 0.00 ✔ GCP n2-standard-2 2 8 - us-central1-a 0.10 --------------------------------------------------------------------------------------------------- Considered resources for task 'eval_A_B' (1 node): --------------------------------------------------------------------------------------------------- CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN --------------------------------------------------------------------------------------------------- Kubernetes 2CPU--2GB 2 2 - kind-skypilot 0.00 ✔ GCP n2-standard-2 2 8 - us-central1-a 0.10 --------------------------------------------------------------------------------------------------- Launching managed job 'pipeline' from jobs controller... ⚙︎ Launching managed jobs controller on Kubernetes. └── Pod is up. ✓ Cluster launched: sky-jobs-controller-6eabc0cb. View logs at: ~/sky_logs/sky-2024-10-27-21-14-08-796591/provision.log ⚙︎ Mounting files. Syncing (to 1 node): /tmp/managed-dag-pipeline-81wrv4dy -> ~/.sky/managed_jobs/pipeline-ba6f.yaml Syncing (to 1 node): /tmp/tmpnk_zyq_5 -> ~/.sky/managed_jobs/pipeline-ba6f.config_yaml ✓ Files synced. View logs at: ~/sky_logs/sky-2024-10-27-21-14-08-796591/file_mounts.log ⚙︎ Running setup on managed jobs controller. Check & install cloud dependencies on controller: done. ✓ Setup completed. View logs at: ~/sky_logs/sky-2024-10-27-21-14-08-796591/setup-*.log Auto-stop is not supported for Kubernetes and RunPod clusters. Skipping. ⚙︎ Job submitted, ID: 4 ├── To stream job logs: sky jobs logs 4 --controller or sky jobs logs 4 --task-id task_id

📋 Useful Commands Managed Job ID: 4 ├── To cancel the job: sky jobs cancel 4 ├── To stream job logs: sky jobs logs 4 ├── To stream controller logs: sky jobs logs --controller 4 ├── To view all managed jobs: sky jobs queue └── To view managed job dashboard: sky jobs dashboard

Controller Logs

andyl@DESKTOP-7FP6SMO ~/skypilot (dag-execute)> sky jobs logs 4 --controller    (skypilot) 
├── Waiting for task resources on 1 node.
└── Job started. Streaming logs... (Ctrl-C to exit log streaming; job will not be killed)
(pipeline, pid=2474641) I 10-28 04:14:45 controller.py:63] DAG(pipeline: data-processing(infer_B,infer_A) infer_A(eval_A_B) infer_B(eval_A_B) eval_A_B(-))
(pipeline, pid=2474641) I 10-28 04:14:45 controller.py:442] Task 0 is submitted to run. To prevent from interleaving, the launch logs are redirected to ~/sky_logs/sky-2024-10-28-04-14-45-846061/task_0_launch.log
(pipeline, pid=2474641) I 10-28 04:17:33 controller.py:457] Task 0 completed.
(pipeline, pid=2474641) I 10-28 04:17:33 controller.py:400] Task 0 completed with result: True
(pipeline, pid=2474641) I 10-28 04:17:33 controller.py:442] Task 2 is submitted to run. To prevent from interleaving, the launch logs are redirected to ~/sky_logs/sky-2024-10-28-04-14-45-846061/task_2_launch.log
(pipeline, pid=2474641) I 10-28 04:17:33 controller.py:442] Task 1 is submitted to run. To prevent from interleaving, the launch logs are redirected to ~/sky_logs/sky-2024-10-28-04-14-45-846061/task_1_launch.log
(pipeline, pid=2474641) I 10-28 04:19:35 controller.py:457] Task 2 completed.
(pipeline, pid=2474641) I 10-28 04:19:35 controller.py:400] Task 2 completed with result: True
(pipeline, pid=2474641) I 10-28 04:20:06 controller.py:457] Task 1 completed.
(pipeline, pid=2474641) I 10-28 04:20:06 controller.py:400] Task 1 completed with result: True
(pipeline, pid=2474641) I 10-28 04:20:06 controller.py:442] Task 3 is submitted to run. To prevent from interleaving, the launch logs are redirected to ~/sky_logs/sky-2024-10-28-04-14-45-846061/task_3_launch.log
(pipeline, pid=2474641) I 10-28 04:22:06 controller.py:457] Task 3 completed.
(pipeline, pid=2474641) I 10-28 04:22:06 controller.py:400] Task 3 completed with result: True
(pipeline, pid=2474641) I 10-28 04:22:07 controller.py:570] Killing controller process 2511381.
(pipeline, pid=2474641) I 10-28 04:22:07 controller.py:578] Controller process 2511381 killed.
(pipeline, pid=2474641) I 10-28 04:22:07 controller.py:580] Cleaning up any cluster for job 4.
(pipeline, pid=2474641) I 10-28 04:22:07 controller.py:589] Cluster of managed job 4 has been cleaned up.
✓ Job finished (status: SUCCEEDED).
⏎

But for now things like ~/sky_logs/sky-2024-10-28-04-14-45-846061/task_3_launch.log are all unreachable. They're only on controller machine.

Cluster's Run Log

andyl@DESKTOP-7FP6SMO ~/skypilot (dag-execute)> sky jobs logs 4 --task-id 1                                                               (skypilot) 
├── Waiting for task resources on 1 node.
└── Job started. Streaming logs... (Ctrl-C to exit log streaming; job will not be killed)
(setup pid=2321) setup for infer_A
(infer_A, pid=2321) infer_A starts
(infer_A, pid=2321) infer_A ends
✓ Job finished (status: SUCCEEDED).
Shared connection to 34.27.213.54 closed.
✓ Managed job finished: 4 (status: PENDING).
andyl@DESKTOP-7FP6SMO ~/skypilot (dag-execute)> sky jobs logs 4 --task-id 0                                                               (skypilot)

✓ Managed job finished: 4 (status: STARTING).

From here we can see the necessity of the second TODO, after task 0 completed, the log was not reserved anymore.

@cblmemo PTAL at these remaining issues.

cblmemo · 2024-10-28T21:18:55Z

For logs like ~/sky_logs/sky-2024-10-28-04-14-45-846061/task_3_launch.log, we can add some command to the controller log. e.g.

andyl@DESKTOP-7FP6SMO ~/skypilot (dag-execute)> sky jobs logs 4 --controller    (skypilot) 
├── Waiting for task resources on 1 node.
└── Job started. Streaming logs... (Ctrl-C to exit log streaming; job will not be killed)
(pipeline, pid=2474641) I 10-28 04:14:45 controller.py:63] DAG(pipeline: data-processing(infer_B,infer_A) infer_A(eval_A_B) infer_B(eval_A_B) eval_A_B(-))
(pipeline, pid=2474641) I 10-28 04:14:45 controller.py:442] Task 0 is submitted to run. To see logs: sky jobs logs 4 --task-id 0 <<<<<<<<< NOTICE HERE
(pipeline, pid=2474641) I 10-28 04:17:33 controller.py:457] Task 0 completed.
(pipeline, pid=2474641) I 10-28 04:17:33 controller.py:400] Task 0 completed with result: True

cblmemo · 2024-10-28T21:20:05Z

From here we can see the necessity of the second TODO, after task 0 completed, the log was not reserved anymore.

We can download logs before terminating the cluster. Reference:

skypilot/sky/serve/replica_managers.py

Lines 759 to 768 in d3be8ed

    
           if sync_down_logs: 
        
               _download_and_stream_logs(info) 
        
           logger.info(f'preempted: {info.status_property.preempted}, ' 
        
                       f'replica_id: {replica_id}') 
        
           p = multiprocessing.Process( 
        
               target=ux_utils.RedirectOutputForProcess(terminate_cluster, 
        
                                                        log_file_name, 'a').run, 
        
               args=(info.cluster_name, replica_drain_delay_seconds), 
        
           )

andylizf added 20 commits October 10, 2024 16:29

provide an example, edited from pipeline.yml

7c0965a

more focus on dependencies for user dag lib

6949ac1

more powerful user interface

55db40b

load and dump new yaml format

db7ff9f

fix

054cc26

fix: reversed logic in add_edge

24ef94e

rename

129bdbf

refactor due to reviewer's comments

12ec5a4

generate task.name if not given

9497a3e

add comments for add_edge

ff528a5

add print_exception_no_traceback when raise

04c6f9d

make Dag.tasks a property

4985813

print dependencies for __repr__

48a2826

move get_unique_task_name to common_utils

78d826d

rename methods to use downstream/edge terminology

e88acc1

fix(jobs): type errors

4bc8b89

refactor: _update_failed_task_state for unified error handling

4ba76c3

refactor: separate finally block for a meaningful name

e1b27f3

feat: simple parallel execution support

c102f5d

andylizf and others added 8 commits October 20, 2024 14:03

Apply suggestions from code review

e4fbb28

Co-authored-by: Tian Xia <[email protected]>

change wording all to up/downstream style

a27969b

Add unique suffix to task names, fallback to timestamp if unnamed

8486352

Unify handling of single and multiple tasks without dependencies

c14980e

Refactor tasks initialization: use list comprehension and fail fast

66fc864

Fix remove task dependency description: upstream, not downstream

65d0bdd

Co-authored-by: Tian Xia <[email protected]>

Remove duplicated self.edges, use nx api instead

28b6482

Revert "Add unique suffix to task names, fallback to timestamp if unn…

1792ba6

…amed" Otherwise, users can not refer to the task by name in the DAG. This reverts commit 8486352.

andylizf changed the base branch from master to advanced-dag October 23, 2024 21:27

andylizf added 2 commits October 23, 2024 16:29

Merge remote-tracking branch 'upstream/advanced-dag' into dag-execute

bb94930

feat: redirect logging for each thread to a separate file to prevent …

98044c9

…interleaving output

cblmemo reviewed Oct 25, 2024

View reviewed changes

fix due to reviwer's suggestions and some nits

d107a73

andylizf force-pushed the dag-execute branch from 90405ea to d107a73 Compare October 26, 2024 03:08

andylizf added 3 commits October 25, 2024 20:11

chore: remove some debugging info

43e0f19

partially revert "update canceled tasks in database", view a task as …

b001b9c

…a whole

add some comments to inform the future of thread-level redirector

ba4a0d1

andylizf mentioned this pull request Oct 26, 2024

[Jobs] Refactor: Extract task failure state update helper #4185

Merged

5 tasks

andylizf added 2 commits October 25, 2024 21:09

cancell all tasks when a task failed

6b3b4c8

combine 3 sets to 1 dict

934bde3

andylizf mentioned this pull request Oct 26, 2024

[Job] Support DAG execution by replacing is_chain with is_dag check #4186

Merged

5 tasks

add some comments to illustrate _try_add_successors_to_queue only q…

59195da

…ueue each node once

andylizf mentioned this pull request Oct 27, 2024

[Jobs DAG] Flexible DAG Workflow Job Cancellation Policy #4195

Open

andylizf added 9 commits October 27, 2024 13:18

Cancel all non-running tasks when cancelling a job. Left a TODO for f…

609d57d

…uture cancellation policy discussion.

make those logging files and dir

9d577e5

Merge branch 'advanced-dag' into dag-execute

c50a881

refactor: stream_logs_by_id

af16aab

refactor and format

0095b98

provide a cli to print run.log of a certain subtask

cecdd7b

format

837d7ab

add some comments and checks

5f7e50d

clearer log

b9e143d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement parallel execution for DAG tasks #4128

Implement parallel execution for DAG tasks #4128

andylizf commented Oct 19, 2024

andylizf commented Oct 19, 2024

cblmemo commented Oct 19, 2024

andylizf commented Oct 24, 2024

cblmemo commented Oct 24, 2024

cblmemo left a comment

cblmemo Oct 25, 2024

cblmemo Oct 25, 2024

cblmemo Oct 25, 2024

andylizf Oct 26, 2024

cblmemo Oct 26, 2024

andylizf commented Oct 27, 2024 •

edited

Loading

andylizf commented Oct 28, 2024 •

edited

Loading

cblmemo commented Oct 28, 2024

cblmemo commented Oct 28, 2024

Implement parallel execution for DAG tasks #4128

Are you sure you want to change the base?

Implement parallel execution for DAG tasks #4128

Conversation

andylizf commented Oct 19, 2024

Changes

andylizf commented Oct 19, 2024

cblmemo commented Oct 19, 2024

andylizf commented Oct 24, 2024

cblmemo commented Oct 24, 2024

cblmemo left a comment

Choose a reason for hiding this comment

cblmemo Oct 25, 2024

Choose a reason for hiding this comment

cblmemo Oct 25, 2024

Choose a reason for hiding this comment

cblmemo Oct 25, 2024

Choose a reason for hiding this comment

andylizf Oct 26, 2024

Choose a reason for hiding this comment

cblmemo Oct 26, 2024

Choose a reason for hiding this comment

andylizf commented Oct 27, 2024 • edited Loading

andylizf commented Oct 28, 2024 • edited Loading

cblmemo commented Oct 28, 2024

cblmemo commented Oct 28, 2024

andylizf commented Oct 27, 2024 •

edited

Loading

andylizf commented Oct 28, 2024 •

edited

Loading