Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vine: efficient resource allocation #4006

Conversation

JinZhou5042
Copy link
Member

@JinZhou5042 JinZhou5042 commented Dec 11, 2024

Proposed Changes

Fix #3995

Problem

Main issue: resource allocation for tasks mostly fail, which harms the concurrency and extends the workflow execution time

Tentative resource allocation policy:

  • The number of cores is the first-class citizen – if there are no available cores, never attempt to consider a task or even select a worker for the task
  • With proportional resource allocation, if there is one core usable, there will always be enough memory to use.
  • The overuse of worker cache is the only reason for the failure of task resource allocation if there are available cores
  • Given that we the overuse of worker cache is something not likely to happen (we usually have large enough disk), the number cores has a decisive impact on whether the resource allocation will be succeed.
  • Therefore, we want to keep track of the global usable cores (or function slots) and allow task resource allocation if there is any. This ensures that most task resource allocations succeed, and that we don't waste time on considering a depth of tasks if no cores available at all.

Here is an analysis on why the allocated disk tends to be bigger than the available disk (I excluded the input size for tasks as they don't seem to matter):

image

In vine_manager_choose_resources_for_task, using proportional resource allocation, this is how we choose disk resource for a task:

image

In check_worker_have_enough_resources, this is how we calculate the available disk on a worker:

image

Initially, there are no tasks running, c is totally free, so the first several tasks get larger allocated disk. As more tasks are assigned on that worker, their outputs are brought to the cache, so c becomes smaller.

Say the size of cache increases delta_c on task t_i completion, now task t_(i+1) gets scheduled. Compared to t_i, we have a decrease in both disk allocation and disk available:

image

Apparently, as cache being more used, the available disk tends to shrink more than the allocated disk. disk_allocate > disk_available happens when:

image

Which is:

image

When we have more tasks running, we use more cache space, c becomes larger, the right hand side gets smaller; we use more task sandboxes, s becomes larger, the left hand side gets bigger. Therefore, such inequality becomes more likely to be satisfied and that's why more disk allocations fail as we have more tasks running.

Here is an example that aligns with the analysis. I requested for 16 cores, but at each time, there are at most 15 tasks running concurrently:

image

The csv file that records the resource allocation history

Solutions

  1. The proportional resource allocation should not exceed the available resources, force a hard limit on it.
  2. Keep track of the global usable cores / function slots, one task can be considered only if there are available cores or slots.

Results

A dramatic improvement of task concurrency!

  • Original:
image image
  • With the proposed solutions
image image

Merge Checklist

The following items must be completed before PRs can be merged.
Check these off to verify you have completed all steps.

  • make test Run local tests prior to pushing.
  • make format Format source code to comply with lint policies. Note that some lint errors can only be resolved manually (e.g., Python)
  • make lint Run lint on source code prior to pushing.
  • Manual Update: Update the manual to reflect user-visible changes.
  • Type Labels: Select a github label for the type: bugfix, enhancement, etc.
  • Product Labels: Select a github label for the product: TaskVine, Makeflow, etc.
  • PR RTM: Mark your PR as ready to merge.

@JinZhou5042 JinZhou5042 marked this pull request as draft December 11, 2024 02:22
@JinZhou5042 JinZhou5042 self-assigned this Dec 11, 2024
@JinZhou5042 JinZhou5042 marked this pull request as ready for review December 12, 2024 02:27
@JinZhou5042 JinZhou5042 marked this pull request as draft December 12, 2024 13:17
@JinZhou5042
Copy link
Member Author

JinZhou5042 commented Dec 12, 2024

Before comments from @colinthomas-z80 and @btovar, I didn't fully understand the problem from the perspective of algorithm design, the math was straightforward but didn't identify the underlying problem.

What's going wrong in the code is that we are using disk_total - cache_inuse as the available disk to allocate. However, that's not the actual amount of available disk to use. Instead, the total size of task sandboxes should be accounted as well, which means the available disk should be: disk_total - cache_inuse - sandboxes.

That said, we should use task_disk_estimate = (disk_total - disk_inuse) * proportion to guess the disk allocation. @dthain further suggested that for the disk estimate, leaving half of it being freed would provide more disk space for incoming tasks and allow for potential cache expansion, so we have task_disk_estimate /= 2.

With these changes, results are very encouraging! Both concurrency and success rate for resource estimate improved significantly, and the policy is very self-consistent!

@btovar
Copy link
Member

btovar commented Dec 13, 2024

Jin, I don't think that disk_total - cache_inuse - sandboxes is correct. One way to see this is that if the tasks do not have any input files, and I want to schedule two tasks to the worker, the second task will get a smaller proportion than the first one.

I believe that cache_inuse and sandboxes are not important here and are just confusing the main issue, which is that by design, the proportional computation gives conservative allocations. This is true for all the resources, and it is my hunch that this is not an accounting problem.

I'd be more comfortable with a solution that includes all the resources. For example, check at the end if an allocation from a computed proportion would not fit in the worker, then modify it in an easy-to-explain way (e.g., divide it by 2), and let the scheduler decide if it would fit, and maybe reject the allocation. I do not think we want to allocate whatever is left, as we want automated allocations for similar tasks to be similar (about the same order of magnitude). For example, we do not want one task to be 1GB and another to be 10MB just because that's what was left. Such correction should be made before we ensure that the allocation goes below limits that were explicitly specified.

taskvine/src/manager/vine_manager.c Outdated Show resolved Hide resolved
taskvine/src/manager/vine_manager.c Outdated Show resolved Hide resolved
taskvine/src/manager/vine_manager.c Outdated Show resolved Hide resolved
taskvine/src/manager/vine_manager.c Outdated Show resolved Hide resolved
taskvine/src/manager/vine_manager.c Show resolved Hide resolved
taskvine/src/manager/vine_manager.c Outdated Show resolved Hide resolved
@JinZhou5042 JinZhou5042 requested a review from dthain January 15, 2025 18:32
@dthain
Copy link
Member

dthain commented Jan 16, 2025

Come on over today and let's talk through a few things that I would like to understand better.

@dthain
Copy link
Member

dthain commented Jan 17, 2025

Per our discussion today, the disk allocation should be:

disk = ((total - cache)*frac)/proportion

Where frac is a tunable value with a default of 0.75

@JinZhou5042 JinZhou5042 requested review from btovar and dthain January 17, 2025 18:07
Copy link
Member

@dthain dthain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last thing, almost there!

taskvine/src/manager/vine_manager.c Outdated Show resolved Hide resolved
@JinZhou5042 JinZhou5042 requested a review from dthain January 17, 2025 19:35
Copy link
Member

@dthain dthain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hooray! Good work on a long and complex PR!

Copy link
Member

@btovar btovar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seem to be a lot of changes in this pr that are unrelated.
I suggest to close this pr and resubmit with only the following changes:

  • The multiplying factor to the disk_available.
  • The corresponding tune parameter.
  • Updates to the tune parameter documentation in the manual.

/* Compute the proportion of the worker the task shall have across resource types. */
double max_proportion = -1;
double min_proportion = -1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Min proportion is needed when using automatic resource allocation via categories, please do not remove it.

@JinZhou5042
Copy link
Member Author

Sounds good! I would like to hold it a bit until the DV5 application is compatible with the latest DaskVine/Dask and a supported Coffea version. I currently can't do experimental tests on DV5 with our latest changes.

@dthain
Copy link
Member

dthain commented Jan 21, 2025

Hmm, are you able to run a test by going back to an earlier version of Coffea and/or DV5?

@JinZhou5042
Copy link
Member Author

Just in case something unexpected happens since we had a few PRs getting merged. I can test with an earlier version of Dask + DaskVine in my end, but if try to merge that PR then resource allocations for the latest cctools will remain untested.

@JinZhou5042
Copy link
Member Author

All features in this PR have been separated into a sequence of PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

vine: scheduling inefficiency b/c task resource pre-allocation usually fails
3 participants