Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Diversity calculations on windowed datasets generate lots of unmanaged memory #1278

Open
percyfal opened this issue Dec 5, 2024 · 8 comments

Comments

@percyfal
Copy link

percyfal commented Dec 5, 2024

When working on a reasonably large dataset (7TiB Zarr store), I noticed that diversity calculations, in particular windowed ones, generate lots of unmanaged memory. The call_genotype data portion is 280GiB stored, 7.2TiB actual size. There are some 3 billion sites and 1000 samples. At the end of a run, there is almost 300GiB unmanaged memory, which rules out the use of 256 and possibly 512GB memory nodes (I've been testing dask LocalCluster). Maybe this is more an issue with dask, but I thought I'd post it in case there is something that can be done to free up memory in the underlying implementation.

@jeromekelleher
Copy link
Collaborator

Very interesting, thanks @percyfal! Can you give more details about how you got these numbers please?

@percyfal
Copy link
Author

percyfal commented Dec 5, 2024

Basically by monitoring the task manager. I will be running lots of these calculations in the coming days so I'll make sure to include a screenshot. On the low-memory nodes (256GB RAM), the worker memory bars very quickly fill up with light blue (unmanaged) and orange (spillover).

@percyfal
Copy link
Author

percyfal commented Dec 5, 2024

Unfortunately the task graph is too large to display so I couldn't easily tell what the scheduler is holding on to (@benjeffery suggested I look at this).

@jeromekelleher
Copy link
Collaborator

I see, this is coming from Dask. Unfortunately our experience has been that it's quite hard to keep a handle on Dask's memory usage - this is one of the key motivations for cubed, which we plan to support in sgkit (#908).

@tomwhite - are the popgen calculations something that will run on Cubed currently?

@tomwhite
Copy link
Collaborator

tomwhite commented Dec 6, 2024

I see, this is coming from Dask. Unfortunately our experience has been that it's quite hard to keep a handle on Dask's memory usage - this is one of the key motivations for cubed, which we plan to support in sgkit (#908).

@tomwhite - are the popgen calculations something that will run on Cubed currently?

Not yet unfortunately, since the windowing introduces variable-sized chunks, which Cubed can't currently handle.

Having said that, I'd like to look at getting popgen methods working on Cubed in the new year. I'm wondering if the xarray groupby work (using flox) would work here, since that is something that Cubed does support (see cubed-dev/cubed#476).

@percyfal what windowing method are you using? It would be useful to have an example to target if it's possible to share the data (or even just the windows) easily.

@percyfal
Copy link
Author

percyfal commented Dec 6, 2024

I'm currently using window_by_position, as fixed-size windows are commonly used to show genome-wide distributions of various statistics. I'm targeting 50k and 100k windows. I'm running the analyses as we speak, ramping up the node memory did solve this issue though, but I could certainly share the windows - what data arrays in particular are you thinking of here?

@tomwhite
Copy link
Collaborator

tomwhite commented Dec 6, 2024

It would be useful to have the precise call to window_by_position (there are lots of options), and the variant_contig and variant_position data, if possible.

@percyfal
Copy link
Author

percyfal commented Dec 6, 2024

Ok, I'll compile the data and let you know where you can get hold of it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants