Diversity calculations on windowed datasets generate lots of unmanaged memory #1278

percyfal · 2024-12-05T09:39:22Z

When working on a reasonably large dataset (7TiB Zarr store), I noticed that diversity calculations, in particular windowed ones, generate lots of unmanaged memory. The call_genotype data portion is 280GiB stored, 7.2TiB actual size. There are some 3 billion sites and 1000 samples. At the end of a run, there is almost 300GiB unmanaged memory, which rules out the use of 256 and possibly 512GB memory nodes (I've been testing dask LocalCluster). Maybe this is more an issue with dask, but I thought I'd post it in case there is something that can be done to free up memory in the underlying implementation.

The text was updated successfully, but these errors were encountered:

jeromekelleher · 2024-12-05T09:56:48Z

Very interesting, thanks @percyfal! Can you give more details about how you got these numbers please?

percyfal · 2024-12-05T10:11:50Z

Basically by monitoring the task manager. I will be running lots of these calculations in the coming days so I'll make sure to include a screenshot. On the low-memory nodes (256GB RAM), the worker memory bars very quickly fill up with light blue (unmanaged) and orange (spillover).

percyfal · 2024-12-05T10:13:13Z

Unfortunately the task graph is too large to display so I couldn't easily tell what the scheduler is holding on to (@benjeffery suggested I look at this).

jeromekelleher · 2024-12-05T10:23:28Z

I see, this is coming from Dask. Unfortunately our experience has been that it's quite hard to keep a handle on Dask's memory usage - this is one of the key motivations for cubed, which we plan to support in sgkit (#908).

@tomwhite - are the popgen calculations something that will run on Cubed currently?

tomwhite · 2024-12-06T12:59:42Z

I see, this is coming from Dask. Unfortunately our experience has been that it's quite hard to keep a handle on Dask's memory usage - this is one of the key motivations for cubed, which we plan to support in sgkit (#908).

@tomwhite - are the popgen calculations something that will run on Cubed currently?

Not yet unfortunately, since the windowing introduces variable-sized chunks, which Cubed can't currently handle.

Having said that, I'd like to look at getting popgen methods working on Cubed in the new year. I'm wondering if the xarray groupby work (using flox) would work here, since that is something that Cubed does support (see cubed-dev/cubed#476).

@percyfal what windowing method are you using? It would be useful to have an example to target if it's possible to share the data (or even just the windows) easily.

percyfal · 2024-12-06T13:16:48Z

I'm currently using window_by_position, as fixed-size windows are commonly used to show genome-wide distributions of various statistics. I'm targeting 50k and 100k windows. I'm running the analyses as we speak, ramping up the node memory did solve this issue though, but I could certainly share the windows - what data arrays in particular are you thinking of here?

tomwhite · 2024-12-06T13:23:30Z

It would be useful to have the precise call to window_by_position (there are lots of options), and the variant_contig and variant_position data, if possible.

percyfal · 2024-12-06T13:29:13Z

Ok, I'll compile the data and let you know where you can get hold of it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Diversity calculations on windowed datasets generate lots of unmanaged memory #1278

Diversity calculations on windowed datasets generate lots of unmanaged memory #1278

percyfal commented Dec 5, 2024

jeromekelleher commented Dec 5, 2024

percyfal commented Dec 5, 2024

percyfal commented Dec 5, 2024

jeromekelleher commented Dec 5, 2024

tomwhite commented Dec 6, 2024

percyfal commented Dec 6, 2024

tomwhite commented Dec 6, 2024

percyfal commented Dec 6, 2024

Diversity calculations on windowed datasets generate lots of unmanaged memory #1278

Diversity calculations on windowed datasets generate lots of unmanaged memory #1278

Comments

percyfal commented Dec 5, 2024

jeromekelleher commented Dec 5, 2024

percyfal commented Dec 5, 2024

percyfal commented Dec 5, 2024

jeromekelleher commented Dec 5, 2024

tomwhite commented Dec 6, 2024

percyfal commented Dec 6, 2024

tomwhite commented Dec 6, 2024

percyfal commented Dec 6, 2024