-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak when getting/setting dataset distributed over last axis #165
Comments
I'd like to revive this issue, because Haochen and I have recently run into something that seems related, and it would nice to resolve it. Here is a config that can be used to reproduce the issue on cedar:
You can pick your favourite LSD, but I've mostly been working with 3245. With
The crash occurs in this line of Line 1143 in 0beefd6
I've attempted some dumb debugging by adding log statements as shown in this git diff:
These statements reveal that there is no MPI task that completes its write successfully. Here's what I get from task 0, right before the crash:
In this case, If we use 2 nodes (24 tasks), the analogous output instead looks like:
Here, the output is partitioned into 2 slices along the "1" axis (i.e. the pol axis, which has 4 elements). The slice being written is 1.375GB here as well, but now the write operation completes successfully. I've found that restricting the slice size to 1GB, using |
I think that the slice into One thing I noticed is that the fixes mentioned by both @jrs65 and @sjforeman involve reducing the number of processes, which would increase the size of the local array on each process. In this case, by reducing the number of processes, |
This seems relevant, although I can't access the JIRA page linked to see if the issue was ever resolved in hdf5 |
Thanks @ljgray! I'm experimenting with possible fixes related to the slicing issues you've described |
I should note that our situation isn't exactly the same as the one in the h5py issue, as they're trying to take non-consecutive elements, but it's always possible that we're seeing the effects of that issue in another way |
I think that issue is a red herring as it only applies to chunked data (which this is not). That said, I can't really see any obvious issues with what's going on. Can you try forcing collective IO to be off? That is an obvious contender as I think the h5py implementation has some issues (e.g. the eliding of zero length writes, which we already need to work around). |
I tried forcing collective IO off and the job just timed out after 90 minutes. Interestingly, a few ranks did complete their write quite quickly but most never did |
Can you go one step further and try forcing |
Github really needs to add the 🤔 emoji. That's pretty much the reaction I want to add to every issue comment. |
In the case where we're just slicing a length-1 axis with |
Actually that doesn't make any sense. If it can't use MPI-IO it needs to use Sorry, bad suggestion! |
Good test. I think broadly it would be good to try and reduce this to a simpler test case. Can you get it down to something that only needs 4 MPI processes and where the dataset is smaller and with fewer dimensions? |
Yeah it would be beneficial to be able to iterate more quickly, but it might be tricky to simplify when we don't really know exactly what the issue is. @sjforeman just note that using a slice(None) will fail in |
Also, one other test I ran last night applied the slice to a different axis than zero, with the same result, so it doesn't seem to be anything unique to that axis |
You bet, I'll see if I can come up with a smaller test case that fails in the same way
Yeah, I also tried something this morning that failed in |
It may take awhile to complete, but I'm running a test to brute force the slice into being a slice(None) and the just trying to slice into the dataset without setting anything. This has just been hanging for ~15 minutes but not crashing. This is the same behaviour that I saw when running without collective IO. Let me know if you come up with anything different |
I have not been able to make it fail at a smaller size (as of yet). I'm giving up for now, but in case anyone else finds it useful, this is what I was doing: import logging
import numpy as np
from mpi4py import MPI
from caput import mpiarray
logging.basicConfig(level=logging.DEBUG)
comm = MPI.COMM_WORLD
# mul=1 should correspond to the crashing case, mul=2 the one that works
mul = 1
shape = (1, 4, 1024, 4096, 11 * mul * comm.size + 1)
print(f"rank={comm.rank=}/{comm.size}: {shape=}")
d = mpiarray.MPIArray(shape, dtype=np.float64, axis=4)
print(f"rank={comm.rank=}/{comm.size}: {d.local_shape=}")
print(f"rank={comm.rank=}/{comm.size}: filling")
d[:] = 1.0
print(f"rank={comm.rank=}/{comm.size}: writing")
d.to_hdf5("testfile.h5", "testdset", create=True)
comm.Barrier()
print(f"rank={comm.rank=}/{comm.size}: done") This didn't crash for me with either 2 or 6 processes (all on the same node). I think going to 24 should correspond pretty much exactly to what Simon was doing. |
A quick update - forcing slice(None) on the split axis did eventually let me index into the dset, but it took ~25 minutes to do so on all ranks. It then crashed when trying to actually set the new dset data. We have access to more memory when running on 4 nodes vs 2, so it makes me think that there's some sort of memory leak going on |
@sjforeman is it crashing on the first dataset in the container that it's trying to write out? I've had issues before where the issue happens at an earlier dataset, and then only surface later on. e.g. the writes get out of sync some how. You might consider chucking in a Barrier at the end start and end of |
Interesting! |
I didn't actually try putting a barrier at the start, but putting one at the end of |
I also found that manually using |
@jrs65 It does appear to be crashing on the first dataset it attempts to write out |
It might be helpful to ssh into the node while it's running and use |
Reminding myself how it all works MPI-IO will indeed aggregate the data to write onto a smaller set of nodes, but the buffers are all 32 MB max sizes, so I don't quite see how it would eat that much memory. |
I'm watching |
I've been experimenting with chunking for the dataset where the crash occurs by changing |
Indeed. Chunking is disabled for HDF5 output as parallel chunked IO is fundamentally broken in HDF5. |
That should be true, though I'm struggling to see where we actually turn that on and off at the moment! I'm pretty sure it is there though. |
We might just be relying on the fact that |
Ok, I managed to make it work. Sort of. I restricted the
I don't really understand why so much memory is being used in the first place (just holding the data), because the ringmaps are quite small. I also don't understand why it would work fine with fewer nodes/processes. Reducing the tolerance as @sjforeman did earlier would result in the same split, which would explain why it worked. |
@sjforeman How long did it take to run this when you used 2 nodes? |
It took 51 minutes of walltime for the 2-node job to write the full ringmap to disk |
Interesting. The job timed out after 90 minutes on 4 nodes |
I had a look at the HDF5 issue that I linked before, it's still open and was reassigned to someone as of Jan. 2023 and affects versions starting at 1.10.4 (at least that's what's listed). I'm not entirely sure if that's the culprit here, as this could be something to do with mpi_io as well. |
Hmm. Maybe it's just a temporary I/O slowdown? (I've noticed that scratch is a bit laggy for me over the past few hours). In the meantime, here are some other data points to consider: if I only read the first 64 elements of the el axis of the ringmap (i.e. using |
If you still have the job id available for those, can you check them using |
Thanks for the tip! 48 tasks:
24 tasks:
12 tasks (timed out without crashing):
48 tasks, forcing
My guess is that the |
One more test - in the |
My tests are also successful if I manually redistribute in another axis before writing to disk. One could imagine the last axis causing issues because it is the fastest-varying axis, and therefore the least efficient axis to be using for random access... |
I've tested distributed across all axes and it's only the last axis that causes an issue (did not try distributed across the 0th axis because we don't use collective IO in that case anyway) |
It would be the least efficient, but whatever is going on here probably goes beyond that I would think. |
This one is a little nebulous. I had originally thought that the issue was when the distributed axis is shorter than the number o processes and so one has nothing to do. That may still be true, but some of the log messages seemed to indicate that it was actually doing the IO split along that axis?? Regardless, reducing the number of MPI processes being used seemed to fix the issue.
Anyway this clearly needs a little more debugging to figure out what exactly is happening, but we shouldn't crash when this happens.
The text was updated successfully, but these errors were encountered: