-
Notifications
You must be signed in to change notification settings - Fork 224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LAMMPS on multiple GPUs #195
Comments
multi-GPU lammps is an untested beta feature, hopefully we will make progress on that soon. |
I was told that I probably formulated my issue not precise enough. So I want to run LAMMPS REPLICA EXCHANGE simulations. This means that LAMMPS creates multiple MD instances and runs them in parallel. So I want to stick to I would be very happy if you take another look at this. |
I've never run replica exchange in LAMMPS, so I'll need to dig into the internals to understand why it's breaking. Would you please send a complete (but simple) example, including a model? I'll use that to figure it out. Thanks |
The problem seems to be linked to how to the model is sent to different GPUs. If nothing is done, libtorch probably loads the model on the first GPU of the node. I think there is a probably a line to add to distribute the models on the right GPUs. |
Yeah I'm sure it's not terribly complicated. There is already this logic which assigns the model to the GPU associated with the local MPI rank. That's how the domain decomposition works.
I'd still like a self-contained example. |
Thanks a lot |
Thanks for the example - I've looked this some now. What happens if you try without Kokkos? So, adapting your input command:
|
It returns no error. However it only uses a single GPU for all replicas (instances) instead of distributing it over the GPUs. |
Okay thanks. Let’s keep Kokkos off until we figure this out - it
complicates things.
I confirmed yesterday that the distributed multi-GPU is working as
intended, so the problem is specific to replica somehow.
I couldn’t manage to run your example myself; it kept failing with MPI
errors right at the beginning. I think it was a problem with my install -
can you confirm which packages are required for the replica and
partitioning?
Also, do you know what happens if you try a CPU only install and distribute
over CPUs? Does that work as expected?
…On Wed, 22 Nov 2023 at 04:01, Felix Riccius ***@***.***> wrote:
It returns no error. However it only uses a single GPU for all replicas
(instances) instead of distributing it over the GPUs.
--> without kokkos it gives me the same result compared to using kokkos
with one gpu (-k on g 2 -sf kk)
—
Reply to this email directly, view it on GitHub
<#195 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACXHHTQ6C4JJI6SLTAX7FFDYFW5N5AVCNFSM6AAAAAA6HDRU42VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRSGM2TQMRTG4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I guess your install is missing the REPLICA package Here is my current install module purge
module load gcc/10 mkl/2022.2 gsl/2.4 impi/2021.4 fftw-mpi/3.3.9
module load anaconda/3/2023.03
module load cuda/11.6 cudnn/8.8.1
module load cmake/3.18
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$MKLROOT/lib/intel64
source ${MKLROOT}/env/vars.sh intel64
cmake \
-D CMAKE_BUILD_TYPE=Release \
-D CMAKE_INSTALL_PREFIX=$(pwd) \
-D BUILD_MPI=yes \
-D BUILD_OMP=yes \
-D BUILD_SHARED_LIBS=yes \
-D LAMMPS_EXCEPTIONS=yes \
-D PKG_KOKKOS=yes \
-D Kokkos_ARCH_AMPERE80=yes \
-D Kokkos_ARCH_AMDAVX=yes \
-D Kokkos_ENABLE_CUDA=yes \
-D Kokkos_ENABLE_OPENMP=yes \
-D Kokkos_ENABLE_DEBUG=no \
-D Kokkos_ENABLE_DEBUG_BOUNDS_CHECK=no \
-D Kokkos_ENABLE_CUDA_UVM=no \
-D CMAKE_CXX_COMPILER=$(pwd)/../lib/kokkos/bin/nvcc_wrapper \
-D PKG_ML-MACE=yes \
-D PKG_MOLECULE=yes \
-D PKG_REPLICA=yes \
-D PKG_KSPACE=yes \
-D PKG_RIGID=yes \
-D CMAKE_PREFIX_PATH=$(pwd)/../../libtorch-gpu \
../cmake
make -j 12 I never installed the CPU only version, I will update you as soon I tried that. |
I've now tried the CPU installation and everything seems to work fine: I tried it on one node and I manage to either run it with |
Did you have any progress and managed to run the simulation on your installation? And do you have anything else I should try out? |
Thanks - I'm at conference this week, so haven't had much time unfortunately. But there are some people here who may know the answer immediately - I'll track them down. |
Ok sounds good. It would be important for me to get this fixed until the 15th of December. |
I will try, but can't promise. |
Do you know if you are using a CUDA-aware MPI? That's not required in general, but I wonder if it could make a difference here. |
@wcwitt how does this work? It looks to me like it'll split the MPI tasks on the current node into a "local" descriptor, and then use the rank within that local descriptor to pick GPUs. Does it implicitly assume that you're starting exactly one MPI rank per GPU? If so, this code is effectively assigning unique integers, 0..N-1, to each task on each node, and using those to assign GPU IDs. I have a hypothesis - I believe that the multiple images are called multiple "world"s in LAMMPS. Seeing this code, I wonder if that because each gets its own I'll poke around at the replica exchange code and see if I can find evidence for this. @wcwitt do you want me to try to play with the example, or is this enough of a clue that you want to investigate it yourself? |
OK. I found this in // comm_replica = communicator between all proc 0s across replicas
int color = me;
MPI_Comm_split(universe->uworld,color,0,&comm_replica); Looks like |
I dont think I do. Im gonna try this out, maybe it speeds things up if doesnt solve the issue. |
Thanks @bernstei this is a huge help (especially if it works).
Yes, that's how I've done the domain decomposition so far. What you suggest sounds plausible; I was hoping it would be something simple like that. I'm in deadline mode, so won't have time until next week. If anyone else wants to try in the meantime, definitely feel free. |
I have some time, but on the other hand, I don't have LAMMPS compiled appropriately right now. If someone else wants to try it (@Felixrccs ?) I'm happy to discuss the necessary patch. Otherwise, I do in principle want to get LAMMPS+GPU MACE working, so I will set it up eventually, but possibly only on the days-week timescale that @wcwitt will get to it as well. |
Ok I tried the suggested code changes out and it works perfect. I tried it both with and without kokkos (Though with kokkos its about twice as fast). So far I've only tested simple systems (4 replicas distributed over 4 gpus) and I get an equal distribution/workload over alll the GPUs. Here are the changes I did. if (!torch::cuda::is_available()) {
std::cout << "CUDA unavailable, setting device type to torch::kCPU." << std::endl;
device = c10::Device(torch::kCPU);
} else {
std::cout << "CUDA found, setting device type to torch::kCUDA." << std::endl;
//int worldrank;
//MPI_Comm_rank(world, &worldrank);
MPI_Comm local;
MPI_Comm_split_type(universe->uworld, MPI_COMM_TYPE_SHARED, 0, MPI_INFO_NULL, &local);
int localrank;
MPI_Comm_rank(local, &localrank);
device = c10::Device(torch::kCUDA,localrank);
} I also had to add this line #include "universe.h" I will put my changes into a pull request in the following days (after I've run some longer simulations to check if everything continues working as expected). @wcwitt @bernstei again a thousand thanks for the help and patience in debugging this. |
Great, I'm glad it worked. I have to say I'm surprised I figured it out, because normally I find LAMMPS's internals to be pretty opaque. @wcwitt Would it be useful to test for |
Ok sadly I run into new problems trying to run 8 replicas on 4 GPUs. The error I get ist the following Exception: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
..... The Full error is here I tried to google it and its apparently related to calling GPUs that do not exist. Any Ideas? I have the feeling this is more difficult to solve than the issue we had before. |
I think that the way it's currently coded implicitly assumes that number of MPI ranks per node is equal to number of GPUs. Are you running it that way, or are you running with 8 MPI ranks, one per replica? I have an idea that might fix this, although we might need @wcwitt to confirm it'll work, but first I'd like to know how exactly your run is configured. |
Yes I have to use one mpi rank per replica, so that the srun -n 8 lmp -partition 1 1 1 1 1 1 1 1 -l lammps.log -i lammps.temper > lammps.out |
OK. In that case, I have a suggestion, if it's possible for two MPI tasks to share a GPU. If that's not possible, you're just stuck - you'll have to run on more nodes so that there's one GPU per replica. If it is possible, this is proposed (untested) syntax for assigning the same GPU ID to more than one task. Whether that's sufficient, or something else needs to be done to allow two tasks to share a GPU, I just don't know. @wcwitt? Anyway, the proposed syntax is #include <cuda_runtime.h>
.
.
.
MPI_Comm_rank(local, &localrank);
int nDevices;
cudaGetDeviceCount(&nDevices);
device = c10::Device(torch::kCUDA,localrank % nDevices); This is the sloppy version, without any error checking. Also, my understanding is that you need to link to [edited - you might need |
It should possible to assign a single GPU to multiple MPI tasks (this was the problem that we had before 8 replicas/MPI processes calculated on a single GPU). I will try to add your suggestion to the code and play a bit around. However, there is a high chance that this exceeds my C++ skills. |
In that case, I don't think anything complex should be needed - I'm now pretty certain my code will work as is. When you compile it, if you do |
It works :) So the compilation worked out of the box. And everything seems to work as expected:
Beginning of next week I'll try a 2 ns run on a bigger systeme as final test. But I'm confident that this issue is solved now. |
Excellent. Thanks, both of you. @Felixrccs, you mentioned a PR - do feel free to go ahead with that. As part of it, or in parallel, I'll think through questions like this and other error-checking that will make things more robust.
Let's leave this issue open until the changes are merged. |
I can't see how this would fail (famous last words), since it should just leave them idle, but I definitely agree about deferring |
Wasn't thinking it would fail - more concerned about what else is missing. |
Here is the pull request. I hope I've done everything right ACEsuit/lammps#1 |
Everything works on one gpu, however I would like to run my LAMMPS simulation over multiple GPUs.
My lammps submission command for two GPUs:
srun -n 4 lmp -partition 1 1 1 1 -l lammps.log -sc screen -k on g 2 -sf kk -i lammps.in
If I run it on multiple GPUs, LAMMPS returns this error back:
The text was updated successfully, but these errors were encountered: