Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: async GPU force/torque back-transfer, launch kernels earlier #2637

Open
wants to merge 1 commit into
base: python
Choose a base branch
from

Conversation

RudolfWeeber
Copy link
Contributor

This uses the thrust pinned memory allocator for the vectors receiving gpu forces/torques on the host.
Then, the back-transfer can be asynchroneous.
The host force/torque vector have to remain globals, because the allocation of pinned memory takes at least as long as the data transfer itself.
A Python-level exit handler maeks sure, the vectors are de-allocated before cuda is unloaded (becuase the custom allocator calls cudaFree())
The performance benefits apply mostly to dense systems and probably also to systems with long range interactions.

@fweik fweik self-assigned this Apr 2, 2019
@@ -336,26 +326,45 @@ void copy_part_data_to_gpu(ParticleRange particles) {
}
}

std::unique_ptr<PinnedVectorHost<float>> particle_forces_host{
Copy link
Contributor

@fweik fweik Apr 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't like this construction. I a way this is worse than manual memory management, because it pretends to be RAII, but really isn't. As far as I can see there is not reason for these buffers to exist statically, couldn't you just create them e.g. when the integration starts, and release them after. Depending on the cost of the allocation, it may even be feasible to create it in the force calculation.

#include "thrust/host_vector.h"
#include "thrust/system/cuda/experimental/pinned_allocator.h"
template <class T>
using PinnedVectorHost = thrust::host_vector<
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be called PinnedHostVector to keep it closer to thrusts name.

@RudolfWeeber
Copy link
Contributor Author

It is unfortunate that one cannot convince Python to actually call the destructor of the system class.
Then this could be done differently. As it is, I didn't find a working alternative to the exit handler.

For 1k particles, the alloc takes approx as long as an md time step with lj at 0.1 volume fraction (200us).
It scales much less then linearly with the number of particles

It would probably surprise users if
10x integrator.run(1) takes 20x the time than integrator.run(10)

The inner-most place one could move the vectors to is
python_integrate(), then, the vectors would stay allocated for sytems with auto update accumulators, at least.
AFAIK, there eis no observable recorder (yet) which could store un-processed observable results, though.

@fweik
Copy link
Contributor

fweik commented Apr 2, 2019

Can you please explain in more detail what the lifetime issue is? It's not totally clear to me.

@fweik
Copy link
Contributor

fweik commented Apr 2, 2019

Also you seemed to have timed the allocation, could you please share your results and how you have measured that? It's not clear to me why a single allocation (or two) would be so expensive, my understanding is that the pinned allocation is just a normal allocation followed by a mlock call. This should not be expensive (if you are not out of RAM).

@RudolfWeeber
Copy link
Contributor Author

RudolfWeeber commented Apr 2, 2019 via email

@RudolfWeeber
Copy link
Contributor Author

RudolfWeeber commented Apr 2, 2019 via email

@fweik
Copy link
Contributor

fweik commented Apr 2, 2019

Again: Can you please explain in more detail what the lifetime issue is? It's not totally clear to me.

@RudolfWeeber
Copy link
Contributor Author

RudolfWeeber commented Apr 2, 2019 via email

@fweik
Copy link
Contributor

fweik commented Apr 2, 2019

Wouldn't you have the same problem with free() from the shared object libc?

@RudolfWeeber
Copy link
Contributor Author

It is not clear to me.
There is, as far as I can tell, no explicit call unloading the cuda driver in Espresso.
Also, I could not find a global whose destruciotn would entail the unloading.
So I cannot really think about an other explanation than the unloading order of shared objects.
One odd thing is that
libEspressoCore.so and libEsperssoCua.so
are actually not linked against libcudart, but just against libcufft:

linux-vdso.so.1 (0x00007ffebe3fa000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x0000145d6b140000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x0000145d6af3c000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x0000145d6ad34000)
libcufft.so.9.1 => /usr/lib/x86_64-linux-gnu/libcufft.so.9.1 (0x0000145d63847000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x0000145d634be000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x0000145d63120000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x0000145d62f08000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x0000145d62b17000)
/lib64/ld-linux-x86-64.so.2 (0x0000145d6b89b000)

@fweik
Copy link
Contributor

fweik commented Apr 3, 2019

It seems to me that I forgot linking to cudart:

(From /CMakeLists.txt)

        function(add_gpu_library)
          cuda_add_library(${ARGV})
          set_property(TARGET ${ARGV0} PROPERTY CUDA_SEPARABLE_COMPILATION ON)
          target_link_libraries(${ARGV0} PRIVATE ${CUDA_CUFFT_LIBRARIES})
        endfunction()

Could you please add it (like for the clang case) and check if this improves matters? (I probably thought that cuda_add_library does that automatically...)

@RudolfWeeber
Copy link
Contributor Author

So libcudart was linked statically already. Linking it dynamically did not help.

It is possible to avoid the exit hanlder by storing the vectors containig the pinned memory in a wrapper class which explicitly creates a cuda context. An releases it after destructing the vectors.

This solution still needs work, e.g., for switching gpus. It also needs to be moved/integrated with cuda init and device switching.
I'll do that, if we go with this solution.

@fweik
Copy link
Contributor

fweik commented Apr 12, 2019

Ok I think we can go forward with this. In general I think we should seek to remove all the non-trivial globals rather sooner than later. We've had multiple issues with them.

@fweik
Copy link
Contributor

fweik commented Apr 27, 2019

@RudolfWeeber it seems like there is a file missing here (where the feature detection in python went), could you pleas add that? I am currently working on the async forward communication and want to integrate this, but it is not working.

@fweik
Copy link
Contributor

fweik commented Jul 3, 2019

@RudolfWeeber are you still looking into this?

@RudolfWeeber
Copy link
Contributor Author

RudolfWeeber commented Jul 3, 2019 via email

@fweik
Copy link
Contributor

fweik commented Jul 3, 2019 via email

@RudolfWeeber
Copy link
Contributor Author

RudolfWeeber commented Jul 3, 2019 via email

@fweik
Copy link
Contributor

fweik commented Jul 3, 2019

Ah maybe there is just an guard missing. I'll have a look

@KaiSzuttor KaiSzuttor added this to the Espresso 4.1 milestone Jul 15, 2019
@RudolfWeeber RudolfWeeber modified the milestones: Espresso 4.1, Espresso 5 Sep 3, 2019
@KaiSzuttor
Copy link
Member

@fweik is this obsolete by now?

@fweik
Copy link
Contributor

fweik commented Mar 3, 2020

No. Let's keep this open for now.

@fweik fweik changed the title async GPU force/torque back-transfer, launch kernels earlier WIP: async GPU force/torque back-transfer, launch kernels earlier Mar 3, 2020
@fweik fweik removed their assignment Dec 9, 2020
@jngrad jngrad removed this from the Espresso 5 milestone May 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants