CUDA updates for non-contiguous datatypes #5

bosilca · 2015-12-01T20:53:46Z

No description provided.

add cuda stream for submmitting multiple kernels. add suppot for predefined datatypes. Conflicts: opal/datatype/opal_datatype_unpack.c test/datatype/ddt_test.c

Add support for iovec and for pipeline iovec. a new way to compute nb_block and thread_per_block Conflicts: test/datatype/Makefile.am

Conflicts: test/datatype/Makefile.am

Improve the GPU memory management. Conflicts: opal/mca/mpool/gpusm/mpool_gpusm.h opal/mca/mpool/gpusm/mpool_gpusm_module.c

device 0, we now use the devices already opened.

issues, when 2 peers were doing a send/recv or when multiple senders were targetting the same receiver. Rolf provided a patch to solve this issue, by moving the IPC communication index from a global location onto each endpoint.

and will be populated with all the known information. Beware: one still has to manually set the CUDA lib and path as they are not available after configure (unlike the include which is). Conflicts: opal/datatype/cuda/Makefile

reason to have a copy of a locally generated file in the source.

minor cleanups.

Various other minor cleanups.

1. free code did not work right because we were computing the amount we freed after merging the list 2. we need to store original malloc GPU buffer in extra place because the one in the convertor gets changed over time Conflicts: opal/datatype/cuda/opal_datatype_cuda.cu opal/datatype/cuda/opal_datatype_pack_cuda_wrapper.cu

Conflicts: ompi/mca/pml/ob1/pml_ob1_cuda.c opal/datatype/cuda/opal_datatype_unpack_cuda_wrapper.cu

Conflicts: opal/datatype/cuda/opal_datatype_unpack_cuda_wrapper.cu opal/mca/btl/smcuda/btl_smcuda.c

Conflicts: opal/datatype/cuda/opal_datatype_unpack_cuda_wrapper.cu

we do a D2D copy Conflicts: opal/datatype/cuda/opal_datatype_unpack_cuda_wrapper.cu test/datatype/Makefile.am

Conflicts: opal/datatype/cuda/opal_datatype_unpack_cuda_wrapper.cu

iteration of the datatype based on a NULL pointer. This list will then contain the displacement and the length of each fragment of the datatype memory layout and can be used for any packing/unpacking purpose.

Conflicts: opal/datatype/cuda/opal_datatype_unpack_cuda_wrapper.cu opal/datatype/opal_datatype_unpack.c

contiguous

Conflicts: opal/datatype/cuda/opal_datatype_pack_cuda_wrapper.cu

functions Conflicts: opal/datatype/cuda/opal_datatype_cuda.cu opal/datatype/cuda/opal_datatype_cuda_internal.cuh opal/datatype/cuda/opal_datatype_pack_cuda_kernel.cu opal/datatype/cuda/opal_datatype_pack_cuda_wrapper.cu opal/datatype/cuda/opal_datatype_unpack_cuda_kernel.cu opal/datatype/cuda/opal_datatype_unpack_cuda_wrapper.cu opal/datatype/opal_datatype_gpu.c

not work with realloc, so use malloc instead

check point use raw_cached, but cuda iov caching is not enabled check point, split iov into two version, non-cached and cached check point iov cache another checkpoint check point, cuda iov is cached, but not used for pack/unpack check point, ready to use cached cuda iov checkpoint, cached cuda iov is working with multiple send, but not for count > 1 checkpoint, fix a bug for partial unpack checkpoint, fix unpack size

checkpoint, during unpack, cache the entire iov before unpack another checkpoint checkpoint , remove unnecessary cuda stream sync use bit to replace % rollback to use %, not bit, since it is faster, not sure why

another checkpoint now convertor->count > 1 is woring

set_cuda_iov_position

not work with realloc, so use malloc instead

checkpoint, rewrite non-cached version fix for non cached iov fix the non cached iov, set position should be put at first move ddt iov to cuda iov into a function merge iov cached and non-cached for non cached iov, if there is no enough cuda iov space, break

fix cuda stream

rolfv and others added 30 commits October 27, 2015 17:17

Add GPU packing and unpacking

d909529

add cuda stream for submmitting multiple kernels. add suppot for predefined datatypes. Conflicts: opal/datatype/opal_datatype_unpack.c test/datatype/ddt_test.c

indexed datatype new, bonus stask support.

e3463fa

Add support for iovec and for pipeline iovec. a new way to compute nb_block and thread_per_block Conflicts: test/datatype/Makefile.am

RDMA send is now working.

cf44223

Conflicts: test/datatype/Makefile.am

Add support for vector datatype. Add pipeline.

c6a00d7

Improve the GPU memory management. Conflicts: opal/mca/mpool/gpusm/mpool_gpusm.h opal/mca/mpool/gpusm/mpool_gpusm_module.c

fix gpu memory and vector datatype

c10d3f4

unrestricted GPU. Instead of forcing everything to go on

0fda4df

device 0, we now use the devices already opened.

Generate the Makefile. It will now be placed in the bindir

6fda036

and will be populated with all the known information. Beware: one still has to manually set the CUDA lib and path as they are not available after configure (unlike the include which is). Conflicts: opal/datatype/cuda/Makefile

This file was certainly not supposed to be here. There is NO valid

742992a

reason to have a copy of a locally generated file in the source.

Add the capability to install the generated library and other

9c63b09

minor cleanups.

Open the datatype CUDA library from a default install location.

a681551

Various other minor cleanups.

clean up code in pack and unpack

bdfe31b

Conflicts: ompi/mca/pml/ob1/pml_ob1_cuda.c opal/datatype/cuda/opal_datatype_unpack_cuda_wrapper.cu

big changes, now pack is driven by receiver by active message

a670db4

intel test working

42ad920

Conflicts: opal/datatype/cuda/opal_datatype_unpack_cuda_wrapper.cu opal/mca/btl/smcuda/btl_smcuda.c

fix a bug when buffer is not big enough for whole ddt

bab3559

Conflicts: opal/datatype/cuda/opal_datatype_unpack_cuda_wrapper.cu

if data in different gpu, instead of copy direct from one to the other,

29c90a0

we do a D2D copy Conflicts: opal/datatype/cuda/opal_datatype_unpack_cuda_wrapper.cu test/datatype/Makefile.am

now we can use cudamemcpy2d

44a1550

Conflicts: opal/datatype/cuda/opal_datatype_unpack_cuda_wrapper.cu

enable zero copy + fix GPU buffer bug

a67c842

Conflicts: opal/datatype/cuda/opal_datatype_unpack_cuda_wrapper.cu

put pipeline size into mca

7bd8151

Upon datatype commit create a list of iovec representing a single

9d10357

iteration of the datatype based on a NULL pointer. This list will then contain the displacement and the length of each fragment of the datatype memory layout and can be used for any packing/unpacking purpose.

contiguous vs non-contiguous is working

756b2af

Conflicts: opal/datatype/cuda/opal_datatype_unpack_cuda_wrapper.cu opal/datatype/opal_datatype_unpack.c

Fix pipeline bug

3a6bdd9

now we are able to pack directly to remote buffer if receiver is

f86c81e

contiguous

add ddt_benchmark

6ae39b2

modify for matrix transpose

25ead9b

enable vector

5e14fdd

receiver now will send msg back to sender for buffer reuse

d03c601

Conflicts: opal/datatype/cuda/opal_datatype_pack_cuda_wrapper.cu

fix zerocopy

c377c36

eddy16112 added 25 commits February 26, 2016 14:37

if cuda_iov is not big enough, use realloc. However, cudaMallocHost does

84f7abb

not work with realloc, so use malloc instead

make sure check pointer is not NULL before free it

65424d0

checkpoint, rewrite non-cached version

5d316d9

fix for non cached iov

02c8b7f

fix the non cached iov, set position should be put at first

bb807fc

move ddt iov to cuda iov into a function

842cc3f

merge iov cached and non-cached

6df01a5

for non cached iov, if there is no enough cuda iov space, break

da23f82

cache the entire cuda iov

7b26aaa

checkpoint, during unpack, cache the entire iov before unpack another checkpoint checkpoint , remove unnecessary cuda stream sync use bit to replace % rollback to use %, not bit, since it is faster, not sure why

now cuda iov is {nc_disp, c_disp}

6af6658

clean up kernel, put variables uses multiple times into register

63e148e

cached cuda iov is working for count > 1

c75393f

another checkpoint now convertor->count > 1 is woring

move the cuda iov caching into a seperate function

11d4a5b

these two variables are useless now

1e29fc0

fix a bug for ib, current count of convertor should be set in

1bac78c

set_cuda_iov_position

cleanup, move cudamalloc into cache cuda iov

686c90e

rearrange varibles

85dad6c

if cuda_iov is not big enough, use realloc. However, cudaMallocHost does

4c6c0e4

not work with realloc, so use malloc instead

make sure check pointer is not NULL before free it

2120edd

apply loop unroll on packing kernels

eb143dc

apply unroll to unpack

b45b646

fix a cuda event bug. cudaStreamWaitEvent is not blocking call.

4037554

fix cuda stream

new vector kernel

e6c765e

eddy16112 force-pushed the cuda branch from b6d56eb to e6c765e Compare February 26, 2016 23:03

eddy16112 added 3 commits February 26, 2016 15:06

Merge branch 'cuda' of https://github.com/eddy16112/ompi into cuda

e981580

fix a if CUDA_41 error

2b0048f

clean up a if

d22e54a

eddy16112 force-pushed the master branch from ad0d5f1 to 5ced037 Compare August 8, 2016 18:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA updates for non-contiguous datatypes #5

CUDA updates for non-contiguous datatypes #5

bosilca commented Dec 1, 2015

CUDA updates for non-contiguous datatypes #5

Are you sure you want to change the base?

CUDA updates for non-contiguous datatypes #5

Conversation

bosilca commented Dec 1, 2015