Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA pipeline for computing APR #185

Open
wants to merge 65 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
65 commits
Select commit Hold shift + click to select a range
e6aa9c9
Bspline filters fixed for CUDA pipeline
krzysg Aug 1, 2022
b563da4
Debug messages turned off
krzysg Aug 1, 2022
3db510f
Fixed Inv Bspline in X direction (CUDA pipeline)
krzysg Aug 1, 2022
18fce44
Inverse Bspline pipeline for CUDA fixed
krzysg Aug 2, 2022
ad5f194
Downsample and downsample gradient corrected to match GPU
krzysg Aug 3, 2022
557eff3
GPU pipeline fixes - Full Gradient test is working now
krzysg Aug 9, 2022
57765a7
Merge branch 'develop' into cuda
krzysg Aug 9, 2022
3da13ba
Merge branch 'develop' into cuda
krzysg Aug 9, 2022
d958161
GPU and CPU give same resutls in Release mode - turned off unsafe opt…
krzysg Aug 10, 2022
4ace238
Quick fix of processOnGpu() - not it gets correct bspline data for ea…
krzysg Aug 10, 2022
b050e07
Added new test file for LIS CUDA, GPU now handles boundary (without p…
krzysg Nov 14, 2022
570ab20
Local Intensity Scale (LIS) not works in X-dir as expected. GPU and C…
krzysg Jan 31, 2023
17e5d8e
Local Intensity Scale (LIS) now works in Z-dir as expected. GPU and C…
krzysg Feb 1, 2023
5ad9865
Updated compareMeshes to show maximum error found
krzysg Feb 17, 2023
af1c3ac
LIS in X-dir redesigned so code is clearer and faster. Also new test …
krzysg Feb 17, 2023
521d826
LIS in Z-dir redesigned so code is clearer and faster. Also new test …
krzysg Feb 24, 2023
b297adf
Local Intensity Scale (LIS) now works in Y-dir as expected. GPU and C…
krzysg Mar 13, 2023
2cdf3fe
Whole LIS pipeline is matching exactly CPU implementation + tests upd…
krzysg Mar 16, 2023
e093c01
Quick fix of linking error
krzysg Mar 16, 2023
053380d
maximum error diff. GPU vs CPU for compute gradient set to 0
krzysg Mar 16, 2023
97cf75e
rescaleAndThreshold in now only rescaling (to reflect changed in CPU …
krzysg Mar 17, 2023
83c2a31
rescaleAndThreshold in now only rescaling (to reflect changed in CPU …
krzysg Mar 17, 2023
5b5a719
constant_intensity_scale handling in LIS added for GPU
krzysg Mar 17, 2023
5d0375a
Removed unused threshold functions
krzysg Mar 20, 2023
53ef94b
FullPipeline test moved to new file
krzysg Mar 20, 2023
ac2c22e
PixelDataDim updated with maximum dimension lenght and nuber of dimen…
krzysg Mar 20, 2023
122a96a
GradLisLevels test working now
krzysg Mar 20, 2023
6a5db35
full pipeline tests fixed
krzysg Mar 24, 2023
4088e9d
Changes from old branches added + modified to GenInfo instead of APRA…
krzysg Jul 20, 2023
b8f2504
Added debug printout to GenInfo
krzysg Jul 21, 2023
6400a9a
Moved old CUDA tests to new file
krzysg Aug 11, 2023
4b35b8e
Moved old CUDA tests to new file
krzysg Aug 11, 2023
1ed5d4f
Added CUDA_ARCHITECTURES set to OFF (keep current behaviour) to suppr…
krzysg Oct 30, 2023
93ac120
Temporary test updated to print particles using LinearAccess iterator
krzysg Nov 8, 2023
09bf86a
Merge branch 'master' into cuda
krzysg Nov 8, 2023
b7ae1cb
Merge branch 'master' into cuda
krzysg Nov 10, 2023
6181da6
Merge branch 'master' into cuda
krzysg Nov 10, 2023
ed09686
Merge branch 'master' into cuda
krzysg Nov 13, 2023
70543d2
TODO about some problems with edge case
krzysg Nov 30, 2023
dd3d448
Fixed test where out of range idx was given
krzysg Dec 6, 2023
1a112ec
Pulling Scheme tests (and OVPC on CPU) finished.
krzysg Dec 13, 2023
64ca641
Fixes for tests
krzysg Dec 14, 2023
9f31bfd
Fixed OVPC - clamping values of input levels is necessary
krzysg Jan 9, 2024
2707207
Updated OVPC (PS) for CUDA - now it gives correct ans same results as…
krzysg Feb 5, 2024
3cb4529
PullingSchemeCudaTest finished, added init file for LinearAcccess test
krzysg Feb 16, 2024
027e52a
Finished LinearAccess tests (for linear structure only), added draft …
krzysg Feb 21, 2024
e83b952
Check also total_number_particles in LinearAccess test
krzysg Feb 23, 2024
2cc5bca
LinearAccessCuda implemented (it is not used yet in CUDA pipeline)
krzysg Aug 2, 2024
e1b63d7
Compiler warnings fixed
krzysg Aug 2, 2024
4c88fae
Removed debug outputs from LinearAccessCuda test.
krzysg Aug 6, 2024
169cd9d
Added two more test for full pipeline (including PS, and LinearAccess)
krzysg Aug 6, 2024
dadf92f
-ffast-math must be removed - some optimizations still make GPU and C…
krzysg Aug 8, 2024
27a8dc3
(nasty) fix for computeLevels in CUDA - added TODO to make it more re…
krzysg Aug 8, 2024
bb3b3f4
Fix for bsplineYdir for very small input images + test for full pipel…
krzysg Aug 9, 2024
a8c4d77
Fixed Local Intensity Scale (LIS) for super small inputs
krzysg Aug 14, 2024
e6e4327
ParticleCellTreeCuda is now main stuff for CUDA
krzysg Aug 19, 2024
00aac97
computeOvpcCuda now using 'stream' instead of hardcoded values
krzysg Aug 20, 2024
1fba1bc
ParticleCellTreeCuda moved and handle now cpu2gpu transfer
krzysg Aug 20, 2024
3474250
LinearAccessCuda is now using ParticleCellTreeCuda
krzysg Aug 20, 2024
1d4e549
OVPC added to GpuTask
krzysg Aug 21, 2024
9ff0580
Full GPU pipeline works1
krzysg Aug 21, 2024
c10225d
Some debug prints removed
krzysg Aug 21, 2024
6b7a87d
Test for full pipeline cleaned up
krzysg Aug 21, 2024
3c601be
doAll() removed from Gpu pipeline
krzysg Aug 21, 2024
d2fd1d0
GPU pipeline now works for APRConverter!
krzysg Aug 22, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 7 additions & 5 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -174,14 +174,14 @@ else()
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -std=c++14 ")

if(CMAKE_COMPILER_IS_GNUCC)
set(CMAKE_CXX_FLAGS_RELEASE "-O4 -ffast-math")
set(CMAKE_CXX_FLAGS_RELEASE "-O4")
set(CMAKE_CXX_FLAGS_DEBUG "-O0 -g -Wall -pedantic")
set(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} -Bdynamic")
if(NOT WIN32)
set(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} -ldl -lz")
endif()
elseif (CMAKE_CXX_COMPILER_ID MATCHES "Clang")
set(CMAKE_CXX_FLAGS_RELEASE "-O3 -ffast-math")
set(CMAKE_CXX_FLAGS_RELEASE "-O3")
set(CMAKE_CXX_FLAGS_DEBUG "-O0 -g -Wall -pedantic")
set(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} -lz")
endif()
Expand Down Expand Up @@ -211,8 +211,8 @@ if(APR_USE_CUDA)
message(STATUS "APR: Building CUDA for APR")
set(CMAKE_CUDA_STANDARD 14)
set(CMAKE_CUDA_RUNTIME_LIBRARY "Static")
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --default-stream per-thread -Xptxas -v -DAPR_USE_CUDA")
set(CMAKE_CUDA_FLAGS_RELEASE "-O3 --use_fast_math") # -lineinfo for profiling
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --fmad=false --default-stream per-thread -Xptxas -v -DAPR_USE_CUDA")
set(CMAKE_CUDA_FLAGS_RELEASE "-O3") # -lineinfo for profiling
set(CMAKE_CUDA_FLAGS_DEBUG "-O0 -g -G")
if(APR_BENCHMARK)
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -DAPR_BENCHMARK")
Expand All @@ -226,6 +226,7 @@ if(APR_USE_CUDA)
src/algorithm/LocalIntensityScale.cu
src/algorithm/OVPC.cu
src/data_structures/APR/access/GPUAccess.cu
src/data_structures/APR/access/LinearAccessCuda.cu
src/numerics/miscCuda.cu
src/numerics/APRDownsampleGPU.cu
src/numerics/PixelNumericsGPU.cu
Expand All @@ -241,6 +242,7 @@ if(APR_BUILD_STATIC_LIB)
# generate static library used as a intermediate step in generating fat lib
set(STATIC_TARGET_NAME staticLib)
add_library(${STATIC_TARGET_NAME} STATIC $<TARGET_OBJECTS:aprObjLib> ${APR_CUDA_SOURCE_FILES})
set_property(TARGET ${STATIC_TARGET_NAME} PROPERTY CUDA_ARCHITECTURES OFF)
target_compile_features(${STATIC_TARGET_NAME} PUBLIC cxx_std_14)
set_target_properties(${STATIC_TARGET_NAME} PROPERTIES OUTPUT_NAME ${LIBRARY_NAME})
set_target_properties(${STATIC_TARGET_NAME} PROPERTIES CUDA_SEPARABLE_COMPILATION OFF)
Expand All @@ -262,7 +264,7 @@ if(APR_BUILD_SHARED_LIB)
# generate fat shared library
set(SHARED_TARGET_NAME sharedLib)
add_library(${SHARED_TARGET_NAME} SHARED $<TARGET_OBJECTS:aprObjLib> ${APR_CUDA_SOURCE_FILES})

set_property(TARGET ${SHARED_TARGET_NAME} PROPERTY CUDA_ARCHITECTURES OFF)
target_include_directories(${SHARED_TARGET_NAME} PUBLIC $<BUILD_INTERFACE:${CMAKE_CURRENT_SOURCE_DIR}/src> $<BUILD_INTERFACE:${PROJECT_BINARY_DIR}>)
set_target_properties(${SHARED_TARGET_NAME} PROPERTIES OUTPUT_NAME ${LIBRARY_NAME})
set_target_properties(${SHARED_TARGET_NAME} PROPERTIES LIBRARY_OUTPUT_NAME ${LIBRARY_NAME})
Expand Down
2 changes: 1 addition & 1 deletion examples/Example_get_apr.h
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ struct cmdLineOptions{
bool auto_parameters = false;

float Ip_th = 0;
float lambda = -1;
float lambda = 3.0;
float sigma_th = 0;
float rel_error = 0.1;
float grad_th = 1;
Expand Down
166 changes: 40 additions & 126 deletions src/algorithm/APRConverter.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -117,7 +117,7 @@ class APRConverter {
PixelData<float> local_scale_temp; // Used as down-sampled images for some averaging steps where it is useful to not lose precision, or get over-flow errors
PixelData<float> local_scale_temp2;

void applyParameters(APR& aAPR,APRParameters& aprParameters);
void applyParameters(APRParameters& aprParameters);

template<typename T>
void computeL(APR& aAPR,PixelData<T>& input_image);
Expand Down Expand Up @@ -184,7 +184,7 @@ void APRConverter<ImageType>::get_apr_custom_grad_scale(APR& aAPR,PixelData<Imag
}

aAPR.parameters = par;
applyParameters(aAPR,par);
applyParameters(par);
solveForAPR(aAPR);
generateDatastructures(aAPR);

Expand Down Expand Up @@ -215,6 +215,10 @@ void APRConverter<ImageType>::computeL(APR& aAPR,PixelData<T>& input_image){

fine_grained_timer.start_timer("offset image");

// offset image by factor (this is required if there are zero areas in the background with
// uint16_t and uint8_t images, as the Bspline co-efficients otherwise may be negative!)
// Warning both of these could result in over-flow!

if (std::is_floating_point<ImageType>::value) {
image_temp.copyFromMesh(input_image);
} else {
Expand Down Expand Up @@ -247,7 +251,7 @@ void APRConverter<ImageType>::computeL(APR& aAPR,PixelData<T>& input_image){
}

template<typename ImageType>
void APRConverter<ImageType>::applyParameters(APR& aAPR,APRParameters& aprParameters) {
void APRConverter<ImageType>::applyParameters(APRParameters& aprParameters) {
//
// Apply the main parameters
//
Expand All @@ -261,39 +265,7 @@ void APRConverter<ImageType>::applyParameters(APR& aAPR,APRParameters& aprParame
}
fine_grained_timer.stop_timer();

fine_grained_timer.start_timer("threshold");
iComputeGradient.threshold_gradient(grad_temp,local_scale_temp2,aprParameters.Ip_th + bspline_offset);
fine_grained_timer.stop_timer();

float max_th = 60000;

#ifdef HAVE_OPENMP
#pragma omp parallel for default(shared)
#endif
for (size_t i = 0; i < grad_temp.mesh.size(); ++i) {

float rescaled = local_scale_temp.mesh[i];
if (rescaled < aprParameters.sigma_th) {
rescaled = (rescaled < aprParameters.sigma_th_max) ? max_th : par.sigma_th;
local_scale_temp.mesh[i] = rescaled;
}
}

#ifdef HAVE_LIBTIFF
if(par.output_steps) {
TiffUtils::saveMeshAsTiff(par.output_dir + "local_intensity_scale_rescaled.tif", local_scale_temp);
}
#endif

#ifdef HAVE_OPENMP
#pragma omp parallel for default(shared)
#endif
for (size_t i = 0; i < grad_temp.mesh.size(); ++i) {

if(grad_temp.mesh[i] < aprParameters.grad_th){
grad_temp.mesh[i] = 0;
}
}
iComputeGradient.applyParameters(grad_temp, local_scale_temp, local_scale_temp2, aprParameters, bspline_offset);
}


Expand Down Expand Up @@ -401,7 +373,7 @@ inline bool APRConverter<ImageType>::get_lrf(APR &aAPR, PixelData<T>& input_imag
template<typename ImageType>
inline bool APRConverter<ImageType>::get_ds(APR &aAPR) {

applyParameters(aAPR,par);
applyParameters(par);
aAPR.parameters = par;

solveForAPR(aAPR);
Expand All @@ -422,103 +394,45 @@ inline bool APRConverter<ImageType>::get_ds(APR &aAPR) {
*/
template<typename ImageType> template<typename T>
inline bool APRConverter<ImageType>::get_apr_cuda(APR &aAPR, PixelData<T>& input_image) {
if (!initPipelineAPR(aAPR, input_image)) return false;

if (!initPipelineAPR(aAPR, input_image)) return false;

initPipelineMemory(input_image.y_num, input_image.x_num, input_image.z_num);

method_timer.start_timer("compute_gradient_magnitude_using_bsplines and local instensity scale CUDA");
APRTimer t(true);
APRTimer d(true);
t.start_timer(" =========== ALL");
{

computation_timer.start_timer("init_mem");
PixelData<ImageType> image_temp(input_image, false /* don't copy */, true /* pinned memory */); // global image variable useful for passing between methods, or re-using memory (should be the only full sized copy of the image)

/////////////////////////////////
/// Pipeline
////////////////////////
//offset image by factor (this is required if there are zero areas in the background with uint16_t and uint8_t images, as the Bspline co-efficients otherwise may be negative!)
// Warning both of these could result in over-flow (if your image is non zero, with a 'buffer' and has intensities up to uint16_t maximum value then set image_type = "", i.e. uncomment the following line)

if (std::is_same<uint16_t, ImageType>::value) {
bspline_offset = 100;
image_temp.copyFromMeshWithUnaryOp(input_image, [=](const auto &a) { return (a + bspline_offset); });
} else if (std::is_same<uint8_t, ImageType>::value) {
bspline_offset = 5;
image_temp.copyFromMeshWithUnaryOp(input_image, [=](const auto &a) { return (a + bspline_offset); });
} else {
image_temp.copyFromMesh(input_image);
}

computation_timer.stop_timer();

std::vector<GpuProcessingTask<ImageType>> gpts;

int numOfStreams = 1;
int repetitionsPerStream = 1;

computation_timer.start_timer("compute_L");
// Create streams and send initial task to do
for (int i = 0; i < numOfStreams; ++i) {
gpts.emplace_back(GpuProcessingTask<ImageType>(image_temp, local_scale_temp, par, bspline_offset, aAPR.level_max()));
gpts.back().sendDataToGpu();
gpts.back().processOnGpu();
}
computation_timer.stop_timer();


for (int i = 0; i < numOfStreams * repetitionsPerStream; ++i) {
int c = i % numOfStreams;

computation_timer.start_timer("apply_parameters");
// get data from previous task
gpts[c].getDataFromGpu();

computation_timer.stop_timer();

// in theory we get new data and send them to task
if (i < numOfStreams * (repetitionsPerStream - 1)) {
gpts[c].sendDataToGpu();
gpts[c].processOnGpu();
}

// Postprocess on CPU
std::cout << "--------- start CPU processing ---------- " << i << std::endl;

computation_timer.start_timer("solve_for_apr");
iPullingScheme.initialize_particle_cell_tree(aAPR.aprInfo);

PixelData<float> lst(local_scale_temp, true);

#ifdef HAVE_LIBTIFF
if (par.output_steps){
TiffUtils::saveMeshAsTiff(par.output_dir + "local_intensity_scale_step.tif", lst);
}
#endif
computation_timer.start_timer("init_mem");
PixelData<ImageType> image_temp(input_image, false /* don't copy */, true /* pinned memory */); // global image variable useful for passing between methods, or re-using memory (should be the only full sized copy of the image)

#ifdef HAVE_LIBTIFF
if (par.output_steps){
TiffUtils::saveMeshAsTiff(par.output_dir + "gradient_step.tif", grad_temp);
}
#endif
/////////////////////////////////
/// Pipeline
////////////////////////
// offset image by factor (this is required if there are zero areas in the background with
// uint16_t and uint8_t images, as the Bspline co-efficients otherwise may be negative!)
// Warning both of these could result in over-flow!

iLocalParticleSet.get_local_particle_cell_set(iPullingScheme,lst, local_scale_temp2,par);
if (std::is_same<uint16_t, ImageType>::value) {
bspline_offset = 100;
image_temp.copyFromMeshWithUnaryOp(input_image, [=](const auto &a) { return (a + bspline_offset); });
} else if (std::is_same<uint8_t, ImageType>::value) {
bspline_offset = 5;
image_temp.copyFromMeshWithUnaryOp(input_image, [=](const auto &a) { return (a + bspline_offset); });
} else {
image_temp.copyFromMesh(input_image);
}

iPullingScheme.pulling_scheme_main();
GpuProcessingTask<ImageType> gpt(image_temp, local_scale_temp, par, bspline_offset, aAPR.level_max());
gpt.sendDataToGpu();
gpt.processOnGpu();
auto linearAccessGpu = gpt.getDataFromGpu();

computation_timer.stop_timer();
aAPR.aprInfo.total_number_particles = linearAccessGpu.y_vec.size();

computation_timer.start_timer("generate_data_structures");
generateDatastructures(aAPR);
computation_timer.stop_timer();
}
std::cout << "Total n ENDED" << std::endl;
// generateDatastructures(aAPR) for linearAcceess for CUDA
aAPR.linearAccess.y_vec.copy(linearAccessGpu.y_vec);
aAPR.linearAccess.xz_end_vec.copy(linearAccessGpu.xz_end_vec);
aAPR.linearAccess.level_xz_vec.copy(linearAccessGpu.level_xz_vec);
aAPR.apr_initialized = true;

}
t.stop_timer();
method_timer.stop_timer();
std::cout << "CUDA pipeline finished!\n";

return true;
}
Expand Down Expand Up @@ -560,7 +474,7 @@ inline bool APRConverter<ImageType>::get_apr_cpu(APR &aAPR, PixelData<T> &input_
method_timer.stop_timer();
}

applyParameters(aAPR,par);
applyParameters(par);

computation_timer.stop_timer();

Expand Down Expand Up @@ -592,7 +506,7 @@ template<typename ImageType> template<typename T>
inline bool APRConverter<ImageType>::get_apr(APR &aAPR, PixelData<T> &input_image) {
// TODO: CUDA pipeline is temporarily turned off and CPU version is always chosen.
// After revising a CUDA pipeline remove "#if true // " part.
#if true // #ifndef APR_USE_CUDA
#ifndef APR_USE_CUDA
return get_apr_cpu(aAPR, input_image);
#else
return get_apr_cuda(aAPR, input_image);
Expand Down
Loading
Loading