Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bfloat16 #1709

Draft
wants to merge 48 commits into
base: develop
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
c0f2d9a
only can compile cuda/omp
yhmtsai Nov 29, 2022
ac1dbb2
next_precision to itself when complex only float, double add empty co…
yhmtsai Jan 5, 2023
7bffc78
can compile with cuda/omp/ref (without test)
yhmtsai Jan 8, 2023
c517644
compile for cuda/sycl/test/mpi (hip needs trick)
yhmtsai Jan 11, 2023
6b5a4ec
hip finally
yhmtsai Jan 12, 2023
add37ab
fix the narrow issue and atomic support
yhmtsai Jan 12, 2023
2f53fce
fixed more error
yhmtsai Jan 12, 2023
cdc4d6b
fix the op order and gdb
yhmtsai Jan 12, 2023
c1c1551
add the rand template not_implemented
yhmtsai Jan 12, 2023
3f11657
this version can compile/run complex<half> on cuda114
yhmtsai Jan 12, 2023
6b9546b
does not work for the other executor
yhmtsai Jan 14, 2023
384371b
fix complex issue and sqrt issue
yhmtsai Feb 6, 2023
c04d7d0
try fix the compilation issue from MSVC and MacOS
yhmtsai Feb 6, 2023
efb9aea
move the half to public and use sycl::half for dpcpp
yhmtsai Feb 7, 2023
9480b50
limit the next precision in test and benchmark
yhmtsai Feb 7, 2023
9625e11
allow disable half operation
yhmtsai Feb 7, 2023
620aef5
fix macro
yhmtsai Feb 8, 2023
06a6f46
clean and refine the code
yhmtsai Feb 8, 2023
814dca4
move half.hpp out of type.hpp
yhmtsai Feb 8, 2023
54755b4
enable half for testing
yhmtsai Feb 8, 2023
cc3a7d5
__habs is added in cuda10.2
yhmtsai Feb 8, 2023
7af58c9
fix nullptr and missing instantiation.
yhmtsai Feb 9, 2023
35a47fa
fix missing device_type and ptr_param
yhmtsai Mar 23, 2023
faea48d
update rounding
yhmtsai Mar 25, 2023
f1c1181
do not use distribution with half
yhmtsai Mar 27, 2023
d450dc5
WIP fix half of failed test
yhmtsai Mar 27, 2023
e83d53d
fix/skip half test and fix numeric_limit on device
yhmtsai Jun 13, 2023
2d7f4cb
mkl csr does not support half
yhmtsai Jun 21, 2023
b04c993
add half to batch_vector
yhmtsai Sep 7, 2023
10bb4ae
fix hip thrust complex op, avoid const in nvhpc, reduce job in windows
yhmtsai Sep 12, 2023
bf25352
fix nvc++ atomic, dpcpp half
yhmtsai Sep 13, 2023
cf9c218
make half test optional
yhmtsai Sep 14, 2023
1e4b68b
nvhpc optimization/computation error workaround
yhmtsai Sep 15, 2023
c6bbf20
disable mpi half test
yhmtsai Sep 15, 2023
18d323e
some math func is not defined if nvhpc is for host
yhmtsai Sep 29, 2023
8b27e3c
add half spmv benchmark (with cusparse for cuda)
yhmtsai Sep 30, 2023
4eb3b53
add bfloat16
yhmtsai Sep 7, 2023
f3913a2
this can be compiled after cuda arch 80
yhmtsai Sep 11, 2023
d87dc17
some trick for cuda arch < 8.0
yhmtsai Sep 11, 2023
5f48681
fix the missing type conversion
yhmtsai Sep 12, 2023
316775f
enable bfloat16 test and fix/skip
yhmtsai Oct 1, 2023
4f0d12a
fix missing type, use bfloat16 op from cudaarch80
yhmtsai Oct 2, 2023
8364204
fix hip
yhmtsai Oct 2, 2023
95ac5ea
fix dpcpp
yhmtsai Oct 2, 2023
653cd36
fix nvhpc
yhmtsai Oct 2, 2023
50cf5b2
add the casting
yhmtsai Oct 3, 2023
51ab0b0
use float as the bridge between bfloat16 and half
yhmtsai Oct 4, 2023
b531ba3
fix ell accessor type
yhmtsai Oct 4, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/windows-mingw.yml
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ jobs:
mkdir build
cd build
cmake -G "MinGW Makefiles" -DBUILD_SHARED_LIBS=${{ matrix.config.shared }} -DCMAKE_BUILD_TYPE=${{ matrix.config.build_type }} -DGINKGO_COMPILER_FLAGS=${{ matrix.config.cflags }} ..
cmake --build . -j4
cmake --build . -j2
shell: cmd

- name: install
Expand Down
7 changes: 4 additions & 3 deletions .github/workflows/windows-msvc-ref.yml
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,9 @@ jobs:
fail-fast: false
matrix:
config:
- {shared: "ON", build_type: "Debug", name: "reference/debug/shared"}
- {shared: "OFF", build_type: "Release", name: "reference/release/static"}
# Debug with half precision has the issue "library limit of 65535 objects exceeded"
- {shared: "ON", build_type: "Debug", name: "reference/debug/shared", half: "OFF"}
- {shared: "OFF", build_type: "Release", name: "reference/release/static", half: "ON"}
# Debug static needs too much storage
# - {shared: "OFF", build_type: "Debug", name: "reference/debug/static"}
name: msvc/${{ matrix.config.name }}
Expand All @@ -47,7 +48,7 @@ jobs:
run: |
mkdir build
cd build
cmake -DBUILD_SHARED_LIBS=${{ matrix.config.shared }} -DCMAKE_CXX_FLAGS_DEBUG='/MDd /Zi /Ob1 /O1 /Od /RTC1' -DGINKGO_BUILD_CUDA=OFF -DGINKGO_BUILD_OMP=OFF ..
cmake -DBUILD_SHARED_LIBS=${{ matrix.config.shared }} -DCMAKE_CXX_FLAGS_DEBUG='/MDd /Zi /Ob1 /O1 /Od /RTC1' -DGINKGO_BUILD_CUDA=OFF -DGINKGO_BUILD_OMP=OFF -DGINKGO_ENABLE_HALF=${{ matrix.config.half }}..
cmake --build . -j4 --config ${{ matrix.config.build_type }}
ctest . -C ${{ matrix.config.build_type }} --output-on-failure

Expand Down
1 change: 1 addition & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ option(GINKGO_BUILD_DOC "Generate documentation" OFF)
option(GINKGO_FAST_TESTS "Reduces the input size for a few tests known to be time-intensive" OFF)
option(GINKGO_TEST_NONDEFAULT_STREAM "Uses non-default streams in CUDA and HIP tests" OFF)
option(GINKGO_MIXED_PRECISION "Instantiate true mixed-precision kernels (otherwise they will be conversion-based using implicit temporary storage)" OFF)
option(GINKGO_ENABLE_HALF "Enable the half operation" ON)
option(GINKGO_SKIP_DEPENDENCY_UPDATE
"Do not update dependencies each time the project is rebuilt" ON)
option(GINKGO_EXPORT_BUILD_DIR
Expand Down
15 changes: 14 additions & 1 deletion accessor/cuda_helper.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,9 @@ OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#include "utils.hpp"


struct __half;


namespace gko {
namespace acc {
namespace detail {
Expand All @@ -57,6 +60,16 @@ struct cuda_type {
using type = T;
};

template <>
struct cuda_type<gko::half> {
using type = __half;
};

template <>
struct cuda_type<gko::bfloat16> {
using type = __nv_bfloat16;
};

// Unpack cv and reference / pointer qualifiers
template <typename T>
struct cuda_type<const T> {
Expand Down Expand Up @@ -87,7 +100,7 @@ struct cuda_type<T&&> {
// Transform std::complex to thrust::complex
template <typename T>
struct cuda_type<std::complex<T>> {
using type = thrust::complex<T>;
using type = thrust::complex<typename cuda_type<T>::type>;
};


Expand Down
14 changes: 13 additions & 1 deletion accessor/hip_helper.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,9 @@ OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#include "utils.hpp"


struct __half;


namespace gko {
namespace acc {
namespace detail {
Expand Down Expand Up @@ -83,11 +86,20 @@ struct hip_type<T&&> {
using type = typename hip_type<T>::type&&;
};

template <>
struct hip_type<gko::half> {
using type = __half;
};

template <>
struct hip_type<gko::bfloat16> {
using type = hip_bfloat16;
};

// Transform std::complex to thrust::complex
template <typename T>
struct hip_type<std::complex<T>> {
using type = thrust::complex<T>;
using type = thrust::complex<typename hip_type<T>::type>;
};


Expand Down
5 changes: 3 additions & 2 deletions accessor/reduced_row_major_reference.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,7 @@ class reduced_storage
operator=(arithmetic_type val) &&
{
storage_type* const GKO_ACC_RESTRICT r_ptr = ptr_;
*r_ptr = val;
*r_ptr = detail::implicit_explicit_conversion<storage_type>(val);
return val;
}

Expand All @@ -115,7 +115,8 @@ class reduced_storage
}

constexpr GKO_ACC_ATTRIBUTES arithmetic_type
operator=(reduced_storage&& ref) && noexcept
operator=(reduced_storage&& ref) &&
noexcept
{
std::move(*this) = ref.implicit_conversion();
return *this;
Expand Down
6 changes: 4 additions & 2 deletions accessor/reference_helper.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -43,8 +43,10 @@ OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

// CUDA TOOLKIT < 11 does not support constexpr in combination with
// thrust::complex, which is why constexpr is only present in later versions
#if defined(__CUDA_ARCH__) && defined(__CUDACC_VER_MAJOR__) && \
(__CUDACC_VER_MAJOR__ < 11)
// TODO: NVC++ constexpr
#if (defined(__CUDA_ARCH__) && defined(__CUDACC_VER_MAJOR__) && \
(__CUDACC_VER_MAJOR__ < 11)) || \
(defined(__NVCOMPILER) && GINKGO_ENABLE_HALF)

#define GKO_ACC_ENABLE_REFERENCE_CONSTEXPR

Expand Down
46 changes: 27 additions & 19 deletions benchmark/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -24,15 +24,12 @@ function(ginkgo_benchmark_cusparse_linops type def)
endfunction()

function(ginkgo_benchmark_hipsparse_linops type def)
add_library(hipsparse_linops_${type} utils/hip_linops.hip.cpp)
set_source_files_properties(utils/hip_linops.hip.cpp PROPERTIES HIP_SOURCE_PROPERTY_FORMAT TRUE)
hip_add_library(hipsparse_linops_${type} utils/hip_linops.hip.cpp
HIPCC_OPTIONS ${GINKGO_HIPCC_OPTIONS} -D${def}
CLANG_OPTIONS ${GINKGO_HIP_CLANG_OPTIONS}
NVCC_OPTIONS ${GINKGO_HIP_NVCC_OPTIONS})
target_compile_definitions(hipsparse_linops_${type} PUBLIC ${def})
EXECUTE_PROCESS(COMMAND ${HIP_PATH}/bin/hipconfig --cpp_config OUTPUT_VARIABLE HIP_CXX_FLAGS)
set_target_properties(hipsparse_linops_${type} PROPERTIES COMPILE_FLAGS ${HIP_CXX_FLAGS})
# use Thrust C++ device just for compilation, we don't use thrust::complex in the benchmarks
target_compile_definitions(hipsparse_linops_${type} PUBLIC -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_CPP)
target_include_directories(hipsparse_linops_${type} SYSTEM PRIVATE
${HSA_HEADER} ${HIP_INCLUDE_DIRS}
${HIPBLAS_INCLUDE_DIRS} ${HIPSPARSE_INCLUDE_DIRS})
target_link_libraries(hipsparse_linops_${type} Ginkgo::ginkgo ${HIPSPARSE_LIBRARIES})
endfunction()

Expand Down Expand Up @@ -79,17 +76,25 @@ function(ginkgo_add_single_benchmark_executable name use_lib_linops macro_def ty
target_compile_options("${name}" PRIVATE ${GINKGO_COMPILER_FLAGS})
ginkgo_benchmark_add_tuning_maybe("${name}")
if("${use_lib_linops}")
if (GINKGO_BUILD_CUDA)
target_compile_definitions("${name}" PRIVATE HAS_CUDA=1)
target_link_libraries("${name}" cusparse_linops_${type})
endif()
if (GINKGO_BUILD_HIP)
target_compile_definitions("${name}" PRIVATE HAS_HIP=1)
target_link_libraries("${name}" hipsparse_linops_${type})
endif()
if (GINKGO_BUILD_DPCPP)
target_compile_definitions("${name}" PRIVATE HAS_DPCPP=1)
target_link_libraries("${name}" onemkl_linops_${type})
if ("${type}" STREQUAL "h")
# only cuda supports half currently
if (GINKGO_BUILD_CUDA)
target_compile_definitions("${name}" PRIVATE HAS_CUDA=1)
target_link_libraries("${name}" cusparse_linops_${type})
endif()
else()
if (GINKGO_BUILD_CUDA)
target_compile_definitions("${name}" PRIVATE HAS_CUDA=1)
target_link_libraries("${name}" cusparse_linops_${type})
endif()
if (GINKGO_BUILD_HIP)
target_compile_definitions("${name}" PRIVATE HAS_HIP=1)
target_link_libraries("${name}" hipsparse_linops_${type})
endif()
if (GINKGO_BUILD_DPCPP)
target_compile_definitions("${name}" PRIVATE HAS_DPCPP=1)
target_link_libraries("${name}" onemkl_linops_${type})
endif()
endif()
endif()
endfunction(ginkgo_add_single_benchmark_executable)
Expand Down Expand Up @@ -119,6 +124,9 @@ if (GINKGO_BUILD_CUDA)
ginkgo_benchmark_cusparse_linops(s GKO_BENCHMARK_USE_SINGLE_PRECISION)
ginkgo_benchmark_cusparse_linops(z GKO_BENCHMARK_USE_DOUBLE_COMPLEX_PRECISION)
ginkgo_benchmark_cusparse_linops(c GKO_BENCHMARK_USE_SINGLE_COMPLEX_PRECISION)
if (GINKGO_ENABLE_HALF)
ginkgo_benchmark_cusparse_linops(h GKO_BENCHMARK_USE_HALF_PRECISION)
endif()
add_library(cuda_timer utils/cuda_timer.cpp)
target_link_libraries(cuda_timer ginkgo CUDA::cudart)
endif()
Expand Down
15 changes: 12 additions & 3 deletions benchmark/run_all_benchmarks.sh
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,8 @@ elif [ "${BENCHMARK_PRECISION}" == "dcomplex" ]; then
BENCH_SUFFIX="_dcomplex"
elif [ "${BENCHMARK_PRECISION}" == "scomplex" ]; then
BENCH_SUFFIX="_scomplex"
elif [ "${BENCHMARK_PRECISION}" == "half" ]; then
BENCH_SUFFIX="_half"
else
echo "BENCHMARK_PRECISION is set to the not supported \"${BENCHMARK_PRECISION}\"." 1>&2
echo "Currently supported values: \"double\", \"single\", \"dcomplex\" and \"scomplex\"" 1>&2
Expand Down Expand Up @@ -216,9 +218,16 @@ keep_latest() {
compute_matrix_statistics() {
[ "${DRY_RUN}" == "true" ] && return
cp "$1" "$1.imd" # make sure we're not loosing the original input
./matrix_statistics/matrix_statistics${BENCH_SUFFIX} \
--backup="$1.bkp" --double_buffer="$1.bkp2" \
<"$1.imd" 2>&1 >"$1"
if [ "${BENCH_SUFFIX}" == "_half" ]; then
# half precision benchmark still uses single for statistics
./matrix_statistics/matrix_statistics_single \
--backup="$1.bkp" --double_buffer="$1.bkp2" \
<"$1.imd" 2>&1 >"$1"
else
./matrix_statistics/matrix_statistics${BENCH_SUFFIX} \
--backup="$1.bkp" --double_buffer="$1.bkp2" \
<"$1.imd" 2>&1 >"$1"
fi
keep_latest "$1" "$1.bkp" "$1.bkp2" "$1.imd"
}

Expand Down
5 changes: 5 additions & 0 deletions benchmark/spmv/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,9 @@
ginkgo_add_typed_benchmark_executables(spmv "YES" spmv.cpp)
# TODO: move to all benchmark
if (GINKGO_ENABLE_HALF)
ginkgo_add_single_benchmark_executable(
"spmv_half" "YES" "GKO_BENCHMARK_USE_HALF_PRECISION" "h" spmv.cpp)
endif()
if(GINKGO_BUILD_MPI)
add_subdirectory(distributed)
endif()
4 changes: 3 additions & 1 deletion benchmark/spmv/spmv_common.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -144,7 +144,9 @@ struct SpmvBenchmark : Benchmark<spmv_benchmark_state<Generator>> {
exec->synchronize();
auto max_relative_norm2 =
compute_max_relative_norm2(x_clone.get(), state.answer.get());
format_case["max_relative_norm2"] = max_relative_norm2;
format_case["max_relative_norm2"] =
static_cast<typename gko::detail::arth_type<rc_etype>::type>(
max_relative_norm2);
}

IterationControl ic{timer};
Expand Down
40 changes: 25 additions & 15 deletions benchmark/utils/cuda_linops.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -558,14 +558,19 @@ class CusparseHybrid
((CUDA_VERSION >= 10020) && !(defined(_WIN32) || defined(__CYGWIN__)))


// cuSPARSE does not support 16 bit compute for full 16 bit floating point
// input. Also, the scalar must be the compute type, i.e. float.
template <typename ValueType>
void cusparse_generic_spmv(std::shared_ptr<const gko::CudaExecutor> gpu_exec,
const cusparseSpMatDescr_t mat,
const gko::array<ValueType>& scalars,
const gko::LinOp* b, gko::LinOp* x,
cusparseOperation_t trans, cusparseSpMVAlg_t alg)
void cusparse_generic_spmv(
std::shared_ptr<const gko::CudaExecutor> gpu_exec,
const cusparseSpMatDescr_t mat,
const gko::array<typename gko::detail::arth_type<ValueType>::type>& scalars,
const gko::LinOp* b, gko::LinOp* x, cusparseOperation_t trans,
cusparseSpMVAlg_t alg)
{
cudaDataType_t cu_value = gko::kernels::cuda::cuda_data_type<ValueType>();
cudaDataType_t compute_value = gko::kernels::cuda::cuda_data_type<
typename gko::detail::arth_type<ValueType>::type>();
using gko::kernels::cuda::as_culibs_type;
auto dense_b = gko::as<gko::matrix::Dense<ValueType>>(b);
auto dense_x = gko::as<gko::matrix::Dense<ValueType>>(x);
Expand All @@ -584,13 +589,14 @@ void cusparse_generic_spmv(std::shared_ptr<const gko::CudaExecutor> gpu_exec,
gko::size_type buffer_size = 0;
GKO_ASSERT_NO_CUSPARSE_ERRORS(cusparseSpMV_bufferSize(
gpu_exec->get_cusparse_handle(), trans, &scalars.get_const_data()[0],
mat, vecb, &scalars.get_const_data()[1], vecx, cu_value, alg,
mat, vecb, &scalars.get_const_data()[1], vecx, compute_value, alg,
&buffer_size));
gko::array<char> buffer_array(gpu_exec, buffer_size);
auto dbuffer = buffer_array.get_data();
GKO_ASSERT_NO_CUSPARSE_ERRORS(cusparseSpMV(
gpu_exec->get_cusparse_handle(), trans, &scalars.get_const_data()[0],
mat, vecb, &scalars.get_const_data()[1], vecx, cu_value, alg, dbuffer));
mat, vecb, &scalars.get_const_data()[1], vecx, compute_value, alg,
dbuffer));
GKO_ASSERT_NO_CUSPARSE_ERRORS(cusparseDestroyDnVec(vecx));
GKO_ASSERT_NO_CUSPARSE_ERRORS(cusparseDestroyDnVec(vecb));
}
Expand Down Expand Up @@ -669,8 +675,8 @@ class CusparseGenericCsr
protected:
void apply_impl(const gko::LinOp* b, gko::LinOp* x) const override
{
cusparse_generic_spmv(this->get_gpu_exec(), mat_, scalars, b, x, trans_,
Alg);
cusparse_generic_spmv<ValueType>(this->get_gpu_exec(), mat_, scalars, b,
x, trans_, Alg);
}

void apply_impl(const gko::LinOp* alpha, const gko::LinOp* b,
Expand All @@ -686,9 +692,11 @@ class CusparseGenericCsr
{}

private:
using compute_type = typename gko::detail::arth_type<ValueType>::type;
// Contains {alpha, beta}
gko::array<ValueType> scalars{
this->get_executor(), {gko::one<ValueType>(), gko::zero<ValueType>()}};
gko::array<compute_type> scalars{
this->get_executor(),
{gko::one<compute_type>(), gko::zero<compute_type>()}};
std::shared_ptr<csr> csr_;
cusparseOperation_t trans_;
cusparseSpMatDescr_t mat_;
Expand Down Expand Up @@ -761,8 +769,8 @@ class CusparseGenericCoo
protected:
void apply_impl(const gko::LinOp* b, gko::LinOp* x) const override
{
cusparse_generic_spmv(this->get_gpu_exec(), mat_, scalars, b, x, trans_,
default_csr_alg);
cusparse_generic_spmv<ValueType>(this->get_gpu_exec(), mat_, scalars, b,
x, trans_, default_csr_alg);
}

void apply_impl(const gko::LinOp* alpha, const gko::LinOp* b,
Expand All @@ -777,9 +785,11 @@ class CusparseGenericCoo
{}

private:
using compute_type = typename gko::detail::arth_type<ValueType>::type;
// Contains {alpha, beta}
gko::array<ValueType> scalars{
this->get_executor(), {gko::one<ValueType>(), gko::zero<ValueType>()}};
gko::array<compute_type> scalars{
this->get_executor(),
{gko::one<compute_type>(), gko::zero<compute_type>()}};
std::shared_ptr<coo> coo_;
cusparseOperation_t trans_;
cusparseSpMatDescr_t mat_;
Expand Down
5 changes: 1 addition & 4 deletions benchmark/utils/generator.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -158,10 +158,7 @@ struct DefaultSystemGenerator {
{
auto res = Vec::create(exec);
res->read(gko::matrix_data<ValueType, itype>(
size,
std::uniform_real_distribution<gko::remove_complex<ValueType>>(-1.0,
1.0),
get_engine()));
size, std::uniform_real_distribution<>(-1.0, 1.0), get_engine()));
return res;
}

Expand Down
Loading
Loading