Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nightly test failures with rocm/5.2.0+tpls, timeouts in sparse_hip and gmres_test_prec #1839

Closed
ndellingwood opened this issue May 23, 2023 · 3 comments
Assignees

Comments

@ndellingwood
Copy link
Contributor

Nightly test failures with rocm/5.2.0+tpls due to timeouts in sparse_hip and gmres_test_prec
This showed up after some compilation errors were resolved by #1819 (comment)

The sparse_hip failures occur with spmv testing

Reproducer (Caraway MI100 queue):

salloc -N 1 -p MI100

module purge
module load cmake/3.19.3 rocm/5.2.0

$KOKKOSKERNELS_PATH/cm_generate_makefile.bash --with-devices=Hip,Serial --arch=VEGA908 --compiler=hipcc --cxxflags="-O3" --cxxstandard="17" --with-hip --kokkos-path=$KOKKOS_PATH --kokkoskernels-path=$KOKKOSKERNELS_PATH --with-scalars='double,complex_double' --with-ordinals=int --with-offsets=int,size_t --with-layouts=LayoutLeft --with-tpls=rocblas,rocsparse --no-examples
@cwpearson cwpearson self-assigned this May 23, 2023
@cwpearson
Copy link
Contributor

cwpearson commented May 24, 2023

salloc -N 1 -p MI100
export KOKKOS_SRC=...
export KERNELS_SRC=...
module purge
module load cmake/3.19.3 rocm/5.2.0

export KOKKOS_BUILD="${KOKKOS_SRC}/build-1839"
export KOKKOS_INSTALL="${KOKKOS_SRC}/install-1839"
export KERNELS_BUILD="${KERNELS_SRC}/build-1839"

cmake \
-S $KOKKOS_SRC \
-B $KOKKOS_BUILD \
-DCMAKE_INSTALL_PREFIX=$KOKKOS_INSTALL \
-DCMAKE_CXX_COMPILER=hipcc \
-DKokkos_ENABLE_SERIAL=ON \
-DKokkos_ENABLE_HIP=ON \
-DKokkos_ARCH_VEGA908=ON \
-DKokkos_ENABLE_TESTS=OFF \
-DKokkos_ENABLE_EXAMPLES=OFF \
-DCMAKE_VERBOSE_MAKEFILE=ON \
-DCMAKE_CXX_EXTENSIONS=OFF \
-DCMAKE_CXX_STANDARD=17 \
-DBUILD_SHARED_LIBS=OFF \
-DKokkos_ENABLE_DEPRECATION_WARNINGS=OFF \
-DKokkos_ENABLE_DEPRECATED_CODE_3=OFF

m -C $KOKKOS_BUILD install

cmake \
-S $KERNELS_SRC \
-B $KERNELS_BUILD \
-DCMAKE_CXX_COMPILER=hipcc \
-DKokkos_DIR=$KOKKOS_INSTALL/lib64/cmake/Kokkos \
-DCMAKE_CXX_FLAGS="-O3" \
-DKokkosKernels_ENABLE_TESTS_AND_PERFSUITE=OFF \
-DKokkosKernels_ENABLE_TESTS=ON \
-DKokkosKernels_ENABLE_PERFTESTS=ON \
-DKokkosKernels_ENABLE_EXAMPLES:BOOL=ON \
-DCMAKE_EXPORT_COMPILE_COMMANDS:BOOL=OFF \
-DKokkosKernels_INST_COMPLEX_DOUBLE=ON \
-DKokkosKernels_INST_DOUBLE=ON \
-DKokkosKernels_INST_ORDINAL_INT=ON \
-DKokkosKernels_INST_OFFSET_SIZE_T=ON \
-DKokkosKernels_INST_OFFSET_INT=ON \
-DKokkosKernels_INST_LAYOUTLEFT=ON \
-DKokkosKernels_ENABLE_TPL_CUSPARSE=OFF \
-DKokkosKernels_ENABLE_TPL_CUBLAS=OFF \
-DKokkosKernels_ENABLE_TPL_ROCSPARSE=ON \
-DKokkosKernels_ENABLE_TPL_ROCBLAS=ON \
-DCMAKE_EXE_LINKER_FLAGS="" \
-DBUILD_SHARED_LIBS=OFF \
-DKokkosKernels_ENABLE_DOCS=OFF

m -C $KERNELS_BUILD KokkosKernels_sparse_hip

This one seems to hang:

$KERNELS_BUILD/sparse/unit_test/KokkosKernels_sparse_hip --gtest_filter="*sparse_spmv_double_int_int*"
  • EXECUTE_TEST_FN
    • test_spmv_algorithms
      • test_spmv
        • check_spmv
          • KokkosSparse::spmv
            • KokkosBlas::scal
              • ends up calling this code
              • suspicious use of rocblas_pointer_mode_device here since the scalar is on the host
              • commenting out the stream handling and pointer-mode code causes the test to pass!
            • Kokkos::fence <- hang here

This runs all spmv and spmmv tests

$KERNELS_BUILD/sparse/unit_test/KokkosKernels_sparse_hip --gtest_filter="*spmv*:*spmmv*"

@cwpearson
Copy link
Contributor

After #1861, the SPMV unit test pass on Caraway MI100:

[cwpears@caraway05 build-1839]$ $KERNELS_BUILD/sparse/unit_test/KokkosKernels_sparse_hip --gtest_filter="*spmv*:*spmmv*"
Note: Google Test filter = *spmv*:*spmmv*
[==========] Running 24 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 24 tests from hip
[ RUN      ] hip.sparse_spmv_double_int_int_TestExecSpace
[       OK ] hip.sparse_spmv_double_int_int_TestExecSpace (353 ms)
[ RUN      ] hip.sparse_spmv_struct_double_int_int_TestExecSpace
[       OK ] hip.sparse_spmv_struct_double_int_int_TestExecSpace (73 ms)
[ RUN      ] hip.sparse_spmv_double_int_size_t_TestExecSpace
[       OK ] hip.sparse_spmv_double_int_size_t_TestExecSpace (291 ms)
[ RUN      ] hip.sparse_spmv_struct_double_int_size_t_TestExecSpace
[       OK ] hip.sparse_spmv_struct_double_int_size_t_TestExecSpace (69 ms)
[ RUN      ] hip.sparse_spmv_kokkos_complex_double_int_int_TestExecSpace
[       OK ] hip.sparse_spmv_kokkos_complex_double_int_int_TestExecSpace (367 ms)
[ RUN      ] hip.sparse_spmv_struct_kokkos_complex_double_int_int_TestExecSpace
[       OK ] hip.sparse_spmv_struct_kokkos_complex_double_int_int_TestExecSpace (84 ms)
[ RUN      ] hip.sparse_spmv_kokkos_complex_double_int_size_t_TestExecSpace
[       OK ] hip.sparse_spmv_kokkos_complex_double_int_size_t_TestExecSpace (358 ms)
[ RUN      ] hip.sparse_spmv_struct_kokkos_complex_double_int_size_t_TestExecSpace
[       OK ] hip.sparse_spmv_struct_kokkos_complex_double_int_size_t_TestExecSpace (82 ms)
[ RUN      ] hip.sparse_spmv_mv_double_int_int_LayoutLeft_TestExecSpace
[       OK ] hip.sparse_spmv_mv_double_int_int_LayoutLeft_TestExecSpace (10236 ms)
[ RUN      ] hip.sparse_spmv_mv_struct_double_int_int_LayoutLeft_TestExecSpace
[       OK ] hip.sparse_spmv_mv_struct_double_int_int_LayoutLeft_TestExecSpace (11 ms)
[ RUN      ] hip.sparse_spmv_mv_double_int_size_t_LayoutLeft_TestExecSpace
[       OK ] hip.sparse_spmv_mv_double_int_size_t_LayoutLeft_TestExecSpace (10678 ms)
[ RUN      ] hip.sparse_spmv_mv_struct_double_int_size_t_LayoutLeft_TestExecSpace
[       OK ] hip.sparse_spmv_mv_struct_double_int_size_t_LayoutLeft_TestExecSpace (10 ms)
[ RUN      ] hip.sparse_spmv_mv_kokkos_complex_double_int_int_LayoutLeft_TestExecSpace
[       OK ] hip.sparse_spmv_mv_kokkos_complex_double_int_int_LayoutLeft_TestExecSpace (12089 ms)
[ RUN      ] hip.sparse_spmv_mv_struct_kokkos_complex_double_int_int_LayoutLeft_TestExecSpace
[       OK ] hip.sparse_spmv_mv_struct_kokkos_complex_double_int_int_LayoutLeft_TestExecSpace (11 ms)
[ RUN      ] hip.sparse_spmv_mv_kokkos_complex_double_int_size_t_LayoutLeft_TestExecSpace
[       OK ] hip.sparse_spmv_mv_kokkos_complex_double_int_size_t_LayoutLeft_TestExecSpace (12149 ms)
[ RUN      ] hip.sparse_spmv_mv_struct_kokkos_complex_double_int_size_t_LayoutLeft_TestExecSpace
[       OK ] hip.sparse_spmv_mv_struct_kokkos_complex_double_int_size_t_LayoutLeft_TestExecSpace (11 ms)
[ RUN      ] hip.sparse_bsr_spmv_double_int_int_TestExecSpace
[       OK ] hip.sparse_bsr_spmv_double_int_int_TestExecSpace (3548 ms)
[ RUN      ] hip.sparse_bsr_spmv_double_int_size_t_TestExecSpace
[       OK ] hip.sparse_bsr_spmv_double_int_size_t_TestExecSpace (3548 ms)
[ RUN      ] hip.sparse_bsr_spmv_kokkos_complex_double_int_int_TestExecSpace
[       OK ] hip.sparse_bsr_spmv_kokkos_complex_double_int_int_TestExecSpace (4165 ms)
[ RUN      ] hip.sparse_bsr_spmv_kokkos_complex_double_int_size_t_TestExecSpace
[       OK ] hip.sparse_bsr_spmv_kokkos_complex_double_int_size_t_TestExecSpace (4397 ms)
[ RUN      ] hip.sparse_bsr_spmmv_double_int_int_LayoutLeft_TestExecSpace
[       OK ] hip.sparse_bsr_spmmv_double_int_int_LayoutLeft_TestExecSpace (11584 ms)
[ RUN      ] hip.sparse_bsr_spmmv_double_int_size_t_LayoutLeft_TestExecSpace
[       OK ] hip.sparse_bsr_spmmv_double_int_size_t_LayoutLeft_TestExecSpace (11353 ms)
[ RUN      ] hip.sparse_bsr_spmmv_kokkos_complex_double_int_int_LayoutLeft_TestExecSpace
[       OK ] hip.sparse_bsr_spmmv_kokkos_complex_double_int_int_LayoutLeft_TestExecSpace (15494 ms)
[ RUN      ] hip.sparse_bsr_spmmv_kokkos_complex_double_int_size_t_LayoutLeft_TestExecSpace
[       OK ] hip.sparse_bsr_spmmv_kokkos_complex_double_int_size_t_LayoutLeft_TestExecSpace (16457 ms)
[----------] 24 tests from hip (117418 ms total)

[----------] Global test environment tear-down
[==========] 24 tests from 1 test case ran. (117418 ms total)
[  PASSED  ] 24 tests.

@ndellingwood
Copy link
Contributor Author

Passing nightly build again :) , thanks @cwpearson

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants