-
Notifications
You must be signed in to change notification settings - Fork 357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cuSPARSELt matmul example not working on M=N=K8192 #203
Comments
@OrenLeung a couple of questions to better understand your issue.
|
hi @fbusato , thanks for the quick reply. I didn't change anything else in the code, just the m,n,k vars. I was able to compile & run the matmul example with default m,n,k vars.
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install libcusparselt0 libcusparselt-dev I have double checked that my cusparse ls /usr/local/cuda/lib64/libcusparse
libcusparse.so libcusparseLt.so libcusparseLt_static.a
libcusparse.so.12 libcusparseLt.so.0
libcusparse.so.12.5.1.3 libcusparseLt.so.0.5.2.1 |
it seems that you are using cuSPARSELt 0.5.2.1 which doesn't support Hopper https://docs.nvidia.com/cuda/cusparselt/release_notes.html |
Hi @fbusato Thanks for the suggestion, I have now correctly symlinked to cuSPARSELt v0.6.2 using your suggestion. I have verifed that the provided m,n,k in the example works properly and does not deadlock. But unfortunately for m=n=k=8192, I am deadlocked, it seems like it is deadlocked on a half to float convertion I have also double checked that m,n,k is the only thing i changed. |
Hi @OrenLeung, the 'deadlock' you observe is due to the long computation time on the host side (correctness) for large matrices. If you want to speed up the process, my suggestion is to use cuBLAS to compute the matrix multiplication on the GPU. |
hi @fbusato Thanks for your suggestion! I have now got it working on but unfortunately the realized TFLOP/s of nowhere close to the peak theoretical sparse TFLOP/s. Do you have any tips on how to improve the cuSPARSE performance? realized sparse cuSPARSELt fp16: 1005 TFLOP/s out of the peak theoretical 1,979 this menas there is only around a 15% realized improvement. Although no one was expecting the claimed 2x imrpovement, one would expect closer to a 40-50% realized improvement. On A100, Nvidia claims that the speed up for big GEMMs is 1.6-1.8x https://developer.nvidia.com/blog/exploiting-ampere-structured-sparsity-with-cusparselt/ Attached is my script to benchmarking 8192x8192x8192 cuSPARSE 2:4 semi structured 16 sparsity vs cuBLAS fp16 dense gemms on h100. I have ensured that I am benchmarking gpu time through cudaevents and i am on the latest cuSPARSE version. |
there are several things to consider when benchmarking cuSPARSELt. You should nsight-system (or cupti) to get more reliable time measurement. Second, you need to run the autotuning functionality, see the other example. Other points to consider: run some warm-up runs, lock gpu sm/memory clock, disable autoboost, ensure there is no power/thermal throttling, disable cpu turboboost, set cpu governor to performance, etc. |
hi fbusato, thanks for your suggestion.
|
It seems when changing the inputs to a normal distribution centered around 0, then the sparse performance gets a bit better with 20% improvement over dense. OrenLeung@9cabba4
|
@OrenLeung we evaluated the same sparse GEMM operation on our systems, default clocks. We observed 1.38x speedup (sparse vs. dense) on a H100 350W and 1.22x on H100 800W. |
@fbusato thanks for running it. by "800W h100", you mean 700W right? we also see around 1.20-1.22x improvement too. Would you have any suggestions on shapes where sparsity would show the biggest gain compared to dense? |
I don't have any specific suggestions other than to try different shapes and data types. The results are affected by different GPU models, clock settings, and cuda version, so it is hard to give exact sizes. The main engineer is OOTO, and he will be back in 2w. He can help you better |
on https://github.com/NVIDIA/CUDALibrarySamples/tree/master/cuSPARSELt/matmul
the example runs fine on the existing small m,n,k, but unfortunately when i change my m,n,k to be 8192, i get a runtime error. any pointers or patches on how to fix it?
CUSPARSE API failed at line 191 with error: operation not supported (10)
https://github.com/NVIDIA/CUDALibrarySamples/blob/master/cuSPARSELt/matmul/matmul_example.cpp#L116-L118
The text was updated successfully, but these errors were encountered: