monarch_cuda_interface_fwd_bf16.cu failed with invalid argument (1). #6

jeohalves · 2023-11-23T21:32:10Z

ERROR: CUDA RT call "cudaFuncSetAttribute(&monarch_conv_cuda_32_32_32_kernel<32, 8, 32768, 2, 16, false, 2, 8, 8>, cudaFuncAttributeMaxDynamicSharedMemorySize, 135168)" in line 969 of file /root/flash-fft-conv/csrc/flashfftconv/monarch_cuda/monarch_cuda_interface_fwd_bf16.cu failed with invalid argument (1). CUDA Runtime Error at: /root/flash-fft-conv/csrc/flashfftconv/monarch_cuda/monarch_cuda_interface_fwd_bf16.cu:1041 invalid argument

Tried the example code with my_flashfftconv(x, k) and tests/test_flashfftconv.py using the Nvidia PyTorch docker container (23.05). Previously, I used conda with different CUDA versions (12.1, 12.2 and 12.3).

I'm using two NVIDIA RTX 3090 with Driver Version: 535.129.03 and CUDA Version: 12.2

Is there any fix for this problem? (changing tensor types didn't fixed)

The text was updated successfully, but these errors were encountered:

DanFu09 · 2023-11-24T22:53:38Z

Thanks for this bug report! This is because the RTX series has less SRAM than A100/H100 (99 KB vs. 163/227 KB), which I didn't check for during development. You should be good for now for sequences up to 16K, and sequence lengths between 64K and 524K.

We'll try to fill in the rest of the sequence lengths for 3090 & 4090 in the next week or so, up to 2M (it requires some code changes and special-casing for different GPUs).

DanFu09 added the bug Something isn't working label Nov 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

monarch_cuda_interface_fwd_bf16.cu failed with invalid argument (1). #6

monarch_cuda_interface_fwd_bf16.cu failed with invalid argument (1). #6

jeohalves commented Nov 23, 2023

DanFu09 commented Nov 24, 2023

monarch_cuda_interface_fwd_bf16.cu failed with invalid argument (1). #6

monarch_cuda_interface_fwd_bf16.cu failed with invalid argument (1). #6

Comments

jeohalves commented Nov 23, 2023

DanFu09 commented Nov 24, 2023