You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Due to the limitations of our tile-based kernel optimization for quantized kernels with small LLM shapes, as discussed in issue #64, and considering we are a library capable of providing different backends for various scenarios, PR #80 introduces a CUDA implementation for efficient small-batch quantized matrix multiplication. Looking ahead, we are contemplating the implementation of quantized flash attention with our TL backend. BitBlas needs to determine when and how to dispatch operation configurations to different backends, which requires thoughtful design for the new component.
The text was updated successfully, but these errors were encountered:
Due to the limitations of our tile-based kernel optimization for quantized kernels with small LLM shapes, as discussed in issue #64, and considering we are a library capable of providing different backends for various scenarios, PR #80 introduces a CUDA implementation for efficient small-batch quantized matrix multiplication. Looking ahead, we are contemplating the implementation of quantized flash attention with our TL backend. BitBlas needs to determine when and how to dispatch operation configurations to different backends, which requires thoughtful design for the new component.
The text was updated successfully, but these errors were encountered: