-
Notifications
You must be signed in to change notification settings - Fork 947
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding direct-F16 quantization #2136
Comments
Some extra context (the numbers are of an RTX 2070) A prompt of 512 tokens it processed at ~600t/s using the MMQ kernels. If I force it to dequantize first, convert it to f16, then do the matmuls in f16, then convert it to f32 I can get candle to use the same kernels llama.cpp uses for prompt processing (I think? The names are almost the same). This runs at ~700t/s. On this latter approach, 25% of the GPU time is spent doing f32 -> f16 conversion. Ideally we'd dequantize directly to f16 to reduce some of that workload. This PR EricLBuehler/mistral.rs#238 implements what I described above, and it contains comparisons between These lines of llama.cpp do the same f32->f16 and matmul https://github.com/ggerganov/llama.cpp/blob/master/ggml-cuda.cu#L1232-L1270 that is called from https://github.com/ggerganov/llama.cpp/blob/master/ggml-cuda.cu#L1959 |
That sounds like some pretty neat speedup to get. Is it just useful for cuda or also for cpu/metal? |
I think this would be an optimization for CUDA. |
Ok thanks, let me have a quick look I don't think that the kernels do any float specific magic so the conversion shouldn't be tricky. |
See #2137 , I'm just going to add a bit of testing but this should be hopefully all fine. |
After direct f16 dequantization we're at 1000t/s EricLBuehler/mistral.rs#238 (comment) Thank you! |
@LaurentMazare, thank you for adding this! We observe about a 60% performance increase for prompt processing. It seems like the Candle matmul kernels here are slower than the llama.cpp ones overall, though by about 60%, which correlates with our prompt processing deficit to llama.cpp of also about 60%. |
I created a new issue about the different kernels #2139 |
Hello all,
During our work on mistral.rs we have noticed that Candle only dequantizes to F32 whereas llama.cpp can dequantize to F16. This affects performance because on certain hardware,
turing
will be used over the slowervolta
matmul kernels when in F16. Are there any plans to add support for dequantizing to arbitrary floating point datatypes in the future?For reference, here is our tracking issue: EricLBuehler/mistral.rs#153
Thank you!
The text was updated successfully, but these errors were encountered: