Support FP8 KV Cache #652

ajtejankar · 2024-10-17T07:59:29Z

Tested GLUE tasks with and without adapter and the accuracy is as expected. Previous implementation using vllm kernels didn't work in this setting. Uses static scales obtained with ultrachat 2k. The cache is stored in E4M3 format. Test weights at ajinkya-tejankar/Mistral-7B-Instruct-v0.2-FP8-UltraChat-2000-KV.

server/lorax_server/utils/paged_attention.py

server/lorax_server/models/custom_modeling/flash_mistral_modeling.py

server/lorax_server/models/flash_causal_lm.py

server/lorax_server/utils/torch_utils.py

ajtejankar added 3 commits October 8, 2024 00:22

(feat) : support fp8 kv cache

6f887aa

fix a few things

93fc0d1

add support for fp8 kv cache using flash infer

939b479

ajtejankar marked this pull request as draft October 17, 2024 08:00

ajtejankar added 6 commits October 17, 2024 20:21

add logging

56589ed

merge back previous code to provide fp8_kv as a quantization option

9590a71

remove unnecessary comments

1517b16

remove is_fp8_kv_supported function

a5f1c25

fix attention api and kv_dtype

52021c0

keep ruff happy

7dd10f3

ajtejankar marked this pull request as ready for review October 18, 2024 21:54

ajtejankar requested a review from tgaddair October 18, 2024 21:54

ajtejankar commented Oct 19, 2024

View reviewed changes

server/lorax_server/utils/paged_attention.py Outdated Show resolved Hide resolved

ajtejankar added 6 commits October 24, 2024 00:51

use fp16 prefill with fp8 kv

787b58e

use fp16 prefill for fp8 kv cache (without prefix caching)

8dae2d8

Merge branch 'main' into fp8-kv-flash-infer

8540461

add window_left option

d880464

fix merge conflicts

bb75730

move paged_attention import location

3031ba4