You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Dear Sglang Team,
we are a security research group. We are impressed by its decent design, especially by the shared prefix kv-cache. But as we studied further, more concerns about the security of Sglang have arosen. When a new prompt comes, if the TokenKVPool has its prefix tokens, the prefill process will be accelerated, which can be reflected in TTFT. We found the timing differences of TTFT introduced by more shared-tokens are significant enough to be recognized.
Description
Assume the victim has sent a valuable prompt to the sglang, or a valuable system prompt is sent beforehand in sglang, under certain conditions (e.g. the attacker shares the same serving backend with the victim, etc.), the attacker can endeavor to guess the content of the victim prompt and check its validity according to the TTFT.
Different from vLLM (which shares tokens in chunks), Sglang uses token-by-token sharing mechanism (RadixAttention) and cooperates it with trie structure to store kv-cache info. On the other hand, the timing decrease of one more shared token is often negligble, which increases the difficulties for the attacker to guess prompts token-by-token, so we want simply demonstrate the above leakage with multiple-more-shared-tokens.
Environment
GPU: NVIDIA A100 (40G)
CUDA: 11.8
pytorch: 2.3.1
OS: ubuntu 18.04
Sglang: v0.2.6
We lanuch the Sglang Server using the default settings. We set the configuration max_tokens=1 of requests to measure the TTFT.
Leakage
We've tested in LLaMA2-13B and LLaMA2-70B-GPTQ (on one device), and plotted the ROC curve to fingerprint the timing difference when the prompts share the prefix of 1, 2, 4 and 8 tokens respectively.
Results seem to indicate larger Model has greater leakage windows. Even when we only have 2-more-shared-tokens, the ROC is still great enough for we to check the validity of our guess.
Attack
We've tried to design some methods to amplify the phenomenon and we found that the AUC of one-more-shared-token can be increased from 0.529 to 0.58. By using the function flush_cache provided by the Sglang, we can increase our TPR in more trails without interfering ourselves (since when the guess is the same, the later prompt will be accelerated).
We've designed a theoretical token-by-token algorithm to recover victim prompts. Detailed information will be provided soon in our paper.
Possible mitigations
Below are some possible mitigations to our attacks.
Maybe Srts can detect whether a user is consistently asking for the same question (using the same prompt) , i.e. Guess in more trails. This can also be inferred from other behaviour, e.g. the attacker might always set the max_tokens = 1 to get the TTFT.
Increase the granularity of minimum shared tokens. Though the timing differences (shown in ROC graph above) will be amplified, the searching space of attacker scaling exponentially. It could cost the attacker forever when the granularity of shared tokens increase to 8 tokens or more.
We hope to receive your early reply and look forward to discussing with you!
The text was updated successfully, but these errors were encountered:
Unik-lif
changed the title
[Disscusions] Possible timing side-channels of KV-Cache?
Possible timing side-channels of KV-Cache?
Sep 24, 2024
Unik-lif
changed the title
Possible timing side-channels of KV-Cache?
Possible timing side-channels caused by shared prefix
Sep 29, 2024
@Unik-lif This is very interesting. Is your paper publicly available now?
We would like to invite you to join our bi-weekly online development meeting to discuss this vulnerability. Are you available on Oct. 19? If so, could you sign up for a 20-min slot in this doc?
Thank u for ur warm reply @merrymercy !
We are honored by the invitation you extended to us, however, we are now busy in settling down other stuffs, and might not be available on Oct.19 . I am sorry for that.😢
Is your paper publicly available now?
Yes! We've recently put our manuscript on Arixiv. However, the content presented in this manuscript is not yet complete, and we hope to further refine it in the future.
Dear Sglang Team,
we are a security research group. We are impressed by its decent design, especially by the shared prefix kv-cache. But as we studied further, more concerns about the security of Sglang have arosen. When a new prompt comes, if the
TokenKVPool
has its prefix tokens, the prefill process will be accelerated, which can be reflected in TTFT. We found the timing differences of TTFT introduced by more shared-tokens are significant enough to be recognized.Description
Assume the victim has sent a valuable prompt to the sglang, or a valuable system prompt is sent beforehand in sglang, under certain conditions (e.g. the attacker shares the same serving backend with the victim, etc.), the attacker can endeavor to guess the content of the victim prompt and check its validity according to the TTFT.
Different from vLLM (which shares tokens in chunks), Sglang uses token-by-token sharing mechanism (RadixAttention) and cooperates it with trie structure to store kv-cache info. On the other hand, the timing decrease of one more shared token is often negligble, which increases the difficulties for the attacker to guess prompts token-by-token, so we want simply demonstrate the above leakage with multiple-more-shared-tokens.
Environment
We lanuch the Sglang Server using the default settings. We set the configuration
max_tokens=1
of requests to measure the TTFT.Leakage
We've tested in LLaMA2-13B and LLaMA2-70B-GPTQ (on one device), and plotted the ROC curve to fingerprint the timing difference when the prompts share the prefix of 1, 2, 4 and 8 tokens respectively.
Results seem to indicate larger Model has greater leakage windows. Even when we only have 2-more-shared-tokens, the ROC is still great enough for we to check the validity of our guess.
Attack
We've tried to design some methods to amplify the phenomenon and we found that the AUC of one-more-shared-token can be increased from 0.529 to 0.58. By using the function
flush_cache
provided by the Sglang, we can increase our TPR in more trails without interfering ourselves (since when the guess is the same, the later prompt will be accelerated).We've designed a theoretical token-by-token algorithm to recover victim prompts. Detailed information will be provided soon in our paper.
Possible mitigations
Below are some possible mitigations to our attacks.
max_tokens = 1
to get the TTFT.We hope to receive your early reply and look forward to discussing with you!
The text was updated successfully, but these errors were encountered: