Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for H2O cache eviction with LLaMA #35381

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

justincharney
Copy link

What does this PR do?

We implement the Heavy-Hitter Oracle (H2O) cache eviction strategy in Huggingface transformers, which selectively retains a balance of KV pairs that are recent or contribute most to the cumulative attention scores while evicting less important ones to maintain a fixed cache size. Our implementation identifies and preserves these “heavy hitter” tokens during inference, maintaining generation quality while dramatically reducing memory requirements.

Key features:

  • Dynamic tracking of token importance through attention scores
  • Configurable ratio between recent and heavy-hitter sections
  • Added support with LLaMA through "post_processing" the KV cache in the LlamaAttention

Fixes #30758

Before submitting

Who can review?

@gante as this relates to generation functionality since this touches core caching infrastructure.

Outline of code

Files modified:

  • src/transformers/cache_utils.py: Added a new class H2OCache
  • src/transformers/models/llama/modeling_llama.py: Added post processing function to track attention weights to identify heavy hitters
  • benchmark/h20: Added benchmarking scripts to compare H2OCache performance with DynamicCache
  • tests/h2O: Added tests to run LLM with H20Cache

Executing code

To test an LLM with the H20 cache mechanism, run the following:

python -m pytest -n auto --dist=loadfile -s -v ./tests/h2O/test_h2O.py

Results

We demonstrate that H2O achieves over 80% reduction in KV cache size while incurring less than a 5% reduction in throughput. This represents a significant improvement over QuantizedCache, which introduces substantially higher overhead for similar memory savings.

@Rocketknight1
Copy link
Member

Sorry for the delay from the Christmas break! cc @gante and @zucchini-nlp because it's a generation cache PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement kv cache sparsity like H2O with attention score
3 participants