Skip to content

Commit

Permalink
o:
Browse files Browse the repository at this point in the history
  • Loading branch information
Yang Zhou committed Sep 5, 2024
1 parent 5781577 commit 5231a62
Showing 1 changed file with 4 additions and 4 deletions.
8 changes: 4 additions & 4 deletions index.html
Original file line number Diff line number Diff line change
Expand Up @@ -181,10 +181,10 @@ <h2 class="title is-3" style="text-align: center;"><img src="static/images/twomo
<div class="content has-text-justified">
<p>
With the blossom of large language models (LLMs), inference efficiency becomes increasingly important. Various approximation methods are proposed to reduce the cost at inference time. Contextual Sparsity (CS) is appealing for its training-free nature and its ability to reach a higher compression ratio seemingly without quality degradation.
However, after a comprehensive evaluation of contextual sparsity methods on various complex generation tasks, we find that although CS succeeds in prompt-understanding tasks, CS significantly degrades the model performance for reasoning, deduction, and knowledge-based tasks. Despite the gap in end-to-end accuracy, we observed that sparse models often share general problem-solving logic and require only a few token corrections to recover the original model performance.
<span style="font-weight: bold; color: dodgerblue">However, after a comprehensive evaluation of contextual sparsity methods on various complex generation tasks, we find that although CS succeeds in prompt-understanding tasks, CS significantly degrades the model performance for reasoning, deduction, and knowledge-based tasks.</span> Despite the gap in end-to-end accuracy, we observed that sparse models often <span style="font-weight: bold; color: dodgerblue">share general problem-solving logic</span> and require only <span style="font-weight: bold; color: dodgerblue">a minor portion of token corrections</span> to recover the original model performance.
</p>
<p>
This paper introduces Sirius[1], an efficient correction mechanism, which significantly recovers CS models quality on reasoning tasks while maintaining its efficiency gain. Sirius is evaluated on 6 models with 8 difficult generation tasks in reasoning, math, and coding and shows consistent effectiveness and efficiency. Also, we carefully develop a system implementation for Sirius and show that Sirius achieves roughly 20% reduction in latency for 8B model on-chip and 35% reduction for 70B model offloading.
This paper introduces Sirius[1], an efficient correction mechanism, which significantly recovers CS models quality on reasoning tasks while maintaining its efficiency gain. Sirius is evaluated on 6 models with 8 difficult generation tasks in reasoning, math, and coding and <span style="font-weight: bold; color: dodgerblue">shows consistent effectiveness and efficiency</span>. Also, we carefully develop a system implementation for Sirius and show that Sirius achieves roughly 20% reduction in latency for 8B model on-chip and 35% reduction for 70B model offloading.
</p>
<p class="footnote">
<span class="asteriss">[1] We draw inspiration from the astronomical concept, in which Sirius refers to a two-body star system, where one is the brightest star ever detected, while the other is a dim star.</span>
Expand Down Expand Up @@ -336,7 +336,7 @@ <h2 class="title is-3" style="text-align: center;"><img src="static/images/rocke
</p>
<p>
Can the generation be corrected by just correcting these minor mistakes in the middle? We run both the full model and CS model and contrast token-by-token for Llama-3-8B-Instruct and Llama-2-7B-Chat, the results are shown in Figure (a) and (b).
We found that <span style="font-weight: bold; color: dodgerblue">the percentage of tokens needed to corrected is minor, with 10% tokens be modified enough to recover the full model performance</span>. This motivates us to develop an efficient correction mechanism to boosts the CS models on complex generation tasks with reasoning.
We found that <span style="font-weight: bold; color: dodgerblue">the percentage of tokens needed to corrected is minor, with 11% tokens be modified enough to recover the full model performance</span>. This motivates us to develop an efficient correction mechanism to boosts the CS models on complex generation tasks with reasoning.
</p>
</ul>
</div>
Expand Down Expand Up @@ -661,7 +661,7 @@ <h2 class="title is-3"><img src="static/images/cosmonautllama.png" style="height
<div class="content has-text-justified">
<p>
We observe that contextual sparse methods significantly degrade for reasoning and deduction tasks. However, we find that the degradation from contextual sparse models can theoretically be recovered
with 10% token corrected by original model. Following the observation, we develop Sirius. Sirius provides an effective solution to the performance degradation issue of contextual sparsity methods in complex reasoning tasks. By introducing an efficient correction mechanism, Sirius significantly boosts the performance of CS models while maintaining their efficiency gains. This work opens up new possibilities for deploying efficient LLMs in resource-constrained environments without compromising on task performance.
with 11% token corrected by original model. Following the observation, we develop Sirius. Sirius provides an effective solution to the performance degradation issue of contextual sparsity methods in complex reasoning tasks. By introducing an efficient correction mechanism, Sirius significantly boosts the performance of CS models while maintaining their efficiency gains. This work opens up new possibilities for deploying efficient LLMs in resource-constrained environments without compromising on task performance.
</p>
</div>
</div>
Expand Down

0 comments on commit 5231a62

Please sign in to comment.