- CS models work well on prompt understanding tasks, e.g. text summarization (CNN/DailyMail) and conversation question answering (CoQA).
- CS models significantly ill-perform on generation tasks that require complex reasoning (GSM8K) or knowledge-based tasks (MMLU-FLAN-COT).
- The problem in complex reasoning generation tasks escalates for more well-trained model, given the similar parameter count.
GSM8K | ||||||
---|---|---|---|---|---|---|
Model | Full Perf. | CSparse Perf. | CSparse Density | Sirius Perf. | AAL | Effective Density |
Llama-3-8B-Instruct | 0.7536 | 0.3844 | 0.65 | 0.7051 (8) | 15.22/16 | 0.706 |
Model | Full Perf. | FSparse Perf. | FSparse Density | Sirius Perf. | AAL | Effective Density |
Llama-3-8B-Instruct | 0.7536 | 0.5868 | 0.76 | 0.7278 (4) | 15.37/16 | 0.807 |
CSQA | ||||||
Model | Full Perf. | CSparse Perf. | CSparse Density | Sirius Perf. | AAL | Effective Density |
Llama-3-8B-Instruct | 0.7073 | 0.6470 | 0.58 | 0.7076 (8) | 14.76/16 | 0.657 |
Model | Full Perf. | FSparse Perf. | FSparse Density | Sirius Perf. | AAL | Effective Density |
Llama-3-8B-Instruct | 0.7073 | 0.6158 | 0.72 | 0.7043 (8) | 15.66/16 | 0.753 |
HumanEval | ||||||
Model | Full Perf. | CSparse Perf. | CSparse Density | Sirius Perf. | AAL | Effective Density |
Llama-3-8B-Instruct | 0.561 | 0.207 | 0.65 | 0.524 (8) | 14.67/16 | 0.733 |
Model | Full Perf. | CSparse Perf. | CSparse Density | Sirius Perf. | AAL | Effective Density |
Llama-3-8B-Instruct | 0.561 | 0.457 | 0.76 | 0.616 (6) | 15.42/16 | 0.804 |
Settings | Performance | A40 | Speedup Ratio | L40 | Speedup Ratio | Performance | A100 | Speedup Ratio |
---|---|---|---|---|---|---|---|---|
Coarse-grained Sparsity | 0.3601 | 20.7 | 0.85 | 15.6 | 0.67 | 0.3601 | 9.6 | 0.72 |
Sirius | 0.7127 | 24.1 | 0.77 | 18.2 | 0.78 | 0.7089 | 11.1 | 0.83 |
Full | 0.7612 | 30.9 | 1.0 | 23.2 | 1.0 | 0.7612 | 13.3 | 1.0 |
Settings | Sparse | Sirius | Full |
---|---|---|---|
Performance | 0.7407 | 0.8719 | 0.9014 |
Latency (s) | 3.57 s | 3.68 s | 5.72 s |
Ratio to Full | 0.6241 | 0.6434 | 1.0 |
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
On special package to notice is that since Sirius uses torch.compile to optimize the inference latency, we strictly require PyTorch version to be 2.3.0.
This section presents code that is for testing only the effectiveness and specific efficiency metrics AAL. The implementation isn't for best speedup.- GSM-8K, GSM-8K-COT, CNN/DailyMail, MMLU-FLAN-COT
We use base our implementation on LM Evaluation Harness since they support these tasks. The essential blocks are packed in the folder "Miscellaneous". To run the Sirius on various Huggingface models, follow the line.
cd Miscellaneous
# Full model
accelerate launch --main_process_port <main_port> --num_processes <num_procs> --num_machines <num_node> main.py --model xhf --model_args pretrained=<huggingface-token-model>,griffin=False,check=False --tasks <tasks_name> --batch_size 1
# Coarse-grained Sparsity
accelerate launch --main_process_port <main_port> --num_processes <num_procs> --num_machines <num_node> main.py --model xhf --model_args pretrained=<huggingface-token-model>,griffin=True,check=False --tasks <task_name> --batch_size 1
# Fine-grained Sparsity
accelerate launch --main_process_port <main_port> --num_processes <num_procs> --num_machines <num_node> main.py --model xhf --model_args pretrained=<huggingface-token-model>,cats=True,check=False --tasks <task_name> --batch_size 1
# Sirius with Sparse
accelerate launch --main_process_port <main_port> --num_processes <num_procs> --num_machines <num_node> main.py --model xhf --model_args pretrained=<huggingface-token-model>,griffin=True,check=True,kernel_size=<kernel_size>,widthtree=<width_tree>,patternstrict=True,thr=0.1 --tasks <task_name> --batch_size 1
For Sirius to be turned on, set check=True
. cats=True
for fine-grained sparsity, while griffin=True
for coarse-grained sparsity. Importantly, fine-grained sparsity here is based on topk not the threshold as in https://arxiv.org/abs/2404.08763. Unfortunately, their implementation isn't open-sourced, and using the threshold isn't safe for testing multiple different generation datasets and maintaining the same neuron sparsity level.
For cats=True
and Sirius to have widthtree > 1, patternstrict
must set to True.
- For Commonsense Reasoning tasks, we follow the Chain-of-Thought (https://arxiv.org/abs/2201.11903) work to convert previously multiple-choice question dataset CSQA, StrategyQA, Date and Sports into generation question. The essential block is packed in "CommonSenseReasoning" folder.
cd CommonSenseReasoning
# Sirius with Sparse
accelerate launch --main_process_port <main_port> --num_processes <num_proc> main.py --tasks <task_name> --model <huggingface_token> --shotfive --cats --check --kernel_size <kernel_size> --spr <sparity> --thr <threshold> --widthtree <widthtree> --patternstrict
Adding --cats
for fine-grained sparsity or --griffin
for coarse-grained sparsity, neither for full model. Adding --check
for using full model for correction, or correction is not used Again, for --cats
and <widthtree>
>1, --patternstrict
must be added. --shotfive
is used for 5 fewshot examples, which is the setting where the measurement is performed.
- For coding, we base our implementation on Big Code Evaluation Harness. The essential code are packed in "CodeGeneration" Folder.
cd CodeGeneration
accelerate launch --num_processes <num_proc> main.py \
--model <huggingface-token-model> \
--tasks <task_name> \
--do_sample False \
--n_samples 1 \
--batch_size 1 \
--max_length_generation 512 \
--enable_epatches \
--cats \
--allow_code_execution \
--spr <sparsity> \
--widthtree $treesize \
--check \
--kernelsize <kernel_size> \
--thr <threshold> \
--patternstrict \
In our code, we only use greedy decoding, --do_sample
is False. Similarly, adding --cats
for fine-grained sparsity or --griffin
for coarse-grained sparsity, neither for full model. Adding --check
for using full model for correction, or correction is not used Again, for --cats
and <widthtree>
>1, --patternstrict
must be added. For <task_name>
, we only support humaneval
and mbppplus
.
cd Miscellaneous
python main.py --model xhf --model_args pretrained=<huggingface-token-model>,griffin=True,check=True,kernel_size=<kernel_size>,widthtree=<width_tree>,patternstrict=True,thr=0.05,mode=wallclock_notree --tasks <task_name> --batch_size 1
python main.py --model xhf --model_args pretrained=<huggingface-token-model>,griffin=True,check=True,kernel_size=<kernel_size>,widthtree=<width_tree>,patternstrict=True,thr=0.05,mode=wallclock_tree --tasks <task_name> --batch_size 1
python main.py --model xhf --model_args pretrained=<huggingface-token-model>,griffin=True,check=True,kernel_size=<kernel_size>,widthtree=<width_tree>,patternstrict=True,thr=0.05,mode=wallclock_70b --tasks <task_name> --batch_size 1