Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot get llama-2-7b-4bit quant model run normally. #62

Open
Zijie-Tian opened this issue Oct 15, 2024 · 1 comment
Open

Cannot get llama-2-7b-4bit quant model run normally. #62

Zijie-Tian opened this issue Oct 15, 2024 · 1 comment

Comments

@Zijie-Tian
Copy link

Currently, by referring to your Bitnet documentation, I was able to get the correct result. However, when I tried using the llama-2-7b model, I encountered an issue.

Based on a previous issue #31 , I chose the model ChenMnZ/Llama-2-7b-EfficientQAT-w4g128-GPTQ.

First, I compiled the TMAC kernel using the following command, and with cmake, placed the kernel.cc file in the correct path.

python compile.py -o tuned -da -nt 12 -tb -gc -ags 64 -t -m llama-2-7b-4bit -md /data/hf_models/Llama-2-7b-EfficientQAT-w4g128-GPTQ

Then, I converted the model using the following command:

python convert_hf_to_gguf.py /data/hf_models/Llama-2-7b-EfficientQAT-w4g128-GPTQ --enable-t-mac --outtype int_n --outfile /data/gguf/T-MAC/llama-2-7B.i4.gguf --kcfg ${TMAC_PROJECT_DIR}/install/lib/kcfg.ini

However, when I executed the model using the following command, the output was abnormal, and the results were very strange.

./bin/llama-cli -m /data/gguf/T-MAC/llama-2-7B.i4.gguf -p "Write a resign email." -n 1024 -t 12 -ngl 0

Result :
Image

I would like to know where I might have gone wrong?


Additionally, I am quite curious about one more thing. If I want to adapt a new model (such as MiniCPM or llama-3.2), what specific steps should I take?

@kaleid-liner
Copy link
Collaborator

Can you use run_pipeline.py please? The script will output the commands it execute:

  1. Compile kernels
  2. Compile and install T-MAC
  3. Convert hf to gguf
  4. Compile llama.cpp
  5. Inference

If I want to adapt a new model (such as MiniCPM or llama-3.2), what specific steps should I take?

You need to check if the new model is supported by llama.cpp. If the answer is yes, you just need to find a GPTQ version of the model and specify -m gptq-auto for run_pipeline.py.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants