Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model Merge reduces performance #1519

Open
mosama1994 opened this issue Jan 7, 2025 · 4 comments
Open

Model Merge reduces performance #1519

mosama1994 opened this issue Jan 7, 2025 · 4 comments

Comments

@mosama1994
Copy link
Contributor

I continuous pretrained a Qwen2.5 0.5B model. When I load the adapter and merge it and then perform inference, the output looks good. But, when I merge with adapter and save model in 16bit or 4bit using the unsloth saving strategy, and then load the saved model for inference the output is not good, the performance degrades. What is the reason for that?

@TechieUser2517
Copy link

TechieUser2517 commented Jan 8, 2025

I've observed this too while doing continuous pretraining tests of various different models around 7-8B size. The merged models obtained with:

model.save_pretrained_merged(modelname, tokenizer, save_method = "merged_16bit",)

have very poor performance and also tokenizer issues with Llama.cpp when attempting to convert them to GGUF format. You can replace the tokenizer files with those of the base model to make it work, but I suspect that's where the problems originate.

On the other hand, adapter checkpoints separately merged using different methods appear to work as expected.

@danielhanchen
Copy link
Contributor

I'm currently working on making merging a much better experience - unfortunately most issues result from chat template issues, but there does seem to be some issues with merging with 4bit models - I'll update you guys!

@TechieUser2517
Copy link

TechieUser2517 commented Jan 10, 2025

As far as I am aware of, adapter merging should very preferably be done with the original (16-bit) weights and not the 4-bit quantized models, so if that's what's occurring in that step, that might be the reason for the performance loss.

@mosama1994
Copy link
Contributor Author

Merging with bf16 precision also causes the issue. When you merge with bf16 precision model and save the issue is still there. I used the merged_16bit and also tried the peft library to merge and save. If you merge and then do inference, that is fine. But, if you save merged model and do inference from saved model it is not good.

Also, a different question @danielhanchen what tokenizer patching are we doing in unsloth? Would it be possible for you to list down in a few points what is happening there?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants