-
Notifications
You must be signed in to change notification settings - Fork 329
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor: wtype per tensor from file instead of global #455
Conversation
Trying master again after having these changes implemented for a while it feels like this reduces the loading times, but maybe there's somthing else affecting it. |
well, it removes conversions. I had this today where i loaded flux with q8_0 t5 and f16 clip, and was wondering why t5 was using f16 (including ram usage). Turns out sd.cpp can only have one conditioner wtype rn... |
You probably saw the embedding models. |
6167c1a
to
cb46146
Compare
I can confirm now, this PR makes loading weights very much faster for larger models. (Results including warm runs only, so the models are always in disk cache)
I think this makes this PR worth merging. |
But how do the other performance metrics change. Also does it all work? |
I think so.
Diffusion/sampling and vae performance are within margin of error. Prompt encoding is significantly faster, when mixing quantizations. Edit: Photomaker (V1 and V2) works. LoRAs work too (on CPU and without quantization on Vulkan). |
Ah, controlnets are not working, I'll see if I can fix. |
I think I got pretty much everything working at least as well as it used to with this refactoring now. If anyone notices something I might have missed, lmk. |
it introduces a tensor types mapper to guide the ggml type of each tensor, but i saw the conversion has filtered out these fixed tensors, is this an overkill implementation? https://github.com/leejet/stable-diffusion.cpp/blob/master/model.cpp#L1873-L1902 the root cause is using one can we just refactor the |
That's a fair point. It would indeed probably improve the loading times the same way without refactoring the whole thing. But the point of this PR was initially to refactor the model loading logic, the improvement in loading time for conditionning models is just a nice side effect. My original motivation for doing this refactor was to be able to better support models with mixed quantization types (like those made with https://github.com/city96/ComfyUI-GGUF/blob/main/tools/convert.py or with #447). Now it also make it possible to implement #490 using the keys of the same tensor types map. |
I think the loading time optimization is just because you use the quantized t5 model, but get_conditioner_wtype doesn't recognize the quantized type correctly. |
I'm not sure if it makes a significant difference yet.