-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flux - torchao inference not working #10470
Comments
Thanks for reporting! Since layout_tensor was made an internal private attribute in TorchAO in version 0.7.0, it seems like we need to update how we handle it in accelerate (which is what's used for sequential offloading). I'll open a fix soon |
Thank you for looking into this. I also tried saving the quantized model locally
There is error
Posting here as both issues are related to same model and quantization. |
same error here
i try to load flux pipeline on cpu, quantize the transformer and text_encoder_2. but then error happens in pipe.to('cuda'). because my gpu only get 24GB vram, i can't load pipe to gpu then quantize the model. so i saved the qauntized pipeline, and get the same error above. i use torchao 0.7.0 and diffusers in main branch |
@nitinmukesh For this comment, you need to pass Regarding the sequential offloading issue, I'm opening a PR to accelerate shortly. |
@nitinmukesh Could you try installing accelerate from this branch and seeing if it fixes the inference? It's working for me now |
|
@zhangvia Cannot seem to replicate. Could you share the output of Env- 🤗 Diffusers version: 0.33.0.dev0
- Platform: Linux-5.4.0-166-generic-x86_64-with-glibc2.31
- Running on Google Colab?: No
- Python version: 3.10.14
- PyTorch version (GPU?): 2.5.1+cu124 (True)
- Flax version (CPU?/GPU?/TPU?): 0.8.5 (cpu)
- Jax version: 0.4.31
- JaxLib version: 0.4.31
- Huggingface_hub version: 0.26.2
- Transformers version: 4.48.0.dev0
- Accelerate version: 1.1.0.dev0
- PEFT version: 0.13.3.dev0
- Bitsandbytes version: 0.43.3
- Safetensors version: 0.4.5
- xFormers version: not installed
- Accelerator: NVIDIA A100-SXM4-80GB, 81920 MiB
NVIDIA A100-SXM4-80GB, 81920 MiB
NVIDIA A100-SXM4-80GB, 81920 MiB
NVIDIA DGX Display, 4096 MiB
NVIDIA A100-SXM4-80GB, 81920 MiB
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in> import torch
from diffusers import FluxPipeline, FluxTransformer2DModel, TorchAoConfig
model_id = "black-forest-labs/Flux.1-Dev"
dtype = torch.bfloat16
quantization_config = TorchAoConfig("int8wo")
transformer = FluxTransformer2DModel.from_pretrained(
model_id,
subfolder="transformer",
quantization_config=quantization_config,
torch_dtype=dtype,
cache_dir="/raid/.cache/huggingface",
)
transformer.save_pretrained("/raid/aryan/flux-transformer-int8wo", max_shard_size="100GB", safe_serialization=False) This is the code I'm using for testing. LMK if this should be different |
Thank you @a-r-r-o-w I will verify both issues and let you know. |
@a-r-r-o-w 2025-01-10 01:45:20.889200: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-01-10 01:45:20.902103: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1736473520.920618 5981 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1736473520.925624 5981 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-01-10 01:45:20.942390: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.
- 🤗 Diffusers version: 0.33.0.dev0
- Platform: Linux-5.2.14-050214-generic-x86_64-with-glibc2.31
- Running on Google Colab?: No
- Python version: 3.10.0
- PyTorch version (GPU?): 2.5.1+cu124 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Huggingface_hub version: 0.24.3
- Transformers version: 4.46.3
- Accelerate version: 1.2.1
- PEFT version: 0.13.2
- Bitsandbytes version: 0.44.1
- Safetensors version: 0.4.5
- xFormers version: 0.0.28.post3
- Accelerator: NVIDIA GeForce RTX 4090, 24564 MiB
NVIDIA GeForce RTX 4090, 24564 MiB
NVIDIA GeForce RTX 4090, 24564 MiB
NVIDIA GeForce RTX 4090, 24564 MiB
NVIDIA GeForce RTX 4090, 24564 MiB
NVIDIA GeForce RTX 4090, 24564 MiB
NVIDIA GeForce RTX 4090, 24564 MiB
NVIDIA GeForce RTX 4090, 24564 MiB
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>
i use the exact script you paste, but i change the model to flux-fill model:
the same error here: 2025-01-10 01:43:12.589355: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-01-10 01:43:12.835107: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1736473392.952239 5850 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1736473392.978978 5850 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-01-10 01:43:13.157551: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
/media/74nvme/software/miniconda3/envs/comfyui/lib/python3.10/site-packages/torchao/utils.py:434: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return func(*args, **kwargs)
Traceback (most recent call last):
File "/media/74nvme/software/miniconda3/envs/comfyui/lib/python3.10/site-packages/huggingface_hub/serialization/_torch.py", line 406, in storage_ptr
return tensor.untyped_storage().data_ptr()
RuntimeError: Attempted to access the data pointer on an invalid python storage.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/media/74nvme/research/test.py", line 269, in <module>
transformer.save_pretrained("/media/74nvme/checkpoints/flux/flux-znwtryon-fp8/transformer", max_shard_size="100GB", safe_serialization=False)
File "/media/74nvme/software/miniconda3/envs/comfyui/lib/python3.10/site-packages/diffusers/models/modeling_utils.py", line 406, in save_pretrained
state_dict_split = split_torch_state_dict_into_shards(
File "/media/74nvme/software/miniconda3/envs/comfyui/lib/python3.10/site-packages/huggingface_hub/serialization/_torch.py", line 330, in split_torch_state_dict_into_shards
return split_state_dict_into_shards_factory(
File "/media/74nvme/software/miniconda3/envs/comfyui/lib/python3.10/site-packages/huggingface_hub/serialization/_base.py", line 108, in split_state_dict_into_shards_factory
storage_id = get_storage_id(tensor)
File "/media/74nvme/software/miniconda3/envs/comfyui/lib/python3.10/site-packages/huggingface_hub/serialization/_torch.py", line 359, in get_torch_storage_id
unique_id = storage_ptr(tensor)
File "/media/74nvme/software/miniconda3/envs/comfyui/lib/python3.10/site-packages/huggingface_hub/serialization/_torch.py", line 410, in storage_ptr
return tensor.storage().data_ptr()
File "/media/74nvme/software/miniconda3/envs/comfyui/lib/python3.10/site-packages/torch/storage.py", line 1224, in data_ptr
return self._data_ptr()
File "/media/74nvme/software/miniconda3/envs/comfyui/lib/python3.10/site-packages/torch/storage.py", line 1228, in _data_ptr
return self._untyped_storage.data_ptr()
RuntimeError: Attempted to access the data pointer on an invalid python storage. |
I have verified inference. It is working now. I will now test saving the quantized model OR pipe.enable_model_cpu_offload()
Unfortunately even with different combinations of num_inference_steps and guidance_scale the quality is very very bad. Not sure if it has to do with quantization or anything else. Verified using Forge (without quantization) and output with same settings are good. |
Test 2: Completed testing of saving quantized model using torchao.
Next test to load quantized model and use it directly. Any sample code available on how to load quantized model? |
Describe the bug
Reproduction
example taken from (merged)
#10009
Logs
(with cpu offload)
System Info
Windows 11
Who can help?
No response
The text was updated successfully, but these errors were encountered: