Optimization Roundup - Discuss <=12GB VRAM Optimization/settings here! #77

Centurion-Rome · 2022-11-09T21:40:34Z

Centurion-Rome
Nov 9, 2022

First, big thanks to all contributors! you have made something impressive!

I hope you could optimize it a tiny little more. I Have Super 2080 8GB VRam and almost made it on Win11:

"Tried to allocate 58.00 MiB (GPU 0; 8.00 GiB total capacity;" :(
--xformers
[v] Don't cache latents
[v] se 8bit Adam
Mixed Precision fp16 and "none" tested

Any chance?

d8ahazard · 2022-11-09T22:48:30Z

d8ahazard
Nov 9, 2022
Maintainer

So, I have a PR here:

AUTOMATIC1111/stable-diffusion-webui#4527

Which would let us use "accelerate launch" to run the app, which I think might be the secret sauce to getting it to work on 8GB with a GPU. You also want to disable training the text encoder. I still need to test this on my 8GB GPU and see if it's possible, but if Shivam's repo can be run on 8GB, then this should work too.

Option B is to use "Use CPU", which will work, but very slowly.

0 replies

d8ahazard · 2022-11-10T17:42:45Z

d8ahazard
Nov 10, 2022
Maintainer

Related to:

#13

0 replies

Jonseed · 2022-11-10T19:21:56Z

Jonseed
Nov 10, 2022

I have a 3060 12GB card (and only 12GB of system RAM, with 25GB set aside for my paging file), and I was successful in running dreambooth training this morning for the first time. Here were my settings under Advanced:

I am running xformers. I was even able to use a classification dataset, and txt files for labels on my dataset. I did not save any checkpoints during training, or generate any previews during training (not sure if that would cause OOM, so more testing is needed there to see if I could enable them).

Let me know if you have any questions about my setup, and I'll try to help.

13 replies

mykeehu Nov 11, 2022

As for xformers, it generates slightly different images with the same parmeters, so if you want to train, I don't recommend it. I don't know how much it can interfere with training, but when generating images you can see the differences between the images.

drnagel Nov 11, 2022

Tried with Edge and studio drivers but still the dame error, i have 24gb of ram and 24gb of pagefile.
I'm using a classification dataset of 1500 imagen and i put 1500 in 'total number of classification images', but if i set to 0, the training starts. Maybe is somtething relates yo that.

Jonseed Nov 11, 2022

I haven't tried that large of a class dataset yet, so it could be that...

drnagel Nov 12, 2022

Finally got it working enabling xformers, without that i couldnt make it work, even using cpu instead of gpu, crashed at half of steps.

Thanks for all the help @Jonseed

Jonseed Nov 12, 2022

@drnagel glad you got it working! I wasn't sure if xformers would make a difference. I thought such optimizations were unloaded during training, but maybe not.

kokakin · 2022-11-11T01:24:05Z

kokakin
Nov 11, 2022

I have a 4GB VRAM GPU, and I would like to know if it was possible to train the model on CPU only, because despite checking "use CPU only", when it's time to make the ckpt file I get a CUDA out of memory error. Is this intended or is it an issue? I want to know just so that I don't keep trying if there is no way :)

3 replies

GhostDragon69 Nov 11, 2022

you don't have enough ram ( i think) how much ram do you have ? ( not gpu)

kokakin Nov 11, 2022

15.9GB, but nevermind, I managed to "fix" the problem with the last version, which added a "generate ckpt" button. I train the model with no generating images or checkpoints during training, and at the last step there is the error, but I think the training progress is saved anyways, so I just generate the ckpt manually after training. Now I only need to check if the program is saving training progress despite the error or if only the training steps number is increasing with no training being made.

kokakin Nov 11, 2022

Well apparently the function that errors is exactly the function that saves weights, so, no training for me :(

LePrau · 2022-11-11T11:03:13Z

LePrau
Nov 11, 2022

I was able to train on a 10GB 3080

Documentation of the steps and settings I used can be found here: #84 (comment)

It seems that generating checkpoint every N steps and preview image every N steps does not clean up memory as thorough as the "final" preview image after training ends. So, traing 100 steps, training another 100 steps, and another 100 steps works, but getting a preview every 100 steps does not (OOM crash after the first checkpoint).

0 replies

d8ahazard · 2022-11-12T21:45:14Z

d8ahazard
Nov 12, 2022
Maintainer

No, I disabled the bit that unloaded optimizations before training.

…

On Sat, Nov 12, 2022, 11:53 AM Jonseed ***@***.***> wrote: @drnagel <https://github.com/drnagel> glad you got it working! I wasn't sure if xformers would make a difference. I thought such optimizations were unloaded during training, but maybe not. — Reply to this email directly, view it on GitHub <#77 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAMO4NBAQGEFWRSIGQSTA6TWH7KSPANCNFSM6AAAAAAR34HC5E> . You are receiving this because you commented.Message ID: <d8ahazard/sd_dreambooth_extension/repo-discussions/77/comments/4124557@ github.com>

0 replies

ghost · 2022-11-13T10:54:13Z

ghost
Nov 13, 2022

Something seems off, before the "Generate ckpt" button was added, i managed to run Dreambooth on 3080 10GB on Linux Mint (Nvidia drivers 520.56.06, CUDA version 11.8), with the recommended VRAM optimizations listed above. Training completed with 1000 steps but OOM crashed at that point - and no ckpt file was generated.

After i saw the button added, i can no longer even begin training. Straight out of VRAM. I've tried all tips i've seen here, to no avail.

edit to add settings:

xformers launch arg enabled
save a checkpoint off
generate previews off
don't cache latents checked
train text encoder unchecked (tried with and without)
8bit adam checked
gradient checkpointing checked
mixed precision fp16 (tried "no" as well)

Version info from starting Stable Diffusion:
Python 3.10.6 (main, Nov 2 2022, 18:53:38) [GCC 11.3.0]
Commit hash: 98947d173e3f1667eba29c904f681047dea9de90
Installing requirements for Web UI
loading Dreambooth reqs from /evo/stable-diffusion/extensions/sd_dreambooth_extension/requirements.txt
Checking Dreambooth requirements.
WARNING: overwriting existing torch/torchvision installation!
Checking torch and torchvision versions
Dreambooth revision is a984e18
Diffusers version is 0.8.0.dev0.
Torch version is 1.12.1+cu116.
Torch vision version is 0.13.1+cu116.

Launching Web UI with arguments: --autolaunch --xformers --ckpt-dir /evo/sd-resources/checkpoints
Preloading Dreambooth!
LatentDiffusion: Running in eps-prediction mode
DiffusionWrapper has 859.52 M params.
making attention of type 'vanilla' with 512 in_channels
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
making attention of type 'vanilla' with 512 in_channels
Using VAE found beside selected model
Loading weights [81761151] from /evo/sd-resources/checkpoints/sd-v1-5-pruned-emaonly.ckpt
Global Step: 840000
Loading VAE weights from: /evo/sd-resources/checkpoints/sd-v1-5-pruned-emaonly.vae.pt
Applying xformers cross attention optimization.
Model loaded.
Loaded a total of 0 textual inversion embeddings.
Embeddings:
Running on local URL

8 replies

ghost Nov 13, 2022

I did a complete reinstall of A1111, because it was a pretty long-standing installation and i wanted to be sure i hadn't screwed anything up. No effect.

I do also get "CUDA_SETUP: WARNING! libcudart.so not found in any environmental path", does this have any effect on Dreambooth? It doesn't seem to affect any other function of Stable Diffusion, as far as i can tell everything works like it should. I'm unsure how to fix that, as i have the latest drivers from official sources (Linux Mint driver manager). I recall seeing this error before so it wasn't introduced by the reinstall.

ghost Nov 13, 2022

Another try, this time there's a different error: training: name 'str2optimizer8bit_blockwise' is not defined but still Out of Memory also.

================================================================================
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
WARNING: No libcudart.so found! Install CUDA or the cudatoolkit package (anaconda)!
CUDA SETUP: Loading binary /evo/stable-diffusion/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so...
Caching latents: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:02<00:00, 4.69it/s]
Scheduler Loaded
Allocated: 0.0GB
Reserved: 0.0GB

Steps: 0%| | 0/100 [00:00<?, ?it/s] Exception while training: name 'str2optimizer8bit_blockwise' is not defined
Allocated: 6.4GB
Reserved: 6.8GB

Traceback (most recent call last):
File "/evo/stable-diffusion/extensions/sd_dreambooth_extension/dreambooth/train_dreambooth.py", line 941, in main
optimizer.step()
File "/evo/stable-diffusion/venv/lib/python3.10/site-packages/accelerate/optimizer.py", line 134, in step
self.scaler.step(self.optimizer, closure)
File "/evo/stable-diffusion/venv/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 338, in step
retval = self._maybe_opt_step(optimizer, optimizer_state, *args, **kwargs)
File "/evo/stable-diffusion/venv/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 285, in _maybe_opt_step
retval = optimizer.step(*args, **kwargs)
File "/evo/stable-diffusion/venv/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 65, in wrapper
return wrapped(*args, **kwargs)
File "/evo/stable-diffusion/venv/lib/python3.10/site-packages/torch/optim/optimizer.py", line 113, in wrapper
return func(*args, **kwargs)
File "/evo/stable-diffusion/venv/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/evo/stable-diffusion/venv/lib/python3.10/site-packages/bitsandbytes/optim/optimizer.py", line 265, in step
self.update_step(group, p, gindex, pindex)
File "/evo/stable-diffusion/venv/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/evo/stable-diffusion/venv/lib/python3.10/site-packages/bitsandbytes/optim/optimizer.py", line 506, in update_step
F.optimizer_update_8bit_blockwise(
File "/evo/stable-diffusion/venv/lib/python3.10/site-packages/bitsandbytes/functional.py", line 858, in optimizer_update_8bit_blockwise
str2optimizer8bit_blockwise[optimizer_name][0](
NameError: name 'str2optimizer8bit_blockwise' is not defined
Exception while training: CUDA out of memory. Tried to allocate 30.00 MiB (GPU 0; 9.76 GiB total capacity; 7.44 GiB already allocated; 15.81 MiB free; 7.54 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Allocated: 6.7GB
Reserved: 7.5GB

mykeehu Nov 14, 2022

Here is a video with a few tricks, did you see this?

ghost Nov 15, 2022

Yes, i watched the video and tested running Dreambooth once again. I managed to bring my idle VRAM usage down to 77M, which is as low as i can imagine it going, by disabling all desktop effects and lowering resolution from 1440p 165Hz to 720p 60Hz. This saved over 150M, quite a drop. However...

Still out of memory. I'm done tearing my hair out over this, i will patiently wait and hope for a technical solution to squeeze just a bit more optimizations out of this. :) Thanks for the suggestions, videos and hard work on the project to everyone involved in it. Much appreciated!

jpe90 Jan 14, 2023

@jammynads In case you're still interested or anyone else finds this, I also encountered this error and solved it on my linux system by adding LD_LIBRARY_PATH="/opt/cuda/targets/x86_64-linux/lib:$LD_LIBRARY_PATH. More context here

d8ahazard · 2022-11-13T18:26:38Z

d8ahazard
Nov 13, 2022
Maintainer

By design, the extension should clear all prior VRAM usage before training, and then restore SD back to "normal" when training is complete.

Similarly, someone somewhere was talking about killing their web browser to save VRAM, but I think that the VRAM used by the GPU for stuff like browser and desktop windows comes from "shared" VRAM space, while stuff like computations in Torch is used in "reserved' VRAM.

Or, in a nutshell, unless you're running something like video editing software, CAD stuff, or a 3D video game, I don't think running software will have an effect on OOM.

Of course, I'm also not a hardware engineer, so I could be completely wrong.

3 replies

LePrau Nov 13, 2022

By design, the extension should clear all prior VRAM usage before training, and then restore SD back to "normal" when training is complete.

Similarly, someone somewhere was talking about killing their web browser to save VRAM, but I think that the VRAM used by the GPU for stuff like browser and desktop windows comes from "shared" VRAM space, while stuff like computations in Torch is used in "reserved' VRAM.

Or, in a nutshell, unless you're running something like video editing software, CAD stuff, or a 3D video game, I don't think running software will have an effect on OOM.

Of course, I'm also not a hardware engineer, so I could be completely wrong.

Sadly, the VRAM used by browser or discord or other software seems to be the same "hardware vram" pool as needed for training. I successfully killed training once by simply opening an image on discord ...

Windows task manager distinguishes between "GPU RAM" and "dedicated GPU RAM" where latter is the number that matters.

ghost Nov 13, 2022

Similar experience with Textual Inversion/Hypernetwork training; VRAM usage was close to 95% at some point -> opened a news site and began scrolling -> training halted.

sgsdxzy Nov 14, 2022
Collaborator

If you have an intel cpu, try plugging your monitor to the integrated graphics for the time being.

sgsdxzy · 2022-11-14T04:12:13Z

sgsdxzy
Nov 14, 2022
Collaborator

Anyone has experience with Colossal AI? https://github.com/hpcaitech/ColossalAI/tree/main/examples/images/diffusion
DeepSpeed training is not supported in native Windows yet (even though developed by Microsoft), while Colossal AI uses similar technique to offload VRAM to RAM, claims to train SD on 8GB, and not stated to not support Windows.

1 reply

sgsdxzy Nov 14, 2022
Collaborator

However official builds only support torch 1.12+cu113 yet, and 8bit AdamW pins dreambooth to cu116.

d8ahazard · 2022-11-15T18:53:51Z

d8ahazard
Nov 15, 2022
Maintainer

Related to #124

On that note, I've added a "wizard" button that attempts to set the "optimal" settings based on the total amount of available VRAM. It's not perfect, but if anybody wants to contribute their "VRAM Total" and settings used to train without OOM, this would help me to help others easily set params to avoid the most common issue with this bit of software. :D

0 replies

mykeehu · 2022-11-15T19:56:34Z

mykeehu
Nov 15, 2022

I'd be interested in the EMA training option if it makes the model better. Is it possible that there will be <= 12 GB VRAM cards? I'm going over the limit with 512 MB, so I'm waiting to see if it will be optimized.

3 replies

mykeehu Nov 16, 2022

I tried it this morning, if I use cache for the training and disable preview image generation, EMA training works with 12 GB VRAM.

d8ahazard Nov 16, 2022
Maintainer

Interestingly, I just read a blog on huggingface last night where they tested various training params, and one of the notes was that EMA doesn't seem to do much.

I'm going to try training a few models today with and without it and see if I can see a difference.

mykeehu Nov 16, 2022

I turned on xformers, which I don't like to use, but it seems to have helped with the training and I got a good result.

d8ahazard · 2022-11-17T17:43:36Z

d8ahazard
Nov 17, 2022
Maintainer

Hot off the presses!

#246

0 replies

mykeehu · 2022-11-18T05:48:57Z

mykeehu
Nov 18, 2022

Something that is not in the hundreds. When I had the simpler version of the training, I could start and continue any number of training sessions, without xformers.
Since I started using the classification images feature, I can only train with xformers, and I can only continue training once, the second time I get an OOM error, so even after resetting, there is still something in memory. 🙄🤔

This is what I saw at the end of the training:

Training complete??
CLEANUP:
Allocated: 7.3GB
Reserved: 10.8GB

Cleanup Complete.
Allocated: 7.3GB
Reserved: 10.8GB

Steps: 100%|███████| 1000/1000 [23:12<00:00, 1.39s/it, loss=0.245, lr=1.22e-7]
Training completed, reloading SD Model.
Allocated: 0.0GB
Reserved: 4.1GB

1 reply

mykeehu Nov 22, 2022

@d8ahazard Now I'm sure: if I generate an image on txt2Img with Restore faces and then try to continue training, I get an OOM error. So the face restorer loads into VRAM, and even if I do a memory optimization, it loads back or gets stuck. Is there any way to get rid of this error before training? Then the next time I generate an image on the txt2Img page, it loads again.

Saoqq · 2022-11-23T23:26:28Z

Saoqq
Nov 23, 2022

My 5 cents about how I've managed to run it on 3060ti with 8Gb vram

Update windows to 22H2 and update WSL2 to latest kernel (wsl --update in windows terminal), otherwise you may receive error with memory pinnig.
- Python script to check if memory pinning works for you Dreambooth doesn't train on 8GB huggingface/diffusers#807 (comment)
- Also I was not able to run training with memory pinning = false on 21H2
After update I was receiving error smth about cuda illegal memory access
- After couple of tries I've discovered that disabling xformers and unchecking 8Bit Adam resolves this issue for me
- Also reinstalling different versions and building xformers manually didn't help in my case
Even after that I was not able to start training, some isue with expected <some tensor> received HalfTensor
- Unchecking Train Text Encoder helped and eventually I'm able to train with 6s/it
- 6s/it looks a bit slow but I'm done with all that juggling around WSL and Windows, when have some time will try with native linux

To sum up, here is step by step guide that worked in my case:

update to 22h2, update WSL
install CUDA on WSL (I use latest 11.8)
clone A1111 SD
install Dreambooth extension via UI
stop SD, add envs that mentioned in this repo readme (accelerate, reqs, torch command)
install deepspeed
run: accelerate config, answer questions
now you are ready to go, lauch and start training
- I was not able to create model from existing .ckpt, OOM happens, so I had to use huggingface instead of local model
- In my case preview images are not created during training. Failing with following exception
```
Exception with the stupid image again: Input type (torch.cuda.FloatTensor) and weight type (torch.cuda.HalfTensor) should be the same                                                | 0/1 [00:00<?, ?it/s]
```

webui-user.sh

export COMMANDLINE_ARGS=""
export TORCH_COMMAND="pip install torch==1.12.1+cu116 torchvision==0.13.1+cu116 --extra-index-url https://download.pytorch.org/whl/cu116"
export REQS_FILE="./extensions/sd_dreambooth_extension/requirements.txt"
export DREAMBOOTH_SKIP_INSTALL=True
export ACCELERATE="True"

~/.cache/huggingface/accelerate/default_config.yaml after running accelerate config

command_file: null
commands: null
compute_environment: LOCAL_MACHINE
deepspeed_config:
  gradient_accumulation_steps: 1
  offload_optimizer_device: cpu
  offload_param_device: cpu
  zero3_init_flag: false
  zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
fsdp_config: {}
gpu_ids: null
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
megatron_lm_config: {}
mixed_precision: fp16
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_name: null
tpu_zone: null
use_cpu: false

Training config:

0 replies

Optimization Roundup - Discuss <=12GB VRAM Optimization/settings here! #77

Replies: 14 comments · 32 replies

d8ahazard Nov 9, 2022 Maintainer

d8ahazard Nov 10, 2022 Maintainer

d8ahazard Nov 12, 2022 Maintainer

d8ahazard Nov 13, 2022 Maintainer

sgsdxzy Nov 14, 2022 Collaborator

sgsdxzy Nov 14, 2022 Collaborator

sgsdxzy Nov 14, 2022 Collaborator

d8ahazard Nov 15, 2022 Maintainer

d8ahazard Nov 16, 2022 Maintainer

d8ahazard Nov 17, 2022 Maintainer

Replies: 14 comments 32 replies

d8ahazard
Nov 9, 2022
Maintainer

d8ahazard
Nov 10, 2022
Maintainer

d8ahazard
Nov 12, 2022
Maintainer

d8ahazard
Nov 13, 2022
Maintainer

sgsdxzy Nov 14, 2022
Collaborator

sgsdxzy
Nov 14, 2022
Collaborator

sgsdxzy Nov 14, 2022
Collaborator

d8ahazard
Nov 15, 2022
Maintainer

d8ahazard Nov 16, 2022
Maintainer

d8ahazard
Nov 17, 2022
Maintainer