The error occurred during the training process. #1057

Lling00 · 2023-03-12T14:54:02Z

Lling00
Mar 12, 2023

1、During the first startup, the download of the "model.safetensors" file failed at 8% completion. After restarting, there was no prompt to download the file again. How can I re-download this file? Also, where should I place the "model.safetensors" file?
2、Is the prompt about "v1-training-default.yaml" an error message?
`Model loaded in 3.3s (load weights from disk: 1.0s, create model: 0.2s, apply weights to model: 0.3s, apply half(): 0.5s, load VAE: 0.2s, move model to device: 0.4s, load textual inversion embeddings: 0.7s).
Running on local URL: http://127.0.0.1:7860

To create a public link, set share=True in launch().
Model dir set to: E:\stable-diffusion-webui\models\dreambooth\newmodel
Loading model from checkpoint.
Loading ckpt...
Pred and size are epsilon and 512, using config: E:\stable-diffusion-webui\extensions\sd_dreambooth_extension\dreambooth..\configs\v1-training-default.yaml
v1 model loaded.
Trying to load: E:\stable-diffusion-webui\extensions\sd_dreambooth_extension\dreambooth..\configs\v1-training-default.yaml
Converting unet...
Converting vae...
Converting text encoder...
Saving text_encoder
Saving tokenizer
Checkpoint successfully extracted to E:\stable-diffusion-webui\models\dreambooth\newmodel\working
Restored system models.`

3、During the training process, the following error message occurred. How can I resolve it?
Initializing bucket counter! ***** Running training ***** Num batches each epoch = 30 Num Epochs = 150 Batch Size Per Device = 1 Gradient Accumulation steps = 1 Total train batch size (w. parallel, distributed & accumulation) = 1 Text Encoder Epochs: 150 Total optimization steps = 2250 Total training steps = 4500 Resuming from checkpoint: False First resume epoch: 0 First resume step: 0 Lora: None, Optimizer: 8Bit Adam, Prec: fp16 Gradient Checkpointing: True EMA: True UNET: True Freeze CLIP Normalization Layers: False LR: 1e-06 V2: False Generating Samples: 50%|▌| 2/4 [00:05<00:05, 2.91s/it, inst_loss=0, loss=0.115, lr=1e-6, prior_loss=0.153, vram=18.8]Model dir set to: E:\stable-diffusion-webui\models\dreambooth\newmodel Generating Samples: 50%|▌| 2/4 [00:04<00:04, 2.39s/it, inst_loss=0, loss=0.0028, lr=1e-6, prior_loss=0.00373, vram=15Model dir set to: E:\stable-diffusion-webui\models\dreambooth\newmodel Traceback (most recent call last): File "E:\stable-diffusion-webui\extensions\sd_dreambooth_extension\dreambooth\train_dreambooth.py", line 856, in save_weights log_images, log_names = parse_logs(model_name=args.model_name) File "E:\stable-diffusion-webui\extensions\sd_dreambooth_extension\dreambooth\utils\utils.py", line 404, in parse_logs converted_loss, converted_lr, converted_ram, merged = convert_tfevent(file_full_path) File "E:\stable-diffusion-webui\extensions\sd_dreambooth_extension\dreambooth\utils\utils.py", line 343, in convert_tfevent for serialized_example in serialized_examples: File "E:\stable-diffusion-webui\venv\lib\site-packages\tensorflow\python\data\ops\iterator_ops.py", line 787, in __next__ return self._next_internal() File "E:\stable-diffusion-webui\venv\lib\site-packages\tensorflow\python\data\ops\iterator_ops.py", line 770, in _next_internal ret = gen_dataset_ops.iterator_get_next( File "E:\stable-diffusion-webui\venv\lib\site-packages\tensorflow\python\ops\gen_dataset_ops.py", line 3043, in iterator_get_next _ops.raise_from_not_ok_status(e, name) File "E:\stable-diffusion-webui\venv\lib\site-packages\tensorflow\python\framework\ops.py", line 7215, in raise_from_not_ok_status raise core._status_to_exception(e) from None # pylint: disable=protected-access tensorflow.python.framework.errors_impl.DataLossError: {{function_node __wrapped__IteratorGetNext_output_types_1_device_/job:localhost/replica:0/task:0/device:CPU:0}} corrupted record at 46555 [Op:IteratorGetNext] Exception parsing logz: {{function_node __wrapped__IteratorGetNext_output_types_1_device_/job:localhost/replica:0/task:0/device:CPU:0}} corrupted record at 46555 [Op:IteratorGetNext] Steps: 7%|▎ | 328/4500 [00:08<00:04, 1028.26it/s, inst_loss=0, loss=0.0478, lr=1e-6, prior_loss=0.0638, vram=15.1]

Lling00 · 2023-03-12T15:03:05Z

Lling00
Mar 12, 2023
Author

Continuously encountering the following error while training...
Generating Samples: 0%| | 0/4 [00:00<?, ?it/s, inst_loss=0, loss=0.00435, lr=1e-6, prior_loss=0.0058, vram=15.1]Model dir set to: E:\stable-diffusion-webui\models\dreambooth\newmodel Traceback (most recent call last): File "E:\stable-diffusion-webui\extensions\sd_dreambooth_extension\dreambooth\train_dreambooth.py", line 856, in save_weights log_images, log_names = parse_logs(model_name=args.model_name) File "E:\stable-diffusion-webui\extensions\sd_dreambooth_extension\dreambooth\utils\utils.py", line 404, in parse_logs converted_loss, converted_lr, converted_ram, merged = convert_tfevent(file_full_path) File "E:\stable-diffusion-webui\extensions\sd_dreambooth_extension\dreambooth\utils\utils.py", line 343, in convert_tfevent for serialized_example in serialized_examples: File "E:\stable-diffusion-webui\venv\lib\site-packages\tensorflow\python\data\ops\iterator_ops.py", line 787, in __next__ return self._next_internal() File "E:\stable-diffusion-webui\venv\lib\site-packages\tensorflow\python\data\ops\iterator_ops.py", line 770, in _next_internal ret = gen_dataset_ops.iterator_get_next( File "E:\stable-diffusion-webui\venv\lib\site-packages\tensorflow\python\ops\gen_dataset_ops.py", line 3043, in iterator_get_next _ops.raise_from_not_ok_status(e, name) File "E:\stable-diffusion-webui\venv\lib\site-packages\tensorflow\python\framework\ops.py", line 7215, in raise_from_not_ok_status raise core._status_to_exception(e) from None # pylint: disable=protected-access tensorflow.python.framework.errors_impl.DataLossError: {{function_node __wrapped__IteratorGetNext_output_types_1_device_/job:localhost/replica:0/task:0/device:CPU:0}} corrupted record at 46555 [Op:IteratorGetNext] Exception parsing logz: {{function_node __wrapped__IteratorGetNext_output_types_1_device_/job:localhost/replica:0/task:0/device:CPU:0}} corrupted record at 46555 [Op:IteratorGetNext]

0 replies

Arilziem · 2023-03-12T21:27:32Z

Arilziem
Mar 12, 2023

The .yaml thing is not an error, no.

Looks like the model download failed during model creation? Not sure why it did not complain, but maybe there is no sanity check in place right now. After that it cannot read the faulty model during training.

While you could theoretically download the model.safetensors manually and put it into the right directory, there is not much point to it. Just create a new model, that should be less hassle. You can delete the failed one, it sits in models\dreambooth\[modelname].

It might be better though if you first download the regular, full stable diffusion model and place it in models\Stable-diffusion\, should be more convenient.

8 replies

Lling00 Mar 19, 2023
Author

Dreambooth revision: 8e7277b7ec30801753e3591e0e1139b97ec88e56

Lling00 Mar 19, 2023
Author

After updating Dreambooth, the problem still exists.

Python revision: 3.10.10 (tags/v3.10.10:aad5f6a, Feb 7 2023, 17:20:36) [MSC v.1929 64 bit (AMD64)]
Dreambooth revision: 1b8437c
SD-WebUI revision: a9fed7c364061ae6efb37f797b6b522cb3cf7aa2

[+] torch version 1.13.1+cu117 installed.
[+] torchvision version 0.14.1+cu117 installed.
[+] xformers version 0.0.17.dev464 installed.
[+] accelerate version 0.17.1 installed.
[+] diffusers version 0.14.0 installed.
[+] transformers version 4.27.1 installed.
[+] bitsandbytes version 0.35.4 installed.

Launching Web UI with arguments: --xformers

Lling00 Mar 19, 2023
Author

Dreambooth revision: 1b8437cb7e3f2b963058caed0abf237a94add0c7

faruk1927 Apr 15, 2024

same issue here with latest version of everythng
version: v1.9.0 • python: 3.10.6 • torch: 2.2.2+cu118 • xformers: 0.0.26+5d59023.d20240414 • gradio: 3.41.2 • checkpoint: f73b5c5c60

faruk1927 Apr 15, 2024

File "C:\Stable\webui\extensions\sd_dreambooth_extension\dreambooth\train_dreambooth.py", line 1465, in save_weights log_images, log_names = log_parser.parse_logs(model_name=args.model_name) File "C:\Stable\webui\extensions\sd_dreambooth_extension\helpers\log_parser.py", line 322, in parse_logs all_df_loss = pd.concat(out_loss)[loss_columns] File "C:\Stable\webui\venv\lib\site-packages\pandas\core\reshape\concat.py", line 382, in concat op = _Concatenator( File "C:\Stable\webui\venv\lib\site-packages\pandas\core\reshape\concat.py", line 445, in __init__ objs, keys = self._clean_keys_and_objs(objs, keys) File "C:\Stable\webui\venv\lib\site-packages\pandas\core\reshape\concat.py", line 507, in _clean_keys_and_objs raise ValueError("No objects to concatenate") ValueError: No objects to concatenate WARNING:dreambooth.train_dreambooth:Exception parsing logz: No objects to concatenate Steps: 22%|███████▉ | 3336/15200 [1:15:11<2:34:55, 1.28it/s, loss=0.104, lr=2e-6, vram=19]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The error occurred during the training process. #1057

{{title}}

Replies: 2 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

The error occurred during the training process. #1057

Lling00 Mar 12, 2023

Replies: 2 comments · 8 replies

Lling00 Mar 12, 2023 Author

Arilziem Mar 12, 2023

Lling00 Mar 19, 2023 Author

Lling00 Mar 19, 2023 Author

Lling00 Mar 19, 2023 Author

faruk1927 Apr 15, 2024

faruk1927 Apr 15, 2024

Lling00
Mar 12, 2023

Replies: 2 comments 8 replies

Lling00
Mar 12, 2023
Author

Arilziem
Mar 12, 2023

Lling00 Mar 19, 2023
Author

Lling00 Mar 19, 2023
Author

Lling00 Mar 19, 2023
Author