Replies: 2 comments 8 replies
-
Continuously encountering the following error while training... |
Beta Was this translation helpful? Give feedback.
-
The .yaml thing is not an error, no. Looks like the model download failed during model creation? Not sure why it did not complain, but maybe there is no sanity check in place right now. After that it cannot read the faulty model during training. While you could theoretically download the model.safetensors manually and put it into the right directory, there is not much point to it. Just create a new model, that should be less hassle. You can delete the failed one, it sits in It might be better though if you first download the regular, full stable diffusion model and place it in |
Beta Was this translation helpful? Give feedback.
-
1、During the first startup, the download of the "model.safetensors" file failed at 8% completion. After restarting, there was no prompt to download the file again. How can I re-download this file? Also, where should I place the "model.safetensors" file?
2、Is the prompt about "v1-training-default.yaml" an error message?
`Model loaded in 3.3s (load weights from disk: 1.0s, create model: 0.2s, apply weights to model: 0.3s, apply half(): 0.5s, load VAE: 0.2s, move model to device: 0.4s, load textual inversion embeddings: 0.7s).
Running on local URL: http://127.0.0.1:7860
To create a public link, set
share=True
inlaunch()
.Model dir set to: E:\stable-diffusion-webui\models\dreambooth\newmodel
Loading model from checkpoint.
Loading ckpt...
Pred and size are epsilon and 512, using config: E:\stable-diffusion-webui\extensions\sd_dreambooth_extension\dreambooth..\configs\v1-training-default.yaml
v1 model loaded.
Trying to load: E:\stable-diffusion-webui\extensions\sd_dreambooth_extension\dreambooth..\configs\v1-training-default.yaml
Converting unet...
Converting vae...
Converting text encoder...
Saving text_encoder
Saving tokenizer
Checkpoint successfully extracted to E:\stable-diffusion-webui\models\dreambooth\newmodel\working
Restored system models.`
3、During the training process, the following error message occurred. How can I resolve it?
Initializing bucket counter! ***** Running training ***** Num batches each epoch = 30 Num Epochs = 150 Batch Size Per Device = 1 Gradient Accumulation steps = 1 Total train batch size (w. parallel, distributed & accumulation) = 1 Text Encoder Epochs: 150 Total optimization steps = 2250 Total training steps = 4500 Resuming from checkpoint: False First resume epoch: 0 First resume step: 0 Lora: None, Optimizer: 8Bit Adam, Prec: fp16 Gradient Checkpointing: True EMA: True UNET: True Freeze CLIP Normalization Layers: False LR: 1e-06 V2: False Generating Samples: 50%|▌| 2/4 [00:05<00:05, 2.91s/it, inst_loss=0, loss=0.115, lr=1e-6, prior_loss=0.153, vram=18.8]Model dir set to: E:\stable-diffusion-webui\models\dreambooth\newmodel Generating Samples: 50%|▌| 2/4 [00:04<00:04, 2.39s/it, inst_loss=0, loss=0.0028, lr=1e-6, prior_loss=0.00373, vram=15Model dir set to: E:\stable-diffusion-webui\models\dreambooth\newmodel Traceback (most recent call last): File "E:\stable-diffusion-webui\extensions\sd_dreambooth_extension\dreambooth\train_dreambooth.py", line 856, in save_weights log_images, log_names = parse_logs(model_name=args.model_name) File "E:\stable-diffusion-webui\extensions\sd_dreambooth_extension\dreambooth\utils\utils.py", line 404, in parse_logs converted_loss, converted_lr, converted_ram, merged = convert_tfevent(file_full_path) File "E:\stable-diffusion-webui\extensions\sd_dreambooth_extension\dreambooth\utils\utils.py", line 343, in convert_tfevent for serialized_example in serialized_examples: File "E:\stable-diffusion-webui\venv\lib\site-packages\tensorflow\python\data\ops\iterator_ops.py", line 787, in __next__ return self._next_internal() File "E:\stable-diffusion-webui\venv\lib\site-packages\tensorflow\python\data\ops\iterator_ops.py", line 770, in _next_internal ret = gen_dataset_ops.iterator_get_next( File "E:\stable-diffusion-webui\venv\lib\site-packages\tensorflow\python\ops\gen_dataset_ops.py", line 3043, in iterator_get_next _ops.raise_from_not_ok_status(e, name) File "E:\stable-diffusion-webui\venv\lib\site-packages\tensorflow\python\framework\ops.py", line 7215, in raise_from_not_ok_status raise core._status_to_exception(e) from None # pylint: disable=protected-access tensorflow.python.framework.errors_impl.DataLossError: {{function_node __wrapped__IteratorGetNext_output_types_1_device_/job:localhost/replica:0/task:0/device:CPU:0}} corrupted record at 46555 [Op:IteratorGetNext] Exception parsing logz: {{function_node __wrapped__IteratorGetNext_output_types_1_device_/job:localhost/replica:0/task:0/device:CPU:0}} corrupted record at 46555 [Op:IteratorGetNext] Steps: 7%|▎ | 328/4500 [00:08<00:04, 1028.26it/s, inst_loss=0, loss=0.0478, lr=1e-6, prior_loss=0.0638, vram=15.1]
Beta Was this translation helpful? Give feedback.
All reactions