You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
11:29:09-280078 INFO Found 1 legal dataset
11:29:25-631481 INFO Wrote promopts to file
D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\config\autosave\20250113-112909-promopt.txt
11:29:25-639480 INFO Training started with config file / 训练开始,使用配置文件:
D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\config\autosave\20250113-112909.toml
11:29:25-648481 INFO Using GPU(s) / 使用 GPU: ['0', '1', '2']
11:29:25-652481 INFO Task 5c92af4b-71c1-48d1-a356-ecf9270c1918 created
W0113 11:29:28.772000 18304 torch\distributed\elastic\multiprocessing\redirects.py:28] NOTE: Redirects are currently not supported in Windows or MacOs.
W0113 11:29:30.883000 18304 torch\distributed\run.py:771] master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
2025-01-13 11:29:41 INFO Loading settings from train_util.py:3745
D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\config\autosave\20250
113-112909.toml...
2025-01-13 11:29:41 INFO Loading settings from train_util.py:3745
D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\config\autosave\20250
113-112909.toml...
2025-01-13 11:29:41 INFO Loading settings from train_util.py:3745
D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\config\autosave\20250
113-112909.toml...
INFO D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\config\autosave\20250 train_util.py:3764
113-112909
INFO D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\config\autosave\20250 train_util.py:3764
113-112909
INFO D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\config\autosave\20250 train_util.py:3764
113-112909
2025-01-13 11:29:41 INFO prepare tokenizer train_util.py:4228
2025-01-13 11:29:41 INFO prepare tokenizer train_util.py:4228
2025-01-13 11:29:41 INFO prepare tokenizer train_util.py:4228
2025-01-13 11:29:42 INFO update token length: 255 train_util.py:4245
INFO Using DreamBooth method. train_network.py:172
2025-01-13 11:29:42 INFO update token length: 255 train_util.py:4245
INFO Using DreamBooth method. train_network.py:172
2025-01-13 11:29:42 INFO update token length: 255 train_util.py:4245
INFO Using DreamBooth method. train_network.py:172
INFO prepare images. train_util.py:1573
INFO found directory train_util.py:1520
D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\train\people\10_peopl
e contains 10 image files
INFO prepare images. train_util.py:1573
INFO 100 train images with repeating. train_util.py:1614
INFO 0 reg images. train_util.py:1617
INFO found directory train_util.py:1520
D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\train\people\10_peopl
e contains 10 image files
WARNING no regularization images / 正則化画像が見つかりませんでした train_util.py:1622
INFO 100 train images with repeating. train_util.py:1614
INFO 0 reg images. train_util.py:1617
INFO prepare images. train_util.py:1573
WARNING no regularization images / 正則化画像が見つかりませんでした train_util.py:1622
INFO [Dataset 0] config_util.py:565
batch_size: 1
resolution: (512, 768)
enable_bucket: True
network_multiplier: 1.0
min_bucket_reso: 256
max_bucket_reso: 1024
bucket_reso_steps: 64
bucket_no_upscale: True
[Subset 0 of Dataset 0]
image_dir:
"D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\train\people\10_peop
le"
image_count: 10
num_repeats: 10
shuffle_caption: True
keep_tokens: 0
keep_tokens_separator: ,
secondary_separator: None
enable_wildcard: False
caption_dropout_rate: 0.0
caption_dropout_every_n_epoches: 0
caption_tag_dropout_rate: 0.0
caption_prefix: None
caption_suffix: None
color_aug: False
flip_aug: False
face_crop_aug_range: None
random_crop: False
token_warmup_min: 1,
token_warmup_step: 0,
is_reg: False
class_tokens: people
caption_extension: .txt
INFO [Dataset 0] config_util.py:571
INFO found directory train_util.py:1520
D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\train\people\10_peopl
e contains 10 image files
INFO loading image sizes. train_util.py:854
INFO [Dataset 0] config_util.py:565
batch_size: 1
resolution: (512, 768)
enable_bucket: True
network_multiplier: 1.0
min_bucket_reso: 256
max_bucket_reso: 1024
bucket_reso_steps: 64
bucket_no_upscale: True
[Subset 0 of Dataset 0]
image_dir:
"D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\train\people\10_peop
le"
image_count: 10
num_repeats: 10
shuffle_caption: True
keep_tokens: 0
keep_tokens_separator: ,
secondary_separator: None
enable_wildcard: False
caption_dropout_rate: 0.0
caption_dropout_every_n_epoches: 0
caption_tag_dropout_rate: 0.0
caption_prefix: None
caption_suffix: None
color_aug: False
flip_aug: False
face_crop_aug_range: None
random_crop: False
token_warmup_min: 1,
token_warmup_step: 0,
is_reg: False
class_tokens: people
caption_extension: .txt
INFO 100 train images with repeating. train_util.py:1614
INFO [Dataset 0] config_util.py:571
INFO 0 reg images. train_util.py:1617
0%| | 0/10 [00:00<?, ?it/s] INFO loading image sizes. train_util.py:854
100%|████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 9984.06it/s]
WARNING no regularization images / 正則化画像が見つかりませんでした train_util.py:1622
INFO make buckets train_util.py:860
100%|████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 9957.99it/s]
WARNING min_bucket_reso and max_bucket_reso are ignored if bucket_no_upscale is train_util.py:877
set, because bucket reso is defined by image size automatically /
bucket_no_upscaleが指定された場合は、bucketの解像度は画像サイズから自動計
算されるため、min_bucket_resoとmax_bucket_resoは無視されます
INFO make buckets train_util.py:860
INFO [Dataset 0] config_util.py:565
batch_size: 1
resolution: (512, 768)
enable_bucket: True
network_multiplier: 1.0
min_bucket_reso: 256
max_bucket_reso: 1024
bucket_reso_steps: 64
bucket_no_upscale: True
[Subset 0 of Dataset 0]
image_dir:
"D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\train\people\10_peop
le"
image_count: 10
num_repeats: 10
shuffle_caption: True
keep_tokens: 0
keep_tokens_separator: ,
secondary_separator: None
enable_wildcard: False
caption_dropout_rate: 0.0
caption_dropout_every_n_epoches: 0
caption_tag_dropout_rate: 0.0
caption_prefix: None
caption_suffix: None
color_aug: False
flip_aug: False
face_crop_aug_range: None
random_crop: False
token_warmup_min: 1,
token_warmup_step: 0,
is_reg: False
class_tokens: people
caption_extension: .txt
INFO number of images (including repeats) / train_util.py:906
各bucketの画像枚数(繰り返し回数を含む)
INFO [Dataset 0] config_util.py:571
WARNING min_bucket_reso and max_bucket_reso are ignored if bucket_no_upscale is train_util.py:877
set, because bucket reso is defined by image size automatically /
bucket_no_upscaleが指定された場合は、bucketの解像度は画像サイズから自動計
算されるため、min_bucket_resoとmax_bucket_resoは無視されます
INFO bucket 0: resolution (384, 1024), count: 10 train_util.py:911
INFO loading image sizes. train_util.py:854
INFO bucket 1: resolution (448, 768), count: 10 train_util.py:911
INFO number of images (including repeats) / train_util.py:906
各bucketの画像枚数(繰り返し回数を含む)
INFO bucket 2: resolution (448, 832), count: 40 train_util.py:911
INFO bucket 0: resolution (384, 1024), count: 10 train_util.py:911
100%|████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 9974.56it/s]
INFO bucket 3: resolution (512, 704), count: 30 train_util.py:911
INFO bucket 1: resolution (448, 768), count: 10 train_util.py:911
INFO make buckets train_util.py:860
INFO bucket 4: resolution (576, 576), count: 10 train_util.py:911
INFO bucket 2: resolution (448, 832), count: 40 train_util.py:911
WARNING min_bucket_reso and max_bucket_reso are ignored if bucket_no_upscale is train_util.py:877
set, because bucket reso is defined by image size automatically /
bucket_no_upscaleが指定された場合は、bucketの解像度は画像サイズから自動計
算されるため、min_bucket_resoとmax_bucket_resoは無視されます
INFO mean ar error (without repeats): 0.01810373870920746 train_util.py:916
INFO bucket 3: resolution (512, 704), count: 30 train_util.py:911
INFO bucket 4: resolution (576, 576), count: 10 train_util.py:911
INFO number of images (including repeats) / train_util.py:906
各bucketの画像枚数(繰り返し回数を含む)
INFO preparing accelerator train_network.py:225
INFO mean ar error (without repeats): 0.01810373870920746 train_util.py:916
INFO bucket 0: resolution (384, 1024), count: 10 train_util.py:911
INFO bucket 1: resolution (448, 768), count: 10 train_util.py:911
INFO bucket 2: resolution (448, 832), count: 40 train_util.py:911
INFO preparing accelerator train_network.py:225
INFO bucket 3: resolution (512, 704), count: 30 train_util.py:911
INFO bucket 4: resolution (576, 576), count: 10 train_util.py:911
INFO mean ar error (without repeats): 0.01810373870920746 train_util.py:916
[W113 11:29:42.000000000 socket.cpp:697] [c10d] The client socket has failed to connect to [stable-diffusio.internal.chinacloudapp.cn]:62018 (system error: 10049 - ??????,?????????).
[W113 11:29:42.000000000 socket.cpp:697] [c10d] The client socket has failed to connect to [stable-diffusio.internal.chinacloudapp.cn]:62018 (system error: 10049 - ??????,?????????).
INFO preparing accelerator train_network.py:225
[W113 11:29:42.000000000 socket.cpp:697] [c10d] The client socket has failed to connect to [stable-diffusio.internal.chinacloudapp.cn]:62018 (system error: 10049 - ??????,?????????).
[W113 11:30:03.000000000 socket.cpp:697] [c10d] The client socket has failed to connect to stable-diffusio.internal.chinacloudapp.cn:62018 (system error: 10060 - ???????????????????????????,???????).
[W113 11:30:03.000000000 socket.cpp:697] [c10d] The client socket has failed to connect to stable-diffusio.internal.chinacloudapp.cn:62018 (system error: 10060 - ???????????????????????????,???????).
[W113 11:30:03.000000000 socket.cpp:697] [c10d] The client socket has failed to connect to stable-diffusio.internal.chinacloudapp.cn:62018 (system error: 10060 - ???????????????????????????,???????).
[E113 11:30:26.000000000 socket.cpp:753] [c10d] The client socket has failed to connect to any network address of (stable-diffusio.ovbu0rvgww0ufjqj4ztrxqhyab.zqzx.internal.chinacloudapp.cn, 62018).
[E113 11:30:26.000000000 socket.cpp:753] [c10d] The client socket has failed to connect to any network address of (stable-diffusio.ovbu0rvgww0ufjqj4ztrxqhyab.zqzx.internal.chinacloudapp.cn, 62018).
[E113 11:30:26.000000000 socket.cpp:753] [c10d] The client socket has failed to connect to any network address of (stable-diffusio.ovbu0rvgww0ufjqj4ztrxqhyab.zqzx.internal.chinacloudapp.cn, 62018).
Traceback (most recent call last):
Traceback (most recent call last):
File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\scripts\stable\train_network.py", line 1115, in <module>
File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\scripts\stable\train_network.py", line 1115, in <module>
trainer.train(args)
File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\scripts\stable\train_network.py", line 226, in train
trainer.train(args)accelerator = train_util.prepare_accelerator(args)
File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\scripts\stable\train_network.py", line 226, in train
File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\scripts\stable\library\train_util.py", line 4307, in prepare_accelerator
accelerator = train_util.prepare_accelerator(args)
File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\scripts\stable\library\train_util.py", line 4307, in prepare_accelerator
Traceback (most recent call last):
File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\scripts\stable\train_network.py", line 1115, in <module>
accelerator = Accelerator(accelerator = Accelerator(
File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\accelerate\accelerator.py", line 383, in __init__
File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\accelerate\accelerator.py", line 383, in __init__
trainer.train(args) self.state = AcceleratorState(
self.state = AcceleratorState(
File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\scripts\stable\train_network.py", line 226, in train
File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\accelerate\state.py", line 846, in __init__
File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\accelerate\state.py", line 846, in __init__
accelerator = train_util.prepare_accelerator(args) PartialState(cpu, **kwargs)
PartialState(cpu, **kwargs)
File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\scripts\stable\library\train_util.py", line 4307, in prepare_accelerator
File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\accelerate\state.py", line 211, in __init__
File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\accelerate\state.py", line 211, in __init__
torch.distributed.init_process_group(backend=self.backend, **kwargs)torch.distributed.init_process_group(backend=self.backend, **kwargs)
File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\c10d_logger.py", line 79, in wrapper
File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\c10d_logger.py", line 79, in wrapper
accelerator = Accelerator(
return func(*args, **kwargs)return func(*args, **kwargs) File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\accelerate\accelerator.py", line 383, in __init__
File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\c10d_logger.py", line 93, in wrapper
File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\c10d_logger.py", line 93, in wrapper
self.state = AcceleratorState(
func_return = func(*args, **kwargs)func_return = func(*args, **kwargs) File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\accelerate\state.py", line 846, in __init__
File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\distributed_c10d.py", line 1361, in init_process_group
File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\distributed_c10d.py", line 1361, in init_process_group
PartialState(cpu, **kwargs)
store, rank, world_size = next(rendezvous_iterator)
store, rank, world_size = next(rendezvous_iterator) File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\rendezvous.py", line 258, in _env_rendezvous_handler
File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\accelerate\state.py", line 211, in __init__
File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\rendezvous.py", line 258, in _env_rendezvous_handler
store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout, use_libuv) torch.distributed.init_process_group(backend=self.backend, **kwargs)
store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout, use_libuv)
File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\rendezvous.py", line 185, in _create_c10d_store
File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\rendezvous.py", line 185, in _create_c10d_store
File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\c10d_logger.py", line 79, in wrapper
return TCPStore( return func(*args, **kwargs)
return TCPStore(
torch.distributed
File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\c10d_logger.py", line 93, in wrapper
.torch.distributed DistNetworkError.func_return = func(*args, **kwargs): DistNetworkError
The client socket has failed to connect to any network address of (stable-diffusio.ovbu0rvgww0ufjqj4ztrxqhyab.zqzx.internal.chinacloudapp.cn, 62018). The client socket has failed to connect to stable-diffusio.internal.chinacloudapp.cn:62018 (system error: 10060 - ???????????????????????????,???????).: File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\distributed_c10d.py", line 1361, in init_process_group
The client socket has failed to connect to any network address of (stable-diffusio.ovbu0rvgww0ufjqj4ztrxqhyab.zqzx.internal.chinacloudapp.cn, 62018). The client socket has failed to connect to stable-diffusio.internal.chinacloudapp.cn:62018 (system error: 10060 - ???????????????????????????,???????).
store, rank, world_size = next(rendezvous_iterator)
File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\rendezvous.py", line 258, in _env_rendezvous_handler
store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout, use_libuv)
File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\rendezvous.py", line 185, in _create_c10d_store
return TCPStore(
torch.distributed.DistNetworkError: The client socket has failed to connect to any network address of (stable-diffusio.ovbu0rvgww0ufjqj4ztrxqhyab.zqzx.internal.chinacloudapp.cn, 62018). The client socket has failed to connect to stable-diffusio.internal.chinacloudapp.cn:62018 (system error: 10060 - ???????????????????????????,???????).
E0113 11:30:28.357000 18304 torch\distributed\elastic\multiprocessing\api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 15808) of binary: D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\python.exe
Traceback (most recent call last):
File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\runpy.py", line 86, in _run_code
exec(code, run_globals)
File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\accelerate\commands\launch.py", line 1116, in <module>
main()
File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\accelerate\commands\launch.py", line 1112, in main
launch_command(args)
File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\accelerate\commands\launch.py", line 1097, in launch_command
multi_gpu_launcher(args)
File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\accelerate\commands\launch.py", line 734, in multi_gpu_launcher
distrib_run.run(args)
File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\run.py", line 892, in run
elastic_launch(
File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\launcher\api.py", line 133, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\launcher\api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
./scripts/stable/train_network.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2025-01-13_11:30:28
host : stable-diffusio.ovbu0rvgww0ufjqj4ztrxqhyab.zqzx.internal.chinacloudapp.cn
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 18976)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2025-01-13_11:30:28
host : stable-diffusio.ovbu0rvgww0ufjqj4ztrxqhyab.zqzx.internal.chinacloudapp.cn
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 8328)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-01-13_11:30:28
host : stable-diffusio.ovbu0rvgww0ufjqj4ztrxqhyab.zqzx.internal.chinacloudapp.cn
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 15808)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
11:30:28-815865 ERROR Training failed / 训练失败
这是我的报错信息,我使用单卡的时候,能正常进行训练,但是我使用多卡,训练就会出现问题,以下是我的训练参数:
The text was updated successfully, but these errors were encountered: