Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

多卡训练报错,单卡训练正常 #612

Open
lidisi8520 opened this issue Jan 13, 2025 · 0 comments
Open

多卡训练报错,单卡训练正常 #612

lidisi8520 opened this issue Jan 13, 2025 · 0 comments

Comments

@lidisi8520
Copy link

11:29:09-280078 INFO     Found 1 legal dataset
11:29:25-631481 INFO     Wrote promopts to file
                         D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\config\autosave\20250113-112909-promopt.txt
11:29:25-639480 INFO     Training started with config file / 训练开始,使用配置文件:
                         D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\config\autosave\20250113-112909.toml
11:29:25-648481 INFO     Using GPU(s) / 使用 GPU: ['0', '1', '2']
11:29:25-652481 INFO     Task 5c92af4b-71c1-48d1-a356-ecf9270c1918 created
W0113 11:29:28.772000 18304 torch\distributed\elastic\multiprocessing\redirects.py:28] NOTE: Redirects are currently not supported in Windows or MacOs.
W0113 11:29:30.883000 18304 torch\distributed\run.py:771] master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
2025-01-13 11:29:41 INFO     Loading settings from                                                    train_util.py:3745
                             D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\config\autosave\20250
                             113-112909.toml...
2025-01-13 11:29:41 INFO     Loading settings from                                                    train_util.py:3745
                             D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\config\autosave\20250
                             113-112909.toml...
2025-01-13 11:29:41 INFO     Loading settings from                                                    train_util.py:3745
                             D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\config\autosave\20250
                             113-112909.toml...
                    INFO     D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\config\autosave\20250 train_util.py:3764
                             113-112909
                    INFO     D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\config\autosave\20250 train_util.py:3764
                             113-112909
                    INFO     D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\config\autosave\20250 train_util.py:3764
                             113-112909
2025-01-13 11:29:41 INFO     prepare tokenizer                                                        train_util.py:4228
2025-01-13 11:29:41 INFO     prepare tokenizer                                                        train_util.py:4228
2025-01-13 11:29:41 INFO     prepare tokenizer                                                        train_util.py:4228
2025-01-13 11:29:42 INFO     update token length: 255                                                 train_util.py:4245
                    INFO     Using DreamBooth method.                                               train_network.py:172
2025-01-13 11:29:42 INFO     update token length: 255                                                 train_util.py:4245
                    INFO     Using DreamBooth method.                                               train_network.py:172
2025-01-13 11:29:42 INFO     update token length: 255                                                 train_util.py:4245
                    INFO     Using DreamBooth method.                                               train_network.py:172
                    INFO     prepare images.                                                          train_util.py:1573
                    INFO     found directory                                                          train_util.py:1520
                             D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\train\people\10_peopl
                             e contains 10 image files
                    INFO     prepare images.                                                          train_util.py:1573
                    INFO     100 train images with repeating.                                         train_util.py:1614
                    INFO     0 reg images.                                                            train_util.py:1617
                    INFO     found directory                                                          train_util.py:1520
                             D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\train\people\10_peopl
                             e contains 10 image files
                    WARNING  no regularization images / 正則化画像が見つかりませんでした              train_util.py:1622
                    INFO     100 train images with repeating.                                         train_util.py:1614
                    INFO     0 reg images.                                                            train_util.py:1617
                    INFO     prepare images.                                                          train_util.py:1573
                    WARNING  no regularization images / 正則化画像が見つかりませんでした              train_util.py:1622
                    INFO     [Dataset 0]                                                              config_util.py:565
                               batch_size: 1
                               resolution: (512, 768)
                               enable_bucket: True
                               network_multiplier: 1.0
                               min_bucket_reso: 256
                               max_bucket_reso: 1024
                               bucket_reso_steps: 64
                               bucket_no_upscale: True

                               [Subset 0 of Dataset 0]
                                 image_dir:
                             "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\train\people\10_peop
                             le"
                                 image_count: 10
                                 num_repeats: 10
                                 shuffle_caption: True
                                 keep_tokens: 0
                                 keep_tokens_separator: ,
                                 secondary_separator: None
                                 enable_wildcard: False
                                 caption_dropout_rate: 0.0
                                 caption_dropout_every_n_epoches: 0
                                 caption_tag_dropout_rate: 0.0
                                 caption_prefix: None
                                 caption_suffix: None
                                 color_aug: False
                                 flip_aug: False
                                 face_crop_aug_range: None
                                 random_crop: False
                                 token_warmup_min: 1,
                                 token_warmup_step: 0,
                                 is_reg: False
                                 class_tokens: people
                                 caption_extension: .txt


                    INFO     [Dataset 0]                                                              config_util.py:571
                    INFO     found directory                                                          train_util.py:1520
                             D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\train\people\10_peopl
                             e contains 10 image files
                    INFO     loading image sizes.                                                      train_util.py:854
                    INFO     [Dataset 0]                                                              config_util.py:565
                               batch_size: 1
                               resolution: (512, 768)
                               enable_bucket: True
                               network_multiplier: 1.0
                               min_bucket_reso: 256
                               max_bucket_reso: 1024
                               bucket_reso_steps: 64
                               bucket_no_upscale: True

                               [Subset 0 of Dataset 0]
                                 image_dir:
                             "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\train\people\10_peop
                             le"
                                 image_count: 10
                                 num_repeats: 10
                                 shuffle_caption: True
                                 keep_tokens: 0
                                 keep_tokens_separator: ,
                                 secondary_separator: None
                                 enable_wildcard: False
                                 caption_dropout_rate: 0.0
                                 caption_dropout_every_n_epoches: 0
                                 caption_tag_dropout_rate: 0.0
                                 caption_prefix: None
                                 caption_suffix: None
                                 color_aug: False
                                 flip_aug: False
                                 face_crop_aug_range: None
                                 random_crop: False
                                 token_warmup_min: 1,
                                 token_warmup_step: 0,
                                 is_reg: False
                                 class_tokens: people
                                 caption_extension: .txt


                    INFO     100 train images with repeating.                                         train_util.py:1614
                    INFO     [Dataset 0]                                                              config_util.py:571
                    INFO     0 reg images.                                                            train_util.py:1617
  0%|                                                                                           | 0/10 [00:00<?, ?it/s]                    INFO     loading image sizes.                                                      train_util.py:854
100%|████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 9984.06it/s]
                    WARNING  no regularization images / 正則化画像が見つかりませんでした              train_util.py:1622
                    INFO     make buckets                                                              train_util.py:860
100%|████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 9957.99it/s]
                    WARNING  min_bucket_reso and max_bucket_reso are ignored if bucket_no_upscale is   train_util.py:877
                             set, because bucket reso is defined by image size automatically /
                             bucket_no_upscaleが指定された場合は、bucketの解像度は画像サイズから自動計
                             算されるため、min_bucket_resoとmax_bucket_resoは無視されます
                    INFO     make buckets                                                              train_util.py:860
                    INFO     [Dataset 0]                                                              config_util.py:565
                               batch_size: 1
                               resolution: (512, 768)
                               enable_bucket: True
                               network_multiplier: 1.0
                               min_bucket_reso: 256
                               max_bucket_reso: 1024
                               bucket_reso_steps: 64
                               bucket_no_upscale: True

                               [Subset 0 of Dataset 0]
                                 image_dir:
                             "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\train\people\10_peop
                             le"
                                 image_count: 10
                                 num_repeats: 10
                                 shuffle_caption: True
                                 keep_tokens: 0
                                 keep_tokens_separator: ,
                                 secondary_separator: None
                                 enable_wildcard: False
                                 caption_dropout_rate: 0.0
                                 caption_dropout_every_n_epoches: 0
                                 caption_tag_dropout_rate: 0.0
                                 caption_prefix: None
                                 caption_suffix: None
                                 color_aug: False
                                 flip_aug: False
                                 face_crop_aug_range: None
                                 random_crop: False
                                 token_warmup_min: 1,
                                 token_warmup_step: 0,
                                 is_reg: False
                                 class_tokens: people
                                 caption_extension: .txt


                    INFO     number of images (including repeats) /                                    train_util.py:906
                             各bucketの画像枚数(繰り返し回数を含む)
                    INFO     [Dataset 0]                                                              config_util.py:571
                    WARNING  min_bucket_reso and max_bucket_reso are ignored if bucket_no_upscale is   train_util.py:877
                             set, because bucket reso is defined by image size automatically /
                             bucket_no_upscaleが指定された場合は、bucketの解像度は画像サイズから自動計
                             算されるため、min_bucket_resoとmax_bucket_resoは無視されます
                    INFO     bucket 0: resolution (384, 1024), count: 10                               train_util.py:911
                    INFO     loading image sizes.                                                      train_util.py:854
                    INFO     bucket 1: resolution (448, 768), count: 10                                train_util.py:911
                    INFO     number of images (including repeats) /                                    train_util.py:906
                             各bucketの画像枚数(繰り返し回数を含む)
                    INFO     bucket 2: resolution (448, 832), count: 40                                train_util.py:911
                    INFO     bucket 0: resolution (384, 1024), count: 10                               train_util.py:911
100%|████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 9974.56it/s]
                    INFO     bucket 3: resolution (512, 704), count: 30                                train_util.py:911
                    INFO     bucket 1: resolution (448, 768), count: 10                                train_util.py:911
                    INFO     make buckets                                                              train_util.py:860
                    INFO     bucket 4: resolution (576, 576), count: 10                                train_util.py:911
                    INFO     bucket 2: resolution (448, 832), count: 40                                train_util.py:911
                    WARNING  min_bucket_reso and max_bucket_reso are ignored if bucket_no_upscale is   train_util.py:877
                             set, because bucket reso is defined by image size automatically /
                             bucket_no_upscaleが指定された場合は、bucketの解像度は画像サイズから自動計
                             算されるため、min_bucket_resoとmax_bucket_resoは無視されます
                    INFO     mean ar error (without repeats): 0.01810373870920746                      train_util.py:916
                    INFO     bucket 3: resolution (512, 704), count: 30                                train_util.py:911
                    INFO     bucket 4: resolution (576, 576), count: 10                                train_util.py:911
                    INFO     number of images (including repeats) /                                    train_util.py:906
                             各bucketの画像枚数(繰り返し回数を含む)
                    INFO     preparing accelerator                                                  train_network.py:225
                    INFO     mean ar error (without repeats): 0.01810373870920746                      train_util.py:916
                    INFO     bucket 0: resolution (384, 1024), count: 10                               train_util.py:911
                    INFO     bucket 1: resolution (448, 768), count: 10                                train_util.py:911
                    INFO     bucket 2: resolution (448, 832), count: 40                                train_util.py:911
                    INFO     preparing accelerator                                                  train_network.py:225
                    INFO     bucket 3: resolution (512, 704), count: 30                                train_util.py:911
                    INFO     bucket 4: resolution (576, 576), count: 10                                train_util.py:911
                    INFO     mean ar error (without repeats): 0.01810373870920746                      train_util.py:916
[W113 11:29:42.000000000 socket.cpp:697] [c10d] The client socket has failed to connect to [stable-diffusio.internal.chinacloudapp.cn]:62018 (system error: 10049 - ??????,?????????).
[W113 11:29:42.000000000 socket.cpp:697] [c10d] The client socket has failed to connect to [stable-diffusio.internal.chinacloudapp.cn]:62018 (system error: 10049 - ??????,?????????).
                    INFO     preparing accelerator                                                  train_network.py:225
[W113 11:29:42.000000000 socket.cpp:697] [c10d] The client socket has failed to connect to [stable-diffusio.internal.chinacloudapp.cn]:62018 (system error: 10049 - ??????,?????????).
[W113 11:30:03.000000000 socket.cpp:697] [c10d] The client socket has failed to connect to stable-diffusio.internal.chinacloudapp.cn:62018 (system error: 10060 - ???????????????????????????,???????).
[W113 11:30:03.000000000 socket.cpp:697] [c10d] The client socket has failed to connect to stable-diffusio.internal.chinacloudapp.cn:62018 (system error: 10060 - ???????????????????????????,???????).
[W113 11:30:03.000000000 socket.cpp:697] [c10d] The client socket has failed to connect to stable-diffusio.internal.chinacloudapp.cn:62018 (system error: 10060 - ???????????????????????????,???????).
[E113 11:30:26.000000000 socket.cpp:753] [c10d] The client socket has failed to connect to any network address of (stable-diffusio.ovbu0rvgww0ufjqj4ztrxqhyab.zqzx.internal.chinacloudapp.cn, 62018).
[E113 11:30:26.000000000 socket.cpp:753] [c10d] The client socket has failed to connect to any network address of (stable-diffusio.ovbu0rvgww0ufjqj4ztrxqhyab.zqzx.internal.chinacloudapp.cn, 62018).
[E113 11:30:26.000000000 socket.cpp:753] [c10d] The client socket has failed to connect to any network address of (stable-diffusio.ovbu0rvgww0ufjqj4ztrxqhyab.zqzx.internal.chinacloudapp.cn, 62018).
Traceback (most recent call last):
Traceback (most recent call last):
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\scripts\stable\train_network.py", line 1115, in <module>
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\scripts\stable\train_network.py", line 1115, in <module>
    trainer.train(args)
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\scripts\stable\train_network.py", line 226, in train
        trainer.train(args)accelerator = train_util.prepare_accelerator(args)

  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\scripts\stable\train_network.py", line 226, in train
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\scripts\stable\library\train_util.py", line 4307, in prepare_accelerator
    accelerator = train_util.prepare_accelerator(args)
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\scripts\stable\library\train_util.py", line 4307, in prepare_accelerator
Traceback (most recent call last):
          File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\scripts\stable\train_network.py", line 1115, in <module>
accelerator = Accelerator(accelerator = Accelerator(

  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\accelerate\accelerator.py", line 383, in __init__
      File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\accelerate\accelerator.py", line 383, in __init__
    trainer.train(args)    self.state = AcceleratorState(
self.state = AcceleratorState(
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\scripts\stable\train_network.py", line 226, in train

  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\accelerate\state.py", line 846, in __init__
      File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\accelerate\state.py", line 846, in __init__
    accelerator = train_util.prepare_accelerator(args)    PartialState(cpu, **kwargs)
PartialState(cpu, **kwargs)
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\scripts\stable\library\train_util.py", line 4307, in prepare_accelerator

  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\accelerate\state.py", line 211, in __init__
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\accelerate\state.py", line 211, in __init__
        torch.distributed.init_process_group(backend=self.backend, **kwargs)torch.distributed.init_process_group(backend=self.backend, **kwargs)

      File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\c10d_logger.py", line 79, in wrapper
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\c10d_logger.py", line 79, in wrapper
accelerator = Accelerator(
return func(*args, **kwargs)return func(*args, **kwargs)  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\accelerate\accelerator.py", line 383, in __init__


      File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\c10d_logger.py", line 93, in wrapper
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\c10d_logger.py", line 93, in wrapper
self.state = AcceleratorState(
func_return = func(*args, **kwargs)func_return = func(*args, **kwargs)  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\accelerate\state.py", line 846, in __init__


      File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\distributed_c10d.py", line 1361, in init_process_group
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\distributed_c10d.py", line 1361, in init_process_group
PartialState(cpu, **kwargs)
store, rank, world_size = next(rendezvous_iterator)
    store, rank, world_size = next(rendezvous_iterator)  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\rendezvous.py", line 258, in _env_rendezvous_handler

  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\accelerate\state.py", line 211, in __init__
      File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\rendezvous.py", line 258, in _env_rendezvous_handler
    store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout, use_libuv)    torch.distributed.init_process_group(backend=self.backend, **kwargs)
store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout, use_libuv)
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\rendezvous.py", line 185, in _create_c10d_store

  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\rendezvous.py", line 185, in _create_c10d_store
      File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\c10d_logger.py", line 79, in wrapper
    return TCPStore(    return func(*args, **kwargs)
return TCPStore(
torch.distributed
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\c10d_logger.py", line 93, in wrapper
.torch.distributed    DistNetworkError.func_return = func(*args, **kwargs): DistNetworkError
The client socket has failed to connect to any network address of (stable-diffusio.ovbu0rvgww0ufjqj4ztrxqhyab.zqzx.internal.chinacloudapp.cn, 62018). The client socket has failed to connect to stable-diffusio.internal.chinacloudapp.cn:62018 (system error: 10060 - ???????????????????????????,???????).:   File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\distributed_c10d.py", line 1361, in init_process_group

The client socket has failed to connect to any network address of (stable-diffusio.ovbu0rvgww0ufjqj4ztrxqhyab.zqzx.internal.chinacloudapp.cn, 62018). The client socket has failed to connect to stable-diffusio.internal.chinacloudapp.cn:62018 (system error: 10060 - ???????????????????????????,???????).
store, rank, world_size = next(rendezvous_iterator)
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\rendezvous.py", line 258, in _env_rendezvous_handler
    store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout, use_libuv)
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\rendezvous.py", line 185, in _create_c10d_store
    return TCPStore(
torch.distributed.DistNetworkError: The client socket has failed to connect to any network address of (stable-diffusio.ovbu0rvgww0ufjqj4ztrxqhyab.zqzx.internal.chinacloudapp.cn, 62018). The client socket has failed to connect to stable-diffusio.internal.chinacloudapp.cn:62018 (system error: 10060 - ???????????????????????????,???????).
E0113 11:30:28.357000 18304 torch\distributed\elastic\multiprocessing\api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 15808) of binary: D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\python.exe
Traceback (most recent call last):
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\accelerate\commands\launch.py", line 1116, in <module>
    main()
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\accelerate\commands\launch.py", line 1112, in main
    launch_command(args)
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\accelerate\commands\launch.py", line 1097, in launch_command
    multi_gpu_launcher(args)
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\accelerate\commands\launch.py", line 734, in multi_gpu_launcher
    distrib_run.run(args)
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\run.py", line 892, in run
    elastic_launch(
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\launcher\api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\launcher\api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
./scripts/stable/train_network.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2025-01-13_11:30:28
  host      : stable-diffusio.ovbu0rvgww0ufjqj4ztrxqhyab.zqzx.internal.chinacloudapp.cn
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 18976)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2025-01-13_11:30:28
  host      : stable-diffusio.ovbu0rvgww0ufjqj4ztrxqhyab.zqzx.internal.chinacloudapp.cn
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 8328)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-01-13_11:30:28
  host      : stable-diffusio.ovbu0rvgww0ufjqj4ztrxqhyab.zqzx.internal.chinacloudapp.cn
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 15808)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
11:30:28-815865 ERROR    Training failed / 训练失败

这是我的报错信息,我使用单卡的时候,能正常进行训练,但是我使用多卡,训练就会出现问题,以下是我的训练参数:

model_train_type = "sd-lora"
pretrained_model_name_or_path = "D:/webui/sd-webui-aki-v4.6.1/models/Stable-diffusion/v1-5-pruned.safetensors"
resume = ""
v2 = false
train_data_dir = "D:/webui/lora-scripts-v1.10.0/lora-scripts-v1.10.0/train/people"
prior_loss_weight = 1
resolution = "512,768"
enable_bucket = true
min_bucket_reso = 256
max_bucket_reso = 1024
bucket_reso_steps = 64
bucket_no_upscale = true
output_name = "aki_1"
output_dir = "./output"
save_model_as = "safetensors"
save_precision = "fp16"
save_every_n_epochs = 2
save_state = false
max_train_epochs = 10
train_batch_size = 1
gradient_checkpointing = false
gradient_accumulation_steps = 1
network_train_unet_only = false
network_train_text_encoder_only = false
learning_rate = 0.0001
unet_lr = 0.0001
text_encoder_lr = 0.00001
lr_scheduler = "constant"
lr_warmup_steps = 0
optimizer_type = "AdamW8bit"
network_module = "networks.lora"
network_dim = 64
network_alpha = 32
log_with = "tensorboard"
log_prefix = ""
log_tracker_name = ""
logging_dir = "./logs"
caption_extension = ".txt"
shuffle_caption = false
weighted_captions = false
keep_tokens = 0
keep_tokens_separator = ","
max_token_length = 255
random_crop = false
seed = 1337
clip_skip = 2
mixed_precision = "fp16"
xformers = true
lowram = false
cache_latents = true
cache_latents_to_disk = true
cache_text_encoder_outputs = false
cache_text_encoder_outputs_to_disk = false
persistent_data_loader_workers = true
ddp_gradient_as_bucket_view = false
gpu_ids = [ "0", "1", "2", "3" ]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant