Training issues #3

fupiao1998 · 2023-04-07T05:37:17Z

Hi,
I found a file "landscape_linear1000_16x64x64_shiftT_window148_lr1e-4_ema_100000.pt" in the training script. Is this file the landscape.pt in the open-source model? I am looking forward to your answer, thank you very much.

ludanruan · 2023-04-07T09:02:05Z

Sorry， it should be "guided-diffusion_64_256_upsampler.pt". README and scripts have been updated.

fupiao1998 · 2023-04-07T09:27:44Z

Get it! Thanks so much for such a quick reply! Thank you!

fupiao1998 · 2023-04-07T09:30:05Z

Also, I would like to know which file is the "landscape_linear1000_16x64x64_shiftT_window148_lr1e-4_ema_100000.pt" in multimodal_train.sh.

ludanruan · 2023-04-07T16:20:00Z

Training multimodal-generation model requires no initialization, it has been updated now.

fupiao1998 · 2023-04-09T08:14:16Z

Thinks for your reply!

fupiao1998 · 2023-04-09T14:40:54Z

Hello, I'm disturbing you again. I saw in the supplemental material of the paper that the training process uses 32 V100s with a batch size of 128, but the current open source training script has a batch size of 4 and the number of graphics cards used is 1. Could you please provide me with the training script you used in your experiments? I look forward to your reply, thank you very much.

ludanruan · 2023-04-10T03:35:00Z

The batchsize aims at one GPU. For example, set "--GPU 0,1,2,3,4,5,6,7 mpiexec -n 8 python..." , the total batchsize equals 4*8=32. Our training requires 4 Nodes, that is 32*GPUs, you need to apply the scripts across multiple nodes according the requirements of your own cluster.

fupiao1998 · 2023-04-10T05:10:01Z

Get it，thank you!

fupiao1998 · 2023-04-11T02:50:19Z

Hello, I am training AIST dataset with 8 A100 cards, each card has a batch size of 12 and the overall batch size is 96. After 10,000 steps of training, the video as well as the sound from the test is still full noise. I'm not sure what the reason is at the moment.
How long do you think it takes to converge to a reasonable result during training?
Below is my training script, is there any difference between this and your original script?

#!/bin/bash

#################256 x 256 uncondition###########################################################
MODEL_FLAGS="--cross_attention_resolutions 2,4,8 --cross_attention_windows 1,4,8
--cross_attention_shift True --dropout 0.1 
--video_attention_resolutions 2,4,8
--audio_attention_resolutions -1
--video_size 16,3,64,64 --audio_size 1,25600 --learn_sigma False --num_channels 128
--num_head_channels 64 --num_res_blocks 2 --resblock_updown True --use_fp16 True 
--use_scale_shift_norm True --num_workers 12"

# Modify --devices to your own GPU ID
TRAIN_FLAGS="--lr 0.0001 --batch_size 12 
--devices 0,1,2,3,4,5,6,7 --log_interval 1 --save_interval 500 --use_db False" #--schedule_sampler loss-second-moment
DIFFUSION_FLAGS="--noise_schedule linear --diffusion_steps 1000 --save_type mp4 --sample_fn dpm_solver++" 

# Modify the following pathes to your own paths
DATA_DIR="/nvme/datasets/video_diffusion/AIST++_crop/train"
OUTPUT_DIR="debug"
NUM_GPUS=8

mpiexec -n $NUM_GPUS  python py_scripts/multimodal_train.py --data_dir ${DATA_DIR} --output_dir ${OUTPUT_DIR} $MODEL_FLAGS $TRAIN_FLAGS $VIDEO_FLAGS $DIFFUSION_FLAGS

Looking forward to your reply, thank you!

ludanruan · 2023-04-11T08:50:31Z

In my experiments, updating to 50000 steps will have meaningful results.
I recomand you to set --save_interval 10000 to save the storage.
Set --sample_fn ddpm to test the intermediate checkpoints. Because when the model does not converge enough, accelerated sampling methods can produce worse results.

You can follow these advices and continue training on the current training checkpoints.

fupiao1998 · 2023-04-11T08:53:21Z

Thank you very much for your reply, I am sure it will help me in my experiment!

aselimc · 2023-06-19T10:47:18Z

Hello @ludanruan , thanks for sharing information. I was wondering what is the average time in hours to have meaningful results (or average step time) on Landscape or AIST++ datasets?

ludanruan · 2023-06-19T20:55:43Z

In my experiments, 50,000 iter brings meaningful results.

…

---- Replied Message ---- | From | Ahmet Selim ***@***.***> | | Date | 06/19/2023 03:47 | | To | ***@***.***> | | Cc | Ludan ***@***.***>***@***.***> | | Subject | Re: [researchmm/MM-Diffusion] Training issues (Issue #3) | Hello @ludanruan , thanks for sharing information. I was wondering what is the average time in hours to have meaningful results (or average step time) on Landscape or AIST++ datasets? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

aselimc · 2023-06-19T20:59:33Z

@ludanruan Perhaps, my question was not so clear. I meant the average training time in hours/days to achieve this 50,000 iterations with V100 GPU's (as I have read from the paper.)?

Thanks, best

ludanruan · 2023-06-19T21:06:05Z

In my experiments setting(32x32GV100), 30,000 iter takes one day. In other words, I get meaningful results?within 2 days.

…

---- Replied Message ---- | From | Ahmet Selim ***@***.***> | | Date | 06/19/2023 13:59 | | To | ***@***.***> | | Cc | Ludan ***@***.***>***@***.***> | | Subject | Re: [researchmm/MM-Diffusion] Training issues (Issue #3) | @ludanruan Perhaps, my question was not so clear. I meant the average training time in hours/days to achieve this 50,000 iterations with V100 GPU's (as I have read from the paper.)? Thanks, best — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

aselimc · 2023-06-19T21:07:46Z

Thank you for the information! I needed since I am planning to do research on this :)

fupiao1998 closed this as completed Apr 9, 2023

fupiao1998 reopened this Apr 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training issues #3

Training issues #3

fupiao1998 commented Apr 7, 2023

ludanruan commented Apr 7, 2023

fupiao1998 commented Apr 7, 2023

fupiao1998 commented Apr 7, 2023

ludanruan commented Apr 7, 2023

fupiao1998 commented Apr 9, 2023

fupiao1998 commented Apr 9, 2023

ludanruan commented Apr 10, 2023 •

edited

Loading

fupiao1998 commented Apr 10, 2023

fupiao1998 commented Apr 11, 2023

ludanruan commented Apr 11, 2023 •

edited

Loading

fupiao1998 commented Apr 11, 2023

aselimc commented Jun 19, 2023

ludanruan commented Jun 19, 2023 via email

aselimc commented Jun 19, 2023

ludanruan commented Jun 19, 2023 via email

aselimc commented Jun 19, 2023

Training issues #3

Training issues #3

Comments

fupiao1998 commented Apr 7, 2023

ludanruan commented Apr 7, 2023

fupiao1998 commented Apr 7, 2023

fupiao1998 commented Apr 7, 2023

ludanruan commented Apr 7, 2023

fupiao1998 commented Apr 9, 2023

fupiao1998 commented Apr 9, 2023

ludanruan commented Apr 10, 2023 • edited Loading

fupiao1998 commented Apr 10, 2023

fupiao1998 commented Apr 11, 2023

ludanruan commented Apr 11, 2023 • edited Loading

fupiao1998 commented Apr 11, 2023

aselimc commented Jun 19, 2023

ludanruan commented Jun 19, 2023 via email

aselimc commented Jun 19, 2023

ludanruan commented Jun 19, 2023 via email

aselimc commented Jun 19, 2023

ludanruan commented Apr 10, 2023 •

edited

Loading

ludanruan commented Apr 11, 2023 •

edited

Loading