Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]Zero++ training failed #6926

Open
HelloWorld506 opened this issue Jan 6, 2025 · 8 comments
Open

[BUG]Zero++ training failed #6926

HelloWorld506 opened this issue Jan 6, 2025 · 8 comments
Assignees
Labels
bug Something isn't working training

Comments

@HelloWorld506
Copy link

Describe the bug
I have 4 nodes, each with 8 A100 gpu. In order to reduce communication between nodes, I used zero++ training, which indeed accelerated the training process. However, during the training process, the loss remained at 11.9321 and the grad_norm remained at 0, resulting in training failure

My Deepspeed configuration file is as follows:
{
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"zero_allow_untested_optimizer": true,
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": "auto"
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"zero_hpz_partition_size": 8,
"zero_quantized_weights": false,
"zero_quantized_gradients": false,
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
}
}

Excuse me, where is the problem and how should I solve it?

@HelloWorld506 HelloWorld506 added bug Something isn't working training labels Jan 6, 2025
@loadams
Copy link
Contributor

loadams commented Jan 6, 2025

@HelloWorld506 - can you please share a repro script and the DeepSpeed version you are using?

@HelloWorld506
Copy link
Author

@loadams hello
My Deepspeed version is 0.14.5
The script I used was successful when using zero1, 2, and 3, but failed when using zero++. I did not modify the code on the original basis, only modified the configuration file and added the following configuration:
"zero_hpz_partition_size": 8,
"zero_quantized_weights": false,
"zero_quantized_gradients": false
Do I need to modify the code for using zero++?

@HelloWorld506
Copy link
Author

@loadams In addition, when I set zero_quantized_weights and zero_quantized_gradients to true, it happens an error:
RuntimeError: expected mat1 and mat2 to have the same dtype, but got: float != c10::Half
And I set bf16 and fp16 to false , using fp32. But it not works, the error still exists.
My Deepspeed configuration file is as follows:
{
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"zero_allow_untested_optimizer": true,
"fp16": {
"enabled": false,
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": false
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"zero_hpz_partition_size": 8,
"zero_quantized_weights": true,
"zero_quantized_gradients": true,
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
}
}

by the way, I use Lora to finetune my model.

@GuanhuaWang
Copy link
Member

Hi @HelloWorld506

could you try disable offload params and optimizers? like delete these lines:

"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},

@HelloWorld506
Copy link
Author

Hi @HelloWorld506

could you try disable offload params and optimizers? like delete these lines:

"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},

@GuanhuaWang I tried it, but it did not work, nothing changed, the loss still remained at 11.9321 and the grad_norm remained at 0

@GuanhuaWang
Copy link
Member

GuanhuaWang commented Jan 22, 2025

Hi @HelloWorld506 ,

because in quantization part, I believe we only supporting fp16 not (bf16 or fp32).

could you enable fp16=True? as:

"fp16": {
"enabled": true,

@HelloWorld506
Copy link
Author

Hi @HelloWorld506 ,

because in quantization part, I believe we only supporting fp16 not (bf16 or fp32).

could you enable fp16=True? as:

"fp16": {
"enabled": true,

@GuanhuaWang I tried it, but it did not work.

@GuanhuaWang
Copy link
Member

GuanhuaWang commented Jan 23, 2025

ok, @HelloWorld506 ,

first, could you try disable hpz? like
"zero_hpz_partition_size": 1
see if it works?

second, could you provide a simple reproducible python script so I can look into it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working training
Projects
None yet
Development

No branches or pull requests

3 participants