-
Notifications
You must be signed in to change notification settings - Fork 157
Issues: intelligent-machine-learning/dlrover
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Author
Label
Projects
Milestones
Assignee
Sort
Issues list
How does dlrover make sure all the nodes in one job are in one switch
#1298
opened Oct 17, 2024 by
gangxie112
while using megatron distributed flash-checkpoint to recovery, error ocurs when load_checkpoint
#1233
opened Aug 13, 2024 by
deepcoldfish
scale down allreduct pytorch job won't complete and report error
#1215
opened Jul 29, 2024 by
cocodee
megatron-lm flash-ckpt can not save ckpt to disk when use pipeline parallel
#1146
opened May 29, 2024 by
Lzhang-hub
Keep the batch size even in strategy generator.
good first issue
Good for newcomers
stale
#706
opened Sep 16, 2023 by
Antlera
Run DLRover in jupyter.
enhancement
New feature or request
stale
#697
opened Sep 14, 2023 by
Antlera
[Feature]: Summarize the elapsed time of PyTorch ops in a training job.
stale
#664
opened Sep 6, 2023 by
workingloong
[Feature]: Dynamically adjust prefetch_factor of dataloader.
stale
#662
opened Sep 6, 2023 by
workingloong
[Feature]: Dynamically adjust the num_workers of dataloader.
stale
#661
opened Sep 6, 2023 by
workingloong
Previous Next
ProTip!
Adding no:label will show everything without a label.