Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is the relationship with DLRover and Megatron? Can I integrate DLRover with Megatron with fault-tolerance and monitoring capabilities. How DLRover can recover from GPU offline problems with TP and PP needing to be reorganized? #1243

Open
dotsonliu opened this issue Aug 19, 2024 · 1 comment

Comments

@dotsonliu
Copy link

No description provided.

@majieyue majieyue changed the title 您好,请问以下dlrover和megatron是什么关系?megatron没有容灾监控功能,借用dlrover这部分能力吗?怎么集成?如果突然一个GPU坏了,tp pp这些都变了,怎么动态兼容? What is the relationship with DLRover and Megatron? Can I integrate DLRover with Megatron with fault-tolerance and monitoring capabilities. How DLRover can recover from GPU offline problems with TP and PP needing to be reorganized? Sep 19, 2024
@majieyue
Copy link
Collaborator

majieyue commented Sep 19, 2024

Thank you for using DLRover. I've transfer your headline into English and please send issues in English in future.

Have a good day

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants