Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] GPU health checking in Node Problem Detector #4735

Open
sdesai345 opened this issue Jan 7, 2025 · 0 comments
Open

[Feature] GPU health checking in Node Problem Detector #4735

sdesai345 opened this issue Jan 7, 2025 · 0 comments
Assignees
Labels

Comments

@sdesai345
Copy link
Contributor

By default, AKS will periodically conduct health checks on your GPU nodes (starting with select skus) using Node Problem Detector. NPD will detect and surface node conditions reflecting a range of common failures that can impact Linux GPU node pools, for improved observability and proactive workload placement.

@sdesai345 sdesai345 added the GPU label Jan 7, 2025
@sdesai345 sdesai345 self-assigned this Jan 7, 2025
@sdesai345 sdesai345 moved this to Planned (Committed) in Azure Kubernetes Service Roadmap (Public) Jan 7, 2025
@sdesai345 sdesai345 changed the title GPU health checking in Node Problem Detector [Feature] GPU health checking in Node Problem Detector Jan 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Development

No branches or pull requests

1 participant