Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nodes (Standard_NC24ads_A100_v4) not configuring anymore for v1.0.40 #1919

Open
jlphillipsphd opened this issue Aug 26, 2024 · 2 comments
Open
Labels
kind/bug Something isn't working

Comments

@jlphillipsphd
Copy link

Version

v1.0.40
slurm: 22.05.3
cyclecloud: 2.7.2

In what area(s)?

/area ansible
/area autoscaling
/area configuration
/area cyclecloud

Expected Behavior

Nodes (Standard_NC24ads_A100_v4) should autoscale and be configured properly for workloads.

Actual Behavior

Nodes have spawned and report that they are being configured for workloads, but ultimately terminate before the job is started. I can log into the node via SSH and see that it is active, but something isn't connecting correctly. I will post back with the specific error message (was a python error from jetpack) when I get a chance to try again, but these were working up to a week or so prior.

Steps to Reproduce the Problem

Deploy gpu cluster (Standard_NC24ads_A100_v4) using the versions above.

@jlphillipsphd jlphillipsphd added the kind/bug Something isn't working label Aug 26, 2024
@jlphillipsphd jlphillipsphd changed the title Nodes () not configuring anymore for v1.0.40 Nodes (Standard_NC24ads_A100_v4) not configuring anymore for v1.0.40 Aug 26, 2024
@jlphillipsphd
Copy link
Author

Wow, fishing in the first few seconds, not suspect at all...

@xpillons
Copy link
Collaborator

can you please check the slurm logs on the node when this occurs ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants
@jlphillipsphd @xpillons and others