-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Users can't start server #6
Comments
This indicates that this is a new node so it's pulling all the required images down. This takes some time like ~10 mins last time I timed it. I try to pre load the node with the needed images as it comes up, but before it joins the cluster. However, it shouldn't be indefinitely pulling. How long are you waiting? |
It definitely was for longer than 10 minutes. When I checked the pods, there were a bunch of pods that were stuck in initializing mode for an hour. But it seems like they are all back to normal now. |
To answer your question directly, the SOP for a bad node is to just kill it. Example:
|
Some users are stuck on "pulling image" indefinitely when trying to start up a server.
I think I traced all of the problems to a single node: ip-10-42-3-36.us-west-2.compute.internal
I cordoned off this node so that everyone gets redirected to one of the nodes that's functioning normally. Right now the node just unschedulable. I would like to know what is SOP when I have an issue like this for repairing the node.
Here is the event log, in case it's helpful. I saved some other logs as well.
The text was updated successfully, but these errors were encountered: