Users can't start server #6

dlsun · 2018-10-29T21:56:25Z

Some users are stuck on "pulling image" indefinitely when trying to start up a server.

I think I traced all of the problems to a single node: ip-10-42-3-36.us-west-2.compute.internal

I cordoned off this node so that everyone gets redirected to one of the nodes that's functioning normally. Right now the node just unschedulable. I would like to know what is SOP when I have an issue like this for repairing the node.

Here is the event log, in case it's helpful. I saved some other logs as well.

  Type    Reason                 Age   From                                               Message
  ----    ------                 ----  ----                                               -------
  Normal  Scheduled              2m    default-scheduler                                  Successfully assigned jupyter-sammishi to ip-10-42-3-36.us-west-2.compute.internal
  Normal  SuccessfulMountVolume  2m    kubelet, ip-10-42-3-36.us-west-2.compute.internal  MountVolume.SetUp succeeded for volume "nfs-stat312"
  Normal  SuccessfulMountVolume  2m    kubelet, ip-10-42-3-36.us-west-2.compute.internal  MountVolume.SetUp succeeded for volume "pv-class-stat312"
  Normal  Pulling                2m    kubelet, ip-10-42-3-36.us-west-2.compute.internal  pulling image "jupyterhub/k8s-network-tools:0.7.0"

The text was updated successfully, but these errors were encountered:

awstown · 2018-10-30T16:19:30Z

pulling image "jupyterhub/k8s-network-tools:0.7.0"

This indicates that this is a new node so it's pulling all the required images down. This takes some time like ~10 mins last time I timed it. I try to pre load the node with the needed images as it comes up, but before it joins the cluster.

However, it shouldn't be indefinitely pulling. How long are you waiting?

dlsun · 2018-10-30T16:33:12Z

It definitely was for longer than 10 minutes. When I checked the pods, there were a bunch of pods that were stuck in initializing mode for an hour. But it seems like they are all back to normal now.

awstown · 2018-10-30T22:04:02Z

To answer your question directly, the SOP for a bad node is to just kill it.

Example:

IP_ADDRESS=10.42.2.226
INSTANCE_IDS = $(aws ec2 describe-instances --filters=Name=private-ip-address,Values=$IP_ADDRESS --query 'Reservations[*].Instances[*].InstanceId' --output text)
aws terminate-instances --instance-ids $INSTANCE_IDS

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Users can't start server #6

Users can't start server #6

dlsun commented Oct 29, 2018 •

edited

Loading

awstown commented Oct 30, 2018

dlsun commented Oct 30, 2018

awstown commented Oct 30, 2018 •

edited

Loading

Users can't start server #6

Users can't start server #6

Comments

dlsun commented Oct 29, 2018 • edited Loading

awstown commented Oct 30, 2018

dlsun commented Oct 30, 2018

awstown commented Oct 30, 2018 • edited Loading

dlsun commented Oct 29, 2018 •

edited

Loading

awstown commented Oct 30, 2018 •

edited

Loading