Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Users can't start server #6

Open
dlsun opened this issue Oct 29, 2018 · 3 comments
Open

Users can't start server #6

dlsun opened this issue Oct 29, 2018 · 3 comments

Comments

@dlsun
Copy link

dlsun commented Oct 29, 2018

Some users are stuck on "pulling image" indefinitely when trying to start up a server.

I think I traced all of the problems to a single node: ip-10-42-3-36.us-west-2.compute.internal

I cordoned off this node so that everyone gets redirected to one of the nodes that's functioning normally. Right now the node just unschedulable. I would like to know what is SOP when I have an issue like this for repairing the node.

Here is the event log, in case it's helpful. I saved some other logs as well.

  Type    Reason                 Age   From                                               Message
  ----    ------                 ----  ----                                               -------
  Normal  Scheduled              2m    default-scheduler                                  Successfully assigned jupyter-sammishi to ip-10-42-3-36.us-west-2.compute.internal
  Normal  SuccessfulMountVolume  2m    kubelet, ip-10-42-3-36.us-west-2.compute.internal  MountVolume.SetUp succeeded for volume "nfs-stat312"
  Normal  SuccessfulMountVolume  2m    kubelet, ip-10-42-3-36.us-west-2.compute.internal  MountVolume.SetUp succeeded for volume "pv-class-stat312"
  Normal  Pulling                2m    kubelet, ip-10-42-3-36.us-west-2.compute.internal  pulling image "jupyterhub/k8s-network-tools:0.7.0"
@awstown
Copy link
Contributor

awstown commented Oct 30, 2018

pulling image "jupyterhub/k8s-network-tools:0.7.0"

This indicates that this is a new node so it's pulling all the required images down. This takes some time like ~10 mins last time I timed it. I try to pre load the node with the needed images as it comes up, but before it joins the cluster.

However, it shouldn't be indefinitely pulling. How long are you waiting?

@dlsun
Copy link
Author

dlsun commented Oct 30, 2018

It definitely was for longer than 10 minutes. When I checked the pods, there were a bunch of pods that were stuck in initializing mode for an hour. But it seems like they are all back to normal now.

@awstown
Copy link
Contributor

awstown commented Oct 30, 2018

To answer your question directly, the SOP for a bad node is to just kill it.

Example:

IP_ADDRESS=10.42.2.226
INSTANCE_IDS = $(aws ec2 describe-instances --filters=Name=private-ip-address,Values=$IP_ADDRESS --query 'Reservations[*].Instances[*].InstanceId' --output text)
aws terminate-instances --instance-ids $INSTANCE_IDS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants