Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s] Inconsistent Display in Setting up a Local Cluster #4191

Open
root-hbx opened this issue Oct 26, 2024 · 8 comments
Open

[k8s] Inconsistent Display in Setting up a Local Cluster #4191

root-hbx opened this issue Oct 26, 2024 · 8 comments

Comments

@root-hbx
Copy link

Issue Reproduction

I attempted to set up a local Kubernetes cluster on my laptop (Apple M2 Pro) following the instructions here.

After starting it up, I followed the QuickStart guide to launch a cluster named mycluster and run the task hello_sky.yaml.

However, even after waiting over 6 hours, the process remained stalled in CLI.

When I use sky status, I find it show UP status.

But when I was trying to sky exec mycluster hello_sky.yaml since I supposed mycluster already existed, I found I was blocked :(

It's a little bit confusing.

# init + launch (remain this state for more than 6h)
❯ sky launch -c mycluster hello_sky.yaml
Task from YAML spec: hello_sky.yaml
Considered resources (1 node):
-----------------------------------------------------------------------------------------------
 CLOUD        INSTANCE    vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE     COST ($)   CHOSEN
-----------------------------------------------------------------------------------------------
 Kubernetes   2CPU--2GB   2       2         -              kind-skypilot   0.00          ✔
-----------------------------------------------------------------------------------------------
Launching a new cluster 'mycluster'. Proceed? [Y/n]: y
⚙︎ Launching on Kubernetes.
⠧ Launching  View logs at: ~/sky_logs/sky-2024-10-26-00-32-04-918185/provision.log
# showcase the cluster (`mycluster`) initialization is finished
❯ sky status --k8s
Kubernetes cluster state (context: kind-skypilot)
SkyPilot clusters
USER     NAME                           LAUNCHED     RESOURCES                     STATUS
huluobo  mycluster                      10 hrs ago   1x Kubernetes(cpus=2, mem=2)  UP
huluobo  sky-serve-controller-9d6cffc0  4 weeks ago  1x Kubernetes(cpus=4, mem=4)  UP
❯ sky exec mycluster hello_sky.yaml
Task from YAML spec: hello_sky.yaml
Executing task on cluster mycluster...
sky.exceptions.ClusterNotUpError: Executing tasks: skipped for cluster 'mycluster' (status: INIT). It is only allowed for UP clusters. Wait for a launch to finish, or use this command to try to transition the cluster to UP: sky start mycluster

BTW, just a month ago, I used the same setup steps to initialize and launch a GCP cluster, which only didn't show this problem.

Issue Visualization

image

Raise Question

I’m wondering 3 issues:

  1. When I use k8s local cluster on my laptop, the process remained stalled in CLI, and actually it's unfinished. But when I use sky statis simultaneously, it shows me UP status. It's really confusing, maybe we should work on this display issue?
    • I'm not sure if this issue occurs with other types of clusters, but based on the above, it's clear that there’s a problem with the setup of a local Kubernetes cluster.
  2. Can we display the consumed time to initialize the cluster at this step?
    • Considering that initializing large clusters can take a significant amount of time, providing an estimated duration would help in predicting the setup time for clusters of a similar scale. This way, operators can better gauge how long they might need to wait.
  3. Is there a way to optimize the initialization time for a local Kubernetes cluster?
    • Following the documentation, I tried deploying a local Kubernetes cluster, but after waiting 6 hours, it still wasn’t completed. However, initializing on GCP took only 23 minutes.

Version & Commit info:

  • sky -v: version 1.0.0.dev20241024
  • sky -c: commit cbf5c00
@root-hbx root-hbx changed the title [k8s] Setting up a local cluster is time-consuming [k8s] Inconsistent Display in Setting up a Local Cluster Oct 26, 2024
@root-hbx
Copy link
Author

The tutorial said: --k8s is still experimental now.

It appears that issue 1 lies with the experimental sky status --k8s command, as demonstrated below:

  • sky status: ✅
  • sky status -a: ✅
  • sky status --k8s: ❌
  ~/paper/hello-sky                                                 🐍 skypilot  15:04:39
❯ sky status -a
Clusters
NAME       LAUNCHED   RESOURCES                 REGION         ZONE  STATUS  AUTOSTOP  HEAD_IP  COMMAND
mycluster  4 hrs ago  1x Kubernetes(2CPU--2GB)  kind-skypilot  -     INIT    -         -        sky launch -c mycluster hello_sky.yaml

Managed jobs
No in-progress managed jobs. (See: sky jobs -h)

Services
No live services. (See: sky serve -h)
  ~/paper/hello-sky                                                 🐍 skypilot  15:06:30
❯ sky status
Clusters
NAME       LAUNCHED   RESOURCES                 STATUS  AUTOSTOP  COMMAND
mycluster  4 hrs ago  1x Kubernetes(2CPU--2GB)  INIT    -         sky launch -c mycluster h...

Managed jobs
No in-progress managed jobs. (See: sky jobs -h)

Services
No live services. (See: sky serve -h)
  ~/paper/hello-sky                                                 🐍 skypilot  15:06:34
❯ sky status --k8s
Kubernetes cluster state (context: kind-skypilot)
SkyPilot clusters
USER     NAME                           LAUNCHED     RESOURCES                     STATUS
huluobo  mycluster                      4 hrs ago    1x Kubernetes(cpus=2, mem=2)  UP
huluobo  sky-serve-controller-9d6cffc0  4 weeks ago  1x Kubernetes(cpus=4, mem=4)  UP

Hint: SkyServe replica pods are shown in the "SkyPilot clusters" section.

The discrepancy in the sky status --k8s command output may indicate some limitations or unexpected behavior in this experimental feature.

@romilbhardwaj
Copy link
Collaborator

Hi @root-hbx, thanks for the report. Before I answer your questions, can you share your provision.log? ~/sky_logs/sky-2024-10-26-00-32-04-918185/provision.log

@root-hbx
Copy link
Author

Hi @root-hbx, thanks for the report. Before I answer your questions, can you share your provision.log? ~/sky_logs/sky-2024-10-26-00-32-04-918185/provision.log

Here it is, thanks :)

provision.log

@romilbhardwaj
Copy link
Collaborator

Looks like it is stuck at pulling the image. Are you behind a firewall?

@root-hbx
Copy link
Author

root-hbx commented Oct 27, 2024

Thanks @romilbhardwaj . My computer's firewall has always been turned off.

  1. try kubectl describe pods:

des_pod.txt

  1. running step A1 of our troubleshooting guide
image
❯ kubectl apply -f https://raw.githubusercontent.com/skypilot-org/skypilot/master/tests/kubernetes/cpu_test_pod.yaml
pod/skytest created
service/skytest-svc unchanged
  ~/Github_Content/AcademicHomepage   main                                                                                                      🐍 skypilot ⎈ kind-skypilot  12:17:39
❯ kubectl apply -f https://raw.githubusercontent.com/skypilot-org/skypilot/master/tests/kubernetes/cpu_test_pod.yaml
pod/skytest unchanged
service/skytest-svc unchanged
  ~/Github_Content/AcademicHomepage   main                                                                                                      🐍 skypilot ⎈ kind-skypilot  12:17:54
❯ kubectl get pod skytest
NAME      READY   STATUS    RESTARTS   AGE
skytest   1/1     Running   0          24s
  ~/Github_Content/AcademicHomepage   main                                                                                                      🐍 skypilot ⎈ kind-skypilot  12:18:03
❯ kubectl port-forward svc/skytest-svc 8080:8080
Forwarding from [::1]:8080 -> 8080
Handling connection for 8080
image

@root-hbx
Copy link
Author

Thanks a lot @romilbhardwaj ! Following your guidance, I identified the issue — it was related to Docker Desktop on my computer. I deleted the pod and re-deployed it. Running step A1 of our troubleshooting guide showed that everything was working correctly. After trying to deploy the local Kubernetes cluster again, it worked as expected.

image

However, I'm still a bit confused about why, in the previous stuck state, the outputs of sky status --k8s and sky status were inconsistent regarding the cluster status?

@romilbhardwaj
Copy link
Collaborator

Great to hear it works now!

sky status --k8s and sky status are inconsistent when the pod is being initialized because status --k8s treats a running pod as UP, even though it may not be initialized by SkyPilot:

if pod.status.phase == 'Pending':
# If pod is pending, do not show it in the status
continue
cluster_info = KubernetesSkyPilotClusterInfo(
cluster_name_on_cloud=cluster_name_on_cloud,
cluster_name=cluster_name,
user=pod.metadata.labels.get('skypilot-user'),
status=status_lib.ClusterStatus.UP,
pods=[],
launched_at=start_time,
resources=resources,
resources_str='')

We could improve this by actually polling the pod to get the right SkyPilot cluster status, but that might be an expensive operation requiring kubectl exec into each pod. Might be worth implementing and measuring the overhead before taking a call whether to implement it or not.

We can also change the cluster status in --k8s to use something other than UP.

@root-hbx
Copy link
Author

Got it 🫡 I’m wondering if there might be an alternative approach. For example, setting up a probe within each pod that continuously outputs its status externally. This way, we could simply track the status changes from its streaming output instead of inspecting each pod individually each time?

Great to hear it works now!

sky status --k8s and sky status are inconsistent when the pod is being initialized because status --k8s treats a running pod as UP, even though it may not be initialized by SkyPilot:

if pod.status.phase == 'Pending':
# If pod is pending, do not show it in the status
continue
cluster_info = KubernetesSkyPilotClusterInfo(
cluster_name_on_cloud=cluster_name_on_cloud,
cluster_name=cluster_name,
user=pod.metadata.labels.get('skypilot-user'),
status=status_lib.ClusterStatus.UP,
pods=[],
launched_at=start_time,
resources=resources,
resources_str='')

We could improve this by actually polling the pod to get the right SkyPilot cluster status, but that might be an expensive operation requiring kubectl exec into each pod. Might be worth implementing and measuring the overhead before taking a call whether to implement it or not.

We can also change the cluster status in --k8s to use something other than UP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants