Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to Down the Serve Controller on K8S Anyway #4590

Open
andylizf opened this issue Jan 18, 2025 · 0 comments
Open

Failed to Down the Serve Controller on K8S Anyway #4590

andylizf opened this issue Jan 18, 2025 · 0 comments

Comments

@andylizf
Copy link
Contributor

andylizf commented Jan 18, 2025

After lost connection to the K8S Serve Controller, I then have no way to terminate it, even with --purge flag.

andyl@andylizf-dev-server ~/skypilot (fix-aws-name)> sky down sky-serve-controller-e2dc6f0f          (sky) 
⠹ Checking for live services
Canceled autodown on the cluster 'sky-serve-controller-e2dc6f0f', since it is found to be in an abnormal state. To fix, try running: sky start -f -i 10 --down sky-serve-controller-e2dc6f0f
sky.exceptions.ClusterNotUpError: Failed to connect to serve controller, please try again later.

During handling of the above exception, another exception occurred:

sky.exceptions.NotSupportedError: Tearing down the sky serve controller while it is in INIT state is not supported (this means a sky serve up is in progress or the previous launch failed), as we cannot guarantee that all the services are terminated. Please wait until the sky serve controller is UP or fix it with sky start sky-serve-controller-e2dc6f0f.
andyl@andylizf-dev-server ~/skypilot (fix-aws-name) [1]> sky down sky-serve-controller-e2dc6f0f --purge  (sky) 
sky.exceptions.ClusterNotUpError: Failed to connect to serve controller, please try again later.

During handling of the above exception, another exception occurred:

sky.exceptions.NotSupportedError: Tearing down the sky serve controller while it is in INIT state is not supported (this means a sky serve up is in progress or the previous launch failed), as we cannot guarantee that all the services are terminated. Please wait until the sky serve controller is UP or fix it with sky start sky-serve-controller-e2dc6f0f.

Also, for K8S pod, I don't think we could re-start it. But the log prompt us to do so, and it gives:

andyl@andylizf-dev-server ~/skypilot (fix-aws-name) [1]> sky start sky-serve-controller-e2dc6f0f     (sky) 
Restarting 1 cluster: sky-serve-controller-e2dc6f0f. Proceed? [Y/n]: 
Traceback (most recent call last):
  File "/home/andyl/miniconda3/envs/sky/bin/sky", line 8, in <module>
    sys.exit(cli())
             ^^^^^
  File "/home/andyl/miniconda3/envs/sky/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/andyl/miniconda3/envs/sky/lib/python3.11/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/home/andyl/skypilot/sky/utils/common_utils.py", line 366, in _record
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/andyl/skypilot/sky/cli.py", line 838, in invoke
    return super().invoke(ctx)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/andyl/miniconda3/envs/sky/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/andyl/miniconda3/envs/sky/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/andyl/miniconda3/envs/sky/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/andyl/skypilot/sky/utils/common_utils.py", line 386, in _record
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/andyl/skypilot/sky/cli.py", line 2581, in start
    core.start(name,
  File "/home/andyl/skypilot/sky/utils/common_utils.py", line 386, in _record
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/andyl/skypilot/sky/core.py", line 379, in start
    return _start(cluster_name,
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/andyl/skypilot/sky/core.py", line 305, in _start
    handle = backend.provision(dummy_task,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/andyl/skypilot/sky/utils/common_utils.py", line 386, in _record
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/andyl/skypilot/sky/utils/common_utils.py", line 366, in _record
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/andyl/skypilot/sky/backends/backend.py", line 84, in provision
    return self._provision(task, to_provision, dryrun, stream_logs,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/andyl/skypilot/sky/backends/cloud_vm_ray_backend.py", line 2869, in _provision
    config_dict = retry_provisioner.provision_with_retries(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/andyl/skypilot/sky/utils/common_utils.py", line 386, in _record
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/andyl/skypilot/sky/backends/cloud_vm_ray_backend.py", line 2039, in provision_with_retries
    assert (clouds.CloudImplementationFeatures.STOP
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: set()

That's all happens when:

andyl@andylizf-dev-server ~/skypilot (fix-aws-name) [1]> sky status                                  (sky) 
Clusters
NAME                           LAUNCHED     RESOURCES                                                               STATUS  AUTOSTOP  COMMAND                         
sky-serve-controller-e2dc6f0f  10 mins ago  1x Kubernetes(2CPU--2GB, cpus=2+, disk_size=20, ports=['30001-30020'])  INIT    -         sky serve up ./_a/task.yaml...  

Managed jobs
No in-progress managed jobs. (See: sky jobs -h)

Services
Failed to connect to serve controller, please try again later.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant