Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Serve] Make SkyServe Controller More Persistent #4563

Open
andylizf opened this issue Jan 15, 2025 · 0 comments · May be fixed by #4564
Open

[Serve] Make SkyServe Controller More Persistent #4563

andylizf opened this issue Jan 15, 2025 · 0 comments · May be fixed by #4564

Comments

@andylizf
Copy link
Contributor

Design Docs: Persistent Service in SkyPilot

Overview

The Persistent Service feature aims to enhance SkyServe's reliability by enabling service recovery after controller failures or system shutdowns. This design addresses how to restore services with their previous states preserved, making services more resilient to interruptions and allowing for planned downtimes.

Service Recovery Flow

  1. Service Recovery Trigger:

    • Primary Option (& our current implementation): Kubernetes deployment with automatic recovery via pod initialization
    • Cloud Provider Option: Cloud-native serverless functions (AWS Lambda, Azure Functions, Google Cloud Functions) for monitoring and auto-recovery
    • Fallback Option (TBD): Reuse sky serve up command for manual recovery in unsupported environments with FAILED_CONTROLLER state
  2. When the controller is launched:
    Check database states whether the service exists to realize that it’s on recover mode.

  3. Recovery Process:

    • Skip service registration
    • Load Controller and LoadBalancer ports from serve_state
      • LoadBalancer ports should definitely be unchanged
    • Restart Controller (seems stateless already)
    • Load AutoScaler state from database
      • request_timestamps
    • Reconstruct LoadBalancer with persisted replica URLs synced from Controller
      • client_pool (TBD, or we can also wait for syncing from Controller, which takes longer to recover to previous states)
      • _request_aggregator (TBD, or just ignore it, since it flushes each time syncing with Controller)
    • Restore ReplicaManager process pools from database
      • launch_process_pool and down_process_pool
      • We can possibly check if they become zombies.
      • We simply store launching cluster, and start a new launch when recovery, since our sky launch and sky down are already robust enough

Future Work

The system should consider recovery from broken operations, such as suddenly terminating an ongoing update_service. That's negligible, but would be more robust on thread safety if we implement this.

We should save unfinished requests, so that we can print something like Unfinished requests: … when recovering, providing a nice add-on when Controller and LoadBalancer are separated, since the unfinished http requests would not be interrupted when Controller failed, causing some confusion to users.


This design provides a foundation for making SkyServe services more resilient while enabling users to manage service life cycles more effectively, particularly for cost optimization during off-hours.

@andylizf andylizf linked a pull request Jan 15, 2025 that will close this issue
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant