[Serve] Make SkyServe Controller More Persistent #4563

andylizf · 2025-01-15T06:31:18Z

Design Docs: Persistent Service in SkyPilot

Overview

The Persistent Service feature aims to enhance SkyServe's reliability by enabling service recovery after controller failures or system shutdowns. This design addresses how to restore services with their previous states preserved, making services more resilient to interruptions and allowing for planned downtimes.

Service Recovery Flow

Service Recovery Trigger:
- Primary Option (& our current implementation): Kubernetes deployment with automatic recovery via pod initialization
- Cloud Provider Option: Cloud-native serverless functions (AWS Lambda, Azure Functions, Google Cloud Functions) for monitoring and auto-recovery
- Fallback Option (TBD): Reuse sky serve up command for manual recovery in unsupported environments with FAILED_CONTROLLER state
When the controller is launched:
Check database states whether the service exists to realize that it’s on recover mode.
Recovery Process:
- Skip service registration
- Load Controller and LoadBalancer ports from serve_state
  - LoadBalancer ports should definitely be unchanged
- Restart Controller (seems stateless already)
- Load AutoScaler state from database
  - ~~request_timestamps~~
- Reconstruct LoadBalancer with persisted replica URLs synced from Controller
  - ~~client_pool (TBD, or we can also wait for syncing from Controller, which takes longer to recover to previous states)~~
  - ~~_request_aggregator (TBD, or just ignore it, since it flushes each time syncing with Controller)~~
- Restore ReplicaManager process pools from database
  - launch_process_pool and down_process_pool
  - ~~We can possibly check if they become zombies.~~
  - We simply store launching cluster, and start a new launch when recovery, since our sky launch and sky down are already robust enough

Future Work

The system should consider recovery from broken operations, such as suddenly terminating an ongoing update_service. That's negligible, but would be more robust on thread safety if we implement this.

We should save unfinished requests, so that we can print something like Unfinished requests: … when recovering, providing a nice add-on when Controller and LoadBalancer are separated, since the unfinished http requests would not be interrupted when Controller failed, causing some confusion to users.

This design provides a foundation for making SkyServe services more resilient while enabling users to manage service life cycles more effectively, particularly for cost optimization during off-hours.

The text was updated successfully, but these errors were encountered:

andylizf linked a pull request Jan 15, 2025 that will close this issue

Persistent Service #4564

Draft

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Serve] Make SkyServe Controller More Persistent #4563

[Serve] Make SkyServe Controller More Persistent #4563

andylizf commented Jan 15, 2025

[Serve] Make SkyServe Controller More Persistent #4563

[Serve] Make SkyServe Controller More Persistent #4563

Comments

andylizf commented Jan 15, 2025

Design Docs: Persistent Service in SkyPilot

Overview

Service Recovery Flow

Future Work