You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The Persistent Service feature aims to enhance SkyServe's reliability by enabling service recovery after controller failures or system shutdowns. This design addresses how to restore services with their previous states preserved, making services more resilient to interruptions and allowing for planned downtimes.
Service Recovery Flow
Service Recovery Trigger:
Primary Option (& our current implementation): Kubernetes deployment with automatic recovery via pod initialization
Cloud Provider Option: Cloud-native serverless functions (AWS Lambda, Azure Functions, Google Cloud Functions) for monitoring and auto-recovery
Fallback Option (TBD): Reuse sky serve up command for manual recovery in unsupported environments with FAILED_CONTROLLER state
When the controller is launched:
Check database states whether the service exists to realize that it’s on recover mode.
Recovery Process:
Skip service registration
Load Controller and LoadBalancer ports from serve_state
LoadBalancer ports should definitely be unchanged
Restart Controller (seems stateless already)
Load AutoScaler state from database
request_timestamps
Reconstruct LoadBalancer with persisted replica URLs synced from Controller
client_pool (TBD, or we can also wait for syncing from Controller, which takes longer to recover to previous states)
_request_aggregator (TBD, or just ignore it, since it flushes each time syncing with Controller)
Restore ReplicaManager process pools from database
launch_process_pool and down_process_pool
We can possibly check if they become zombies.
We simply store launching cluster, and start a new launch when recovery, since our sky launch and sky down are already robust enough
Future Work
The system should consider recovery from broken operations, such as suddenly terminating an ongoing update_service. That's negligible, but would be more robust on thread safety if we implement this.
We should save unfinished requests, so that we can print something like Unfinished requests: … when recovering, providing a nice add-on when Controller and LoadBalancer are separated, since the unfinished http requests would not be interrupted when Controller failed, causing some confusion to users.
This design provides a foundation for making SkyServe services more resilient while enabling users to manage service life cycles more effectively, particularly for cost optimization during off-hours.
The text was updated successfully, but these errors were encountered:
Design Docs: Persistent Service in SkyPilot
Overview
The Persistent Service feature aims to enhance SkyServe's reliability by enabling service recovery after controller failures or system shutdowns. This design addresses how to restore services with their previous states preserved, making services more resilient to interruptions and allowing for planned downtimes.
Service Recovery Flow
Service Recovery Trigger:
FAILED_CONTROLLER
stateWhen the controller is launched:
Check database states whether the service exists to realize that it’s on recover mode.
Recovery Process:
serve_state
request_timestamps
client_pool
(TBD, or we can also wait for syncing from Controller, which takes longer to recover to previous states)_request_aggregator
(TBD, or just ignore it, since it flushes each time syncing with Controller)launch_process_pool
anddown_process_pool
We can possibly check if they become zombies.sky launch
andsky down
are already robust enoughFuture Work
The system should consider recovery from broken operations, such as suddenly terminating an ongoing
update_service
. That's negligible, but would be more robust on thread safety if we implement this.We should save unfinished requests, so that we can print something like
Unfinished requests: …
when recovering, providing a nice add-on when Controller and LoadBalancer are separated, since the unfinished http requests would not be interrupted when Controller failed, causing some confusion to users.This design provides a foundation for making SkyServe services more resilient while enabling users to manage service life cycles more effectively, particularly for cost optimization during off-hours.
The text was updated successfully, but these errors were encountered: