You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are observing some of the sessions being stuck at PREPARING or TERMINATING status. These results are indeed a side effect of various problems from the agent side - invalid NAS mount status, problematic container engine, and so on. Currently the right way to resolve these are to call force termination API manually, but we can think of a new timer architecture to collect those sessions after a certain amount of period.
Doing so will consequently resolve an issue where certain model services fall into DESTRYING status infinitely, as they are caused by scale_service() scheduler refusing to force terminate sessions in doubt.
Objective
Let problematic sessions automatically collected after given period of time.
�Expected Sub Issue
Add a new configuration variable to ScalingGroup which represents a grace period of each problematic session
Add a new scheduler architecture which iterates through all problematic sessions and try to force terminate
The text was updated successfully, but these errors were encountered:
Motivation
We are observing some of the sessions being stuck at
PREPARING
orTERMINATING
status. These results are indeed a side effect of various problems from the agent side - invalid NAS mount status, problematic container engine, and so on. Currently the right way to resolve these are to call force termination API manually, but we can think of a new timer architecture to collect those sessions after a certain amount of period.Doing so will consequently resolve an issue where certain model services fall into
DESTRYING
status infinitely, as they are caused byscale_service()
scheduler refusing to force terminate sessions in doubt.Objective
Let problematic sessions automatically collected after given period of time.
�Expected Sub Issue
ScalingGroup
which represents a grace period of each problematic sessionThe text was updated successfully, but these errors were encountered: