implement session-level force-termination grace period #3426

kyujin-cho · 2025-01-10T07:59:31Z

Motivation

We are observing some of the sessions being stuck at PREPARING or TERMINATING status. These results are indeed a side effect of various problems from the agent side - invalid NAS mount status, problematic container engine, and so on. Currently the right way to resolve these are to call force termination API manually, but we can think of a new timer architecture to collect those sessions after a certain amount of period.

Doing so will consequently resolve an issue where certain model services fall into DESTRYING status infinitely, as they are caused by scale_service() scheduler refusing to force terminate sessions in doubt.

Objective

Let problematic sessions automatically collected after given period of time.

�Expected Sub Issue

Add a new configuration variable to ScalingGroup which represents a grace period of each problematic session
Add a new scheduler architecture which iterates through all problematic sessions and try to force terminate

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implement session-level force-termination grace period #3426

implement session-level force-termination grace period #3426

kyujin-cho commented Jan 10, 2025 •

edited

Loading

implement session-level force-termination grace period #3426

implement session-level force-termination grace period #3426

Comments

kyujin-cho commented Jan 10, 2025 • edited Loading

Motivation

Objective

�Expected Sub Issue

kyujin-cho commented Jan 10, 2025 •

edited

Loading