Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement session-level force-termination grace period #3426

Open
kyujin-cho opened this issue Jan 10, 2025 — with Lablup-Issue-Syncer · 0 comments
Open

implement session-level force-termination grace period #3426

kyujin-cho opened this issue Jan 10, 2025 — with Lablup-Issue-Syncer · 0 comments

Comments

@kyujin-cho
Copy link
Member

kyujin-cho commented Jan 10, 2025

Motivation  

We are observing some of the sessions being stuck at PREPARING or TERMINATING status. These results are indeed a side effect of various problems from the agent side - invalid NAS mount status, problematic container engine, and so on. Currently the right way to resolve these are to call force termination API manually, but we can think of a new timer architecture to collect those sessions after a certain amount of period.

Doing so will consequently resolve an issue where certain model services fall into DESTRYING status infinitely, as they are caused by scale_service() scheduler refusing to force terminate sessions in doubt.

Objective

Let problematic sessions automatically collected after given period of time.

�Expected Sub Issue

  • Add a new configuration variable to ScalingGroup which represents a grace period of each problematic session
  • Add a new scheduler architecture which iterates through all problematic sessions and try to force terminate
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant