Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exclude SYSTEM sessions from resource allocation while keeping per-container resource limits #1672

Open
achimnol opened this issue Oct 31, 2023 · 0 comments
Labels
effort:hard Need to understand many components / a large extent of contextual or historical information. urgency:3 Must be finished within a certain time frame.
Milestone

Comments

@achimnol
Copy link
Member

achimnol commented Oct 31, 2023

Currently, SYSTEM session containers also occupy the resource allocation and sometimes this confuses our users and admins when they figure out the remaining resources available for computing.

We should make it intuitive to see the remaining "usable" resources, and SYSTEM session containers makes it difficult as they are hidden but still affects the scheduling process. It is hard to design the UI to display the remaining "usable" resources as the actual agent occupation differs from what the user see in the UI's resource indicators. The problem becomes more complicated if we consider hide_agents=false configuration.

Let's resolve this situation by making SYSTEM session containers to have only resource limits but not resource allocation:

  • Assume that they DO NOT use accelerators and require relatively small amount of CPU cores and memory, allowing oversubscription.
  • Just keep track of the maximum number of SYSTEM session containers per agent.
  • Just use the minimum CPU/MEM resource requirements to set the resource limits.

Note

  • Resource allocation: Assign an exclusive, dedicated portion of resources which the scheduler will treat as "consumed". (manager/agent-level scheduling)
  • Resource limit: Ensure the container cannot use more resources exceeding the configured limit. (container's cgroup configuration, jail, hook, and some env-vars)
  • Normal Backend.AI session containers have both.

This conceptual model change make it possible to merge #710 into the SFTP-only resource group and hidden agents running along with the storage proxy. Both SFTP sessions and filebrowser sessions could be treated in the same way.

Expected scope:

  • We may need to add many branches in the scheduler to apply the different semantics to SYSTEM pending sessions. For instance, we should skip or override the normal scheduler and agent selection strategy for them.
    • By default, we could apply a simple load balancing of SYSTEM sessions by selecting the agent with the minimum number of running SYSTEM sessions within the target resource group.
  • After implementation, we could exclude SYSTEM sessions from the normal session listing APIs, except for when explicitly requested to include it via a new option.
@achimnol achimnol added type:feature Add new features effort:hard Need to understand many components / a large extent of contextual or historical information. urgency:3 Must be finished within a certain time frame. labels Oct 31, 2023
@achimnol achimnol added this to the 24.03 milestone Oct 31, 2023
@achimnol achimnol changed the title Exclude SYSTEM sessions from resource allocation while keeping the resource limits Exclude SYSTEM and DIRECT_ACCESS sessions from resource allocation while keeping the resource limits Aug 21, 2024
@achimnol achimnol changed the title Exclude SYSTEM and DIRECT_ACCESS sessions from resource allocation while keeping the resource limits Exclude SYSTEM and DIRECT_ACCESS sessions from resource allocation while keeping per-container resource limits Aug 21, 2024
@achimnol achimnol changed the title Exclude SYSTEM and DIRECT_ACCESS sessions from resource allocation while keeping per-container resource limits Exclude SYSTEM sessions from resource allocation while keeping per-container resource limits Sep 2, 2024
@achimnol achimnol removed the type:feature Add new features label Oct 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
effort:hard Need to understand many components / a large extent of contextual or historical information. urgency:3 Must be finished within a certain time frame.
Projects
None yet
Development

No branches or pull requests

1 participant