Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: Split PendingSession Scheduler into PendingSession Scheduler and AgentSelector #1655

Merged
merged 46 commits into from
Sep 16, 2024

Conversation

jopemachine
Copy link
Member

@jopemachine jopemachine commented Oct 25, 2023

Follow up PR of #1405.

  • Extract the agent selector subsystem from the session scheduler.
  • Refactor the roundrobin flag to AgentSelectionStrategy.

Why

The existing session scheduler code has made confusion as it mixed session scheduling logic with the agent selection logic. This PR removes the confusion by separating the responsibilities into two classes: Scheduler and AgentSelector.

And since most of the logic related to the RoundRobin flag pertains to the AgentSelector, the flag has been changed to a policy within the AgentSelectionStrategy.

Refactoring Details

Introduce AgentSelector with Generic ABC

  • AbstractAgentSelector(Generic[T_ResourceGroupState], ABC)
    where T_ResourceGroupState = TypeVar(..., bound=ResourceGroupState)
    • BaseAgentSelector(AbstractAgentSelector[T_ResourceGroupState])
      • LegacyAgentSelector(BaseAgentSelector[NullAgentSelectorState])
      • ConcentratedAgentSelector(BaseAgentSelector[NullAgentSelectorState])
      • DispersedAgentSelector(BaseAgentSelector[NullAgentSelectorState])
      • RoundRobinAgentSelector(BaseAgentSelector[RRAgentSelectorState])

These are the materialization of hard-coded logic branches in prior versions. They have a resource group state store object to load/store persistent states across subsequent scheduler invocations. The type of resource group states are determined by the type argument, so that the agent selection logic could read/write them without extra type casting and conversion.

Introduce ResourceGroupState with ABC and type variable

  • ResourceGroupState(pydantic.BaseModel, ABC)
    • NullAgentSelectorState(ResourceGroupState)
    • RRAgentSelectorState(ResourceGroupState)
  • AbstractResourceGroupStateStore(Generic[T_ResourceGroupState], ABC)
    declares load(), store(), and reset() methods for per-resource-group states.
    • DefaultResourceGroupStateStore(AbstractResourceGroupStateStore[T_ResourceGroupState])
    • InMemoryResourceGroupStateStore(AbstractResourceGroupStateStore[T_ResourceGroupState])

It is a thin abstraction to inject an interface to load/store arbitrary states represented as a pydantic model from/to an internal state storage. Currently the default implementation uses the etcd (shared_config) as a storage backend with a key prefix resource-group-states. The in-memory version is for writing test codes.

Note

The prior implementation has stored the RR state in the roundrobin_states etcd key. This key is neither removed or migrated in this PR. Admins may delete it manually.

The ABC ResourceGroupState mandates implementation of the classmethod create_empty_state() so that the system can always fill the initial value when the administrator changes to a new agent selector.

Since there may be multiple agent selectors and possibly other per-resource-group entities that want to use resource group states, the load(), store(), and reset() methods in the state store takes an explicit state_name argument to distinguish them within each resource group. The state name is up to the caller of the state store API; the agent selector implementation in this PR.

Other

  • Move agent_selection_resource_priority into AbstractAgentSelector to avoid passing it all related functions each time.
  • Fix wrong type declaration.
    • Replace KernelInfo with KernelRow of assign_agent_for_kernel.
  • Remove useless MOF scheduler.

Database migration

The migration script transforms the roundrobin flag to a value of the agent_selection_strategy field in the scaling_groups.scheduler_opts column if set true. It also adds a new empty object as the agent_selector_config field.

How to test the RoundRobin agent selection strategy

To manually test the round-robin mode, please refer to this section.

  1. To use the RoundRobin policy for agent_selection_strategy, change the agent_selection_strategy in scaling_groups.scheduler_opts to roundrobin using the below command.
./backend.ai admin scaling-group update --scheduler-opts '{"agent_selection_strategy": "roundrobin"}' default
  1. Update the redis address in etcd by ./backend.ai mgr etcd put config/redis/addr <manager_addr>:8111
  2. Update the rpc-listen-addr, id of the agent section in each agent's agent.toml.
  3. Check all agents are properly connected to the manager in the Backend.AI web UI.
  4. Create sessions to verify that sessions are evenly distributed among the agents and that the next_index for round_robin state is correctly updated in etcd.
❯ ./backend.ai mgr etcd get --prefix resource-group-states/default
2024-09-17 00:26:39.261 INFO ai.backend.common.etcd [1871211] using etcd cluster from 127.0.0.1:8121 with namespace "local"
{
    "default": {
        "agselector.roundrobin": "{\"roundrobin_states\":{\"aarch64\":{\"next_index\":0}}}"
    }
}

Checklist: (if applicable)

  • Milestone metadata specifying the target backport version
  • Mention to the original issue

@github-actions github-actions bot added comp:manager Related to Manager component size:XL 500~ LoC labels Oct 25, 2023
@jopemachine jopemachine marked this pull request as ready for review October 26, 2023 06:37
@jopemachine jopemachine changed the title fix - Replace RoundRobin flag with AgentSelectionStrategy.RoundRobin (WIP) fix - Replace RoundRobin flag with AgentSelectionStrategy.RoundRobin Oct 26, 2023
@jopemachine jopemachine changed the title fix - Replace RoundRobin flag with AgentSelectionStrategy.RoundRobin fix - Replace roundrobin flag with AgentSelectionStrategy.RoundRobin strategy Oct 26, 2023
@jopemachine jopemachine changed the title fix - Replace roundrobin flag with AgentSelectionStrategy.RoundRobin strategy fix: Replace roundrobin flag with AgentSelectionStrategy.RoundRobin strategy Oct 26, 2023
@kyujin-cho kyujin-cho added this to the 24.03 milestone Oct 30, 2023
@achimnol achimnol modified the milestones: 24.03, 24.09 Jun 21, 2024
@jopemachine jopemachine added the type:refactor Refactor codes or add tests. label Aug 26, 2024
@jopemachine jopemachine marked this pull request as draft August 27, 2024 05:23
@jopemachine jopemachine changed the title fix: Replace roundrobin flag with AgentSelectionStrategy.RoundRobin strategy refactor: PendingSession Scheduler to PendingSession Scheduler and AgentSelector Aug 28, 2024
@jopemachine jopemachine changed the title refactor: PendingSession Scheduler to PendingSession Scheduler and AgentSelector refactor: Split PendingSession Scheduler into PendingSession Scheduler and AgentSelector Aug 28, 2024
@jopemachine jopemachine force-pushed the fix/improve-rr branch 2 times, most recently from 005bf74 to b49812c Compare August 30, 2024 04:38
@jopemachine jopemachine marked this pull request as ready for review August 30, 2024 06:44
@jopemachine jopemachine requested a review from achimnol August 30, 2024 07:21
@achimnol achimnol added the require:db-migration Automatically set when alembic migrations are added or updated label Sep 16, 2024
@achimnol achimnol added this pull request to the merge queue Sep 16, 2024
Merged via the queue into main with commit 6f9c9cb Sep 16, 2024
23 checks passed
@achimnol achimnol deleted the fix/improve-rr branch September 16, 2024 17:25
@jopemachine jopemachine restored the fix/improve-rr branch September 19, 2024 05:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:manager Related to Manager component require:db-migration Automatically set when alembic migrations are added or updated size:XL 500~ LoC type:refactor Refactor codes or add tests.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants