refactor: Split `PendingSession` Scheduler into `PendingSession` Scheduler and `AgentSelector` #1655

jopemachine · 2023-10-25T07:00:26Z

Follow up PR of #1405.

Extract the agent selector subsystem from the session scheduler.
Refactor the roundrobin flag to AgentSelectionStrategy.

Why

The existing session scheduler code has made confusion as it mixed session scheduling logic with the agent selection logic. This PR removes the confusion by separating the responsibilities into two classes: Scheduler and AgentSelector.

And since most of the logic related to the RoundRobin flag pertains to the AgentSelector, the flag has been changed to a policy within the AgentSelectionStrategy.

Refactoring Details

Introduce `AgentSelector` with Generic ABC

AbstractAgentSelector(Generic[T_ResourceGroupState], ABC)
where T_ResourceGroupState = TypeVar(..., bound=ResourceGroupState)
- BaseAgentSelector(AbstractAgentSelector[T_ResourceGroupState])
  - LegacyAgentSelector(BaseAgentSelector[NullAgentSelectorState])
  - ConcentratedAgentSelector(BaseAgentSelector[NullAgentSelectorState])
  - DispersedAgentSelector(BaseAgentSelector[NullAgentSelectorState])
  - RoundRobinAgentSelector(BaseAgentSelector[RRAgentSelectorState])

These are the materialization of hard-coded logic branches in prior versions. They have a resource group state store object to load/store persistent states across subsequent scheduler invocations. The type of resource group states are determined by the type argument, so that the agent selection logic could read/write them without extra type casting and conversion.

Introduce `ResourceGroupState` with ABC and type variable

ResourceGroupState(pydantic.BaseModel, ABC)
- NullAgentSelectorState(ResourceGroupState)
- RRAgentSelectorState(ResourceGroupState)
AbstractResourceGroupStateStore(Generic[T_ResourceGroupState], ABC)
declares load(), store(), and reset() methods for per-resource-group states.
- DefaultResourceGroupStateStore(AbstractResourceGroupStateStore[T_ResourceGroupState])
- InMemoryResourceGroupStateStore(AbstractResourceGroupStateStore[T_ResourceGroupState])

It is a thin abstraction to inject an interface to load/store arbitrary states represented as a pydantic model from/to an internal state storage. Currently the default implementation uses the etcd (shared_config) as a storage backend with a key prefix resource-group-states. The in-memory version is for writing test codes.

Note

The prior implementation has stored the RR state in the roundrobin_states etcd key. This key is neither removed or migrated in this PR. Admins may delete it manually.

The ABC ResourceGroupState mandates implementation of the classmethod create_empty_state() so that the system can always fill the initial value when the administrator changes to a new agent selector.

Since there may be multiple agent selectors and possibly other per-resource-group entities that want to use resource group states, the load(), store(), and reset() methods in the state store takes an explicit state_name argument to distinguish them within each resource group. The state name is up to the caller of the state store API; the agent selector implementation in this PR.

Other

Move agent_selection_resource_priority into AbstractAgentSelector to avoid passing it all related functions each time.
Fix wrong type declaration.
- Replace KernelInfo with KernelRow of assign_agent_for_kernel.
Remove useless MOF scheduler.

Database migration

The migration script transforms the roundrobin flag to a value of the agent_selection_strategy field in the scaling_groups.scheduler_opts column if set true. It also adds a new empty object as the agent_selector_config field.

How to test the RoundRobin agent selection strategy

To manually test the round-robin mode, please refer to this section.

To use the RoundRobin policy for agent_selection_strategy, change the agent_selection_strategy in scaling_groups.scheduler_opts to roundrobin using the below command.

./backend.ai admin scaling-group update --scheduler-opts '{"agent_selection_strategy": "roundrobin"}' default

Update the redis address in etcd by ./backend.ai mgr etcd put config/redis/addr <manager_addr>:8111
Update the rpc-listen-addr, id of the agent section in each agent's agent.toml.
Check all agents are properly connected to the manager in the Backend.AI web UI.
Create sessions to verify that sessions are evenly distributed among the agents and that the next_index for round_robin state is correctly updated in etcd.

❯ ./backend.ai mgr etcd get --prefix resource-group-states/default
2024-09-17 00:26:39.261 INFO ai.backend.common.etcd [1871211] using etcd cluster from 127.0.0.1:8121 with namespace "local"
{
    "default": {
        "agselector.roundrobin": "{\"roundrobin_states\":{\"aarch64\":{\"next_index\":0}}}"
    }
}

Checklist: (if applicable)

Milestone metadata specifying the target backport version
Mention to the original issue

…global config

… AgentSelectorStore

…lt setting in the scheduler-side instantiation

- Collect round-robin related stuffs around the location of RoundRobinAgentSelector class

- State keys and values are now perfectly specific to the caller.

github-actions bot assigned jopemachine Oct 25, 2023

github-actions bot added comp:manager Related to Manager component size:XL 500~ LoC labels Oct 25, 2023

jopemachine force-pushed the fix/improve-rr branch from b51a9fe to 0eeebf7 Compare October 26, 2023 06:16

jopemachine marked this pull request as ready for review October 26, 2023 06:37

jopemachine changed the title ~~fix - Replace RoundRobin flag with AgentSelectionStrategy.RoundRobin (WIP)~~ fix - Replace RoundRobin flag with AgentSelectionStrategy.RoundRobin Oct 26, 2023

jopemachine changed the title ~~fix - Replace RoundRobin flag with AgentSelectionStrategy.RoundRobin~~ fix - Replace roundrobin flag with AgentSelectionStrategy.RoundRobin strategy Oct 26, 2023

jopemachine changed the title ~~fix - Replace roundrobin flag with AgentSelectionStrategy.RoundRobin strategy~~ fix: Replace roundrobin flag with AgentSelectionStrategy.RoundRobin strategy Oct 26, 2023

kyujin-cho added this to the 24.03 milestone Oct 30, 2023

jopemachine force-pushed the fix/improve-rr branch from f970a82 to a2dec4f Compare March 27, 2024 03:33

achimnol mentioned this pull request Apr 8, 2024

Expand NUMA-aware resource allocation feature to finer scope #2007

Open

achimnol modified the milestones: 24.03, 24.09 Jun 21, 2024

achimnol force-pushed the fix/improve-rr branch from a2dec4f to cddfa71 Compare August 24, 2024 14:52

jopemachine added the type:refactor Refactor codes or add tests. label Aug 26, 2024

jopemachine marked this pull request as draft August 27, 2024 05:23

jopemachine changed the title ~~fix: Replace roundrobin flag with AgentSelectionStrategy.RoundRobin strategy~~ refactor: PendingSession Scheduler to PendingSession Scheduler and AgentSelector Aug 28, 2024

jopemachine changed the title ~~refactor: PendingSession Scheduler to PendingSession Scheduler and AgentSelector~~ refactor: Split PendingSession Scheduler into PendingSession Scheduler and AgentSelector Aug 28, 2024

jopemachine force-pushed the fix/improve-rr branch 2 times, most recently from 005bf74 to b49812c Compare August 30, 2024 04:38

jopemachine marked this pull request as ready for review August 30, 2024 06:44

jopemachine requested a review from achimnol August 30, 2024 07:21

jopemachine and others added 8 commits September 17, 2024 00:50

Replace RoundRobin flag with AgentSelectionStrategy.RoundRobin

4251d4a

Remove MOFScheduler

c6b4c8e

Add use_num_extras flag temporarily

328826f

Trim imports

2bb06a9

Fix wrong type

26fbed2

Distinguish compatible_agents and possible_agents

2ec9bf2

Format with ruff

ae9a2cc

fix: Remove unused import

4addb8b

jopemachine and others added 18 commits September 17, 2024 00:50

chore: change to snake case

0596185

chore: Update comment

ab03430

fix: store-type -> store_type

6a1a985

feat: Separate the Agentselector configuration into local config and …

68a376b

…global config

refactor: Remove code duplication

a87b1a4

feat: Change to etcd key path to kebab case, Remove storage_type from…

0dfcf10

… AgentSelectorStore

feat: Enable RoundRobinAgentSelector for multi node session

682fa80

test: Clarify the intention

1ead959

fix: Require state_store as the mandatory kwarg and add missing defau…

5c91bec

…lt setting in the scheduler-side instantiation

refactor: Minimize the scope of interests for module-specific type defs

e4e6d1c

fix: Add comments and debug logging.

7cde57e

refactor: Clean up ResourceGroupState type hierarchy

a24e2e4

refactor,fix: Add missing state_name key for resource group states

b6f140e

- Collect round-robin related stuffs around the location of RoundRobinAgentSelector class

fix: remove debug print

ae9c537

refactor: Clarify that this is a generic ResourceGroupStateStore

e727671

- State keys and values are now perfectly specific to the caller.

refactor: Reduce verbosity

ec9944f

refactor: Use pydantic instead of trafaret-based mixins

369315f

fix: Add database migration for agent selector configs

cc15f8d

achimnol force-pushed the fix/improve-rr branch from 1e7f33b to cc15f8d Compare September 16, 2024 15:50

achimnol added the require:db-migration Automatically set when alembic migrations are added or updated label Sep 16, 2024

achimnol added 3 commits September 17, 2024 00:59

fix: typo in the class name

15fbc7b

fix: Ensure removal of 'roundrobin' key in the scheduler_opts

b40cfe0

test: Update the test case to be more realistic

69083d9

achimnol approved these changes Sep 16, 2024

View reviewed changes

achimnol added this pull request to the merge queue Sep 16, 2024

Merged via the queue into main with commit 6f9c9cb Sep 16, 2024
23 checks passed

achimnol deleted the fix/improve-rr branch September 16, 2024 17:25

jopemachine mentioned this pull request Sep 19, 2024

chore: Correct typos and change agent selector plugin config key to kebab-case #2846

Merged

1 task

jopemachine restored the fix/improve-rr branch September 19, 2024 05:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: Split `PendingSession` Scheduler into `PendingSession` Scheduler and `AgentSelector` #1655

refactor: Split `PendingSession` Scheduler into `PendingSession` Scheduler and `AgentSelector` #1655

jopemachine commented Oct 25, 2023 •

edited by achimnol

Loading

refactor: Split PendingSession Scheduler into PendingSession Scheduler and AgentSelector #1655

refactor: Split PendingSession Scheduler into PendingSession Scheduler and AgentSelector #1655

Conversation

jopemachine commented Oct 25, 2023 • edited by achimnol Loading

Why

Refactoring Details

Introduce AgentSelector with Generic ABC

Introduce ResourceGroupState with ABC and type variable

Other

Database migration

How to test the RoundRobin agent selection strategy

refactor: Split `PendingSession` Scheduler into `PendingSession` Scheduler and `AgentSelector` #1655

refactor: Split `PendingSession` Scheduler into `PendingSession` Scheduler and `AgentSelector` #1655

jopemachine commented Oct 25, 2023 •

edited by achimnol

Loading

Introduce `AgentSelector` with Generic ABC

Introduce `ResourceGroupState` with ABC and type variable