You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Follow-up of #491.
It will be great to update our NUMA aware resource allocation feature, such as taking accelerator's parent PCIe switch into account, for example.
Currently the device distance calculation is based on the PCI-reported numa node index only.
We could extend this distance calculation logic to consider additional information from the PCIe bus address. For example, we could use the bus number to further distinguish PCIe switches.
First, we need to check if we can determine the existence and layout of PCIe switches from the bus addresses.
It will include additional location metadata in agent heartbeats (e.g., rack number, rack groups, etc.) and considering them when distributing the containers of a single cluster session.
The text was updated successfully, but these errors were encountered:
Main idea
Follow-up of #491.
It will be great to update our NUMA aware resource allocation feature, such as taking accelerator's parent PCIe switch into account, for example.
(Agent) Expansion of
AffinityMap
Currently the device distance calculation is based on the PCI-reported numa node index only.
We could extend this distance calculation logic to consider additional information from the PCIe bus address. For example, we could use the bus number to further distinguish PCIe switches.
First, we need to check if we can determine the existence and layout of PCIe switches from the bus addresses.
(Manager) Hierarchical agent selection strategy for multi-node sessions
This will be a follow-up of #1394 and #1655.
It will include additional location metadata in agent heartbeats (e.g., rack number, rack groups, etc.) and considering them when distributing the containers of a single cluster session.
The text was updated successfully, but these errors were encountered: