Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add TF and ONNX Workers #777

Open
wants to merge 71 commits into
base: v1.0
Choose a base branch
from
Open

Add TF and ONNX Workers #777

wants to merge 71 commits into from

Conversation

al-rigazzi
Copy link
Collaborator

No description provided.

AlyssaCote and others added 30 commits June 11, 2024 16:23
This PR adds the Capnproto schemas and initial MessageHandler class and
tests.
This PR contains an ML worker manager MVP. The worker manager executes a
single-threaded version of the planned ML pipeline for a single worker
instance.

[ committed by @ankona ]
[ approved by @mellis13 ]
This PR removes `device` from the schemas, MessageHandler, and tests.
Add `Model` schema with model metadata.

[ committed by @AlyssaCote ]
[ approved by @ankona ]
…#621)

EnvironmentConfigLoader added for ML Worker Manager.
This PR adds a simple `TorchWorker` which performs inference. The output transform is still not
implemented, but that's something that it is not needed for the moment
being.

[ committed by @al-rigazzi ]
[ reviewed by @AlyssaCote @ankona ]
Adds the ability to specify hardware affinities for cpu/gpu devices.
Creates a dragon policy that uses provided policy to modify the
resulting dragon `ProcessGroup`.

[ committed by @ankona ]
[ approved by @mellis13 @al-rigazzi ]
This PR aims to allow the `WorkerManager` to continue if a `worker`
throws an error. The `WorkerManager` needs to return a `Response`
without blowing up in the process.

[ committed by @AlyssaCote ]
[ approved by @mellis13 @ankona ]
Schemas were enhanced for performance.

[committed by @AlyssaCote ]
[approved by @al-rigazzi @mellis13 ]
Bring mli-feature up to date with develop.

---------

[ committed by @al-rigazzi ]
[ reviewed by @AlyssaCote @ankona ]
Fix two dragon installation issues:

1. Fix issue where search for `*.whl` files may include previously
extracted versions of the dragon package
2. Fix issue where LD_LIBRARY_PATH is incorrectly directed to
`dragon-0.9` folder by using the generated `.env` file created from
`smart build --dragon`

[ committed by @ankona ]
[ approved by @AlyssaCote ]
- Enables using multiple feature stores by enhancing the existing
tensor/model-key classes to include the feature store descriptor.
- Update the `EnvironmentConfigLoader` to retrieve _multiple_ feature
stores from environment using the prior key as a prefix to query with
- Minor (lift & shift) refactor of top-level functions in worker manager
module to reduce number of touch-points for converting to
`FeatureStoreKey` from capnproto type
    - now, only `worker.py` deals with this conversion.

[ committed by @ankona]
[ approved by @mellis13 @AlyssaCote @al-rigazzi ]
Reduce copies by using `torch.from_numpy`.
`SS_INFRA_BACKBONE` has been updated to `_SMARTSIM_INFRA_BACKBONE` and
`SS_REQUEST_QUEUE` is now `_SMARTSIM_REQUEST_QUEUE`.

[ committed by @AlyssaCote ]
[ reviewed by @mellis13 ]
Converted `FeatureStoreKey` into a frozen dataclass and used `_post_init_` to validate that the key
and descriptor are not empty strings.

[ committed by @AlyssaCote ]
[ approved by @ankona ]
## Description

This PR adds two features:

1. Ability to specify hostnames that tasks should run on
2. Enable tasks colocation

### Specifying Hostnames

The existing `DragonRunRequest` supported the ability to specify a
hostname when creating a policy used to run a task. However, the
hostnames were not exposed to clients.

This ticket allows clients to pass a list of hosts that will be used in
place of the default "first available host" behavior.

### Task Colocation

The prior system for finding nodes to execute a task worked worked only
with unassigned nodes. Any node assigned a task could not be assigned
another task.

This ticket adds a more capable prioritizer class that enables clients
using hostnames to colocate tasks. It also retains the capability to
return open nodes when no hostname is specified.
Fix 3 bugs:
1. reordering the init sequence in the dragon backend resulted in an
un-set collection being used
2. fix tests that should have been updated to compare set contents
instead of individual items
3. remove newly added validation on empty host lists that broke existing
tests
This PR adds the `RequestDispatcher` to the MLI. The `RequestDispatcher`
batches inference requests together and dispatches batches to `WorkerManagers`.

[ committed by @al-rigazzi ]
[ reviewed by @mellis13 @ankona @AlyssaCote ]
…ization of failure responses. (#687)

In this PR I fix the `exception_handler` so that it only builds and
serializes a failure response if a reply channel is not None. I also
needed to tweak the tests a bit by mocking out the reply channels.

[ committed by @AlyssaCote ]
[ approved by @mellis13 @al-rigazzi ]
)

Updates SmartSim environment variable names with the new naming convention. 

[ committed by @AlyssaCote ]
[ approved by @ashao ]
Update MLI filenames to be snake case.
Event broadcasting will enable the system to notify other MLI resources
of changes. This PR contains the base capabilities required for
publishing & consuming channel messages as events.

[ committed by @ankona ]
[ reviewed by @mellis13 @al-rigazzi @AlyssaCote ]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants