-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add TF and ONNX Workers #777
Open
al-rigazzi
wants to merge
71
commits into
v1.0
Choose a base branch
from
tf_worker_1.0
base: v1.0
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This PR adds the Capnproto schemas and initial MessageHandler class and tests.
This PR removes `device` from the schemas, MessageHandler, and tests.
Add `Model` schema with model metadata. [ committed by @AlyssaCote ] [ approved by @ankona ]
…#621) EnvironmentConfigLoader added for ML Worker Manager.
This PR adds a simple `TorchWorker` which performs inference. The output transform is still not implemented, but that's something that it is not needed for the moment being. [ committed by @al-rigazzi ] [ reviewed by @AlyssaCote @ankona ]
Adds the ability to specify hardware affinities for cpu/gpu devices. Creates a dragon policy that uses provided policy to modify the resulting dragon `ProcessGroup`. [ committed by @ankona ] [ approved by @mellis13 @al-rigazzi ]
[ committed by @ankona ] [ approved by @AlyssaCote ]
This PR aims to allow the `WorkerManager` to continue if a `worker` throws an error. The `WorkerManager` needs to return a `Response` without blowing up in the process. [ committed by @AlyssaCote ] [ approved by @mellis13 @ankona ]
Schemas were enhanced for performance. [committed by @AlyssaCote ] [approved by @al-rigazzi @mellis13 ]
Bring mli-feature up to date with develop. --------- [ committed by @al-rigazzi ] [ reviewed by @AlyssaCote @ankona ]
Fix two dragon installation issues: 1. Fix issue where search for `*.whl` files may include previously extracted versions of the dragon package 2. Fix issue where LD_LIBRARY_PATH is incorrectly directed to `dragon-0.9` folder by using the generated `.env` file created from `smart build --dragon` [ committed by @ankona ] [ approved by @AlyssaCote ]
- Enables using multiple feature stores by enhancing the existing tensor/model-key classes to include the feature store descriptor. - Update the `EnvironmentConfigLoader` to retrieve _multiple_ feature stores from environment using the prior key as a prefix to query with - Minor (lift & shift) refactor of top-level functions in worker manager module to reduce number of touch-points for converting to `FeatureStoreKey` from capnproto type - now, only `worker.py` deals with this conversion. [ committed by @ankona] [ approved by @mellis13 @AlyssaCote @al-rigazzi ]
Reduce copies by using `torch.from_numpy`.
`SS_INFRA_BACKBONE` has been updated to `_SMARTSIM_INFRA_BACKBONE` and `SS_REQUEST_QUEUE` is now `_SMARTSIM_REQUEST_QUEUE`. [ committed by @AlyssaCote ] [ reviewed by @mellis13 ]
Converted `FeatureStoreKey` into a frozen dataclass and used `_post_init_` to validate that the key and descriptor are not empty strings. [ committed by @AlyssaCote ] [ approved by @ankona ]
## Description This PR adds two features: 1. Ability to specify hostnames that tasks should run on 2. Enable tasks colocation ### Specifying Hostnames The existing `DragonRunRequest` supported the ability to specify a hostname when creating a policy used to run a task. However, the hostnames were not exposed to clients. This ticket allows clients to pass a list of hosts that will be used in place of the default "first available host" behavior. ### Task Colocation The prior system for finding nodes to execute a task worked worked only with unassigned nodes. Any node assigned a task could not be assigned another task. This ticket adds a more capable prioritizer class that enables clients using hostnames to colocate tasks. It also retains the capability to return open nodes when no hostname is specified.
Fix 3 bugs: 1. reordering the init sequence in the dragon backend resulted in an un-set collection being used 2. fix tests that should have been updated to compare set contents instead of individual items 3. remove newly added validation on empty host lists that broke existing tests
This PR adds the `RequestDispatcher` to the MLI. The `RequestDispatcher` batches inference requests together and dispatches batches to `WorkerManagers`. [ committed by @al-rigazzi ] [ reviewed by @mellis13 @ankona @AlyssaCote ]
…ization of failure responses. (#687) In this PR I fix the `exception_handler` so that it only builds and serializes a failure response if a reply channel is not None. I also needed to tweak the tests a bit by mocking out the reply channels. [ committed by @AlyssaCote ] [ approved by @mellis13 @al-rigazzi ]
) Updates SmartSim environment variable names with the new naming convention. [ committed by @AlyssaCote ] [ approved by @ashao ]
Update MLI filenames to be snake case.
Event broadcasting will enable the system to notify other MLI resources of changes. This PR contains the base capabilities required for publishing & consuming channel messages as events. [ committed by @ankona ] [ reviewed by @mellis13 @al-rigazzi @AlyssaCote ]
This PR integrates event publishers and consumers in `ProtoClient` and `DragonBackend` [ committed by @ankona] [ reviewed by @al-rigazzi @mellis13 @amandarichardsonn ]
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.