Add TF and ONNX Workers #777

al-rigazzi · 2024-10-31T18:26:28Z

No description provided.

This PR adds the Capnproto schemas and initial MessageHandler class and tests.

@ankona

This PR contains an ML worker manager MVP. The worker manager executes a single-threaded version of the planned ML pipeline for a single worker instance. [ committed by @ankona ] [ approved by @mellis13 ]

This PR removes `device` from the schemas, MessageHandler, and tests.

@AlyssaCote

Add `Model` schema with model metadata. [ committed by @AlyssaCote ] [ approved by @ankona ]

…#621) EnvironmentConfigLoader added for ML Worker Manager.

@al-rigazzi

This PR adds a simple `TorchWorker` which performs inference. The output transform is still not implemented, but that's something that it is not needed for the moment being. [ committed by @al-rigazzi ] [ reviewed by @AlyssaCote @ankona ]

@ankona

Adds the ability to specify hardware affinities for cpu/gpu devices. Creates a dragon policy that uses provided policy to modify the resulting dragon `ProcessGroup`. [ committed by @ankona ] [ approved by @mellis13 @al-rigazzi ]

…ts" (#637) Reverts #631

@ankona

[ committed by @ankona ] [ approved by @AlyssaCote ]

@AlyssaCote

This PR aims to allow the `WorkerManager` to continue if a `worker` throws an error. The `WorkerManager` needs to return a `Response` without blowing up in the process. [ committed by @AlyssaCote ] [ approved by @mellis13 @ankona ]

@AlyssaCote

Schemas were enhanced for performance. [committed by @AlyssaCote ] [approved by @al-rigazzi @mellis13 ]

@al-rigazzi

Bring mli-feature up to date with develop. --------- [ committed by @al-rigazzi ] [ reviewed by @AlyssaCote @ankona ]

@ankona

Fix two dragon installation issues: 1. Fix issue where search for `*.whl` files may include previously extracted versions of the dragon package 2. Fix issue where LD_LIBRARY_PATH is incorrectly directed to `dragon-0.9` folder by using the generated `.env` file created from `smart build --dragon` [ committed by @ankona ] [ approved by @AlyssaCote ]

@ankona

- Enables using multiple feature stores by enhancing the existing tensor/model-key classes to include the feature store descriptor. - Update the `EnvironmentConfigLoader` to retrieve _multiple_ feature stores from environment using the prior key as a prefix to query with - Minor (lift & shift) refactor of top-level functions in worker manager module to reduce number of touch-points for converting to `FeatureStoreKey` from capnproto type - now, only `worker.py` deals with this conversion. [ committed by @ankona] [ approved by @mellis13 @AlyssaCote @al-rigazzi ]

Reduce copies by using `torch.from_numpy`.

@AlyssaCote

`SS_INFRA_BACKBONE` has been updated to `_SMARTSIM_INFRA_BACKBONE` and `SS_REQUEST_QUEUE` is now `_SMARTSIM_REQUEST_QUEUE`. [ committed by @AlyssaCote ] [ reviewed by @mellis13 ]

@AlyssaCote

Converted `FeatureStoreKey` into a frozen dataclass and used `_post_init_` to validate that the key and descriptor are not empty strings. [ committed by @AlyssaCote ] [ approved by @ankona ]

## Description This PR adds two features: 1. Ability to specify hostnames that tasks should run on 2. Enable tasks colocation ### Specifying Hostnames The existing `DragonRunRequest` supported the ability to specify a hostname when creating a policy used to run a task. However, the hostnames were not exposed to clients. This ticket allows clients to pass a list of hosts that will be used in place of the default "first available host" behavior. ### Task Colocation The prior system for finding nodes to execute a task worked worked only with unassigned nodes. Any node assigned a task could not be assigned another task. This ticket adds a more capable prioritizer class that enables clients using hostnames to colocate tasks. It also retains the capability to return open nodes when no hostname is specified.

Fix 3 bugs: 1. reordering the init sequence in the dragon backend resulted in an un-set collection being used 2. fix tests that should have been updated to compare set contents instead of individual items 3. remove newly added validation on empty host lists that broke existing tests

@al-rigazzi

This PR adds the `RequestDispatcher` to the MLI. The `RequestDispatcher` batches inference requests together and dispatches batches to `WorkerManagers`. [ committed by @al-rigazzi ] [ reviewed by @mellis13 @ankona @AlyssaCote ]

@AlyssaCote

…ization of failure responses. (#687) In this PR I fix the `exception_handler` so that it only builds and serializes a failure response if a reply channel is not None. I also needed to tweak the tests a bit by mocking out the reply channels. [ committed by @AlyssaCote ] [ approved by @mellis13 @al-rigazzi ]

@AlyssaCote

) Updates SmartSim environment variable names with the new naming convention. [ committed by @AlyssaCote ] [ approved by @ashao ]

Update MLI filenames to be snake case.

@ankona

Event broadcasting will enable the system to notify other MLI resources of changes. This PR contains the base capabilities required for publishing & consuming channel messages as events. [ committed by @ankona ] [ reviewed by @mellis13 @al-rigazzi @AlyssaCote ]

…to benchmark

@ankona

This PR integrates event publishers and consumers in `ProtoClient` and `DragonBackend` [ committed by @ankona] [ reviewed by @al-rigazzi @mellis13 @amandarichardsonn ]

…to tf_worker

…orker_1.0

AlyssaCote and others added 30 commits June 11, 2024 16:23

Initial MLI schemas and MessageHandler class (#607)

d2fd6a7

This PR adds the Capnproto schemas and initial MessageHandler class and tests.

Merge branch 'develop' into mli-feature

3c9915c

ML Worker Manager MVP (#608)

38081da

This PR contains an ML worker manager MVP. The worker manager executes a single-threaded version of the planned ML pipeline for a single worker instance. [ committed by @ankona ] [ approved by @mellis13 ]

Remove device attribute from schemas (#619)

ab900b8

This PR removes `device` from the schemas, MessageHandler, and tests.

Merge branch 'develop' into mli-feature

a9ffb14

Merge branch 'develop' into mli-feature

ee2c110

Add model metadata to request schema (#624)

8a2f173

Add `Model` schema with model metadata. [ committed by @AlyssaCote ] [ approved by @ankona ]

Enable environment variable based configuration for ML Worker Manager (…

52abd32

…#621) EnvironmentConfigLoader added for ML Worker Manager.

FLI-based Worker Manager (#622)

eace71e

This PR adds a simple `TorchWorker` which performs inference. The output transform is still not implemented, but that's something that it is not needed for the moment being. [ committed by @al-rigazzi ] [ reviewed by @AlyssaCote @ankona ]

Revert "Add ability to specify hardware policies on dragon run reques…

0030a4a

…ts" (#637) Reverts #631

Merge latest develop into mli-feature (#640)

b6c2f2b

[ committed by @ankona ] [ approved by @AlyssaCote ]

Improve error handling in worker manager (#629)

272a1d7

This PR aims to allow the `WorkerManager` to continue if a `worker` throws an error. The `WorkerManager` needs to return a `Response` without blowing up in the process. [ committed by @AlyssaCote ] [ approved by @mellis13 @ankona ]

Schema performance improvements (#632)

7169f1c

Schemas were enhanced for performance. [committed by @AlyssaCote ] [approved by @al-rigazzi @mellis13 ]

New develop merger (#645)

84101b3

Bring mli-feature up to date with develop. --------- [ committed by @al-rigazzi ] [ reviewed by @AlyssaCote @ankona ]

merging develop

e225c07

Merge branch 'develop' into mli-feature

9f482b1

Merge branch 'develop' into mli-feature

99ed41c

Use torch.from_numpy instead of torch.tensor to reduce a copy (#661)

74d6e78

Reduce copies by using `torch.from_numpy`.

MLI environment variables updated using new naming convention (#665)

391784c

`SS_INFRA_BACKBONE` has been updated to `_SMARTSIM_INFRA_BACKBONE` and `SS_REQUEST_QUEUE` is now `_SMARTSIM_REQUEST_QUEUE`. [ committed by @AlyssaCote ] [ reviewed by @mellis13 ]

Remove pydantic dependency from MLI code (#667)

f7ef49b

Converted `FeatureStoreKey` into a frozen dataclass and used `_post_init_` to validate that the key and descriptor are not empty strings. [ committed by @AlyssaCote ] [ approved by @ankona ]

Queue-based Worker Manager (#647)

5d85995

This PR adds the `RequestDispatcher` to the MLI. The `RequestDispatcher` batches inference requests together and dispatches batches to `WorkerManagers`. [ committed by @al-rigazzi ] [ reviewed by @mellis13 @ankona @AlyssaCote ]

SmartSim environment variables updated using new naming convention (#666

8aa990c

) Updates SmartSim environment variable names with the new naming convention. [ committed by @AlyssaCote ] [ approved by @ashao ]

MLI file names conform to snake case (#689)

f6d55d8

Update MLI filenames to be snake case.

al-rigazzi and others added 30 commits September 25, 2024 10:35

Merge branch 'mli-feature' of https://github.com/CrayLabs/SmartSim in…

5217a0a

…to benchmark

Working after merge

c1c4604

Improve TF Worker throughput

3a5a8ce

Revert changes to throughput scripts

48caac2

Style, lint, changelog

f79f53c

Add tests for TF worker

f569b9c

Improve standalone_worker_manager.py

50ce6ad

Remove commented sections

d49634f

Style

a8d501b

Switch to Channel.make_process_local

79b4954

Add timeout and exc handling to WM response

152d434

More commented lines to remove

14b20f1

Remove debug information

dbbdcd9

Fix issue when keras layer has "resource" field

b9bdf99

Address first comments.

3f571f4

Add ONNX worker

dc7307a

Add onnx mock app

dccc07d

Style

a590a36

Add optional compile step for Torch model

bf324d2

Add integration of dragon-based event broadcasting (#710)

ca01cb1

This PR integrates event publishers and consumers in `ProtoClient` and `DragonBackend` [ committed by @ankona] [ reviewed by @al-rigazzi @mellis13 @amandarichardsonn ]

Refine try-catch in onnx worker

dc31b75

Merge branch 'mli-feature' of https://github.com/CrayLabs/SmartSim in…

5afcfcf

…to tf_worker

Use new ProtoClient in apps

4e6ddff

Style

9947013

Mypy

23de37f

Style

10d59c8

Fix tests

ce5a306

Fix tests

f23c267

Merge branch 'v1.0' of https://github.com/CrayLabs/SmartSim into tf_w…

6a08b82

…orker_1.0

Complete post-merge operations

da92472

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add TF and ONNX Workers #777

Add TF and ONNX Workers #777

al-rigazzi commented Oct 31, 2024

Add TF and ONNX Workers #777

Are you sure you want to change the base?

Add TF and ONNX Workers #777

Conversation

al-rigazzi commented Oct 31, 2024