-
Notifications
You must be signed in to change notification settings - Fork 56
Opni GPU Controller Service
Amartya Chakraborty edited this page Jan 25, 2023
·
2 revisions
Manages the GPU within the cluster. If a training job is submitted, utilize the GPU for training a new Deep Learning model. Otherwise, utilize the GPU for inferencing.
- Python
- Train a new Deep Learning model using GPU.
- Inference on workload logs with a Deep Learning model using a GPU.
- Send over inferred log messages to workload DRAIN service to be put into cache.
Component | Type | Description |
---|---|---|
model_update | Nats subject | When a new Deep Learning model has been trained and is ready for inferencing, the GPU controller is subscribed to this Nats subject so that it can then load the latest model. |
gpu_service_inference_internal | Nats subject | Within the GPU controller, the training controller will send any logs through this Nats subject to be inferred on using the GPU. |
gpu_service_running | Nats request/reply subject | The GPU controller receives a request through the gpu_service_running Nats subject and it will reply that it is running. |
gpu_service_training_internal | Nats subject | The GPU controller will receive payload from this Nats subject and then will run a training job. |
Component | Type | Description |
---|---|---|
model_update | Nats subject | When a new Deep Learning model has been trained and is ready for inferencing, the GPU controller publishes to this Nats subject. |
gpu_trainingjob_status | Nats subject | When a new Deep Learning model has been trained, the GPU controller service publishes "JobEnd" to this Nats subject |
gpu_service_running | Nats request/reply subject | The GPU controller receives a request through the gpu_service_running Nats subject and it will reply that it is running. |
model_inferenced_workload_logs | Nats subject | After inferencing on the logs it receives, the GPU inferencing service publishes to the model_inferenced_workload_logs Nats subject which will be read by the workload DRAIN service. |
- Requires an NVIDIA GPU configured on the cluster.
- Unit tests
- Integration tests
- e2e tests
- Manual testing
Architecture
- Backends
- Core Components