Skip to content

Opni GPU Controller Service

Amartya Chakraborty edited this page Jan 25, 2023 · 2 revisions

Opni GPU Controller Service

Description

Manages the GPU within the cluster. If a training job is submitted, utilize the GPU for training a new Deep Learning model. Otherwise, utilize the GPU for inferencing.

Programming Languages

  • Python

Diagram

Opni GPU Controller

Responsibilities

  • Train a new Deep Learning model using GPU.
  • Inference on workload logs with a Deep Learning model using a GPU.
  • Send over inferred log messages to workload DRAIN service to be put into cache.

Input and output interfaces

Input

Component Type Description
model_update Nats subject When a new Deep Learning model has been trained and is ready for inferencing, the GPU controller is subscribed to this Nats subject so that it can then load the latest model.
gpu_service_inference_internal Nats subject Within the GPU controller, the training controller will send any logs through this Nats subject to be inferred on using the GPU.
gpu_service_running Nats request/reply subject The GPU controller receives a request through the gpu_service_running Nats subject and it will reply that it is running.
gpu_service_training_internal Nats subject The GPU controller will receive payload from this Nats subject and then will run a training job.

Output

Component Type Description
model_update Nats subject When a new Deep Learning model has been trained and is ready for inferencing, the GPU controller publishes to this Nats subject.
gpu_trainingjob_status Nats subject When a new Deep Learning model has been trained, the GPU controller service publishes "JobEnd" to this Nats subject
gpu_service_running Nats request/reply subject The GPU controller receives a request through the gpu_service_running Nats subject and it will reply that it is running.
model_inferenced_workload_logs Nats subject After inferencing on the logs it receives, the GPU inferencing service publishes to the model_inferenced_workload_logs Nats subject which will be read by the workload DRAIN service.

Restrictions/limitations

  • Requires an NVIDIA GPU configured on the cluster.

Performance issues

Test plan

  • Unit tests
  • Integration tests
  • e2e tests
  • Manual testing
Clone this wiki locally