Opni GPU Controller Service

Description

Manages the GPU within the cluster. If a training job is submitted, utilize the GPU for training a new Deep Learning model. Otherwise, utilize the GPU for inferencing.

Programming Languages

Python

Diagram

Opni GPU Controller

Responsibilities

Train a new Deep Learning model using GPU.
Inference on workload logs with a Deep Learning model using a GPU.
Send over inferred log messages to workload DRAIN service to be put into cache.

Input and output interfaces

Input

Component	Type	Description
model_update	Nats subject	When a new Deep Learning model has been trained and is ready for inferencing, the GPU controller is subscribed to this Nats subject so that it can then load the latest model.
gpu_service_inference_internal	Nats subject	Within the GPU controller, the training controller will send any logs through this Nats subject to be inferred on using the GPU.
gpu_service_running	Nats request/reply subject	The GPU controller receives a request through the gpu_service_running Nats subject and it will reply that it is running.
gpu_service_training_internal	Nats subject	The GPU controller will receive payload from this Nats subject and then will run a training job.

Output

Component	Type	Description
model_update	Nats subject	When a new Deep Learning model has been trained and is ready for inferencing, the GPU controller publishes to this Nats subject.
gpu_trainingjob_status	Nats subject	When a new Deep Learning model has been trained, the GPU controller service publishes "JobEnd" to this Nats subject
gpu_service_running	Nats request/reply subject	The GPU controller receives a request through the gpu_service_running Nats subject and it will reply that it is running.
model_inferenced_workload_logs	Nats subject	After inferencing on the logs it receives, the GPU inferencing service publishes to the model_inferenced_workload_logs Nats subject which will be read by the workload DRAIN service.

Restrictions/limitations

Requires an NVIDIA GPU configured on the cluster.

Performance issues

Test plan

Unit tests
Integration tests
e2e tests
Manual testing

Architecture

Backends
Core Components
- Opni Gateway
- Opni Agent

How Tos

Releases

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Opni GPU Controller Service

Opni GPU Controller Service

Description

Programming Languages

Diagram

Responsibilities

Input and output interfaces

Input

Output

Restrictions/limitations

Performance issues

Test plan

Clone this wiki locally