GitHub - elevate/nvidia-device-plugin: kops hook for NVIDIA GPU Driver and DevicePlugin Installation

`kops` hook for NVIDIA GPU Driver and DevicePlugin Installation

This kops hook container may be used to enable nodes with GPUs to work with Kubernetes.

It installs the following from web sources.

How it works

This kops hook container runs on a kubernetes node upon every boot.
It installs onto the host system a systemd oneshot service unit nvidia-device-plugin.service along with setup scripts.
The systemd unit nvidia-device-plugin.service runs and executes the setup scripts in the host directory /nvidia-device-plugin.
The scripts install the Nvidia device drivers and Nvidia docker.

Using this DevicePlugin

Create a Cluster with GPU Nodes

kops version
Version 1.16.0 (git-4b0e62b82)

export KOPS_STATE_STORE=s3://some-s3-backet-name

kops create cluster \
    --cloud aws \
    --zones eu-west-1a,eu-west-1b,eu-west-1c \
    --master-zones eu-west-1a \
    --networking calico \
    --master-size m5.large \
    --node-size g4dn.xlarge \
    --node-count 1 \
    gpu.k8s.local

Enable the Kops Installation Hook and DevicePlugins

This should be safe to do for all machines, because the hook auto-detects if the machine has NVIDIA GPU installed and will NO-OP otherwise.

kops edit ig --name=gpu.k8s.local nodes

spec:
  hooks:
  - execContainer:
      image: pure/nvidia-device-plugin:tesla

Update the cluster

kops update cluster --name gpu.k8s.local --yes
while ! kops validate cluster --name gpu.k8s.local; do sleep 10; done

Deploy the Daemonset for the Nvidia DevicePlugin

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta4/nvidia-device-plugin.yml

Check Node capacity for `nvidia.com/gpu`

kubectl get no -l beta.kubernetes.io/instance-type=g4dn.xlarge -ojson | jq '.items[].status.capacity'
{
  "attachable-volumes-aws-ebs": "39",
  "cpu": "4",
  "ephemeral-storage": "125753328Ki",
  "hugepages-1Gi": "0",
  "hugepages-2Mi": "0",
  "memory": "16133788Ki",
  "nvidia.com/gpu": "1",
  "pods": "110"
}

Validate that GPUs are Working

Deploy a Test Pod

cat << EOF | kubectl create -f -
apiVersion: v1
kind: Pod
metadata:
  name: tf-gpu
spec:
  terminationGracePeriodSeconds: 3
  containers:
  - name: gpu
    image: tensorflow/tensorflow:latest-gpu
    imagePullPolicy: IfNotPresent
    args: ["sleep", "1d"]
    env:
    - name: TF_CPP_MIN_LOG_LEVEL
      value: "3"
    resources:
      limits:
        memory: 1024Mi
        nvidia.com/gpu: 1 # requesting 1 GPUs
EOF

Show GPU info from within the pod

kubectl exec -it tf-gpu -- nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00    Driver Version: 440.64.00    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   37C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Show Tensorflow detects GPUs from within the pod

kubectl exec -it tf-gpu -- python -c 'from tensorflow.python.client import device_lib; print(device_lib.list_local_devices())'

[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 3121764385910360567
, name: "/device:XLA_CPU:0"
device_type: "XLA_CPU"
memory_limit: 17179869184
locality {
}
incarnation: 18396904841851242797
physical_device_desc: "device: XLA_CPU device"
, name: "/device:XLA_GPU:0"
device_type: "XLA_GPU"
memory_limit: 17179869184
locality {
}
incarnation: 4503292422335862858
physical_device_desc: "device: XLA_GPU device"
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 14941647668
locality {
  bus_id: 1
  links {
  }
}
incarnation: 6949372932394118638
physical_device_desc: "device: 0, name: Tesla T4, pci bus id: 0000:00:1e.0, compute capability: 7.5"
]

Detele Test Pod

kubectl delete pod tf-gpu

Delete the test cluster

Running a Kubernetes cluster within AWS obviously costs money, and so you may want to delete your cluster if you are finished running experiments.

kops delete cluster --name gpu.k8s.local --yes

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
tesla-440.31		tesla-440.31
tesla-440.33.01		tesla-440.33.01
tesla-440.64.00		tesla-440.64.00
README.md		README.md
build.sh		build.sh
latest.tags		latest.tags

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

`kops` hook for NVIDIA GPU Driver and DevicePlugin Installation

How it works

Using this DevicePlugin

Create a Cluster with GPU Nodes

Enable the Kops Installation Hook and DevicePlugins

Update the cluster

Deploy the Daemonset for the Nvidia DevicePlugin

Check Node capacity for `nvidia.com/gpu`

Validate that GPUs are Working

Deploy a Test Pod

Show GPU info from within the pod

Show Tensorflow detects GPUs from within the pod

Detele Test Pod

Delete the test cluster

About

Releases

Packages

Languages

elevate/nvidia-device-plugin

Folders and files

Latest commit

History

Repository files navigation

kops hook for NVIDIA GPU Driver and DevicePlugin Installation

How it works

Using this DevicePlugin

Create a Cluster with GPU Nodes

Enable the Kops Installation Hook and DevicePlugins

Update the cluster

Deploy the Daemonset for the Nvidia DevicePlugin

Check Node capacity for nvidia.com/gpu

Validate that GPUs are Working

Deploy a Test Pod

Show GPU info from within the pod

Show Tensorflow detects GPUs from within the pod

Detele Test Pod

Delete the test cluster

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

`kops` hook for NVIDIA GPU Driver and DevicePlugin Installation

Check Node capacity for `nvidia.com/gpu`

Packages