Why is the operator is designed to run on master/control-plane, and the devicePlugin with toleration? #1194

amir-bialek · 2025-01-05T13:41:00Z

Looking at the operator helm-chart default values:

operator:
  repository: nvcr.io/nvidia
  image: gpu-operator
  # If version is not specified, then default is to use chart.AppVersion
  #version: ""
  imagePullPolicy: IfNotPresent
  imagePullSecrets: []
  priorityClassName: system-node-critical
  defaultRuntime: docker
  runtimeClass: nvidia
  use_ocp_driver_toolkit: false
  # cleanup CRD on chart un-install
  cleanupCRD: false
  # upgrade CRD on chart upgrade, requires --disable-openapi-validation flag
  # to be passed during helm upgrade.
  upgradeCRD: false
  initContainer:
    image: cuda
    repository: nvcr.io/nvidia
    version: 12.3.2-base-ubi8
    imagePullPolicy: IfNotPresent
  tolerations:
  - key: "node-role.kubernetes.io/master"
    operator: "Equal"
    value: ""
    effect: "NoSchedule"
  - key: "node-role.kubernetes.io/control-plane"
    operator: "Equal"
    value: ""
    effect: "NoSchedule"
  annotations:
    openshift.io/scc: restricted-readonly
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 1
          preference:
            matchExpressions:
              - key: "node-role.kubernetes.io/master"
                operator: In
                values: [""]

And:

node-feature-discovery:
  enableNodeFeatureApi: true
  gc:
    enable: true
    replicaCount: 1
    serviceAccount:
      name: node-feature-discovery
      create: false
  worker:
    serviceAccount:
      name: node-feature-discovery
      # disable creation to avoid duplicate serviceaccount creation by master spec below
      create: false
    tolerations:
    - key: "node-role.kubernetes.io/master"
      operator: "Equal"
      value: ""
      effect: "NoSchedule"
    - key: "node-role.kubernetes.io/control-plane"
      operator: "Equal"
      value: ""
      effect: "NoSchedule"

So I have the following running on my masternode:

gpu-operator-b97d998bc-d4gc6                                      1/1     Running   0          4d13h
gpu-operator-node-feature-discovery-master-6c6cd4598f-vbzrc   1/1     Running   0          4d13h
gpu-operator-node-feature-discovery-worker-hplbl              1/1     Running   0          4d13h

What is the logic behind it?
Shouldn't we avoid Running such resources on the masternode?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why is the operator is designed to run on master/control-plane, and the devicePlugin with toleration? #1194

Why is the operator is designed to run on master/control-plane, and the devicePlugin with toleration? #1194

amir-bialek commented Jan 5, 2025

Why is the operator is designed to run on master/control-plane, and the devicePlugin with toleration? #1194

Why is the operator is designed to run on master/control-plane, and the devicePlugin with toleration? #1194

Comments

amir-bialek commented Jan 5, 2025