Requesting a Python/CUDA example #18

adelevie · 2022-08-18T13:40:38Z

Hi,

The existing examples are very good. But given that the GPU/AI/ML features were highlighted in the introductory blog post ("Use accelerator-optimized resources."), it would be nice to see a full example here.

If it helps, I've tried this on my own, but got some errors:

{
    "taskGroups": [
        {
        "taskSpec": {
            "computeResource": {
                "cpuMilli": "20000",
                "memoryMib": "15000"
            },
            "runnables": [
          {
            "container": {
              "imageUri": "pytorch/pytorch:1.12.1-cuda11.3-cudnn8-runtime",
                "entrypoint": "/bin/sh",
                "commands": ["-c", "python -c \"import torch;print(torch.cuda.is_available())\""]
            }
          }
            ],
            "maxRetryCount": 2,
            "maxRunDuration": "3600s"
        },
        "taskCount": 1,
        "parallelism": 1
        }
    ],
    "allocationPolicy": {
            "instances": [
                {
                    "instanceTemplate": "alan-test-instance-template-3"
                }
            ]
        },
    "logsPolicy": {
        "destination": "CLOUD_LOGGING"
    }
}

The log output is:

2022-08-18 09:31:08.760 EDT
Task action/STARTUP/0/0/group0/0, STDOUT: Reading package lists...
2022-08-18 09:31:08.772 EDT
Task action/STARTUP/0/0/group0/0, STDOUT:
2022-08-18 09:31:08.777 EDT
Task action/STARTUP/0/0/group0/0, STDOUT: Building dependency tree...
2022-08-18 09:31:08.904 EDT
Task action/STARTUP/0/0/group0/0, STDOUT: Reading state information...
2022-08-18 09:31:08.905 EDT
Task action/STARTUP/0/0/group0/0, STDOUT:
2022-08-18 09:31:08.954 EDT
Task action/STARTUP/0/0/group0/0, STDOUT: Some packages could not be installed. This may mean that you have requested an impossible situation or if you are using the unstable distribution that some required packages have not yet been created or been moved out of Incoming. The following information may help to resolve the situation: The following packages have unmet dependencies:
2022-08-18 09:31:09.008 EDT
Task action/STARTUP/0/0/group0/0, STDOUT: docker.io : Depends: runc (>= 1.0.0~rc6~)
2022-08-18 09:31:09.019 EDT
Task action/STARTUP/0/0/group0/0, STDERR: E: Unable to correct problems, you have held broken packages.

And for reference, here's the info for my instance template:

{
  "creationTimestamp": "2022-08-17T14:05:29.128-07:00",
  "description": "",
  "id": "[redacted]",
  "kind": "compute#instanceTemplate",
  "name": "alan-test-instance-template-3",
  "properties": {
    "confidentialInstanceConfig": {
      "enableConfidentialCompute": false
    },
    "description": "",
    "scheduling": {
      "onHostMaintenance": "TERMINATE",
      "provisioningModel": "STANDARD",
      "automaticRestart": true,
      "preemptible": false
    },
    "tags": {},
    "disks": [
      {
        "type": "PERSISTENT",
        "deviceName": "alan-test-instance-template-3",
        "autoDelete": true,
        "index": 0,
        "boot": true,
        "kind": "compute#attachedDisk",
        "mode": "READ_WRITE",
        "initializeParams": {
          "sourceImage": "projects/ml-images/global/images/c0-deeplearning-common-cu110-v20220806-debian-10",
          "diskType": "pd-balanced",
          "diskSizeGb": "100"
        }
      },
      {
        "type": "PERSISTENT",
        "deviceName": "persistent-disk-1",
        "autoDelete": false,
        "index": 1,
        "kind": "compute#attachedDisk",
        "mode": "READ_WRITE",
        "initializeParams": {
          "description": "",
          "diskType": "pd-balanced",
          "diskSizeGb": "100"
        }
      }
    ],
    "networkInterfaces": [
      {
        "name": "nic0",
        "network": "projects/[redacted]/global/networks/default",
        "accessConfigs": [
          {
            "name": "External NAT",
            "type": "ONE_TO_ONE_NAT",
            "kind": "compute#accessConfig",
            "networkTier": "PREMIUM"
          }
        ],
        "kind": "compute#networkInterface"
      }
    ],
    "reservationAffinity": {
      "consumeReservationType": "ANY_RESERVATION"
    },
    "canIpForward": false,
    "keyRevocationActionType": "NONE",
    "machineType": "n1-standard-4",
    "metadata": {
      "fingerprint": "[redacted]",
      "kind": "compute#metadata"
    },
    "shieldedVmConfig": {
      "enableSecureBoot": false,
      "enableVtpm": true,
      "enableIntegrityMonitoring": true
    },
    "shieldedInstanceConfig": {
      "enableSecureBoot": false,
      "enableVtpm": true,
      "enableIntegrityMonitoring": true
    },
    "serviceAccounts": [
      {
        "email": "[redacted]@developer.gserviceaccount.com",
        "scopes": [
          "https://www.googleapis.com/auth/devstorage.read_only",
          "https://www.googleapis.com/auth/logging.write",
          "https://www.googleapis.com/auth/monitoring.write",
          "https://www.googleapis.com/auth/servicecontrol",
          "https://www.googleapis.com/auth/service.management.readonly",
          "https://www.googleapis.com/auth/trace.append"
        ]
      }
    ],
    "guestAccelerators": [
      {
        "acceleratorCount": 1,
        "acceleratorType": "nvidia-tesla-t4"
      }
    ],
    "displayDevice": {
      "enableDisplay": false
    }
  },
  "selfLink": "projects/[redacted]/global/instanceTemplates/alan-test-instance-template-3"
}

EDIT: Digging through the Job spec to the ComputeResource spec, I see the following:

gpuCount | string (int64 format)The GPU count.Not yet implemented.
-- | --

gpuCount	
string ([int64](https://developers.google.com/discovery/v1/type-format) format)

The GPU count.

Not yet implemented.

Does this imply GPU jobs are not yet supported?

The text was updated successfully, but these errors were encountered:

lripoche · 2023-02-06T16:37:57Z

Does this imply GPU jobs are not yet supported?

According to the doc, yes it is. Unfortunately I can't make the container job example work. The base container image is downloaded, the nvidia drivers are installed but the command is not executed and the job exits with error.

GPU count can be set with this syntax.

aaronegolden · 2024-06-26T16:58:17Z

There is now a dogs vs. cats CNN training example here, which uses PyTorch and acceleration via CUDA.

GPU jobs (containerized or not) are supported in general, and Batch will automatically install drivers (when the installGpuDrivers flag is set in the job spec) and will automatically set the necessary docker options to give containers access to the GPU(s) for container runnables.

I've tested the new sample (this one) recently so it should be a reliable template for other PyTorch/CUDA jobs. Please let me know if you run into any issues.

kesitrifork · 2024-08-30T22:01:56Z

It seems a bit under documented, I can't seem to get it to work without this config, I have tried every combination of this config, and everything other fails:

{
  "taskGroups": [
    {
      "taskSpec": {
        "runnables": [
          {
            "script": {
              "text": "sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml && sudo nvidia-ctk runtime configure --runtime=docker --cdi.enabled && sudo systemctl restart docker"
            }
          },
          {
            "container": {
              "imageUri": "ultralytics/ultralytics:8.2.7",
              "commands": ["/var/lib/nvidia/bin/nvidia-smi"],
              "volumes": ["/var/lib/nvidia/bin:/var/lib/nvidia/bin:ro"],
              "options": "--runtime=nvidia --network=host"
            }
          }
        ],
        "computeResource": {
          "cpuMilli": 1000,
          "memoryMib": 1000
        },
        "maxRetryCount": 2,
        "maxRunDuration": "600s"
      },
      "taskCount": 1,
      "parallelism": 1
    }
  ],
  "allocationPolicy": {
    "instances": [
      {
        "installGpuDrivers": true,
        "policy": {
          "machineType": "n1-standard-2",
          "accelerators": [
            {
              "type": "nvidia-tesla-t4",
              "count": 1
            }
          ],
          "bootDisk": {
            "type": "pd-balanced",
            "sizeGb": "30",
            "image": "projects/batch-custom-image/global/images/family/batch-cos-stable-official"
          }
        }
      }
    ]
  },
  "labels": {
    "department": "creative",
    "environment": "dev"
  },
  "logsPolicy": {
    "destination": "CLOUD_LOGGING"
  }
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Requesting a Python/CUDA example #18

Requesting a Python/CUDA example #18

adelevie commented Aug 18, 2022 •

edited

Loading

lripoche commented Feb 6, 2023 •

edited

Loading

aaronegolden commented Jun 26, 2024

kesitrifork commented Aug 30, 2024

Requesting a Python/CUDA example #18

Requesting a Python/CUDA example #18

Comments

adelevie commented Aug 18, 2022 • edited Loading

lripoche commented Feb 6, 2023 • edited Loading

aaronegolden commented Jun 26, 2024

kesitrifork commented Aug 30, 2024

adelevie commented Aug 18, 2022 •

edited

Loading

lripoche commented Feb 6, 2023 •

edited

Loading