Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Requesting a Python/CUDA example #18

Open
adelevie opened this issue Aug 18, 2022 · 3 comments
Open

Requesting a Python/CUDA example #18

adelevie opened this issue Aug 18, 2022 · 3 comments

Comments

@adelevie
Copy link

adelevie commented Aug 18, 2022

Hi,

The existing examples are very good. But given that the GPU/AI/ML features were highlighted in the introductory blog post ("Use accelerator-optimized resources."), it would be nice to see a full example here.

If it helps, I've tried this on my own, but got some errors:

{
    "taskGroups": [
        {
        "taskSpec": {
            "computeResource": {
                "cpuMilli": "20000",
                "memoryMib": "15000"
            },
            "runnables": [
          {
            "container": {
              "imageUri": "pytorch/pytorch:1.12.1-cuda11.3-cudnn8-runtime",
                "entrypoint": "/bin/sh",
                "commands": ["-c", "python -c \"import torch;print(torch.cuda.is_available())\""]
            }
          }
            ],
            "maxRetryCount": 2,
            "maxRunDuration": "3600s"
        },
        "taskCount": 1,
        "parallelism": 1
        }
    ],
    "allocationPolicy": {
            "instances": [
                {
                    "instanceTemplate": "alan-test-instance-template-3"
                }
            ]
        },
    "logsPolicy": {
        "destination": "CLOUD_LOGGING"
    }
}

The log output is:

2022-08-18 09:31:08.760 EDT
Task action/STARTUP/0/0/group0/0, STDOUT: Reading package lists...
2022-08-18 09:31:08.772 EDT
Task action/STARTUP/0/0/group0/0, STDOUT:
2022-08-18 09:31:08.777 EDT
Task action/STARTUP/0/0/group0/0, STDOUT: Building dependency tree...
2022-08-18 09:31:08.904 EDT
Task action/STARTUP/0/0/group0/0, STDOUT: Reading state information...
2022-08-18 09:31:08.905 EDT
Task action/STARTUP/0/0/group0/0, STDOUT:
2022-08-18 09:31:08.954 EDT
Task action/STARTUP/0/0/group0/0, STDOUT: Some packages could not be installed. This may mean that you have requested an impossible situation or if you are using the unstable distribution that some required packages have not yet been created or been moved out of Incoming. The following information may help to resolve the situation: The following packages have unmet dependencies:
2022-08-18 09:31:09.008 EDT
Task action/STARTUP/0/0/group0/0, STDOUT: docker.io : Depends: runc (>= 1.0.0~rc6~)
2022-08-18 09:31:09.019 EDT
Task action/STARTUP/0/0/group0/0, STDERR: E: Unable to correct problems, you have held broken packages.

And for reference, here's the info for my instance template:

{
  "creationTimestamp": "2022-08-17T14:05:29.128-07:00",
  "description": "",
  "id": "[redacted]",
  "kind": "compute#instanceTemplate",
  "name": "alan-test-instance-template-3",
  "properties": {
    "confidentialInstanceConfig": {
      "enableConfidentialCompute": false
    },
    "description": "",
    "scheduling": {
      "onHostMaintenance": "TERMINATE",
      "provisioningModel": "STANDARD",
      "automaticRestart": true,
      "preemptible": false
    },
    "tags": {},
    "disks": [
      {
        "type": "PERSISTENT",
        "deviceName": "alan-test-instance-template-3",
        "autoDelete": true,
        "index": 0,
        "boot": true,
        "kind": "compute#attachedDisk",
        "mode": "READ_WRITE",
        "initializeParams": {
          "sourceImage": "projects/ml-images/global/images/c0-deeplearning-common-cu110-v20220806-debian-10",
          "diskType": "pd-balanced",
          "diskSizeGb": "100"
        }
      },
      {
        "type": "PERSISTENT",
        "deviceName": "persistent-disk-1",
        "autoDelete": false,
        "index": 1,
        "kind": "compute#attachedDisk",
        "mode": "READ_WRITE",
        "initializeParams": {
          "description": "",
          "diskType": "pd-balanced",
          "diskSizeGb": "100"
        }
      }
    ],
    "networkInterfaces": [
      {
        "name": "nic0",
        "network": "projects/[redacted]/global/networks/default",
        "accessConfigs": [
          {
            "name": "External NAT",
            "type": "ONE_TO_ONE_NAT",
            "kind": "compute#accessConfig",
            "networkTier": "PREMIUM"
          }
        ],
        "kind": "compute#networkInterface"
      }
    ],
    "reservationAffinity": {
      "consumeReservationType": "ANY_RESERVATION"
    },
    "canIpForward": false,
    "keyRevocationActionType": "NONE",
    "machineType": "n1-standard-4",
    "metadata": {
      "fingerprint": "[redacted]",
      "kind": "compute#metadata"
    },
    "shieldedVmConfig": {
      "enableSecureBoot": false,
      "enableVtpm": true,
      "enableIntegrityMonitoring": true
    },
    "shieldedInstanceConfig": {
      "enableSecureBoot": false,
      "enableVtpm": true,
      "enableIntegrityMonitoring": true
    },
    "serviceAccounts": [
      {
        "email": "[redacted]@developer.gserviceaccount.com",
        "scopes": [
          "https://www.googleapis.com/auth/devstorage.read_only",
          "https://www.googleapis.com/auth/logging.write",
          "https://www.googleapis.com/auth/monitoring.write",
          "https://www.googleapis.com/auth/servicecontrol",
          "https://www.googleapis.com/auth/service.management.readonly",
          "https://www.googleapis.com/auth/trace.append"
        ]
      }
    ],
    "guestAccelerators": [
      {
        "acceleratorCount": 1,
        "acceleratorType": "nvidia-tesla-t4"
      }
    ],
    "displayDevice": {
      "enableDisplay": false
    }
  },
  "selfLink": "projects/[redacted]/global/instanceTemplates/alan-test-instance-template-3"
}

EDIT: Digging through the Job spec to the ComputeResource spec, I see the following:

gpuCount | string (int64 format)The GPU count.Not yet implemented.
-- | --

gpuCount	
string ([int64](https://developers.google.com/discovery/v1/type-format) format)

The GPU count.

Not yet implemented.

Does this imply GPU jobs are not yet supported?

@lripoche
Copy link

lripoche commented Feb 6, 2023

Does this imply GPU jobs are not yet supported?

According to the doc, yes it is. Unfortunately I can't make the container job example work. The base container image is downloaded, the nvidia drivers are installed but the command is not executed and the job exits with error.

GPU count can be set with this syntax.

@aaronegolden
Copy link
Collaborator

There is now a dogs vs. cats CNN training example here, which uses PyTorch and acceleration via CUDA.

GPU jobs (containerized or not) are supported in general, and Batch will automatically install drivers (when the installGpuDrivers flag is set in the job spec) and will automatically set the necessary docker options to give containers access to the GPU(s) for container runnables.

I've tested the new sample (this one) recently so it should be a reliable template for other PyTorch/CUDA jobs. Please let me know if you run into any issues.

@kesitrifork
Copy link

It seems a bit under documented, I can't seem to get it to work without this config, I have tried every combination of this config, and everything other fails:

{
  "taskGroups": [
    {
      "taskSpec": {
        "runnables": [
          {
            "script": {
              "text": "sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml && sudo nvidia-ctk runtime configure --runtime=docker --cdi.enabled && sudo systemctl restart docker"
            }
          },
          {
            "container": {
              "imageUri": "ultralytics/ultralytics:8.2.7",
              "commands": ["/var/lib/nvidia/bin/nvidia-smi"],
              "volumes": ["/var/lib/nvidia/bin:/var/lib/nvidia/bin:ro"],
              "options": "--runtime=nvidia --network=host"
            }
          }
        ],
        "computeResource": {
          "cpuMilli": 1000,
          "memoryMib": 1000
        },
        "maxRetryCount": 2,
        "maxRunDuration": "600s"
      },
      "taskCount": 1,
      "parallelism": 1
    }
  ],
  "allocationPolicy": {
    "instances": [
      {
        "installGpuDrivers": true,
        "policy": {
          "machineType": "n1-standard-2",
          "accelerators": [
            {
              "type": "nvidia-tesla-t4",
              "count": 1
            }
          ],
          "bootDisk": {
            "type": "pd-balanced",
            "sizeGb": "30",
            "image": "projects/batch-custom-image/global/images/family/batch-cos-stable-official"
          }
        }
      }
    ]
  },
  "labels": {
    "department": "creative",
    "environment": "dev"
  },
  "logsPolicy": {
    "destination": "CLOUD_LOGGING"
  }
}
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants