Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NVIDIA A100 GPU + CUDA not being recognized for immich-machine-learning on docker #14808

Open
1 of 3 tasks
morikplay opened this issue Dec 19, 2024 · 11 comments
Open
1 of 3 tasks

Comments

@morikplay
Copy link

The bug

NVIDIA A100 80GB running in MIG mode as vGPU 20GB presented by Proxmox PVE 8.3.2 hypervisor as passthrough to Ubuntu 22.04 VM. Because it is an NVAIE 3.x licensed product, only closed version drivers for NVIDIA are supported for this GPU.

immich runs on this VM as a docker container. nvidia-container-toolkit enabled. Host and VM drivers are compatible (not an exact sub-version match but that's the intended bundled drivers configuration). When starting immich with cuda support, ml container is unable to find the GPU. Other applications e.g. Ollama or codeproject.ai are able to recognize and work with the GPU inside VM and docker container.

immich_machine_learning  | *************** EP Error ***************
immich_machine_learning  | EP Error /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_execution_provider_info.cc:59 static onnxruntime::CUDAExecutionProviderInfo onnxruntime::CUDAExecutionProviderInfo::FromProviderOptions(const onnxruntime::ProviderOptions&) [ONNXRuntimeError] : 1 : FAIL : provider_options_utils.h:151 Parse Failed to parse provider option "device_id": CUDA failure 100: no CUDA-capable device is detected ; GPU=-1 ; hostname=cpai.esco.ghaar ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/cuda_execution_provider_info.cc ; line=66 ; expr=cudaGetDeviceCount(&num_devices);
immich_machine_learning  |  when using ['CUDAExecutionProvider', 'CPUExecutionProvider']
immich_machine_learning  | Falling back to ['CPUExecutionProvider'] and retrying.
immich_machine_learning  | ****************************************
immich_machine_learning  | 2024-12-19 04:55:45.045748200 [E:onnxruntime:Default, cuda_call.cc:118 CudaCall] CUDA failure 100: no CUDA-capable device is detected ; GPU=-1 ; hostname=cpai.esco.ghaar ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/cuda_execution_provider_info.cc ; line=66 ; expr=cudaGetDeviceCount(&num_devices);

output of get_device() indicates GPU is visible to the container:

docker exec -it immich_machine_learning  python -c "import onnxruntime as ort;print(ort.get_device())"
  GPU

The OS that Immich Server is running on

Ubuntu 22.04 on Proxmox PVE 8.3.2

Version of Immich Server

v1.123.0

Version of Immich Mobile App

v1.123.0 build.186

Platform with the issue

  • Server
  • Web
  • Mobile

Your docker-compose.yml content

name: immich

services:
  immich-server:
    container_name: immich_server
    image: ghcr.io/immich-app/immich-server:${IMMICH_VERSION:-release}
    extends:
      file: hwaccel.transcoding.yml
      service: nvenc
    volumes:
      - ${UPLOAD_LOCATION}:/usr/src/app/upload
      - /etc/localtime:/etc/localtime:ro
      - /mnt/synrs-photos-ext-lib:/usr/src/app/external/synrs-photos-ext-lib
    env_file:
      - .env
    depends_on:
      - redis
    restart: always
    healthcheck:
      disable: false
    network_mode: host
    group_add:
      - video

  immich-machine-learning:
    container_name: immich_machine_learning
    image: ghcr.io/immich-app/immich-machine-learning:${IMMICH_VERSION:-release}-cuda
    extends:
      file: hwaccel.ml.yml
      service: cuda 
    volumes:
      - model-cache:/cache
    env_file:
      - .env
    restart: always
    healthcheck:
      disable: false
    network_mode: host
    group_add:
      - video

  redis:
    container_name: immich_redis
    image: docker.io/redis:6.2-alpine@sha256:e3b17ba9479deec4b7d1eeec1548a253acc5374d68d3b27937fcfe4df8d18c7e
    healthcheck:
      test: redis-cli ping || exit 1
    restart: always
      #ports:
      #- '6379:6379'
    network_mode: host

volumes:
  model-cache:

Your .env content

IMMICH_LOG_LEVEL=debug
UPLOAD_LOCATION=/mnt/immich_storage/library2
DB_URL='postgresql://<user>:<password>@pgs.esco.ghaar:5432/immichdb'
TZ=America/Los_Angeles

# The Immich version to use. You can pin this to a specific version like "v1.71.0"
IMMICH_VERSION=release
IMMICH_HOST=0.0.0.0
REDIS_HOSTNAME=localhost
NVIDIA_VISIBLE_DEVICES=all
MACHINE_LEARNING_DEVICE_IDS=0

Reproduction steps

  1. PVE 8.3.2 dkms compiled for NVIDIA 3.x AIE drivers for A100 80GB GPU. MIG'ed vGPU was created and passed through to Ubuntu VM. More information can be found in additional information section below.
  2. nvidia-smi configuration on the VM was verified.
  3. immich started with docker compose pull && docker compose up -d && docker compose logs -f
  4. ML container is unable to find CUDA environment even though GPU can be seen from inside the container.

Relevant log output

immich_machine_learning  | [12/19/24 04:55:10] INFO     Starting gunicorn 23.0.0
immich_machine_learning  | [12/19/24 04:55:10] INFO     Listening at: http://0.0.0.0:3003 (9)
immich_machine_learning  | [12/19/24 04:55:10] INFO     Using worker: app.config.CustomUvicornWorker
immich_machine_learning  | [12/19/24 04:55:10] INFO     Booting worker with pid: 10
immich_machine_learning  | [12/19/24 04:55:14] INFO     Started server process [10]
immich_machine_learning  | [12/19/24 04:55:14] INFO     Waiting for application startup.
immich_machine_learning  | [12/19/24 04:55:14] INFO     Created in-memory cache with unloading after 300s
immich_machine_learning  |                              of inactivity.
immich_machine_learning  | [12/19/24 04:55:14] INFO     Initialized request thread pool with 8 threads.
immich_machine_learning  | [12/19/24 04:55:14] INFO     Application startup complete.
immich_machine_learning  | [12/19/24 04:55:14] INFO     Loading visual model 'ViT-L-16-SigLIP-256__webli'
immich_machine_learning  |                              to memory
immich_machine_learning  | [12/19/24 04:55:14] INFO     Setting execution providers to
immich_machine_learning  |                              ['CUDAExecutionProvider', 'CPUExecutionProvider'],
immich_machine_learning  |                              in descending order of preference
immich_machine_learning  | 2024-12-19 04:55:15.116539945 [E:onnxruntime:Default, cuda_call.cc:118 CudaCall] CUDA failure 100: no CUDA-capable device is detected ; GPU=-1 ; hostname=cpai.esco.ghaar ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/cuda_execution_provider_info.cc ; line=66 ; expr=cudaGetDeviceCount(&num_devices);
immich_machine_learning  | *************** EP Error ***************
immich_machine_learning  | EP Error /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_execution_provider_info.cc:59 static onnxruntime::CUDAExecutionProviderInfo onnxruntime::CUDAExecutionProviderInfo::FromProviderOptions(const onnxruntime::ProviderOptions&) [ONNXRuntimeError] : 1 : FAIL : provider_options_utils.h:151 Parse Failed to parse provider option "device_id": CUDA failure 100: no CUDA-capable device is detected ; GPU=-1 ; hostname=cpai.esco.ghaar ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/cuda_execution_provider_info.cc ; line=66 ; expr=cudaGetDeviceCount(&num_devices);
immich_machine_learning  |  when using ['CUDAExecutionProvider', 'CPUExecutionProvider']
immich_machine_learning  | Falling back to ['CPUExecutionProvider'] and retrying.
immich_machine_learning  | ****************************************
immich_machine_learning  | [12/19/24 04:55:17] INFO     Loading detection model 'antelopev2' to memory
immich_machine_learning  | [12/19/24 04:55:17] INFO     Setting execution providers to
immich_machine_learning  |                              ['CUDAExecutionProvider', 'CPUExecutionProvider'],
immich_machine_learning  |                              in descending order of preference
immich_machine_learning  | 2024-12-19 04:55:17.183827391 [E:onnxruntime:Default, cuda_call.cc:118 CudaCall] CUDA failure 100: no CUDA-capable device is detected ; GPU=-1 ; hostname=cpai.esco.ghaar ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/cuda_execution_provider_info.cc ; line=66 ; expr=cudaGetDeviceCount(&num_devices);
immich_machine_learning  | *************** EP Error ***************
immich_machine_learning  | EP Error /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_execution_provider_info.cc:59 static onnxruntime::CUDAExecutionProviderInfo onnxruntime::CUDAExecutionProviderInfo::FromProviderOptions(const onnxruntime::ProviderOptions&) [ONNXRuntimeError] : 1 : FAIL : provider_options_utils.h:151 Parse Failed to parse provider option "device_id": CUDA failure 100: no CUDA-capable device is detected ; GPU=-1 ; hostname=cpai.esco.ghaar ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/cuda_execution_provider_info.cc ; line=66 ; expr=cudaGetDeviceCount(&num_devices);
immich_machine_learning  |  when using ['CUDAExecutionProvider', 'CPUExecutionProvider']
immich_machine_learning  | Falling back to ['CPUExecutionProvider'] and retrying.
immich_machine_learning  | ****************************************
immich_machine_learning  | [12/19/24 04:55:17] INFO     Loading recognition model 'antelopev2' to memory
immich_machine_learning  | [12/19/24 04:55:17] INFO     Setting execution providers to
immich_machine_learning  |                              ['CUDAExecutionProvider', 'CPUExecutionProvider'],
immich_machine_learning  |                              in descending order of preference
immich_machine_learning  | 2024-12-19 04:55:17.872335540 [E:onnxruntime:Default, cuda_call.cc:118 CudaCall] CUDA failure 100: no CUDA-capable device is detected ; GPU=-1 ; hostname=cpai.esco.ghaar ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/cuda_execution_provider_info.cc ; line=66 ; expr=cudaGetDeviceCount(&num_devices);
immich_machine_learning  | *************** EP Error ***************
immich_machine_learning  | EP Error /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_execution_provider_info.cc:59 static onnxruntime::CUDAExecutionProviderInfo onnxruntime::CUDAExecutionProviderInfo::FromProviderOptions(const onnxruntime::ProviderOptions&) [ONNXRuntimeError] : 1 : FAIL : provider_options_utils.h:151 Parse Failed to parse provider option "device_id": CUDA failure 100: no CUDA-capable device is detected ; GPU=-1 ; hostname=cpai.esco.ghaar ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/cuda_execution_provider_info.cc ; line=66 ; expr=cudaGetDeviceCount(&num_devices);
immich_machine_learning  |  when using ['CUDAExecutionProvider', 'CPUExecutionProvider']
immich_machine_learning  | Falling back to ['CPUExecutionProvider'] and retrying.
immich_machine_learning  | ****************************************
immich_machine_learning  | [12/19/24 04:55:37] INFO     Starting gunicorn 23.0.0
immich_machine_learning  | [12/19/24 04:55:37] INFO     Listening at: http://0.0.0.0:3003 (9)
immich_machine_learning  | [12/19/24 04:55:37] INFO     Using worker: app.config.CustomUvicornWorker
immich_machine_learning  | [12/19/24 04:55:37] INFO     Booting worker with pid: 10
immich_machine_learning  | [12/19/24 04:55:40] INFO     Started server process [10]
immich_machine_learning  | [12/19/24 04:55:40] INFO     Waiting for application startup.
immich_machine_learning  | [12/19/24 04:55:40] INFO     Created in-memory cache with unloading after 300s
immich_machine_learning  |                              of inactivity.
immich_machine_learning  | [12/19/24 04:55:40] INFO     Initialized request thread pool with 8 threads.
immich_machine_learning  | [12/19/24 04:55:40] INFO     Application startup complete.
immich_machine_learning  | [12/19/24 04:55:41] INFO     Loading visual model 'ViT-L-16-SigLIP-256__webli'
immich_machine_learning  |                              to memory
immich_machine_learning  | [12/19/24 04:55:41] INFO     Setting execution providers to
immich_machine_learning  |                              ['CUDAExecutionProvider', 'CPUExecutionProvider'],
immich_machine_learning  |                              in descending order of preference
immich_machine_learning  | 2024-12-19 04:55:42.156008534 [E:onnxruntime:Default, cuda_call.cc:118 CudaCall] CUDA failure 100: no CUDA-capable device is detected ; GPU=-1 ; hostname=cpai.esco.ghaar ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/cuda_execution_provider_info.cc ; line=66 ; expr=cudaGetDeviceCount(&num_devices);
immich_machine_learning  | *************** EP Error ***************
immich_machine_learning  | EP Error /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_execution_provider_info.cc:59 static onnxruntime::CUDAExecutionProviderInfo onnxruntime::CUDAExecutionProviderInfo::FromProviderOptions(const onnxruntime::ProviderOptions&) [ONNXRuntimeError] : 1 : FAIL : provider_options_utils.h:151 Parse Failed to parse provider option "device_id": CUDA failure 100: no CUDA-capable device is detected ; GPU=-1 ; hostname=cpai.esco.ghaar ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/cuda_execution_provider_info.cc ; line=66 ; expr=cudaGetDeviceCount(&num_devices);
immich_machine_learning  |  when using ['CUDAExecutionProvider', 'CPUExecutionProvider']
immich_machine_learning  | Falling back to ['CPUExecutionProvider'] and retrying.
immich_machine_learning  | ****************************************
immich_machine_learning  | [12/19/24 04:55:44] INFO     Loading detection model 'antelopev2' to memory
immich_machine_learning  | [12/19/24 04:55:44] INFO     Setting execution providers to
immich_machine_learning  |                              ['CUDAExecutionProvider', 'CPUExecutionProvider'],
immich_machine_learning  |                              in descending order of preference
immich_machine_learning  | 2024-12-19 04:55:44.286638144 [E:onnxruntime:Default, cuda_call.cc:118 CudaCall] CUDA failure 100: no CUDA-capable device is detected ; GPU=-1 ; hostname=cpai.esco.ghaar ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/cuda_execution_provider_info.cc ; line=66 ; expr=cudaGetDeviceCount(&num_devices);
immich_machine_learning  | *************** EP Error ***************
immich_machine_learning  | EP Error /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_execution_provider_info.cc:59 static onnxruntime::CUDAExecutionProviderInfo onnxruntime::CUDAExecutionProviderInfo::FromProviderOptions(const onnxruntime::ProviderOptions&) [ONNXRuntimeError] : 1 : FAIL : provider_options_utils.h:151 Parse Failed to parse provider option "device_id": CUDA failure 100: no CUDA-capable device is detected ; GPU=-1 ; hostname=cpai.esco.ghaar ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/cuda_execution_provider_info.cc ; line=66 ; expr=cudaGetDeviceCount(&num_devices);
immich_machine_learning  |  when using ['CUDAExecutionProvider', 'CPUExecutionProvider']
immich_machine_learning  | Falling back to ['CPUExecutionProvider'] and retrying.
immich_machine_learning  | ****************************************
immich_machine_learning  | [12/19/24 04:55:44] INFO     Loading recognition model 'antelopev2' to memory
immich_machine_learning  | [12/19/24 04:55:44] INFO     Setting execution providers to
immich_machine_learning  |                              ['CUDAExecutionProvider', 'CPUExecutionProvider'],
immich_machine_learning  |                              in descending order of preference
immich_machine_learning  | *************** EP Error ***************
immich_machine_learning  | EP Error /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_execution_provider_info.cc:59 static onnxruntime::CUDAExecutionProviderInfo onnxruntime::CUDAExecutionProviderInfo::FromProviderOptions(const onnxruntime::ProviderOptions&) [ONNXRuntimeError] : 1 : FAIL : provider_options_utils.h:151 Parse Failed to parse provider option "device_id": CUDA failure 100: no CUDA-capable device is detected ; GPU=-1 ; hostname=cpai.esco.ghaar ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/cuda_execution_provider_info.cc ; line=66 ; expr=cudaGetDeviceCount(&num_devices);
immich_machine_learning  |  when using ['CUDAExecutionProvider', 'CPUExecutionProvider']
immich_machine_learning  | Falling back to ['CPUExecutionProvider'] and retrying.
immich_machine_learning  | ****************************************
immich_machine_learning  | 2024-12-19 04:55:45.045748200 [E:onnxruntime:Default, cuda_call.cc:118 CudaCall] CUDA failure 100: no CUDA-capable device is detected ; GPU=-1 ; hostname=cpai.esco.ghaar ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/cuda_execution_provider_info.cc ; line=66 ; expr=cudaGetDeviceCount(&num_devices);

Additional information

Host
nvidia-smi:

nvidia-smi
Thu Dec 19 09:16:36 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.05              Driver Version: 550.90.05      CUDA Version: N/A      |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100 80GB PCIe          On  |   00000000:8A:00.0 Off |                   On |
| N/A   30C    P0             95W /  300W |   19718MiB /  81920MiB |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| MIG devices:                                                                            |
+------------------+----------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                     Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                       BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                  |        ECC|                       |
|==================+==================================+===========+=======================|
|  0    6   0   0  |           19718MiB / 19968MiB    | 14      0 |  1   0    1    0    0 |
|                  |                 0MiB / 32767MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0    6    0       4892    C+G   vgpu                                        19712MiB |
+-----------------------------------------------------------------------------------------+

VM's QEMU passthrough config:

agent: 1,fstrim_cloned_disks=1
bios: ovmf
boot: order=sata0;scsi0;scsi1
cicustom: vendor=local:snippets/esco.yaml,user=local:snippets/users-cpai.yaml,network=local:snippets/metadata.yaml
cores: 1
cpu: x86-64-v4
efidisk0: zfs-cluster-repl:vm-8013-disk-1,size=1M
hostpci0: 0000:8a:00.4,mdev=nvidia-1054
machine: q35,viommu=virtio
memory: 65536
meta: creation-qemu=9.0.2,ctime=1729726853
name: cpai
nameserver: 192.168.100.36 192.168.100.35 192.168.0.1
net0: virtio=00:50:56:88:e9:8d,bridge=vmbr4
onboot: 1
ostype: l26
scsi0: zfs-cluster-repl:vm-8013-disk-0,size=150G
scsi1: vm-data:vm-8013-cloudinit,media=cdrom,size=4M
scsihw: virtio-scsi-pci
searchdomain: esco.ghaar
smbios1: uuid=4208fd12-12f9-9b4d-beb5-2b8a00ebf73a
sockets: 8
startup: order=10
tags: 22.04;docker;ubuntu
vga: std
vmgenid: bf13e566-8051-4035-ad0a-592cc15916de

VM
nvidia-smi:

nvidia-smi
Thu Dec 19 09:15:40 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  GRID A100D-1-20C               On  |   00000000:06:10.0 Off |                   On |
| N/A   N/A    P0             N/A /  N/A  |       1MiB /  20480MiB |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| MIG devices:                                                                            |
+------------------+----------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                     Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                       BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                  |        ECC|                       |
|==================+==================================+===========+=======================|
|  0    0   0   0  |               1MiB / 18412MiB    | 14      0 |  1   0    1    0    0 |
|                  |                 0MiB /  4096MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

nvcc:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0

deviceQuery:

~/cuda-samples/Samples/1_Utilities/deviceQuery/deviceQuery
/home/maumau/cuda-samples/Samples/1_Utilities/deviceQuery/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GRID A100D-1-20C MIG 1g.20gb"
  CUDA Driver Version / Runtime Version          12.4 / 12.4
  CUDA Capability Major/Minor version number:    8.0
  Total amount of global memory:                 20476 MBytes (21470511104 bytes)
  (014) Multiprocessors, (064) CUDA Cores/MP:    896 CUDA Cores
  GPU Max Clock rate:                            1410 MHz (1.41 GHz)
  Memory Clock rate:                             1512 Mhz
  Memory Bus Width:                              1280-bit
  L2 Cache Size:                                 41943040 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        167936 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                No
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 6 / 16
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.4, CUDA Runtime Version = 12.4, NumDevs = 1
Result = PASS

nvidia-container-toolkit:

NVIDIA Container Runtime Hook version 1.17.3
commit: cb82e29c75d387992bf59eb6eadf5d96cb6d4747

docker deamon:

    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "nvidia-container-runtime"
        }
    },

OCI /etc/cdi/nvidia.yaml:

---
cdiVersion: 0.5.0
containerEdits:
  deviceNodes:
  - path: /dev/nvidia-modeset
  - path: /dev/nvidia-uvm
  - path: /dev/nvidia-uvm-tools
  - path: /dev/nvidiactl
  env:
  - NVIDIA_VISIBLE_DEVICES=void
  hooks:
  - args:
    - nvidia-cdi-hook
    - create-symlinks
    - --link
    - ../libnvidia-allocator.so.1::/usr/lib/x86_64-linux-gnu/gbm/nvidia-drm_gbm.so
    - --link
    - libglxserver_nvidia.so.550.90.07::/usr/lib/xorg/modules/extensions/libglxserver_nvidia.so
    hookName: createContainer
    path: /usr/bin/nvidia-cdi-hook
  - args:
    - nvidia-cdi-hook
    - create-symlinks
    - --link
    - libcuda.so.1::/usr/lib/x86_64-linux-gnu/libcuda.so
    - --link
    - libnvidia-opticalflow.so.1::/usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so
    - --link
    - libGLX_nvidia.so.550.90.07::/usr/lib/x86_64-linux-gnu/libGLX_indirect.so.0
    hookName: createContainer
    path: /usr/bin/nvidia-cdi-hook
  - args:
    - nvidia-cdi-hook
    - update-ldcache
    - --folder
    - /usr/lib/x86_64-linux-gnu
    - --folder
    - /usr/lib/x86_64-linux-gnu/vdpau
    hookName: createContainer
    path: /usr/bin/nvidia-cdi-hook
  mounts:
  - containerPath: /run/nvidia-persistenced/socket
    hostPath: /run/nvidia-persistenced/socket
    options:
    - ro
    - nosuid
    - nodev
    - bind
    - noexec
  - containerPath: /usr/bin/nvidia-cuda-mps-control
    hostPath: /usr/bin/nvidia-cuda-mps-control
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/bin/nvidia-cuda-mps-server
    hostPath: /usr/bin/nvidia-cuda-mps-server
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/bin/nvidia-debugdump
    hostPath: /usr/bin/nvidia-debugdump
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/bin/nvidia-persistenced
    hostPath: /usr/bin/nvidia-persistenced
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/bin/nvidia-smi
    hostPath: /usr/bin/nvidia-smi
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /etc/vulkan/icd.d/nvidia_icd.json
    hostPath: /etc/vulkan/icd.d/nvidia_icd.json
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /etc/vulkan/implicit_layer.d/nvidia_layers.json
    hostPath: /etc/vulkan/implicit_layer.d/nvidia_layers.json
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.550.90.07
    hostPath: /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.550.90.07
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.550.90.07
    hostPath: /usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.550.90.07
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.550.90.07
    hostPath: /usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.550.90.07
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.550.90.07
    hostPath: /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.550.90.07
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/x86_64-linux-gnu/libcuda.so.550.90.07
    hostPath: /usr/lib/x86_64-linux-gnu/libcuda.so.550.90.07
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/x86_64-linux-gnu/libcudadebugger.so.550.90.07
    hostPath: /usr/lib/x86_64-linux-gnu/libcudadebugger.so.550.90.07
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/x86_64-linux-gnu/libnvcuvid.so.550.90.07
    hostPath: /usr/lib/x86_64-linux-gnu/libnvcuvid.so.550.90.07
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.550.90.07
    hostPath: /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.550.90.07
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.550.90.07
    hostPath: /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.550.90.07
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/x86_64-linux-gnu/libnvidia-egl-gbm.so.1.1.1
    hostPath: /usr/lib/x86_64-linux-gnu/libnvidia-egl-gbm.so.1.1.1
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/x86_64-linux-gnu/libnvidia-egl-wayland.so.1.1.13
    hostPath: /usr/lib/x86_64-linux-gnu/libnvidia-egl-wayland.so.1.1.13
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.550.90.07
    hostPath: /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.550.90.07
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.550.90.07
    hostPath: /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.550.90.07
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.550.90.07
    hostPath: /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.550.90.07
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.550.90.07
    hostPath: /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.550.90.07
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.550.90.07
    hostPath: /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.550.90.07
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.550.90.07
    hostPath: /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.550.90.07
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/x86_64-linux-gnu/libnvidia-gpucomp.so.550.90.07
    hostPath: /usr/lib/x86_64-linux-gnu/libnvidia-gpucomp.so.550.90.07
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/x86_64-linux-gnu/libnvidia-gtk2.so.550.90.07
    hostPath: /usr/lib/x86_64-linux-gnu/libnvidia-gtk2.so.550.90.07
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/x86_64-linux-gnu/libnvidia-gtk3.so.550.90.07
    hostPath: /usr/lib/x86_64-linux-gnu/libnvidia-gtk3.so.550.90.07
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.550.90.07
    hostPath: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.550.90.07
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.550.90.07
    hostPath: /usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.550.90.07
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/x86_64-linux-gnu/libnvidia-nvvm.so.550.90.07
    hostPath: /usr/lib/x86_64-linux-gnu/libnvidia-nvvm.so.550.90.07
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.550.90.07
    hostPath: /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.550.90.07
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.550.90.07
    hostPath: /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.550.90.07
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/x86_64-linux-gnu/libnvidia-pkcs11-openssl3.so.550.90.07
    hostPath: /usr/lib/x86_64-linux-gnu/libnvidia-pkcs11-openssl3.so.550.90.07
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/x86_64-linux-gnu/libnvidia-pkcs11.so.550.90.07
    hostPath: /usr/lib/x86_64-linux-gnu/libnvidia-pkcs11.so.550.90.07
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.550.90.07
    hostPath: /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.550.90.07
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.550.90.07
    hostPath: /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.550.90.07
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.550.90.07
    hostPath: /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.550.90.07
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/x86_64-linux-gnu/libnvidia-wayland-client.so.550.90.07
    hostPath: /usr/lib/x86_64-linux-gnu/libnvidia-wayland-client.so.550.90.07
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/x86_64-linux-gnu/libnvoptix.so.550.90.07
    hostPath: /usr/lib/x86_64-linux-gnu/libnvoptix.so.550.90.07
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/share/nvidia/nvoptix.bin
    hostPath: /usr/share/nvidia/nvoptix.bin
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /lib/firmware/nvidia/550.90.07/gsp_ga10x.bin
    hostPath: /lib/firmware/nvidia/550.90.07/gsp_ga10x.bin
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /lib/firmware/nvidia/550.90.07/gsp_tu10x.bin
    hostPath: /lib/firmware/nvidia/550.90.07/gsp_tu10x.bin
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/x86_64-linux-gnu/vdpau/libvdpau_nvidia.so.550.90.07
    hostPath: /usr/lib/x86_64-linux-gnu/vdpau/libvdpau_nvidia.so.550.90.07
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/share/X11/xorg.conf.d/nvidia-drm-outputclass.conf
    hostPath: /usr/share/X11/xorg.conf.d/nvidia-drm-outputclass.conf
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/share/egl/egl_external_platform.d/10_nvidia_wayland.json
    hostPath: /usr/share/egl/egl_external_platform.d/10_nvidia_wayland.json
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/share/egl/egl_external_platform.d/15_nvidia_gbm.json
    hostPath: /usr/share/egl/egl_external_platform.d/15_nvidia_gbm.json
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/share/glvnd/egl_vendor.d/10_nvidia.json
    hostPath: /usr/share/glvnd/egl_vendor.d/10_nvidia.json
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/xorg/modules/drivers/nvidia_drv.so
    hostPath: /usr/lib/xorg/modules/drivers/nvidia_drv.so
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/xorg/modules/extensions/libglxserver_nvidia.so.550.90.07
    hostPath: /usr/lib/xorg/modules/extensions/libglxserver_nvidia.so.550.90.07
    options:
    - ro
    - nosuid
    - nodev
    - bind
devices:
- containerEdits:
    deviceNodes:
    - path: /dev/nvidia0
    - path: /dev/dri/card1
    - path: /dev/dri/renderD128
    hooks:
    - args:
      - nvidia-cdi-hook
      - create-symlinks
      - --link
      - ../card1::/dev/dri/by-path/pci-0000:01:00.0-card
      - --link
      - ../renderD128::/dev/dri/by-path/pci-0000:01:00.0-render
      hookName: createContainer
      path: /usr/bin/nvidia-cdi-hook
    - args:
      - nvidia-cdi-hook
      - chmod
      - --mode
      - "755"
      - --path
      - /dev/dri
      hookName: createContainer
      path: /usr/bin/nvidia-cdi-hook
  name: "0"
- containerEdits:
    deviceNodes:
    - path: /dev/nvidia0
    - path: /dev/nvidia-caps/nvidia-cap3
    - path: /dev/nvidia-caps/nvidia-cap4
  name: "0:0"
- containerEdits:
    deviceNodes:
    - path: /dev/nvidia0
    - path: /dev/dri/card1
    - path: /dev/dri/renderD128
    hooks:
    - args:
      - nvidia-cdi-hook
      - create-symlinks
      - --link
      - ../card1::/dev/dri/by-path/pci-0000:01:00.0-card
      - --link
      - ../renderD128::/dev/dri/by-path/pci-0000:01:00.0-render
      hookName: createContainer
      path: /usr/bin/nvidia-cdi-hook
    - args:
      - nvidia-cdi-hook
      - chmod
      - --mode
      - "755"
      - --path
      - /dev/dri
      hookName: createContainer
      path: /usr/bin/nvidia-cdi-hook
  name: GPU-78f59cb0-ac5d-11ef-a7f2-b17bcfe646e9
- containerEdits:
    deviceNodes:
    - path: /dev/nvidia0
    - path: /dev/nvidia-caps/nvidia-cap3
    - path: /dev/nvidia-caps/nvidia-cap4
  name: MIG-4c49bdf2-bf64-5519-8acb-d25564f119cd
- containerEdits:
    deviceNodes:
    - path: /dev/nvidia0
    - path: /dev/dri/card1
    - path: /dev/dri/renderD128
    - path: /dev/nvidia-caps/nvidia-cap3
    - path: /dev/nvidia-caps/nvidia-cap4
    hooks:
    - args:
      - nvidia-cdi-hook
      - create-symlinks
      - --link
      - ../card1::/dev/dri/by-path/pci-0000:01:00.0-card
      - --link
      - ../renderD128::/dev/dri/by-path/pci-0000:01:00.0-render
      hookName: createContainer
      path: /usr/bin/nvidia-cdi-hook
    - args:
      - nvidia-cdi-hook
      - chmod
      - --mode
      - "755"
      - --path
      - /dev/dri
      hookName: createContainer
      path: /usr/bin/nvidia-cdi-hook
  name: all
kind: nvidia.com/gpu

GPU utilization by other processes/applications on the same VM:

sstatus ollama
● ollama.service - Ollama Service
     Loaded: loaded (/etc/systemd/system/ollama.service; enabled; vendor preset: enabled)
    Drop-In: /etc/systemd/system/ollama.service.d
             └─override.conf
     Active: active (running) since Wed 2024-12-18 19:11:14 PST; 14h ago
   Main PID: 54092 (ollama)
      Tasks: 15 (limit: 77022)
     Memory: 100.0M
        CPU: 17.779s
     CGroup: /system.slice/ollama.service
             └─54092 /usr/local/bin/ollama serve

Dec 18 20:56:33 cpai.esco.ghaar ollama[54092]: llama_new_context_with_model: KV self size  = 2560.00 MiB, K (f16): 1280.00 MiB, V (f16): 1280.00 MiB
Dec 18 20:56:33 cpai.esco.ghaar ollama[54092]: llama_new_context_with_model:  CUDA_Host  output buffer size =     1.41 MiB
Dec 18 20:56:33 cpai.esco.ghaar ollama[54092]: llama_new_context_with_model:      CUDA0 compute buffer size =  1352.00 MiB
Dec 18 20:56:33 cpai.esco.ghaar ollama[54092]: llama_new_context_with_model:  CUDA_Host compute buffer size =    48.01 MiB
Dec 18 20:56:33 cpai.esco.ghaar ollama[54092]: llama_new_context_with_model: graph nodes  = 1030
Dec 18 20:56:33 cpai.esco.ghaar ollama[54092]: llama_new_context_with_model: graph splits = 2
Dec 18 20:56:33 cpai.esco.ghaar ollama[54092]: time=2024-12-18T20:56:33.874-08:00 level=INFO source=server.go:615 msg="llama runner started in 1.26 seconds"
Dec 18 20:56:48 cpai.esco.ghaar ollama[54092]: [GIN] 2024/12/18 - 20:56:48 | 200 | 16.043727686s |   192.168.100.6 | POST     "/api/chat"
Dec 18 20:56:48 cpai.esco.ghaar ollama[54092]: [GIN] 2024/12/18 - 20:56:48 | 200 |  611.186849ms |   192.168.100.6 | POST     "/v1/chat/completions"
Dec 18 21:11:16 cpai.esco.ghaar ollama[54092]: [GIN] 2024/12/18 - 21:11:16 | 200 |     405.182µs |   192.168.100.6 | GET      "/api/tags"

I've even tried stopping any other application on this VM from using GPU i.e. GPU is dedicated to immich's use only.

immich(ml):
immich-machine-learning logs:

docker compose logs immich-machine-learning
immich_machine_learning  | [12/19/24 04:55:10] INFO     Starting gunicorn 23.0.0
immich_machine_learning  | [12/19/24 04:55:10] INFO     Listening at: http://0.0.0.0:3003 (9)
immich_machine_learning  | [12/19/24 04:55:10] INFO     Using worker: app.config.CustomUvicornWorker
immich_machine_learning  | [12/19/24 04:55:10] INFO     Booting worker with pid: 10
immich_machine_learning  | [12/19/24 04:55:14] INFO     Started server process [10]
immich_machine_learning  | [12/19/24 04:55:14] INFO     Waiting for application startup.
immich_machine_learning  | [12/19/24 04:55:14] INFO     Created in-memory cache with unloading after 300s
immich_machine_learning  |                              of inactivity.
immich_machine_learning  | [12/19/24 04:55:14] INFO     Initialized request thread pool with 8 threads.
immich_machine_learning  | [12/19/24 04:55:14] INFO     Application startup complete.
immich_machine_learning  | [12/19/24 04:55:14] INFO     Loading visual model 'ViT-L-16-SigLIP-256__webli'
immich_machine_learning  |                              to memory
immich_machine_learning  | [12/19/24 04:55:14] INFO     Setting execution providers to
immich_machine_learning  |                              ['CUDAExecutionProvider', 'CPUExecutionProvider'],
immich_machine_learning  |                              in descending order of preference
immich_machine_learning  | 2024-12-19 04:55:15.116539945 [E:onnxruntime:Default, cuda_call.cc:118 CudaCall] CUDA failure 100: no CUDA-capable device is detected ; GPU=-1 ; hostname=cpai.esco.ghaar ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/cuda_execution_provider_info.cc ; line=66 ; expr=cudaGetDeviceCount(&num_devices);
immich_machine_learning  | *************** EP Error ***************
immich_machine_learning  | EP Error /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_execution_provider_info.cc:59 static onnxruntime::CUDAExecutionProviderInfo onnxruntime::CUDAExecutionProviderInfo::FromProviderOptions(const onnxruntime::ProviderOptions&) [ONNXRuntimeError] : 1 : FAIL : provider_options_utils.h:151 Parse Failed to parse provider option "device_id": CUDA failure 100: no CUDA-capable device is detected ; GPU=-1 ; hostname=cpai.esco.ghaar ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/cuda_execution_provider_info.cc ; line=66 ; expr=cudaGetDeviceCount(&num_devices);
immich_machine_learning  |  when using ['CUDAExecutionProvider', 'CPUExecutionProvider']
immich_machine_learning  | Falling back to ['CPUExecutionProvider'] and retrying.
immich_machine_learning  | ****************************************
immich_machine_learning  | [12/19/24 04:55:17] INFO     Loading detection model 'antelopev2' to memory
immich_machine_learning  | [12/19/24 04:55:17] INFO     Setting execution providers to
immich_machine_learning  |                              ['CUDAExecutionProvider', 'CPUExecutionProvider'],
immich_machine_learning  |                              in descending order of preference
immich_machine_learning  | 2024-12-19 04:55:17.183827391 [E:onnxruntime:Default, cuda_call.cc:118 CudaCall] CUDA failure 100: no CUDA-capable device is detected ; GPU=-1 ; hostname=cpai.esco.ghaar ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/cuda_execution_provider_info.cc ; line=66 ; expr=cudaGetDeviceCount(&num_devices);
immich_machine_learning  | *************** EP Error ***************
immich_machine_learning  | EP Error /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_execution_provider_info.cc:59 static onnxruntime::CUDAExecutionProviderInfo onnxruntime::CUDAExecutionProviderInfo::FromProviderOptions(const onnxruntime::ProviderOptions&) [ONNXRuntimeError] : 1 : FAIL : provider_options_utils.h:151 Parse Failed to parse provider option "device_id": CUDA failure 100: no CUDA-capable device is detected ; GPU=-1 ; hostname=cpai.esco.ghaar ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/cuda_execution_provider_info.cc ; line=66 ; expr=cudaGetDeviceCount(&num_devices);
immich_machine_learning  |  when using ['CUDAExecutionProvider', 'CPUExecutionProvider']
immich_machine_learning  | Falling back to ['CPUExecutionProvider'] and retrying.
immich_machine_learning  | ****************************************
immich_machine_learning  | [12/19/24 04:55:17] INFO     Loading recognition model 'antelopev2' to memory
immich_machine_learning  | [12/19/24 04:55:17] INFO     Setting execution providers to
immich_machine_learning  |                              ['CUDAExecutionProvider', 'CPUExecutionProvider'],
immich_machine_learning  |                              in descending order of preference
immich_machine_learning  | 2024-12-19 04:55:17.872335540 [E:onnxruntime:Default, cuda_call.cc:118 CudaCall] CUDA failure 100: no CUDA-capable device is detected ; GPU=-1 ; hostname=cpai.esco.ghaar ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/cuda_execution_provider_info.cc ; line=66 ; expr=cudaGetDeviceCount(&num_devices);
immich_machine_learning  | *************** EP Error ***************
immich_machine_learning  | EP Error /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_execution_provider_info.cc:59 static onnxruntime::CUDAExecutionProviderInfo onnxruntime::CUDAExecutionProviderInfo::FromProviderOptions(const onnxruntime::ProviderOptions&) [ONNXRuntimeError] : 1 : FAIL : provider_options_utils.h:151 Parse Failed to parse provider option "device_id": CUDA failure 100: no CUDA-capable device is detected ; GPU=-1 ; hostname=cpai.esco.ghaar ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/cuda_execution_provider_info.cc ; line=66 ; expr=cudaGetDeviceCount(&num_devices);
immich_machine_learning  |  when using ['CUDAExecutionProvider', 'CPUExecutionProvider']
immich_machine_learning  | Falling back to ['CPUExecutionProvider'] and retrying.
immich_machine_learning  | ****************************************
immich_machine_learning  | [12/19/24 04:55:37] INFO     Starting gunicorn 23.0.0
immich_machine_learning  | [12/19/24 04:55:37] INFO     Listening at: http://0.0.0.0:3003 (9)
immich_machine_learning  | [12/19/24 04:55:37] INFO     Using worker: app.config.CustomUvicornWorker
immich_machine_learning  | [12/19/24 04:55:37] INFO     Booting worker with pid: 10
immich_machine_learning  | [12/19/24 04:55:40] INFO     Started server process [10]
immich_machine_learning  | [12/19/24 04:55:40] INFO     Waiting for application startup.
immich_machine_learning  | [12/19/24 04:55:40] INFO     Created in-memory cache with unloading after 300s
immich_machine_learning  |                              of inactivity.
immich_machine_learning  | [12/19/24 04:55:40] INFO     Initialized request thread pool with 8 threads.
immich_machine_learning  | [12/19/24 04:55:40] INFO     Application startup complete.
immich_machine_learning  | [12/19/24 04:55:41] INFO     Loading visual model 'ViT-L-16-SigLIP-256__webli'
immich_machine_learning  |                              to memory
immich_machine_learning  | [12/19/24 04:55:41] INFO     Setting execution providers to
immich_machine_learning  |                              ['CUDAExecutionProvider', 'CPUExecutionProvider'],
immich_machine_learning  |                              in descending order of preference
immich_machine_learning  | 2024-12-19 04:55:42.156008534 [E:onnxruntime:Default, cuda_call.cc:118 CudaCall] CUDA failure 100: no CUDA-capable device is detected ; GPU=-1 ; hostname=cpai.esco.ghaar ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/cuda_execution_provider_info.cc ; line=66 ; expr=cudaGetDeviceCount(&num_devices);
immich_machine_learning  | *************** EP Error ***************
immich_machine_learning  | EP Error /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_execution_provider_info.cc:59 static onnxruntime::CUDAExecutionProviderInfo onnxruntime::CUDAExecutionProviderInfo::FromProviderOptions(const onnxruntime::ProviderOptions&) [ONNXRuntimeError] : 1 : FAIL : provider_options_utils.h:151 Parse Failed to parse provider option "device_id": CUDA failure 100: no CUDA-capable device is detected ; GPU=-1 ; hostname=cpai.esco.ghaar ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/cuda_execution_provider_info.cc ; line=66 ; expr=cudaGetDeviceCount(&num_devices);
immich_machine_learning  |  when using ['CUDAExecutionProvider', 'CPUExecutionProvider']
immich_machine_learning  | Falling back to ['CPUExecutionProvider'] and retrying.
immich_machine_learning  | ****************************************
immich_machine_learning  | [12/19/24 04:55:44] INFO     Loading detection model 'antelopev2' to memory
immich_machine_learning  | [12/19/24 04:55:44] INFO     Setting execution providers to
immich_machine_learning  |                              ['CUDAExecutionProvider', 'CPUExecutionProvider'],
immich_machine_learning  |                              in descending order of preference
immich_machine_learning  | 2024-12-19 04:55:44.286638144 [E:onnxruntime:Default, cuda_call.cc:118 CudaCall] CUDA failure 100: no CUDA-capable device is detected ; GPU=-1 ; hostname=cpai.esco.ghaar ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/cuda_execution_provider_info.cc ; line=66 ; expr=cudaGetDeviceCount(&num_devices);
immich_machine_learning  | *************** EP Error ***************
immich_machine_learning  | EP Error /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_execution_provider_info.cc:59 static onnxruntime::CUDAExecutionProviderInfo onnxruntime::CUDAExecutionProviderInfo::FromProviderOptions(const onnxruntime::ProviderOptions&) [ONNXRuntimeError] : 1 : FAIL : provider_options_utils.h:151 Parse Failed to parse provider option "device_id": CUDA failure 100: no CUDA-capable device is detected ; GPU=-1 ; hostname=cpai.esco.ghaar ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/cuda_execution_provider_info.cc ; line=66 ; expr=cudaGetDeviceCount(&num_devices);
immich_machine_learning  |  when using ['CUDAExecutionProvider', 'CPUExecutionProvider']
immich_machine_learning  | Falling back to ['CPUExecutionProvider'] and retrying.
immich_machine_learning  | ****************************************
immich_machine_learning  | [12/19/24 04:55:44] INFO     Loading recognition model 'antelopev2' to memory
immich_machine_learning  | [12/19/24 04:55:44] INFO     Setting execution providers to
immich_machine_learning  |                              ['CUDAExecutionProvider', 'CPUExecutionProvider'],
immich_machine_learning  |                              in descending order of preference
immich_machine_learning  | *************** EP Error ***************
immich_machine_learning  | EP Error /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_execution_provider_info.cc:59 static onnxruntime::CUDAExecutionProviderInfo onnxruntime::CUDAExecutionProviderInfo::FromProviderOptions(const onnxruntime::ProviderOptions&) [ONNXRuntimeError] : 1 : FAIL : provider_options_utils.h:151 Parse Failed to parse provider option "device_id": CUDA failure 100: no CUDA-capable device is detected ; GPU=-1 ; hostname=cpai.esco.ghaar ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/cuda_execution_provider_info.cc ; line=66 ; expr=cudaGetDeviceCount(&num_devices);
immich_machine_learning  |  when using ['CUDAExecutionProvider', 'CPUExecutionProvider']
immich_machine_learning  | Falling back to ['CPUExecutionProvider'] and retrying.
immich_machine_learning  | ****************************************
immich_machine_learning  | 2024-12-19 04:55:45.045748200 [E:onnxruntime:Default, cuda_call.cc:118 CudaCall] CUDA failure 100: no CUDA-capable device is detected ; GPU=-1 ; hostname=cpai.esco.ghaar ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/cuda_execution_provider_info.cc ; line=66 ; expr=cudaGetDeviceCount(&num_devices);
immich_machine_learning  | [12/19/24 05:18:11] INFO     Shutting down due to inactivity.
immich_machine_learning  | [12/19/24 05:18:11] INFO     Shutting down
immich_machine_learning  | [12/19/24 05:18:11] INFO     Waiting for application shutdown.
immich_machine_learning  | [12/19/24 05:18:11] INFO     Application shutdown complete.
immich_machine_learning  | [12/19/24 05:18:11] INFO     Finished server process [10]
immich_machine_learning  | [12/19/24 05:18:11] ERROR    Worker (pid:10) was sent SIGINT!
immich_machine_learning  | [12/19/24 05:18:11] INFO     Booting worker with pid: 422
immich_machine_learning  | [12/19/24 05:18:15] INFO     Started server process [422]
immich_machine_learning  | [12/19/24 05:18:15] INFO     Waiting for application startup.
immich_machine_learning  | [12/19/24 05:18:15] INFO     Created in-memory cache with unloading after 300s
immich_machine_learning  |                              of inactivity.
immich_machine_learning  | [12/19/24 05:18:15] INFO     Initialized request thread pool with 8 threads.
immich_machine_learning  | [12/19/24 05:18:15] INFO     Application startup complete.
@mertalev
Copy link
Contributor

Would you be able to test running without Proxmox? I think this is most likely a quirk of GPU passthrough, but I'd like to confirm that without assuming.

@morikplay
Copy link
Author

morikplay commented Dec 19, 2024

Thank you for the prompt response, @mertalev. Do I understand your suggestion correctly in that the request is to run Immich (in docker) on an OS which has direct control of the GPU? If so, unfortunately, that wouldn't be possible in my environment as these are beefy servers running many other VMs, containers etc - all of which is via hypervisor :(

Because other applications (containerized or otherwise) can use the pass through GPU I presume that GPU configuration is relatively okay.

I do apologize for not including container logs with debug enabled. I can do so by end of today (Pacific time).

@mertalev
Copy link
Contributor

Would you be able to share the output of nvidia-smi within the container?

@morikplay
Copy link
Author

Certainly, @mertalev.

From within immich-machine-learning container:

Thu Dec 19 19:31:32 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  GRID A100D-1-20C               On  |   00000000:06:10.0 Off |                   On |
| N/A   N/A    P0             N/A /  N/A  |                  N/A   |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| MIG devices:                                                                            |
+------------------+----------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                     Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                       BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                  |        ECC|                       |
|==================+==================================+===========+=======================|
|  No MIG devices found                                                                   |
+-----------------------------------------------------------------------------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

@mertalev
Copy link
Contributor

Thanks! This looks normal to my eyes. I wonder if this is a bug related to ONNX Runtime. Can you try using the 1.119.1 release-cuda image for the machine learning service? That should have a different version of onnxruntime-gpu that might behave differently.

@morikplay
Copy link
Author

morikplay commented Dec 19, 2024

Certainly. When trying image: ghcr.io/immich-app/immich-machine-learning:v1.119.1-cuda the same base error can be seen along w/ (what I presume are) derivative errors:

immich_machine_learning  | [12/19/24 21:42:03] INFO     Attempt #6 to load visual model
immich_machine_learning  |                              'ViT-L-16-SigLIP-256__webli' to memory
immich_machine_learning  | [12/19/24 21:42:04] ERROR    Exception in ASGI application
immich_machine_learning  |
immich_machine_learning  |                              ╭─────── Traceback (most recent call last) ───────╮
immich_machine_learning  |                              │ /opt/venv/lib/python3.11/site-packages/onnxrunt │
immich_machine_learning  |                              │ ime/capi/onnxruntime_inference_collection.py:41 │
immich_machine_learning  |                              │ 9 in __init__                                   │
immich_machine_learning  |                              │                                                 │
immich_machine_learning  |                              │    416 │   │   disabled_optimizers = kwargs.get │
immich_machine_learning  |                              │    417 │   │                                    │
immich_machine_learning  |                              │    418 │   │   try:                             │
immich_machine_learning  |                              │ ❱  419 │   │   │   self._create_inference_sessi │
immich_machine_learning  |                              │        disabled_optimizers)                     │
immich_machine_learning  |                              │    420 │   │   except (ValueError, RuntimeError │
immich_machine_learning  |                              │    421 │   │   │   if self._enable_fallback:    │
immich_machine_learning  |                              │    422 │   │   │   │   try:                     │
immich_machine_learning  |                              │                                                 │
immich_machine_learning  |                              │ /opt/venv/lib/python3.11/site-packages/onnxrunt │
immich_machine_learning  |                              │ ime/capi/onnxruntime_inference_collection.py:48 │
immich_machine_learning  |                              │ 3 in _create_inference_session                  │
immich_machine_learning  |                              │                                                 │
immich_machine_learning  |                              │    480 │   │   │   disabled_optimizers = set(di │
immich_machine_learning  |                              │    481 │   │                                    │
immich_machine_learning  |                              │    482 │   │   # initialize the C++ InferenceSe │
immich_machine_learning  |                              │ ❱  483 │   │   sess.initialize_session(provider │
immich_machine_learning  |                              │    484 │   │                                    │
immich_machine_learning  |                              │    485 │   │   self._sess = sess                │
immich_machine_learning  |                              │    486 │   │   self._sess_options = self._sess. │
immich_machine_learning  |                              ╰─────────────────────────────────────────────────╯
immich_machine_learning  |                              RuntimeError:
immich_machine_learning  |                              /onnxruntime_src/onnxruntime/core/providers/cuda/cu
immich_machine_learning  |                              da_execution_provider_info.cc:58 static
immich_machine_learning  |                              onnxruntime::CUDAExecutionProviderInfo
immich_machine_learning  |                              onnxruntime::CUDAExecutionProviderInfo::FromProvide
immich_machine_learning  |                              rOptions(const onnxruntime::ProviderOptions&)
immich_machine_learning  |                              [ONNXRuntimeError] : 1 : FAIL :
immich_machine_learning  |                              provider_options_utils.h:151 Parse Failed to parse
immich_machine_learning  |                              provider option "device_id": CUDA failure 100: no
immich_machine_learning  |                              CUDA-capable device is detected ; GPU=0 ;
immich_machine_learning  |                              hostname=cpai.esco.ghaar ;
immich_machine_learning  |                              file=/onnxruntime_src/onnxruntime/core/providers/cu
immich_machine_learning  |                              da/cuda_execution_provider_info.cc ; line=65 ;
immich_machine_learning  |                              expr=cudaGetDeviceCount(&num_devices);
immich_machine_learning  |
immich_machine_learning  |
immich_machine_learning  |                              The above exception was the direct cause of the
immich_machine_learning  |                              following exception:
immich_machine_learning  |
immich_machine_learning  |                              ╭─────── Traceback (most recent call last) ───────╮
immich_machine_learning  |                              │ /usr/src/app/main.py:150 in predict             │
immich_machine_learning  |                              │                                                 │
immich_machine_learning  |                              │   147 │   │   inputs = text                     │
immich_machine_learning  |                              │   148 │   else:                                 │
immich_machine_learning  |                              │   149 │   │   raise HTTPException(400, "Either  │
immich_machine_learning  |                              │ ❱ 150 │   response = await run_inference(inputs │
immich_machine_learning  |                              │   151 │   return ORJSONResponse(response)       │
immich_machine_learning  |                              │   152                                           │
immich_machine_learning  |                              │   153                                           │
immich_machine_learning  |                              │                                                 │
immich_machine_learning  |                              │ /usr/src/app/main.py:173 in run_inference       │
immich_machine_learning  |                              │                                                 │
immich_machine_learning  |                              │   170 │   │   response[entry["task"]] = output  │
immich_machine_learning  |                              │   171 │                                         │
immich_machine_learning  |                              │   172 │   without_deps, with_deps = entries     │
immich_machine_learning  |                              │ ❱ 173 │   await asyncio.gather(*[_run_inference │
immich_machine_learning  |                              │   174 │   if with_deps:                         │
immich_machine_learning  |                              │   175 │   │   await asyncio.gather(*[_run_infer │
immich_machine_learning  |                              │   176 │   if isinstance(payload, Image):        │
immich_machine_learning  |                              │                                                 │
immich_machine_learning  |                              │ /usr/src/app/main.py:167 in _run_inference      │
immich_machine_learning  |                              │                                                 │
immich_machine_learning  |                              │   164 │   │   │   except KeyError:              │
immich_machine_learning  |                              │   165 │   │   │   │   message = f"Task {entry[' │
immich_machine_learning  |                              │       output of {dep}"                          │
immich_machine_learning  |                              │   166 │   │   │   │   raise HTTPException(400,  │
immich_machine_learning  |                              │ ❱ 167 │   │   model = await load(model)         │
immich_machine_learning  |                              │   168 │   │   output = await run(model.predict, │
immich_machine_learning  |                              │   169 │   │   outputs[model.identity] = output  │
immich_machine_learning  |                              │   170 │   │   response[entry["task"]] = output  │
immich_machine_learning  |                              │                                                 │
immich_machine_learning  |                              │ /usr/src/app/main.py:211 in load                │
immich_machine_learning  |                              │                                                 │
immich_machine_learning  |                              │   208 │   │   return model                      │
immich_machine_learning  |                              │   209 │                                         │
immich_machine_learning  |                              │   210 │   try:                                  │
immich_machine_learning  |                              │ ❱ 211 │   │   return await run(_load, model)    │
immich_machine_learning  |                              │   212 │   except (OSError, InvalidProtobuf, Bad │
immich_machine_learning  |                              │   213 │   │   log.warning(f"Failed to load {mod │
immich_machine_learning  |                              │       '{model.model_name}'. Clearing cache.")   │
immich_machine_learning  |                              │   214 │   │   model.clear_cache()               │
immich_machine_learning  |                              │                                                 │
immich_machine_learning  |                              │ /usr/src/app/main.py:186 in run                 │
immich_machine_learning  |                              │                                                 │
immich_machine_learning  |                              │   183 │   if thread_pool is None:               │
immich_machine_learning  |                              │   184 │   │   return func(*args, **kwargs)      │
immich_machine_learning  |                              │   185 │   partial_func = partial(func, *args, * │
immich_machine_learning  |                              │ ❱ 186 │   return await asyncio.get_running_loop │
immich_machine_learning  |                              │   187                                           │
immich_machine_learning  |                              │   188                                           │
immich_machine_learning  |                              │   189 async def load(model: InferenceModel) ->  │
immich_machine_learning  |                              │                                                 │
immich_machine_learning  |                              │ /usr/local/lib/python3.11/concurrent/futures/th │
immich_machine_learning  |                              │ read.py:58 in run                               │
immich_machine_learning  |                              │                                                 │
immich_machine_learning  |                              │ /usr/src/app/main.py:198 in _load               │
immich_machine_learning  |                              │                                                 │
immich_machine_learning  |                              │   195 │   │   │   raise HTTPException(500, f"Fa │
immich_machine_learning  |                              │   196 │   │   with lock:                        │
immich_machine_learning  |                              │   197 │   │   │   try:                          │
immich_machine_learning  |                              │ ❱ 198 │   │   │   │   model.load()              │
immich_machine_learning  |                              │   199 │   │   │   except FileNotFoundError as e │
immich_machine_learning  |                              │   200 │   │   │   │   if model.model_format ==  │
immich_machine_learning  |                              │   201 │   │   │   │   │   raise e               │
immich_machine_learning  |                              │                                                 │
immich_machine_learning  |                              │ /usr/src/app/models/base.py:53 in load          │
immich_machine_learning  |                              │                                                 │
immich_machine_learning  |                              │    50 │   │   self.download()                   │
immich_machine_learning  |                              │    51 │   │   attempt = f"Attempt #{self.load_a │
immich_machine_learning  |                              │       else "Loading"                            │
immich_machine_learning  |                              │    52 │   │   log.info(f"{attempt} {self.model_ │
immich_machine_learning  |                              │       '{self.model_name}' to memory")           │
immich_machine_learning  |                              │ ❱  53 │   │   self.session = self._load()       │
immich_machine_learning  |                              │    54 │   │   self.loaded = True                │
immich_machine_learning  |                              │    55 │                                         │
immich_machine_learning  |                              │    56 │   def predict(self, *inputs: Any, **mod │
immich_machine_learning  |                              │                                                 │
immich_machine_learning  |                              │ /usr/src/app/models/facial_recognition/detectio │
immich_machine_learning  |                              │ n.py:21 in _load                                │
immich_machine_learning  |                              │                                                 │
immich_machine_learning  |                              │   18 │   │   super().__init__(model_name, **mod │
immich_machine_learning  |                              │   19 │                                          │
immich_machine_learning  |                              │   20 │   def _load(self) -> ModelSession:       │
immich_machine_learning  |                              │ ❱ 21 │   │   session = self._make_session(self. │
immich_machine_learning  |                              │   22 │   │   self.model = RetinaFace(session=se │
immich_machine_learning  |                              │   23 │   │   self.model.prepare(ctx_id=0, det_t │
immich_machine_learning  |                              │   24                                            │
immich_machine_learning  |                              │                                                 │
immich_machine_learning  |                              │ /usr/src/app/models/base.py:110 in              │
immich_machine_learning  |                              │ _make_session                                   │
immich_machine_learning  |                              │                                                 │
immich_machine_learning  |                              │   107 │   │   │   case ".armnn":                │
immich_machine_learning  |                              │   108 │   │   │   │   session: ModelSession = A │
immich_machine_learning  |                              │   109 │   │   │   case ".onnx":                 │
immich_machine_learning  |                              │ ❱ 110 │   │   │   │   session = OrtSession(mode │
immich_machine_learning  |                              │   111 │   │   │   case _:                       │
immich_machine_learning  |                              │   112 │   │   │   │   raise ValueError(f"Unsupp │
immich_machine_learning  |                              │   113 │   │   return session                    │
immich_machine_learning  |                              │                                                 │
immich_machine_learning  |                              │ /usr/src/app/sessions/ort.py:28 in __init__     │
immich_machine_learning  |                              │                                                 │
immich_machine_learning  |                              │    25 │   │   self.providers = providers if pro │
immich_machine_learning  |                              │    26 │   │   self.provider_options = provider_ │
immich_machine_learning  |                              │       self._provider_options_default            │
immich_machine_learning  |                              │    27 │   │   self.sess_options = sess_options  │
immich_machine_learning  |                              │       self._sess_options_default                │
immich_machine_learning  |                              │ ❱  28 │   │   self.session = ort.InferenceSessi │
immich_machine_learning  |                              │    29 │   │   │   self.model_path.as_posix(),   │
immich_machine_learning  |                              │    30 │   │   │   providers=self.providers,     │
immich_machine_learning  |                              │    31 │   │   │   provider_options=self.provide │
immich_machine_learning  |                              │                                                 │
immich_machine_learning  |                              │ /opt/venv/lib/python3.11/site-packages/onnxrunt │
immich_machine_learning  |                              │ ime/capi/onnxruntime_inference_collection.py:43 │
immich_machine_learning  |                              │ 2 in __init__                                   │
immich_machine_learning  |                              │                                                 │
immich_machine_learning  |                              │    429 │   │   │   │   │   self.disable_fallbac │
immich_machine_learning  |                              │    430 │   │   │   │   │   return               │
immich_machine_learning  |                              │    431 │   │   │   │   except Exception as fall │
immich_machine_learning  |                              │ ❱  432 │   │   │   │   │   raise fallback_error │
immich_machine_learning  |                              │    433 │   │   │   # Fallback is disabled. Rais │
immich_machine_learning  |                              │    434 │   │   │   raise e                      │
immich_machine_learning  |                              │    435                                          │
immich_machine_learning  |                              │                                                 │
immich_machine_learning  |                              │ /opt/venv/lib/python3.11/site-packages/onnxrunt │
immich_machine_learning  |                              │ ime/capi/onnxruntime_inference_collection.py:42 │
immich_machine_learning  |                              │ 7 in __init__                                   │
immich_machine_learning  |                              │                                                 │
immich_machine_learning  |                              │    424 │   │   │   │   │   print(f"EP Error {e} │
immich_machine_learning  |                              │    425 │   │   │   │   │   print(f"Falling back │
immich_machine_learning  |                              │    426 │   │   │   │   │   print("************* │
immich_machine_learning  |                              │ ❱  427 │   │   │   │   │   self._create_inferen │
immich_machine_learning  |                              │    428 │   │   │   │   │   # Fallback only once │
immich_machine_learning  |                              │    429 │   │   │   │   │   self.disable_fallbac │
immich_machine_learning  |                              │    430 │   │   │   │   │   return               │
immich_machine_learning  |                              │                                                 │
immich_machine_learning  |                              │ /opt/venv/lib/python3.11/site-packages/onnxrunt │
immich_machine_learning  |                              │ ime/capi/onnxruntime_inference_collection.py:48 │
immich_machine_learning  |                              │ 3 in _create_inference_session                  │
immich_machine_learning  |                              │                                                 │
immich_machine_learning  |                              │    480 │   │   │   disabled_optimizers = set(di │
immich_machine_learning  |                              │    481 │   │                                    │
immich_machine_learning  |                              │    482 │   │   # initialize the C++ InferenceSe │
immich_machine_learning  |                              │ ❱  483 │   │   sess.initialize_session(provider │
immich_machine_learning  |                              │    484 │   │                                    │
immich_machine_learning  |                              │    485 │   │   self._sess = sess                │
immich_machine_learning  |                              │    486 │   │   self._sess_options = self._sess. │
immich_machine_learning  |                              ╰─────────────────────────────────────────────────╯
immich_machine_learning  |                              RuntimeError:
immich_machine_learning  |                              /onnxruntime_src/onnxruntime/core/providers/cuda/cu
immich_machine_learning  |                              da_call.cc:123 std::conditional_t<THRW, void,
immich_machine_learning  |                              onnxruntime::common::Status>
immich_machine_learning  |                              onnxruntime::CudaCall(ERRTYPE, const char*, const
immich_machine_learning  |                              char*, ERRTYPE, const char*, const char*, int)
immich_machine_learning  |                              [with ERRTYPE = cudaError; bool THRW = true;
immich_machine_learning  |                              std::conditional_t<THRW, void, common::Status> =
immich_machine_learning  |                              void]
immich_machine_learning  |                              /onnxruntime_src/onnxruntime/core/providers/cuda/cu
immich_machine_learning  |                              da_call.cc:116 std::conditional_t<THRW, void,
immich_machine_learning  |                              onnxruntime::common::Status>
immich_machine_learning  |                              onnxruntime::CudaCall(ERRTYPE, const char*, const
immich_machine_learning  |                              char*, ERRTYPE, const char*, const char*, int)
immich_machine_learning  |                              [with ERRTYPE = cudaError; bool THRW = true;
immich_machine_learning  |                              std::conditional_t<THRW, void, common::Status> =
immich_machine_learning  |                              void] CUDA failure 100: no CUDA-capable device is
immich_machine_learning  |                              detected ; GPU=32743 ; hostname=cpai.esco.ghaar ;
immich_machine_learning  |                              file=/onnxruntime_src/onnxruntime/core/providers/cu
immich_machine_learning  |                              da/cuda_execution_provider.cc ; line=280 ;
immich_machine_learning  |                              expr=cudaSetDevice(info_.device_id);
immich_machine_learning  |
immich_machine_learning  |
immich_machine_learning  | [12/19/24 21:42:04] INFO     Setting execution providers to
immich_machine_learning  |                              ['CUDAExecutionProvider', 'CPUExecutionProvider'],
immich_machine_learning  |                              in descending order of preference
immich_machine_learning  | 2024-12-19 21:42:05.556390033 [E:onnxruntime:Default, cuda_call.cc:118 CudaCall] CUDA failure 100: no CUDA-capable device is detected ; GPU=0 ; hostname=cpai.esco.ghaar ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/cuda_execution_provider_info.cc ; line=65 ; expr=cudaGetDeviceCount(&num_devices);
immich_machine_learning  | *************** EP Error ***************
immich_machine_learning  | EP Error /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_execution_provider_info.cc:58 static onnxruntime::CUDAExecutionProviderInfo onnxruntime::CUDAExecutionProviderInfo::FromProviderOptions(const onnxruntime::ProviderOptions&) [ONNXRuntimeError] : 1 : FAIL : provider_options_utils.h:151 Parse Failed to parse provider option "device_id": CUDA failure 100: no CUDA-capable device is detected ; GPU=0 ; hostname=cpai.esco.ghaar ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/cuda_execution_provider_info.cc ; line=65 ; expr=cudaGetDeviceCount(&num_devices);
immich_machine_learning  |  when using ['CUDAExecutionProvider', 'CPUExecutionProvider']
immich_machine_learning  | Falling back to ['CUDAExecutionProvider', 'CPUExecutionProvider'] and retrying.
immich_machine_learning  | ****************************************

@mertalev
Copy link
Contributor

Hmm, gotcha. Two other things you can try:

  1. Run with privileged mode
  2. Maybe the device ID isn't supposed to be 0? I see 6 for the GI ID. nvidia-smi -L should give a more detailed output that might help.

@morikplay
Copy link
Author

morikplay commented Dec 20, 2024

Thank you, @mertalev for continuing the troubleshooting.

Run with privileged mode

Running with privileged: true on both v1.119.1 and release-cuda (v1.123) results in the same erroneous outcome.

Maybe the device ID isn't supposed to be 0? I see 6 for the GI ID. nvidia-smi -L should give a more detailed output that might help.

instanceID on host should have a non-zero value. For passthrough, inside the VM it can have value 0.
VM device ID:

GPU 0: GRID A100D-1-20C (UUID: GPU-33064dd6-bda3-11ef-a783-1394f45ad91f)
  MIG 1g.20gb     Device  0: (UUID: MIG-cbdf113e-8198-55f2-98e4-9bae5dcfddf6)

GPU deviceID inside immich container:

GPU 0: GRID A100D-1-20C (UUID: GPU-33064dd6-bda3-11ef-a783-1394f45ad91f)

/dev from inside immich container:

root@cpai:/dev# l
autofs           dma_heap/  hwrng         loop4    null              ppp     sda2      stdin@   tty14  tty22  tty30  tty39  tty47  tty55  tty63   ttyS13  ttyS21  ttyS3   ttyprintk  vcs3   vcsa5  vfio/
bsg/             dri/       i2c-0         loop5    nvidia-caps/      psaux   sda3      stdout@  tty15  tty23  tty31  tty4   tty48  tty56  tty7    ttyS14  ttyS22  ttyS30  udmabuf    vcs4   vcsa6  vga_arbiter
btrfs-control    ecryptfs   input/        loop6    nvidia-modeset    ptmx@   sg0       tty      tty16  tty24  tty32  tty40  tty49  tty57  tty8    ttyS15  ttyS23  ttyS31  uhid       vcs5   vcsu   vhci
bus/             fb0        kmsg          loop7    nvidia-uvm        pts/    sg1       tty0     tty17  tty25  tty33  tty41  tty5   tty58  tty9    ttyS16  ttyS24  ttyS4   uinput     vcs6   vcsu1  vhost-net
core@            fd@        loop-control  mapper/  nvidia-uvm-tools  random  shm/      tty1     tty18  tty26  tty34  tty42  tty50  tty59  ttyS0   ttyS17  ttyS25  ttyS5   urandom    vcsa   vcsu2  vhost-vsock
cpu/             full       loop0         mcelog   nvidia0           rfkill  snapshot  tty10    tty19  tty27  tty35  tty43  tty51  tty6   ttyS1   ttyS18  ttyS26  ttyS6   userio     vcsa1  vcsu3  vport3p1
cpu_dma_latency  fuse       loop1         mem      nvidiactl         rtc0    snd/      tty11    tty2   tty28  tty36  tty44  tty52  tty60  ttyS10  ttyS19  ttyS27  ttyS7   vcs        vcsa2  vcsu4  zero
cuse             hidraw0    loop2         mqueue/  nvram             sda     sr0       tty12    tty20  tty29  tty37  tty45  tty53  tty61  ttyS11  ttyS2   ttyS28  ttyS8   vcs1       vcsa3  vcsu5  zfs
dm-0             hpet       loop3         net/     port              sda1    stderr@   tty13    tty21  tty3   tty38  tty46  tty54  tty62  ttyS12  ttyS20  ttyS29  ttyS9   vcs2       vcsa4  vcsu6

@mertalev
Copy link
Contributor

Sorry, I’m not sure what else to suggest. I’m pretty confident the issue is not that the GPU is an A100 or that you’re using Docker, and somewhat confident that it relates to GPU passthrough. It could potentially be a driver issue or a quirk with MIG mode. All this is to say I don’t know if this is really an Immich issue.

@morikplay
Copy link
Author

morikplay commented Dec 22, 2024

Thank you @mertalev. Your conclusion was my starting hypothesis :) However, because other docker containers can employ the GPU, I was forced to modify my working assumption in that the GPU driver sharing chain is working as expected :) It may be a confluence of the way nvidia'a closed drivers work together with onnyx code. I do appreciate the sound troubleshooting advise you were kind enough to lend me 🙏🏽

@mertalev
Copy link
Contributor

It may be a confluence of the way nvidia's closed drivers work together with onnx code.

Yes, this would be my guess. I'd suggest checking the ONNX Runtime issues and maybe making one if there's nothing related. But you'll need to be prepared to narrow things down more, specifically whether disabling MIG mode changes anything or whether direct GPU usage without a VM works. They probably can't help if you're not sure what the bug is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants