-
-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NVIDIA A100 GPU + CUDA not being recognized for immich-machine-learning on docker #14808
Comments
Would you be able to test running without Proxmox? I think this is most likely a quirk of GPU passthrough, but I'd like to confirm that without assuming. |
Thank you for the prompt response, @mertalev. Do I understand your suggestion correctly in that the request is to run Immich (in docker) on an OS which has direct control of the GPU? If so, unfortunately, that wouldn't be possible in my environment as these are beefy servers running many other VMs, containers etc - all of which is via hypervisor :( Because other applications (containerized or otherwise) can use the pass through GPU I presume that GPU configuration is relatively okay. I do apologize for not including container logs with debug enabled. I can do so by end of today (Pacific time). |
Would you be able to share the output of |
Certainly, @mertalev. From within immich-machine-learning container:
|
Thanks! This looks normal to my eyes. I wonder if this is a bug related to ONNX Runtime. Can you try using the 1.119.1 release-cuda image for the machine learning service? That should have a different version of onnxruntime-gpu that might behave differently. |
Certainly. When trying
|
Hmm, gotcha. Two other things you can try:
|
Thank you, @mertalev for continuing the troubleshooting.
Running with
instanceID on host should have a non-zero value. For passthrough, inside the VM it can have value 0.
GPU deviceID inside immich container:
/dev from inside immich container:
|
Sorry, I’m not sure what else to suggest. I’m pretty confident the issue is not that the GPU is an A100 or that you’re using Docker, and somewhat confident that it relates to GPU passthrough. It could potentially be a driver issue or a quirk with MIG mode. All this is to say I don’t know if this is really an Immich issue. |
Thank you @mertalev. Your conclusion was my starting hypothesis :) However, because other docker containers can employ the GPU, I was forced to modify my working assumption in that the GPU driver sharing chain is working as expected :) It may be a confluence of the way nvidia'a closed drivers work together with onnyx code. I do appreciate the sound troubleshooting advise you were kind enough to lend me 🙏🏽 |
Yes, this would be my guess. I'd suggest checking the ONNX Runtime issues and maybe making one if there's nothing related. But you'll need to be prepared to narrow things down more, specifically whether disabling MIG mode changes anything or whether direct GPU usage without a VM works. They probably can't help if you're not sure what the bug is. |
The bug
NVIDIA A100 80GB running in MIG mode as vGPU 20GB presented by Proxmox PVE 8.3.2 hypervisor as passthrough to Ubuntu 22.04 VM. Because it is an NVAIE 3.x licensed product, only closed version drivers for NVIDIA are supported for this GPU.
immich runs on this VM as a docker container. nvidia-container-toolkit enabled. Host and VM drivers are compatible (not an exact sub-version match but that's the intended bundled drivers configuration). When starting immich with cuda support, ml container is unable to find the GPU. Other applications e.g.
Ollama
orcodeproject.ai
are able to recognize and work with the GPU inside VM and docker container.output of
get_device()
indicates GPU is visible to the container:The OS that Immich Server is running on
Ubuntu 22.04 on Proxmox PVE 8.3.2
Version of Immich Server
v1.123.0
Version of Immich Mobile App
v1.123.0 build.186
Platform with the issue
Your docker-compose.yml content
Your .env content
Reproduction steps
nvidia-smi
configuration on the VM was verified.docker compose pull && docker compose up -d && docker compose logs -f
Relevant log output
Additional information
Host
nvidia-smi:
VM's QEMU passthrough config:
VM
nvidia-smi:
nvcc:
deviceQuery:
nvidia-container-toolkit:
docker deamon:
OCI /etc/cdi/nvidia.yaml:
GPU utilization by other processes/applications on the same VM:
I've even tried stopping any other application on this VM from using GPU i.e. GPU is dedicated to immich's use only.
immich(ml):
immich-machine-learning logs:
The text was updated successfully, but these errors were encountered: