-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gigahorse-1.1.8-9d66b86: GPU address limit 1TB/40bit problem: instance of 'std::runtime_error, signal 6, swiotlb buffer is full, NVRM: Failed to create a DMA mapping! #28
Comments
Interesting problem, I don't think I can fix this via code though... |
With Bios "1TB Memory Cap" the problems are resolved, K34 works fine. |
The 40bit Limit is also mentioned here. Linear memory is allocated in a single unified address space, which means that separately allocated entities can reference one another via pointers, for example, in a binary tree or linked list. The size of the address space depends on the host system (CPU) and the compute capability of the used GPU:
Note On devices of compute capability 5.3 (Maxwell) and earlier, the CUDA driver creates an uncommitted 40bit virtual address reservation to ensure that memory allocations (pointers) fall into the supported range. This reservation appears as reserved virtual memory, but does not occupy any physical memory until the program actually allocates memory. |
The workaround "Bios option 1TB Memory Cap" reduces my RAM from 1.5TB to 1TB, so I loose 512GB.
|
While it works with iommu, the performance is massively degraded for this use case. For 3060Ti I get these numbers (1st plot omitted)
|
Interesting to know |
Hi,
on 1.5TB machine K34 plots do not work (GPU: Quadro M6000).
I believe this is due to GPU 40bit address limit.
Here it is mentioned:
https://learn.microsoft.com/en-us/windows-hardware/drivers/display/iommu-dma-remapping
" This page describes the IOMMU DMA remapping feature that was introduced in Windows 11 22H2 (WDDM 3.0).
...
Upcoming servers and high end workstations can be configured with over 1TB of memory which crosses the common 40-bit address space limitation of many GPUs."
So it seems while Windows 22H2 can handle it, in Linux it can be a problem (kernel 4.18.0-425.10.1.el8_7).
Also note increasing swiotlb delays the termination to the 2nd plot, but even the 1st plot
might be corrupt as there are ten thousand (!) of such (and other) messages:
Furthermore these messages also appear for K32 after some successful plots when running "-n -1",
so while that did not terminate it might also produce corrupt plots.
This workstation has a BIOS option "1TB Memory Cap":
I will try that next.
Logs
This can be seen with "dmesg -T" or /var/log/messages:
With some experiment I also got this failures mode:
Related documentation:
https://lenovopress.lenovo.com/lp1467.pdf
An Introduction to IOMMU Infrastructure in the Linux Kernel
The text was updated successfully, but these errors were encountered: