Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NVIDIA Jetson AGX, singularity exec --nv Could not find any nv files on this host! #2805

Closed
vlk-jan opened this issue Apr 5, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@vlk-jan
Copy link

vlk-jan commented Apr 5, 2024

Version of Singularity

$ singularity --version
singularity-ce version 3.8.0

Describe the bug
When running the singularity image on NVIDIA Jetson AGX, the singularity cannot find nv files.

To Reproduce
Steps to reproduce the behavior:
We use the singularity image from here: https://github.com/vras-robotour/deploy, on NVIDIA Jetson.
Running the following command in the deploy directory

$ ./scripts/start_singularity.sh --nv

=========== STARTING SINGULARITY CONTAINER ============

INFO: Singularity is already installed.
INFO: Updating repository to the latest version.
Already up to date.
INFO: Mounting /snap directory.
INFO: Starting Singularity container from image robotour_arm64.simg.
INFO: Could not find any nv files on this host!
INFO: The catkin workspace is already initialized.

================== UPDATING PACKAGES ==================

INFO: Updating the package naex to the latest version.
Already up to date.
INFO: Updating the package robotour to the latest version.
Already up to date.
INFO: Updating the package map_data to the latest version.
Already up to date.
INFO: Updating the package test_package to the latest version.
Already up to date.

=======================================================

INFO: Starting interactive bash while sourcing the workspace.

Expected behavior
Expected behavior is one where the nv files are found and we would be able to use pytorch. with cuda

OS / Linux Distribution

$ cat /etc/os-release
NAME="Ubuntu"
VERSION="18.04.6 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.6 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic

Installation Method
Installed using the steps detailed here: https://docs.sylabs.io/guides/3.8/admin-guide/installation.html.

Additional context
We have the nvidia-container-cli

$ nvidia-container-cli --version
version: 0.9.0+beta1
build date: 2019-06-24T22:00+00:00
build revision: 77c1cbc2f6595c59beda3699ebb9d49a0a8af426
build compiler: aarch64-linux-gnu-gcc-7 7.4.0
build platform: aarch64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -g3 -D JETSON=TRUE -DNDEBUG -std=gnu11 -O0 -g3 -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
$ nvidia-container-cli list --binaries --libraries
/usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1.1
/usr/lib/aarch64-linux-gnu/tegra/libnvidia-ptxjitcompiler.so.440.18
/usr/lib/aarch64-linux-gnu/tegra/libnvidia-fatbinaryloader.so.440.18
/usr/lib/aarch64-linux-gnu/tegra/libnvidia-eglcore.so.32.5.1
/usr/lib/aarch64-linux-gnu/tegra/libnvidia-glcore.so.32.5.1
/usr/lib/aarch64-linux-gnu/tegra/libnvidia-tls.so.32.5.1
/usr/lib/aarch64-linux-gnu/tegra/libnvidia-glsi.so.32.5.1
/usr/lib/aarch64-linux-gnu/tegra/libGLX_nvidia.so.0
/usr/lib/aarch64-linux-gnu/tegra-egl/libEGL_nvidia.so.0
/usr/lib/aarch64-linux-gnu/tegra-egl/libGLESv2_nvidia.so.2
/usr/lib/aarch64-linux-gnu/tegra-egl/libGLESv1_CM_nvidia.so.1
$ nvidia-container-cli list --ipcs

strace output of the ./scripts/start_singularity.sh is available here.

@vlk-jan vlk-jan added the bug Something isn't working label Apr 5, 2024
@tri-adam
Copy link
Member

tri-adam commented Apr 5, 2024

Hi @vlk-jan, thanks for the report. On the surface of it, this looks similar to #1850. As noted there, the NVIDIA Container CLI is no longer used on Tegra-based systems. There is some hope that the new --oci mode introduced in Singularity 4.x might help with this, but it has not been confirmed. If you're able to give that a go and report back, it would be appreciated. Thanks!

@vlk-jan
Copy link
Author

vlk-jan commented Apr 5, 2024

Hi, thanks for your swift reply.

I do have some updates.

Similarity to previous issue
I do agree that it seems similar to #1850, however, I believe that the problem there was that no libraries were exported. We have some exported as we are using quite an old version of the nvidia-container-cli. This is why I opened a new issue instead of writing in the old one.
When trying to reproduce our problems on Jetson Orin as opposed to Jetson Xavier, where this was first encountered, we also saw that no libraries were provided (with a fresh install of nvidia-container package).

Odd behavior in binding nv libraries
After some more digging, I found that while the script says: Could not find any nv files on this host!, all of the libraries from nvidia-container-cli list --binaries --libraries are bound in the /.singularity.d/libs/ directory, which seems odd.
The log from the execution with -v and -d flags here. Line 17 shows the could not find message, and lines 136-149 show that the libraries are added and later mounted.
The pytorch inside the singularity still does not support CUDA, but that is probably a problem on our side, as we were using the wrong wheel and were unable to fix that.

Singularity 4.1
I tried installing singularity in version 4.1 on the Jetson but was unsuccessful. The problem seems to be with the libfuse-dev as for Ubuntu 18.04, only the libfuse2 is available. Manual installation of libfuse3 failed for some reason.
I may try that again later. But because of that, I do not have any feedback about the --oci mode for you.

@tri-adam
Copy link
Member

tri-adam commented Apr 9, 2024

Similarity to previous issue I do agree that it seems similar to #1850, however, I believe that the problem there was that no libraries were exported. We have some exported as we are using quite an old version of the nvidia-container-cli. This is why I opened a new issue instead of writing in the old one. When trying to reproduce our problems on Jetson Orin as opposed to Jetson Xavier, where this was first encountered, we also saw that no libraries were provided (with a fresh install of nvidia-container package).

Ah, that makes sense. It looks like this was deprecated in v1.10.0 of the NVIDIA Container Toolkit (NVIDIA/nvidia-container-toolkit#90 (comment)), so as you say, that wouldn't be what you're hitting.

Odd behavior in binding nv libraries After some more digging, I found that while the script says: Could not find any nv files on this host!, all of the libraries from nvidia-container-cli list --binaries --libraries are bound in the /.singularity.d/libs/ directory, which seems odd. The log from the execution with -v and -d flags here. Line 17 shows the could not find message, and lines 136-149 show that the libraries are added and later mounted. The pytorch inside the singularity still does not support CUDA, but that is probably a problem on our side, as we were using the wrong wheel and were unable to fix that.

Taking a quick scan through the code of that version of Singularity, it looks like that warning is specifically when no bins or ipcs are found:

files := make([]string, len(bins)+len(ipcs))
if len(files) == 0 {
sylog.Infof("Could not find any %s files on this host!", gpuPlatform)
} else {

The libraries are handled separately:

if len(libs) == 0 {
sylog.Warningf("Could not find any %s libraries on this host!", gpuPlatform)
sylog.Warningf("You may need to manually edit %s", gpuConfFile)
} else {
engineConfig.SetLibrariesPath(libs)
}

So that looks like it's functioning as expected based on the output you shared from nvidia-container-cli list --binaries --libraries.

@dtrudg
Copy link
Member

dtrudg commented Jun 14, 2024

Newer versions of SingularityCE don't use nvidia-container-cli to find the library list when only the --nv flag is specified. We only call nvidia-container-cli if --nvccli is also specified, in which case it performs container setup, and SingularityCE is not performing the bindings itself.

If you use a current version of SingularityCE, run with --nv only, and are able to provide a complete list of required libraries in the etc/nvliblist.conf file then it's likely that the binding will work as expected.

Given the deprecation of nvidia-container-cli for Tegra based systems we aren't going to be able to handle library binding via nvidia-container-cli. The future of GPU support on Jetson revolves around CDI, which we support in our --oci mode.

Jetson support for native mode (without --oci) would depend on #1395 - so it'd be appropriate to add a comment there if it's important to you.

See also:

@dtrudg dtrudg closed this as not planned Won't fix, can't repro, duplicate, stale Jun 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants