Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERROR: Could not build wheels for apex, which is required to install pyproject.toml-based projects #249

Closed
TJ-Ouyang opened this issue Apr 1, 2024 · 7 comments
Labels

Comments

@TJ-Ouyang
Copy link

Using pip 23.1.2 from /usr/local/lib/python3.10/dist-packages/pip (python 3.10)
git version 2.34.1
torch.version = 2.2.1+cu121

Compiling cuda extensions with
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0
from /usr/local/cuda/bin

Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in
main()
File "/usr/local/lib/python3.10/dist-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main
json_out['return_val'] = hook(**hook_input['kwargs'])
File "/usr/local/lib/python3.10/dist-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 251, in build_wheel
return _build_backend().build_wheel(wheel_directory, config_settings,
File "/usr/local/lib/python3.10/dist-packages/setuptools/build_meta.py", line 416, in build_wheel
return self._build_with_temp_dir(['bdist_wheel'], '.whl',
File "/usr/local/lib/python3.10/dist-packages/setuptools/build_meta.py", line 401, in _build_with_temp_dir
self.run_setup()
File "/usr/local/lib/python3.10/dist-packages/setuptools/build_meta.py", line 338, in run_setup
exec(code, locals())
File "", line 178, in
File "", line 40, in check_cuda_torch_binary_vs_bare_metal
RuntimeError: Cuda extensions are being compiled with a version of Cuda that does not match the version used to compile Pytorch binaries. Pytorch binaries were compiled with Cuda 12.1.
In some cases, a minor-version mismatch will not cause later errors: NVIDIA/apex#323 (comment). You can try commenting out this check (at your own risk).
error: subprocess-exited-with-error

× Building wheel for apex (pyproject.toml) did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.
full command: /usr/bin/python3 /usr/local/lib/python3.10/dist-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py build_wheel /tmp/tmpvipwq2mw
cwd: /tmp/pip-req-build-isqlmxnv
Building wheel for apex (pyproject.toml) ... error
ERROR: Failed building wheel for apex
Failed to build apex
ERROR: Could not build wheels for apex, which is required to install pyproject.toml-based projects


ModuleNotFoundError: No module named 'fused_layer_norm_cuda'
When running the inference code.

Still could not resolve the question following the method in: "NVIDIA/apex#1653". Tried on both server and colab.

@Edenzzzz
Copy link

Edenzzzz commented Apr 2, 2024

This is a common error due to your systems' global Nvidia driver (12.2) and pytorch cuda (12.1) version mismatch. You should comment out this
image

@TJ-Ouyang
Copy link
Author

This is a common error due to your systems' global Nvidia driver (12.2) and pytorch cuda (12.1) version mismatch. You should comment out this image

Then it comes with another error:

[1/1] c++ -MMD -MF /data1/ouyangtianjian/apex-22.04-dev/build/temp.linux-x86_64-cpython-310/csrc/flatten_unflatten.o.d -pthread -B /data1/ouyangtianjian/.conda/envs/opensora/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /data1/ouyangtianjian/.conda/envs/opensora/include -fPIC -O2 -isystem /data1/ouyangtianjian/.conda/envs/opensora/include -fPIC -I/data1/ouyangtianjian/.conda/envs/opensora/lib/python3.10/site-packages/torch/include -I/data1/ouyangtianjian/.conda/envs/opensora/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/data1/ouyangtianjian/.conda/envs/opensora/lib/python3.10/site-packages/torch/include/TH -I/data1/ouyangtianjian/.conda/envs/opensora/lib/python3.10/site-packages/torch/include/THC -I/data1/ouyangtianjian/.conda/envs/opensora/include/python3.10 -c -c /data1/ouyangtianjian/apex-22.04-dev/csrc/flatten_unflatten.cpp -o /data1/ouyangtianjian/apex-22.04-dev/build/temp.linux-x86_64-cpython-310/csrc/flatten_unflatten.o -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="gcc"' '-DPYBIND11_STDLIB="libstdcpp"' '-DPYBIND11_BUILD_ABI="cxxabi1011"' -DTORCH_EXTENSION_NAME=apex_C -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
g++ -pthread -B /data1/ouyangtianjian/.conda/envs/opensora/compiler_compat -shared -Wl,-rpath,/data1/ouyangtianjian/.conda/envs/opensora/lib -Wl,-rpath-link,/data1/ouyangtianjian/.conda/envs/opensora/lib -L/data1/ouyangtianjian/.conda/envs/opensora/lib -Wl,-rpath,/data1/ouyangtianjian/.conda/envs/opensora/lib -Wl,-rpath-link,/data1/ouyangtianjian/.conda/envs/opensora/lib -L/data1/ouyangtianjian/.conda/envs/opensora/lib /data1/ouyangtianjian/apex-22.04-dev/build/temp.linux-x86_64-cpython-310/csrc/flatten_unflatten.o -L/data1/ouyangtianjian/.conda/envs/opensora/lib/python3.10/site-packages/torch/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -o build/lib.linux-x86_64-cpython-310/apex_C.cpython-310-x86_64-linux-gnu.so
building 'amp_C' extension
Emitting ninja build file /data1/ouyangtianjian/apex-22.04-dev/build/temp.linux-x86_64-cpython-310/build.ninja...
Compiling objects...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/14] /data1/ouyangtianjian/.conda/envs/opensora/bin/nvcc -I/data1/ouyangtianjian/.conda/envs/opensora/lib/python3.10/site-packages/torch/include -I/data1/ouyangtianjian/.conda/envs/opensora/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/data1/ouyangtianjian/.conda/envs/opensora/lib/python3.10/site-packages/torch/include/TH -I/data1/ouyangtianjian/.conda/envs/opensora/lib/python3.10/site-packages/torch/include/THC -I/data1/ouyangtianjian/.conda/envs/opensora/include -I/data1/ouyangtianjian/.conda/envs/opensora/include/python3.10 -c -c /data1/ouyangtianjian/apex-22.04-dev/csrc/multi_tensor_novograd.cu -o /data1/ouyangtianjian/apex-22.04-dev/build/temp.linux-x86_64-cpython-310/csrc/multi_tensor_novograd.o -D__CUDA_NO_HALF_OPERATORS
-D__CUDA_NO_HALF_CONVERSIONS
_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -lineinfo -O3 --use_fast_math -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="gcc"' '-DPYBIND11_STDLIB="libstdcpp"' '-DPYBIND11_BUILD_ABI="cxxabi1011"' -DTORCH_EXTENSION_NAME=amp_C -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
FAILED: /data1/ouyangtianjian/apex-22.04-dev/build/temp.linux-x86_64-cpython-310/csrc/multi_tensor_novograd.o
/data1/ouyangtianjian/.conda/envs/opensora/bin/nvcc -I/data1/ouyangtianjian/.conda/envs/opensora/lib/python3.10/site-packages/torch/include -I/data1/ouyangtianjian/.conda/envs/opensora/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/data1/ouyangtianjian/.conda/envs/opensora/lib/python3.10/site-packages/torch/include/TH -I/data1/ouyangtianjian/.conda/envs/opensora/lib/python3.10/site-packages/torch/include/THC -I/data1/ouyangtianjian/.conda/envs/opensora/include -I/data1/ouyangtianjian/.conda/envs/opensora/include/python3.10 -c -c /data1/ouyangtianjian/apex-22.04-dev/csrc/multi_tensor_novograd.cu -o /data1/ouyangtianjian/apex-22.04-dev/build/temp.linux-x86_64-cpython-310/csrc/multi_tensor_novograd.o -D__CUDA_NO_HALF_OPERATORS
-D__CUDA_NO_HALF_CONVERSIONS
_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -lineinfo -O3 --use_fast_math -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=amp_C -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
In file included from /data1/ouyangtianjian/apex-22.04-dev/csrc/multi_tensor_novograd.cu:3:
/data1/ouyangtianjian/.conda/envs/opensora/lib/python3.10/site-packages/torch/include/ATen/cuda/CUDAContext.h:6:10: fatal error: cusparse.h: No such file or directory
6 | #include <cusparse.h>
| ^~~~~~~~~~~~
compilation terminated.

@Edenzzzz
Copy link

Edenzzzz commented Apr 2, 2024

Try reinstalling your system nv driver to the same version?

@TJ-Ouyang
Copy link
Author

Try reinstalling your system nv driver to the same version?

The sad news is the server manager rufuse to modify nv driver version (now is 12.2) because lots of people are using the GPU. And it seems that pytorch for CUDA 12.2 hasn't been released. Anyway, still thank you for your help.

@TJ-Ouyang
Copy link
Author

This is a common error due to your systems' global Nvidia driver (12.2) and pytorch cuda (12.1) version mismatch. You should comment out this image

Finally found the solution: NVIDIA/apex#323 (comment)

I should not comment out the whole function. Only "if (bare_metal_major != torch_binary_major) or (bare_metal_minor != torch_binary_minor):" part needs to be deleted.

Copy link

This issue is stale because it has been open for 7 days with no activity.

@github-actions github-actions bot added the stale label Apr 10, 2024
Copy link

This issue was closed because it has been inactive for 7 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants