Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not able to run Llama 7B float 16 not in my system or google colab #25

Open
Anindyadeep opened this issue Aug 27, 2023 · 8 comments
Open

Comments

@Anindyadeep
Copy link
Collaborator

Anindyadeep commented Aug 27, 2023

I have been testing the repo inside my laptop and Google Colab. Here is the system information for both environments.

My local system:

Memory: 16GB
CPU: AMD Ryzen 9 5900HX with Radeon Graphics
GPU: NVIDIA GeForce RTX 3060 Mobile / Max-Q 

Google colab

CPU: Intel Xeon (2) @ 2.199GHz 
GPU: NVIDIA Tesla T4 

Command to reproduce

!python MinimumExample/Example_ONNX_LlamaV2.py \
--onnx_file 7B_float16/ONNX/LlamaV2_7B_float16.onnx \
--embedding_file 7B_float16/embeddings.pth \
--tokenizer_path tokenizer.model \
--prompt "What is the lightest element?"

Output in my local system

python3 MinimumExample/Example_ONNX_LlamaV2.py --onnx_file 7B_float16/ONNX/LlamaV2_7B_float16.onnx --embedding_file 7B_float16/embeddings.pth --tokenizer_path tokenizer.model --prompt "hello"
/home/anindyadeep/anaconda3/envs/llm/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py:65: UserWarning: Specified provider 'DmlExecutionProvider' is not in available provider names.Available providers: 'TensorrtExecutionProvider, CUDAExecutionProvider, CPUExecutionProvider'
  warnings.warn(
2023-08-27 12:25:33.996863660 [E:onnxruntime:, inference_session.cc:1644 operator()] Exception during initialization: /onnxruntime_src/onnxruntime/core/framework/bfc_arena.cc:368 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool, onnxruntime::Stream*, bool, onnxruntime::WaitNotificationFn) Failed to allocate memory for requested buffer of size 33554432

Traceback (most recent call last):
  File "/home/anindyadeep/workspace/llama2-onnx/Llama-2-Onnx/MinimumExample/Example_ONNX_LlamaV2.py", line 166, in <module>
    response = run_onnx_llamav2(
  File "/home/anindyadeep/workspace/llama2-onnx/Llama-2-Onnx/MinimumExample/Example_ONNX_LlamaV2.py", line 47, in run_onnx_llamav2
    llm_session = onnxruntime.InferenceSession(
  File "/home/anindyadeep/anaconda3/envs/llm/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 383, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "/home/anindyadeep/anaconda3/envs/llm/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 435, in _create_inference_session
    sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: /onnxruntime_src/onnxruntime/core/framework/bfc_arena.cc:368 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool, onnxruntime::Stream*, bool, onnxruntime::WaitNotificationFn) Failed to allocate memory for requested buffer of size 33554432

Output in Google colab

/usr/local/lib/python3.10/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py:65: UserWarning: Specified provider 'DmlExecutionProvider' is not in available provider names.Available providers: 'CPUExecutionProvider'
  warnings.warn(
^C

This probably means the process is automatically getting killed.

So now I have two questions here:

  1. What might be the root cause here, Instead of having Cuda and everything installed, it is switching back to DmlExecutionProvider and giving error.
  2. The execution time is large here. Although I am getting error or the process is getting killed, but till reaching that state, the time take is around 52-60 seconds in google colab (after which it is using ^C to kill the process) and `10-15`` seconds in my local m(after which it is giving error)

!! Update:

made some changes in the example code in just to provide the CPU Execution provider.

options = onnxruntime.SessionOptions()
    llm_session = onnxruntime.InferenceSession(
        onnx_file,
        sess_options=options,
        providers=[
            "CPUExecutionProvider",
        ],
    )

And then ran the same command, it took more than 2.5 minutes and finally the process got killed. It seems like I might not have the correct cuda vs onnx compatibility for which it could be generating error.

cuda version: 12.2
onnx version: 1.15.1
@sania96
Copy link
Collaborator

sania96 commented Nov 15, 2023

I have been testing the repo inside my laptop and Google Colab. Here is the system information for both environments.

My local system:

Memory: 16GB
CPU: AMD Ryzen 9 5900HX with Radeon Graphics
GPU: NVIDIA GeForce RTX 3060 Mobile / Max-Q 

Google colab

CPU: Intel Xeon (2) @ 2.199GHz 
GPU: NVIDIA Tesla T4 

Command to reproduce

!python MinimumExample/Example_ONNX_LlamaV2.py \
--onnx_file 7B_float16/ONNX/LlamaV2_7B_float16.onnx \
--embedding_file 7B_float16/embeddings.pth \
--tokenizer_path tokenizer.model \
--prompt "What is the lightest element?"

Output in my local system

python3 MinimumExample/Example_ONNX_LlamaV2.py --onnx_file 7B_float16/ONNX/LlamaV2_7B_float16.onnx --embedding_file 7B_float16/embeddings.pth --tokenizer_path tokenizer.model --prompt "hello"
/home/anindyadeep/anaconda3/envs/llm/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py:65: UserWarning: Specified provider 'DmlExecutionProvider' is not in available provider names.Available providers: 'TensorrtExecutionProvider, CUDAExecutionProvider, CPUExecutionProvider'
  warnings.warn(
2023-08-27 12:25:33.996863660 [E:onnxruntime:, inference_session.cc:1644 operator()] Exception during initialization: /onnxruntime_src/onnxruntime/core/framework/bfc_arena.cc:368 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool, onnxruntime::Stream*, bool, onnxruntime::WaitNotificationFn) Failed to allocate memory for requested buffer of size 33554432

Traceback (most recent call last):
  File "/home/anindyadeep/workspace/llama2-onnx/Llama-2-Onnx/MinimumExample/Example_ONNX_LlamaV2.py", line 166, in <module>
    response = run_onnx_llamav2(
  File "/home/anindyadeep/workspace/llama2-onnx/Llama-2-Onnx/MinimumExample/Example_ONNX_LlamaV2.py", line 47, in run_onnx_llamav2
    llm_session = onnxruntime.InferenceSession(
  File "/home/anindyadeep/anaconda3/envs/llm/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 383, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "/home/anindyadeep/anaconda3/envs/llm/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 435, in _create_inference_session
    sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: /onnxruntime_src/onnxruntime/core/framework/bfc_arena.cc:368 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool, onnxruntime::Stream*, bool, onnxruntime::WaitNotificationFn) Failed to allocate memory for requested buffer of size 33554432

Output in Google colab

/usr/local/lib/python3.10/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py:65: UserWarning: Specified provider 'DmlExecutionProvider' is not in available provider names.Available providers: 'CPUExecutionProvider'
  warnings.warn(
^C

This probably means the process is automatically getting killed.

So now I have two questions here:

  1. What might be the root cause here, Instead of having Cuda and everything installed, it is switching back to DmlExecutionProvider and giving error.
  2. The execution time is large here. Although I am getting error or the process is getting killed, but till reaching that state, the time take is around 52-60 seconds in google colab (after which it is using ^C to kill the process) and `10-15`` seconds in my local m(after which it is giving error)

!! Update:

made some changes in the example code in just to provide the CPU Execution provider.

options = onnxruntime.SessionOptions()
    llm_session = onnxruntime.InferenceSession(
        onnx_file,
        sess_options=options,
        providers=[
            "CPUExecutionProvider",
        ],
    )

And then ran the same command, it took more than 2.5 minutes and finally the process got killed. It seems like I might not have the correct cuda vs onnx compatibility for which it could be generating error.

cuda version: 12.2
onnx version: 1.15.1

Hi, did you resolve the issue?
i am having the same issue here.

@Anindyadeep
Copy link
Collaborator Author

Nope, I did't got any response, so I left the thread. But it is worth checking out again.

@raffaeleterribile
Copy link
Collaborator

raffaeleterribile commented Nov 23, 2023

I was able to run the minimum example with python 3.10.13 and NO CUDA: I'm using CPU for inference because my GPU has limited memory. So instead of installing ONNXRuntime with "pip install torch onnxruntime-gpu", I've installed it with "pip install torch onnxruntime".
I get the same warning about DirectML and the loading takes long time, but finally I've seen the response. I don't know exactly how much time it takes because was late and went to sleep, so I've seen the results in the morning.

@Anindyadeep
Copy link
Collaborator Author

I was able to run the minimum example with python 3.10.13 and NO CUDA: I'm using CPU for inference because my GPU has limited memory. So instead of installing ONNXRuntime with "pip install torch onnxruntime-gpu", I've installed it with "pip install torch onnxruntime". I get the same warning about DirectML and the loading takes long time, but finally I've seen the response. I don't know exactly how much time it takes because was late and went to sleep, so I've seen the results in the morning.

That's awesome, but the time gap is a large to asses the working of the runtime. But I will also check out on the same.

@raffaeleterribile
Copy link
Collaborator

raffaeleterribile commented Nov 24, 2023

Yes, it's slow. And I have to delete and recreate the python virtual environment several times. Initially I've intalled "onnxruntime-gpu", unistalled it and installed "onnxruntime" (CPU version), but I've got other errors and so I've deleted and recreated the virtual enviroment.
To use CUDA (if you have a GPU with enough memory), you have to use version 11.8: version 12 it's not compatible with onnxruntime

@Anindyadeep
Copy link
Collaborator Author

Yes, it's slow. And I have to delete and recreate the python virtual environment several times. Initially I've intalled "onnxruntime-gpu", unistalled it and installed "onnxruntime" (CPU version), but I've got other errors and so I've deleted and recreated the virtual enviroment. To use CUDA (if you have a GPU with enough memory), you have to use version 11.8: version 12 it's not compatible with onnxruntime

wow, that's a lot of ifs and so, but yeah got it. But thanks for the workaround.

@merveermann
Copy link
Collaborator

Hello all, I actually came across the same problem but with the 7B_FT_float32 model. I have two GPUs that have 24 GB of GPU memory, but as far as I understand, to run the 7B_FT_float32 model, a minimum of 25 GB of GPU memory is needed. So, is there a way to run this on my device? Is it possible to run ONNXRuntime on multiple GPUs?

@avsanjay
Copy link
Collaborator

avsanjay commented Dec 4, 2023

yes, running on multiple GPU's would be very useful

Hello all, I actually came across the same problem but with the 7B_FT_float32 model. I have two GPUs that have 24 GB of GPU memory, but as far as I understand, to run the 7B_FT_float32 model, a minimum of 25 GB of GPU memory is needed. So, is there a way to run this on my device? Is it possible to run ONNXRuntime on multiple GPUs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants