Not able to run Llama 7B float 16 not in my system or google colab #25

Anindyadeep · 2023-08-27T07:03:26Z

I have been testing the repo inside my laptop and Google Colab. Here is the system information for both environments.

My local system:

Memory: 16GB
CPU: AMD Ryzen 9 5900HX with Radeon Graphics
GPU: NVIDIA GeForce RTX 3060 Mobile / Max-Q

Google colab

CPU: Intel Xeon (2) @ 2.199GHz 
GPU: NVIDIA Tesla T4

Command to reproduce

!python MinimumExample/Example_ONNX_LlamaV2.py \
--onnx_file 7B_float16/ONNX/LlamaV2_7B_float16.onnx \
--embedding_file 7B_float16/embeddings.pth \
--tokenizer_path tokenizer.model \
--prompt "What is the lightest element?"

Output in my local system

python3 MinimumExample/Example_ONNX_LlamaV2.py --onnx_file 7B_float16/ONNX/LlamaV2_7B_float16.onnx --embedding_file 7B_float16/embeddings.pth --tokenizer_path tokenizer.model --prompt "hello"
/home/anindyadeep/anaconda3/envs/llm/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py:65: UserWarning: Specified provider 'DmlExecutionProvider' is not in available provider names.Available providers: 'TensorrtExecutionProvider, CUDAExecutionProvider, CPUExecutionProvider'
  warnings.warn(
2023-08-27 12:25:33.996863660 [E:onnxruntime:, inference_session.cc:1644 operator()] Exception during initialization: /onnxruntime_src/onnxruntime/core/framework/bfc_arena.cc:368 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool, onnxruntime::Stream*, bool, onnxruntime::WaitNotificationFn) Failed to allocate memory for requested buffer of size 33554432

Traceback (most recent call last):
  File "/home/anindyadeep/workspace/llama2-onnx/Llama-2-Onnx/MinimumExample/Example_ONNX_LlamaV2.py", line 166, in <module>
    response = run_onnx_llamav2(
  File "/home/anindyadeep/workspace/llama2-onnx/Llama-2-Onnx/MinimumExample/Example_ONNX_LlamaV2.py", line 47, in run_onnx_llamav2
    llm_session = onnxruntime.InferenceSession(
  File "/home/anindyadeep/anaconda3/envs/llm/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 383, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "/home/anindyadeep/anaconda3/envs/llm/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 435, in _create_inference_session
    sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: /onnxruntime_src/onnxruntime/core/framework/bfc_arena.cc:368 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool, onnxruntime::Stream*, bool, onnxruntime::WaitNotificationFn) Failed to allocate memory for requested buffer of size 33554432

Output in Google colab

/usr/local/lib/python3.10/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py:65: UserWarning: Specified provider 'DmlExecutionProvider' is not in available provider names.Available providers: 'CPUExecutionProvider'
  warnings.warn(
^C

This probably means the process is automatically getting killed.

So now I have two questions here:

What might be the root cause here, Instead of having Cuda and everything installed, it is switching back to DmlExecutionProvider and giving error.
The execution time is large here. Although I am getting error or the process is getting killed, but till reaching that state, the time take is around 52-60 seconds in google colab (after which it is using ^C to kill the process) and `10-15`` seconds in my local m(after which it is giving error)

!! Update:

made some changes in the example code in just to provide the CPU Execution provider.

options = onnxruntime.SessionOptions()
    llm_session = onnxruntime.InferenceSession(
        onnx_file,
        sess_options=options,
        providers=[
            "CPUExecutionProvider",
        ],
    )

And then ran the same command, it took more than 2.5 minutes and finally the process got killed. It seems like I might not have the correct cuda vs onnx compatibility for which it could be generating error.

cuda version: 12.2
onnx version: 1.15.1

The text was updated successfully, but these errors were encountered:

sania96 · 2023-11-15T13:08:09Z

I have been testing the repo inside my laptop and Google Colab. Here is the system information for both environments.

My local system:

Memory: 16GB
CPU: AMD Ryzen 9 5900HX with Radeon Graphics
GPU: NVIDIA GeForce RTX 3060 Mobile / Max-Q

Google colab

CPU: Intel Xeon (2) @ 2.199GHz 
GPU: NVIDIA Tesla T4

Command to reproduce

!python MinimumExample/Example_ONNX_LlamaV2.py \
--onnx_file 7B_float16/ONNX/LlamaV2_7B_float16.onnx \
--embedding_file 7B_float16/embeddings.pth \
--tokenizer_path tokenizer.model \
--prompt "What is the lightest element?"

Output in my local system

python3 MinimumExample/Example_ONNX_LlamaV2.py --onnx_file 7B_float16/ONNX/LlamaV2_7B_float16.onnx --embedding_file 7B_float16/embeddings.pth --tokenizer_path tokenizer.model --prompt "hello"
/home/anindyadeep/anaconda3/envs/llm/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py:65: UserWarning: Specified provider 'DmlExecutionProvider' is not in available provider names.Available providers: 'TensorrtExecutionProvider, CUDAExecutionProvider, CPUExecutionProvider'
  warnings.warn(
2023-08-27 12:25:33.996863660 [E:onnxruntime:, inference_session.cc:1644 operator()] Exception during initialization: /onnxruntime_src/onnxruntime/core/framework/bfc_arena.cc:368 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool, onnxruntime::Stream*, bool, onnxruntime::WaitNotificationFn) Failed to allocate memory for requested buffer of size 33554432

Traceback (most recent call last):
  File "/home/anindyadeep/workspace/llama2-onnx/Llama-2-Onnx/MinimumExample/Example_ONNX_LlamaV2.py", line 166, in <module>
    response = run_onnx_llamav2(
  File "/home/anindyadeep/workspace/llama2-onnx/Llama-2-Onnx/MinimumExample/Example_ONNX_LlamaV2.py", line 47, in run_onnx_llamav2
    llm_session = onnxruntime.InferenceSession(
  File "/home/anindyadeep/anaconda3/envs/llm/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 383, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "/home/anindyadeep/anaconda3/envs/llm/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 435, in _create_inference_session
    sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: /onnxruntime_src/onnxruntime/core/framework/bfc_arena.cc:368 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool, onnxruntime::Stream*, bool, onnxruntime::WaitNotificationFn) Failed to allocate memory for requested buffer of size 33554432

Output in Google colab

/usr/local/lib/python3.10/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py:65: UserWarning: Specified provider 'DmlExecutionProvider' is not in available provider names.Available providers: 'CPUExecutionProvider'
  warnings.warn(
^C

This probably means the process is automatically getting killed.

So now I have two questions here:

What might be the root cause here, Instead of having Cuda and everything installed, it is switching back to DmlExecutionProvider and giving error.
The execution time is large here. Although I am getting error or the process is getting killed, but till reaching that state, the time take is around 52-60 seconds in google colab (after which it is using ^C to kill the process) and `10-15`` seconds in my local m(after which it is giving error)

!! Update:

made some changes in the example code in just to provide the CPU Execution provider.

options = onnxruntime.SessionOptions()
    llm_session = onnxruntime.InferenceSession(
        onnx_file,
        sess_options=options,
        providers=[
            "CPUExecutionProvider",
        ],
    )

And then ran the same command, it took more than 2.5 minutes and finally the process got killed. It seems like I might not have the correct cuda vs onnx compatibility for which it could be generating error.

cuda version: 12.2
onnx version: 1.15.1

Hi, did you resolve the issue?
i am having the same issue here.

Anindyadeep · 2023-11-15T13:13:16Z

Nope, I did't got any response, so I left the thread. But it is worth checking out again.

raffaeleterribile · 2023-11-23T20:32:21Z

I was able to run the minimum example with python 3.10.13 and NO CUDA: I'm using CPU for inference because my GPU has limited memory. So instead of installing ONNXRuntime with "pip install torch onnxruntime-gpu", I've installed it with "pip install torch onnxruntime".
I get the same warning about DirectML and the loading takes long time, but finally I've seen the response. I don't know exactly how much time it takes because was late and went to sleep, so I've seen the results in the morning.

Anindyadeep · 2023-11-24T04:38:31Z

I was able to run the minimum example with python 3.10.13 and NO CUDA: I'm using CPU for inference because my GPU has limited memory. So instead of installing ONNXRuntime with "pip install torch onnxruntime-gpu", I've installed it with "pip install torch onnxruntime". I get the same warning about DirectML and the loading takes long time, but finally I've seen the response. I don't know exactly how much time it takes because was late and went to sleep, so I've seen the results in the morning.

That's awesome, but the time gap is a large to asses the working of the runtime. But I will also check out on the same.

raffaeleterribile · 2023-11-24T10:09:26Z

Yes, it's slow. And I have to delete and recreate the python virtual environment several times. Initially I've intalled "onnxruntime-gpu", unistalled it and installed "onnxruntime" (CPU version), but I've got other errors and so I've deleted and recreated the virtual enviroment.
To use CUDA (if you have a GPU with enough memory), you have to use version 11.8: version 12 it's not compatible with onnxruntime

Anindyadeep · 2023-11-24T10:13:02Z

Yes, it's slow. And I have to delete and recreate the python virtual environment several times. Initially I've intalled "onnxruntime-gpu", unistalled it and installed "onnxruntime" (CPU version), but I've got other errors and so I've deleted and recreated the virtual enviroment. To use CUDA (if you have a GPU with enough memory), you have to use version 11.8: version 12 it's not compatible with onnxruntime

wow, that's a lot of ifs and so, but yeah got it. But thanks for the workaround.

merveermann · 2023-12-04T20:14:20Z

Hello all, I actually came across the same problem but with the 7B_FT_float32 model. I have two GPUs that have 24 GB of GPU memory, but as far as I understand, to run the 7B_FT_float32 model, a minimum of 25 GB of GPU memory is needed. So, is there a way to run this on my device? Is it possible to run ONNXRuntime on multiple GPUs?

avsanjay · 2023-12-04T20:20:02Z

yes, running on multiple GPU's would be very useful

Hello all, I actually came across the same problem but with the 7B_FT_float32 model. I have two GPUs that have 24 GB of GPU memory, but as far as I understand, to run the 7B_FT_float32 model, a minimum of 25 GB of GPU memory is needed. So, is there a way to run this on my device? Is it possible to run ONNXRuntime on multiple GPUs?

GrandmasterGrogu mentioned this issue Nov 21, 2023

7B_FT_float16 model size #31

Open

Anindyadeep mentioned this issue Nov 25, 2023

Langchain Support for Onnx Llama langchain-ai/langchain#8619

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not able to run Llama 7B float 16 not in my system or google colab #25

Not able to run Llama 7B float 16 not in my system or google colab #25

Anindyadeep commented Aug 27, 2023 •

edited

Loading

sania96 commented Nov 15, 2023

Anindyadeep commented Nov 15, 2023

raffaeleterribile commented Nov 23, 2023 •

edited

Loading

Anindyadeep commented Nov 24, 2023

raffaeleterribile commented Nov 24, 2023 •

edited

Loading

Anindyadeep commented Nov 24, 2023

merveermann commented Dec 4, 2023

avsanjay commented Dec 4, 2023

Not able to run Llama 7B float 16 not in my system or google colab #25

Not able to run Llama 7B float 16 not in my system or google colab #25

Comments

Anindyadeep commented Aug 27, 2023 • edited Loading

sania96 commented Nov 15, 2023

Anindyadeep commented Nov 15, 2023

raffaeleterribile commented Nov 23, 2023 • edited Loading

Anindyadeep commented Nov 24, 2023

raffaeleterribile commented Nov 24, 2023 • edited Loading

Anindyadeep commented Nov 24, 2023

merveermann commented Dec 4, 2023

avsanjay commented Dec 4, 2023

Anindyadeep commented Aug 27, 2023 •

edited

Loading

raffaeleterribile commented Nov 23, 2023 •

edited

Loading

raffaeleterribile commented Nov 24, 2023 •

edited

Loading