[Bug]: High RAM usage in iGPU #28009

yaniv5678 · 2024-12-10T20:11:41Z

OpenVINO Version

2024.5.0

Operating System

Windows System

Device used for inference

GPU

Framework

None

Model used

deberta-v3-mini

Issue description

Hi,
I converted deberta-v3-mini to OpenVINO using optimum-cli, with weights compressed to int8. The file size on disk is ~160MB.
And then compiled the model using both pythonic openvino & openvino-rs.
In both scenarios, model took ~500MB of RAM.
When I used inference precision hint of "int8" to my iGPU (intel iris xe, with core i5)
It didn't help, it even took more RAM (around 1.2GB!!)

When I compiled this model to CPU, it only took ~40MB somehow.

Can you help me understand why, and how to decrease RAM usage in GPU case?
Is this a bug?

Thanks!

Step-by-step reproduction

optimum-cli export openvino --model microsoft/deberta-v3-small --weight-format int8 deberta

import openvino as ov
core = ov.Core()
compiled_model = core.compile_model("deberta/openvino_model.xml", device_name='GPU')

Relevant log output

No response

Issue submission checklist

I'm reporting an issue. It's not a question.
I checked the problem with the documentation, FAQ, open issues, Stack Overflow, etc., and have not found a solution.
There is reproducer code and related data files such as images, videos, models, etc.

The text was updated successfully, but these errors were encountered:

Aznie-Intel · 2024-12-11T02:22:27Z

Hi @yaniv5678 , how did you check the memory usage when compiling the model with CPU and GPU? Meanwhile, GPU performance relies on the OpenCL kernels for the implementation. You can refer GPU Performance Checklist.

yaniv5678 · 2024-12-11T07:17:05Z

Hi @Aznie-Intel! Thanks for your prompt response.

I've checked using the task manager. I made sure to only read and compile the model, and then put the process to sleep to be sure it isn't related to other code.
I've checked it multiple times and experienced the same RAM usage for my process.

Can you please try to reproduce it in your environment?
Has anyone run DeBERTa or something similiar?

I've referred the GPU perf checklist, thanks.
It didn't help, I think I'm pretty aligned with all the tips and I've tried the relevant ones.
i.e tried using caching and it didn't help to RAM consumption.

Do you know what is the memory layout of these 500MB RAM?
Only model weights or something else big? (I thought the weights are being "mmap"ed so I should not see them! So I wonder why I see not only them, but 500MB of RAM consumption).

Aznie-Intel · 2024-12-12T01:31:36Z

@yaniv5678 Below is my observation for both CPU and GPU.

CPU:

GPU:

There is no significant memory consumption compared to CPU and GPU. Can you provide the following information for further investigation:

Run Hello Query Device Python Sample to find your GPU device specification.
Intel® Graphics Compute Runtime for OpenCL™ driver version.

yaniv5678 · 2024-12-13T11:01:34Z

@Aznie-Intel

Below is the output of "Hello Query Device" script:

[ INFO ] Available devices:
[ INFO ] CPU :
[ INFO ]        SUPPORTED_PROPERTIES:
[ INFO ]                AVAILABLE_DEVICES:
[ INFO ]                RANGE_FOR_ASYNC_INFER_REQUESTS: 1, 1, 1
[ INFO ]                RANGE_FOR_STREAMS: 1, 12
[ INFO ]                EXECUTION_DEVICES: CPU
[ INFO ]                FULL_DEVICE_NAME: 13th Gen Intel(R) Core(TM) i5-1335U
[ INFO ]                OPTIMIZATION_CAPABILITIES: FP32, INT8, BIN, EXPORT_IMPORT
[ INFO ]                DEVICE_TYPE: Type.INTEGRATED
[ INFO ]                DEVICE_ARCHITECTURE: intel64
[ INFO ]                NUM_STREAMS: 1
[ INFO ]                INFERENCE_NUM_THREADS: 0
[ INFO ]                PERF_COUNT: False
[ INFO ]                INFERENCE_PRECISION_HINT: <Type: 'float32'>
[ INFO ]                PERFORMANCE_HINT: PerformanceMode.LATENCY
[ INFO ]                EXECUTION_MODE_HINT: ExecutionMode.PERFORMANCE
[ INFO ]                PERFORMANCE_HINT_NUM_REQUESTS: 0
[ INFO ]                ENABLE_CPU_PINNING: True
[ INFO ]                SCHEDULING_CORE_TYPE: SchedulingCoreType.ANY_CORE
[ INFO ]                MODEL_DISTRIBUTION_POLICY: set()
[ INFO ]                ENABLE_HYPER_THREADING: True
[ INFO ]                DEVICE_ID:
[ INFO ]                CPU_DENORMALS_OPTIMIZATION: False
[ INFO ]                LOG_LEVEL: Level.NO
[ INFO ]                CPU_SPARSE_WEIGHTS_DECOMPRESSION_RATE: 1.0
[ INFO ]                DYNAMIC_QUANTIZATION_GROUP_SIZE: 32
[ INFO ]                KV_CACHE_PRECISION: <Type: 'float16'>
[ INFO ]                AFFINITY: Affinity.HYBRID_AWARE
[ INFO ]
[ INFO ] GPU :
[ INFO ]        SUPPORTED_PROPERTIES:
[ INFO ]                AVAILABLE_DEVICES: 0
[ INFO ]                RANGE_FOR_ASYNC_INFER_REQUESTS: 1, 2, 1
[ INFO ]                RANGE_FOR_STREAMS: 1, 2
[ INFO ]                OPTIMAL_BATCH_SIZE: 1
[ INFO ]                MAX_BATCH_SIZE: 1
[ INFO ]                DEVICE_ARCHITECTURE: GPU: vendor=0x8086 arch=v12.3.0
[ INFO ]                FULL_DEVICE_NAME: Intel(R) Iris(R) Xe Graphics (iGPU)
[ INFO ]                DEVICE_UUID: *****
[ INFO ]                DEVICE_LUID: *****
[ INFO ]                DEVICE_TYPE: Type.INTEGRATED
[ INFO ]                DEVICE_GOPS: {<Type: 'float16'>: 3200.0, <Type: 'float32'>: 1600.0, <Type: 'int8_t'>: 6400.0, <Type: 'uint8_t'>: 6400.0}
[ INFO ]                OPTIMIZATION_CAPABILITIES: FP32, BIN, FP16, INT8, EXPORT_IMPORT
[ INFO ]                GPU_DEVICE_TOTAL_MEM_SIZE: 7441600512
[ INFO ]                GPU_UARCH_VERSION: 12.3.0
[ INFO ]                GPU_EXECUTION_UNITS_COUNT: 80
[ INFO ]                GPU_MEMORY_STATISTICS: {}
[ INFO ]                PERF_COUNT: False
[ INFO ]                MODEL_PRIORITY: Priority.MEDIUM
[ INFO ]                GPU_HOST_TASK_PRIORITY: Priority.MEDIUM
[ INFO ]                GPU_QUEUE_PRIORITY: Priority.MEDIUM
[ INFO ]                GPU_QUEUE_THROTTLE: Priority.MEDIUM
[ INFO ]                GPU_ENABLE_SDPA_OPTIMIZATION: True
[ INFO ]                GPU_ENABLE_LOOP_UNROLLING: True
[ INFO ]                GPU_DISABLE_WINOGRAD_CONVOLUTION: False
[ INFO ]                CACHE_DIR:
[ INFO ]                CACHE_MODE: CacheMode.OPTIMIZE_SPEED
[ INFO ]                PERFORMANCE_HINT: PerformanceMode.LATENCY
[ INFO ]                EXECUTION_MODE_HINT: ExecutionMode.PERFORMANCE
[ INFO ]                COMPILATION_NUM_THREADS: 12
[ INFO ]                NUM_STREAMS: 1
[ INFO ]                PERFORMANCE_HINT_NUM_REQUESTS: 0
[ INFO ]                INFERENCE_PRECISION_HINT: <Type: 'float16'>
[ INFO ]                ENABLE_CPU_PINNING: False
[ INFO ]                DEVICE_ID: 0
[ INFO ]                DYNAMIC_QUANTIZATION_GROUP_SIZE: 32
[ INFO ]                ACTIVATIONS_SCALE_FACTOR: 0.0
[ INFO ]

I couldn't find out where "Intel® Graphics Compute Runtime for OpenCL™ driver" version is stored.
Anyways, I opened "Intel Graphics Command Center" and saw the following versions listed there:

DirectX 12
Graphics Driver 32.0.101.5972 (no update data available)
Shader version 6.6
OpenCL Runtime Version: 3.0
Vulkan 1.3.289
Graphics Output Protocol (GOP) version: 21.0.1060

Aznie-Intel · 2024-12-14T06:26:35Z

@yaniv5678 Thanks for the information. I will check this with the relevant team and update you soon.

avitial · 2024-12-24T03:31:53Z

Ref. 159902

dnkurek · 2025-01-15T19:13:52Z

Hi, please try now with latest master version

Issue has been partially solved with #28167

Generally, turns out that these kinds of models have lots of small allocations, and the driver's alignment on my Windows machine for allocations is 64 KB, which means that for every small allocation we are taking 64 KB, even if it is for 1 byte. So this is quite wasteful.

With this, the alignment is reduced to just 512, because low alignment values would tank the performance.

Still, the GPU plugin also has to make OCL context which could take for instance 100MB to 150MB (this highly depends on implementation) so basically RAM usage will always be higher than executing with CPU. This is unavoidable.

So now the memory usage should be slightly better, but still not nowhere as good as CPU

dnkurek · 2025-01-15T19:14:46Z

Also no idea why int8 is taking more RAM for you

dnkurek · 2025-01-15T19:16:05Z

Plus according to my tests, model actually should run slighly faster (a few % speedup) with said PR

dnkurek · 2025-01-15T19:19:29Z

Still, we will need to hunt down even more small buffers since I still see many many small allocations with such kind of models, each of them eating 64KB of RAM. I have yet no idea where they come from, will research.

Also, I have seen that, in my test machines, driver on Windows wants allocations to be 64KB-aligned, but on Ubuntu I see they are 4KB aligned. You should see better results on Ubuntu because of this driver difference.

yaniv5678 · 2025-01-15T19:24:59Z

Thank you very much!
Do you know why the OCL context takes such high amount of memory?
Does it duplicate the model weights?
Can you check if INT8 deberta-v3 small takes a lot of RAM in your setup as well?

dnkurek · 2025-01-15T20:06:22Z

Hi, OCL context takes that much memory because it needs to setup it's internal memory structures, and maybe some caching and stuff in order to improve performance, I am however not perfectly exactly sure what goes into that memory usage. I have also tested it with Nvidia and AMD cards and they have the same thing for OCL context. In fact, IIRC Nvidia actually used quite a lot (300 MB if I remember correctly?), but AMD used less but still quite a lot. For Intel IIRC I got around 100MB to 150MB but it really depends on your environment.

Guess we would need to ask driver team.

Bear in mind that GPU plugin is technically more complicated since it has to offload and delegate tasks to GPU, compile stuff and deal with the kernel's GPU driver along with backends like OpenCL. CPU plugin does not need any of this. So naturally, there is an overhead for just using the GPU because of this.

So we depend on the driver too.

No I don't think it duplicates model weights.

Sure but I will need to setup stuff on Windows and compile newer version.

dnkurek · 2025-01-15T20:09:00Z

Also it could be that INT8 deberta model is just allocating more small buffers. On a larger model you could see memory usage improvements over FP16 because the buffers are larger, but with a small model like that it all gets eaten up because of driver's memory allocation alignment requirements.

A possible solution would be to use subbuffers so we can use our own alignment, and that's exactly what I did on my PR. However, this is just a partial solution since it only does it with shape_info buffers.

dnkurek · 2025-01-15T20:18:01Z

And still, we would require either 256 or 512 byte alignment which is probably significantly more than what the CPU aligns to. Using an alignment of 128 or below would just tank the performance on any GPU, and with no alignment at all I got around 3x slowdown. However with alignment either 256 or 512 I got good memory usage improvements and even slightly improved speed.

Basically, because of how GPUs work, we will never be as memory efficient as CPU plugin, but we can get closer

yaniv5678 added bug Something isn't working support_request labels Dec 10, 2024

YuChern-Intel assigned Aznie-Intel Dec 10, 2024

Aznie-Intel added the PSE label Dec 14, 2024

avitial self-assigned this Dec 16, 2024

avitial added category: GPU OpenVINO GPU plugin and removed support_request labels Dec 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: High RAM usage in iGPU #28009

[Bug]: High RAM usage in iGPU #28009

yaniv5678 commented Dec 10, 2024

Aznie-Intel commented Dec 11, 2024 •

edited

Loading

yaniv5678 commented Dec 11, 2024 •

edited

Loading

Aznie-Intel commented Dec 12, 2024 •

edited

Loading

yaniv5678 commented Dec 13, 2024 •

edited

Loading

Aznie-Intel commented Dec 14, 2024

avitial commented Dec 24, 2024

dnkurek commented Jan 15, 2025

dnkurek commented Jan 15, 2025

dnkurek commented Jan 15, 2025

dnkurek commented Jan 15, 2025

yaniv5678 commented Jan 15, 2025

dnkurek commented Jan 15, 2025

dnkurek commented Jan 15, 2025 •

edited

Loading

dnkurek commented Jan 15, 2025

[Bug]: High RAM usage in iGPU #28009

[Bug]: High RAM usage in iGPU #28009

Comments

yaniv5678 commented Dec 10, 2024

OpenVINO Version

Operating System

Device used for inference

Framework

Model used

Issue description

Step-by-step reproduction

Relevant log output

Issue submission checklist

Aznie-Intel commented Dec 11, 2024 • edited Loading

yaniv5678 commented Dec 11, 2024 • edited Loading

Aznie-Intel commented Dec 12, 2024 • edited Loading

yaniv5678 commented Dec 13, 2024 • edited Loading

Aznie-Intel commented Dec 14, 2024

avitial commented Dec 24, 2024

dnkurek commented Jan 15, 2025

dnkurek commented Jan 15, 2025

dnkurek commented Jan 15, 2025

dnkurek commented Jan 15, 2025

yaniv5678 commented Jan 15, 2025

dnkurek commented Jan 15, 2025

dnkurek commented Jan 15, 2025 • edited Loading

dnkurek commented Jan 15, 2025

Aznie-Intel commented Dec 11, 2024 •

edited

Loading

yaniv5678 commented Dec 11, 2024 •

edited

Loading

Aznie-Intel commented Dec 12, 2024 •

edited

Loading

yaniv5678 commented Dec 13, 2024 •

edited

Loading

dnkurek commented Jan 15, 2025 •

edited

Loading