[Performance] 40% slowdown in ONNX Resize Operator on CPU #23391

SuhwanSong · 2025-01-16T04:38:54Z

Describe the issue

We observed a significant performance regression (~40% slowdown) in the Resize operator when using Float32 and Int64 data types on the CPU.
This slowdown impacts workloads that rely heavily on the Resize operator, particularly in image processing tasks.

After the bisect, we found the commit 6cc06ad introduces the slowdown.

range: 5fa4505..6cc06ad

model:

analysis:

[name: model_loading_uri Op: Unknown]: 622 / 662 : 93.95770392749245%
[name: session_initialization Op: Unknown]: 3737 / 4004 : 93.33166833166833%
[name: /Resize_fence_before Op: Resize]: 48 / 1 : 4800.0%
[name: /Resize_kernel_time Op: Resize]: 84187 / 84921 : 99.13566726722483%
[name: /Resize_fence_after Op: Resize]: 0 / 12 : 0.0%
[name: /Resize_1_fence_before Op: Resize]: 4 / 22 : 18.181818181818183%
[name: /Resize_1_kernel_time Op: Resize]: 634602 / 444980 : 142.61360061126342%
[name: /Resize_1_fence_after Op: Resize]: 0 / 0 : 0%
[name: /Ceil_fence_before Op: Ceil]: 0 / 4 : 0.0%
[name: /Ceil_kernel_time Op: Ceil]: 259949 / 262544 : 99.01159424705955%
[name: /Ceil_fence_after Op: Ceil]: 10 / 0 : 0%
[name: SequentialExecutor::Execute Op: Unknown]: 1118971 / 924960 : 120.97506919218128%
[name: model_run Op: Unknown]: 1141098 / 950941 : 119.99671903935155%

To reproduce

Download and unzip "model.zip".
Run the following script.

import time
import onnxruntime
import numpy as np

# Set the random seed
np.random.seed(0)

onnx_model_path = 'model.onnx'

# Load the ONNX model with the CPUExecutionProvider
ort_session = onnxruntime.InferenceSession(onnx_model_path, providers=['CPUExecutionProvider'])
ort_session.get_modelmeta()
inputs = ort_session.get_inputs()

nth = 100000

# Warm-up inference to cache optimizations

input_data = np.load("input.npy", allow_pickle=True).item()
ort_session.run(None, input_data)

# Measure inference time excluding input creation
total_time_ns = 0
for _ in range(nth):

    start_ns = time.perf_counter_ns()
    ort_session.run(None, input_data)
    end_ns = time.perf_counter_ns()

    total_time_ns += end_ns - start_ns

avg_time_ns = total_time_ns / nth
avg_time_ms = avg_time_ns / 1e6

print(f'[{onnxruntime.__version__}] Average inference time: {avg_time_ms:.5f} ms')

Urgency

No response

Platform

Linux

OS Version

6.8.0

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

1.20.1

ONNX Runtime API

Python

Architecture

X64

Execution Provider

Default CPU

Execution Provider Library Version

No response

Model File

model.zip

Is this a quantized model?

Yes

The text was updated successfully, but these errors were encountered:

tianleiwu · 2025-01-16T23:43:54Z

I did not reproduce the issue in my machine.
The latency output for ORT 1.18.0 and ORT 1.20.1 are very close in my machine.

SuhwanSong · 2025-01-17T04:44:55Z

@tianleiwu

The latency outputs for ORT 1.18.0 and 1.20.1 are also very close on my machine.
However, the latencies differ significantly when comparing ORT 1.19.0 and 1.20.1.
This variation is due to a problematic commit introduced between ORT 1.19.0 and 1.20.1.

Could you test with ORT 1.19.0 and 1.20.1? You can also use these two adjacent commits:
5fa4505..6cc06ad

ORT 1.18 vs 1.20. (1.20 is faster than 1.18)

[name: model_loading_uri Op: Unknown]: 648 / 621 : 104.34782608695652%
[name: session_initialization Op: Unknown]: 3572 / 3671 : 97.30318714246799%
[name: /Resize_fence_before Op: Resize]: 0 / 0 : 0%
[name: /Resize_kernel_time Op: Resize]: 83168 / 74997 : 110.89510247076551%
[name: /Resize_fence_after Op: Resize]: 16 / 12 : 133.33333333333331%
[name: /Resize_1_fence_before Op: Resize]: 3 / 16 : 18.75%
[name: /Resize_1_kernel_time Op: Resize]: 775730 / 869052 : 89.26163221533349%
[name: /Resize_1_fence_after Op: Resize]: 0 / 1 : 0.0%
[name: /Ceil_fence_before Op: Ceil]: 0 / 0 : 0%
[name: /Ceil_kernel_time Op: Ceil]: 256553 / 239340 : 107.19186095094844%
[name: /Ceil_fence_after Op: Ceil]: 0 / 0 : 0%
[name: SequentialExecutor::Execute Op: Unknown]: 1279805 / 1329464 : 96.26473526172954%
[name: model_run Op: Unknown]: 1299886 / 1350643 : 96.24201213792246%

ORT 1.19 vs 1.20 (1.19 is faster than 1.20)

[name: model_loading_uri Op: Unknown]: 666 / 499 : 133.46693386773546%
[name: session_initialization Op: Unknown]: 3596 / 2560 : 140.46875%
[name: /Resize_fence_before Op: Resize]: 20 / 0 : 0%
[name: /Resize_kernel_time Op: Resize]: 83151 / 82814 : 100.40693602530972%
[name: /Resize_fence_after Op: Resize]: 18 / 0 : 0%
[name: /Resize_1_fence_before Op: Resize]: 0 / 0 : 0%
[name: /Resize_1_kernel_time Op: Resize]: 781757 / 645172 : 121.17032357262869%
[name: /Resize_1_fence_after Op: Resize]: 21 / 12 : 175.0%
[name: /Ceil_fence_before Op: Ceil]: 0 / 0 : 0%
[name: /Ceil_kernel_time Op: Ceil]: 261431 / 256024 : 102.11191138330781%
[name: /Ceil_fence_after Op: Ceil]: 0 / 0 : 0%
[name: SequentialExecutor::Execute Op: Unknown]: 1253278 / 1110786 : 112.82803348259701%
[name: model_run Op: Unknown]: 1273372 / 1130871 : 112.60099516213609%

ORT 5fa4505 vs 6cc06ad (5fa4505 is faster)

[name: model_loading_uri Op: Unknown]: 673 / 613 : 109.7879282218597%
[name: session_initialization Op: Unknown]: 4013 / 3739 : 107.3281626103236%
[name: /Resize_fence_before Op: Resize]: 0 / 47 : 0.0%
[name: /Resize_kernel_time Op: Resize]: 83368 / 84050 : 99.18857822724569%
[name: /Resize_fence_after Op: Resize]: 12 / 0 : 0%
[name: /Resize_1_fence_before Op: Resize]: 0 / 0 : 0%
[name: /Resize_1_kernel_time Op: Resize]: 635616 / 449415 : 141.4318614198458%
[name: /Resize_1_fence_after Op: Resize]: 0 / 0 : 0%
[name: /Ceil_fence_before Op: Ceil]: 0 / 0 : 0%
[name: /Ceil_kernel_time Op: Ceil]: 261413 / 261593 : 99.9311908193262%
[name: /Ceil_fence_after Op: Ceil]: 3 / 5 : 60.0%
[name: SequentialExecutor::Execute Op: Unknown]: 1110583 / 927192 : 119.77918273669314%
[name: model_run Op: Unknown]: 1132598 / 949394 : 119.2969409960459%

tianleiwu · 2025-01-22T23:48:46Z

@SuhwanSong, both commits (5fa4505 6cc06ad) has not changed the Resize operator. It is strange that it could cause performance difference.

The result in my test machine using your test script:

[1.20.1] Average inference time: 0.06691 ms
[1.19.0] Average inference time: 0.06516 ms
[1.18.0] Average inference time: 0.06706 ms

Note that the model has three nodes after optimization: Resize, Resize and Ceil. So the above latency difference between 1.19.0 and 1.20.1 is likely caused by the Ceil node.

SuhwanSong · 2025-01-23T16:06:34Z

Thank you for your efforts!

In my experiments on a 48-core server, I observed a performance difference. However, when I tested on a 16-core server, I obtained results similar to yours (i.e., no difference).

I identified a perf regression test case from the 16-core setup.
model.zip
Using GDB, I verified that the

onnxruntime/onnxruntime/contrib_ops/cpu/bert/group_query_attention.cc

Line 38 in 6cc06ad

REGISTER_KERNEL_TYPED(MLFloat16)

code was executed (while REGISTER_KERNEL_TYPED(float) was executed in both commits).

Although I’m not sure about the root cause, I’ve noticed similar behavior with ArgMin and ArgMax as well.

5fa4505 vs 6cc06ad

model_loading_uri : 230 / 251 : 109.13043478260869%
session_initialization : 1731 / 2030 : 117.27325245522819%
/Clip_kernel_time : 55072 / 61670 : 111.98067983730388%
/ArgMax_kernel_time : 90915 / 132457 : 145.6932299400539%
SequentialExecutor::Execute : 223523 / 274455 : 122.7860220201053%
model_run : 259777 / 310047 : 119.35121277095355%

tianleiwu · 2025-01-23T19:50:17Z

@SuhwanSong, could you try build from source with this PR: #23433 to see whether it could resolve the regression.

The reason is that Ceil/Clip/ArgMax are all implemented using Eigen.

SuhwanSong · 2025-01-24T08:24:34Z

@tianleiwu

I attempted to build this PR: #23433, but it failed due to the use of Eigen::PropagateNaN, which was introduced in Eigen 3.4.

Here’s the error message I encountered:

/workdir/backends/onnxruntime/onnxruntime/core/providers/cpu/math/element_wise_ops.cc:708:56: error: template argument 1 is invalid
  708 |     min = min.array().template min<Eigen::PropagateNaN>(EigenMap<float>(data_n).array());
      |           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/workdir/backends/onnxruntime/onnxruntime/core/providers/cpu/math/element_wise_ops.cc: In lambda function:
/workdir/backends/onnxruntime/onnxruntime/core/providers/cpu/math/element_wise_ops.cc:724:72: error: 'PropagateNaN' is not a member of 'Eigen'

It seems this feature is only available in Eigen 3.4 (reference: Eigen 3.4 Release Notes#Improvement_to_NaN_propagation).

The root cause of this issue (#23337) might also be related to the Eigen library.

SuhwanSong · 2025-01-25T13:48:23Z

@tianleiwu

I tested commit 6cc06ad with both Eigen 3.4 and the Eigen nightly version (Eigen nightly) to compare performance.

Here are the results: 3.4 vs nightly

model_loading_uri         : 255 / 265      = 103.92%  
session_initialization    : 2082 / 2078    = 99.81%  
/Clip_kernel_time         : 58114 / 55940  = 96.26%  
/ArgMax_kernel_time       : 128575 / 115658 = 89.95%  
SequentialExecutor::Execute : 263916 / 247322 = 93.71%  
model_run                 : 298355 / 282185 = 94.58%

Overall, the performance overhead decreased by approximately 10% with the Eigen nightly version.

SuhwanSong added the performance issues related to performance regressions label Jan 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] 40% slowdown in ONNX Resize Operator on CPU #23391

[Performance] 40% slowdown in ONNX Resize Operator on CPU #23391

SuhwanSong commented Jan 16, 2025

tianleiwu commented Jan 16, 2025

SuhwanSong commented Jan 17, 2025 •

edited

Loading

tianleiwu commented Jan 22, 2025 •

edited

Loading

SuhwanSong commented Jan 23, 2025

tianleiwu commented Jan 23, 2025

SuhwanSong commented Jan 24, 2025 •

edited

Loading

SuhwanSong commented Jan 25, 2025

[Performance] 40% slowdown in ONNX Resize Operator on CPU #23391

[Performance] 40% slowdown in ONNX Resize Operator on CPU #23391

Comments

SuhwanSong commented Jan 16, 2025

Describe the issue

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

Model File

Is this a quantized model?

tianleiwu commented Jan 16, 2025

SuhwanSong commented Jan 17, 2025 • edited Loading

tianleiwu commented Jan 22, 2025 • edited Loading

SuhwanSong commented Jan 23, 2025

tianleiwu commented Jan 23, 2025

SuhwanSong commented Jan 24, 2025 • edited Loading

SuhwanSong commented Jan 25, 2025

SuhwanSong commented Jan 17, 2025 •

edited

Loading

tianleiwu commented Jan 22, 2025 •

edited

Loading

SuhwanSong commented Jan 24, 2025 •

edited

Loading