IOBinding #15

dzhao · 2023-01-31T00:41:59Z

dzhao
Jan 31, 2023

Hi, do you plan to support IOBiding for cuda/tensorrt?
https://stackoverflow.com/questions/70740287/onnxruntime-inference-is-way-slower-than-pytorch-on-gpu

This seems a critical feature for gpu serving.

decahedron1 · 2023-01-31T01:10:56Z

decahedron1
Jan 31, 2023
Maintainer

When the input is not copied to the target device, ORT copies it from the CPU as part of the Run() call. Similarly if the output is not pre-allocated on the device, ORT assumes that the output is requested on the CPU and copies it from the device as the last step of the Run() call. This obviously eats into the execution time of the graph misleading users into thinking ORT is slow when the majority of the time is spent in these copies. To address this we’ve introduced the notion of IOBinding. The key idea is to arrange for inputs to be copied to the device and for outputs to be pre-allocated on the device prior to calling Run().

I'm not sure I understand. To me, it doesn't seem like IOBinding is critical for GPU inference? If ORT automatically copies to/from the CPU (which has to happen anyways because ndarray is CPU bound), I don't see what purpose IOBinding serves here. I can understand wanting IOBinding when passing the output of one model into another, but for a single model it isn't helpful, right?

0 replies

dzhao · 2023-01-31T02:47:30Z

dzhao
Jan 31, 2023
Author

good question. imo IOBinding can save two memcopies on the input and on the output respectively.
The sample in the onnx document you quoted uses the mem_info for the cpu mem hence in that particular sample the binding of input is useless.
But check out this example:
microsoft/onnxruntime#10279
The input tensor will be directly created in the input device memory.

0 replies

dzhao · 2023-01-31T05:58:58Z

dzhao
Jan 31, 2023
Author

essentially in this code:

ort/src/tensor/ort_tensor.rs

Line 71 in bb2f924

ortsys![

,
the OrtTensor would have been directly created on the GPU memory instead of on the CPU memory.(if latter then another memcpy will be implicitly carried by ORT).

This is not a blocker but very important if we want to efficiently do GPU serving -- as typically GPU serving needs bigger batches hence extra memcpy of inputs is costly

0 replies

dzhao · 2023-01-31T21:34:18Z

dzhao
Jan 31, 2023
Author

ok, reread the code. I guess it is not necessary since ndarray would be copied to gpu memory as a OrtTensor. so this is not a critical issue for single model inference. Would be nice to have for a pipelined model inferening thouhg

2 replies

decahedron1 Jan 31, 2023
Maintainer

I agree it would be nice to have a pipeline option. I'll look into it more this weekend.

dzhao Feb 11, 2023
Author

any progress on this? :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IOBinding #15

{{title}}

Replies: 4 comments 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

IOBinding #15

dzhao Jan 31, 2023

Replies: 4 comments · 2 replies

decahedron1 Jan 31, 2023 Maintainer

dzhao Jan 31, 2023 Author

dzhao Jan 31, 2023 Author

dzhao Jan 31, 2023 Author

decahedron1 Jan 31, 2023 Maintainer

dzhao Feb 11, 2023 Author

dzhao
Jan 31, 2023

Replies: 4 comments 2 replies

decahedron1
Jan 31, 2023
Maintainer

dzhao
Jan 31, 2023
Author

dzhao
Jan 31, 2023
Author

dzhao
Jan 31, 2023
Author

decahedron1 Jan 31, 2023
Maintainer

dzhao Feb 11, 2023
Author