Replies: 4 comments 2 replies
-
I'm not sure I understand. To me, it doesn't seem like IOBinding is critical for GPU inference? If ORT automatically copies to/from the CPU (which has to happen anyways because |
Beta Was this translation helpful? Give feedback.
-
good question. imo IOBinding can save two memcopies on the input and on the output respectively. |
Beta Was this translation helpful? Give feedback.
-
essentially in this code: Line 71 in bb2f924 the OrtTensor would have been directly created on the GPU memory instead of on the CPU memory.(if latter then another memcpy will be implicitly carried by ORT). This is not a blocker but very important if we want to efficiently do GPU serving -- as typically GPU serving needs bigger batches hence extra memcpy of inputs is costly |
Beta Was this translation helpful? Give feedback.
-
ok, reread the code. I guess it is not necessary since ndarray would be copied to gpu memory as a OrtTensor. so this is not a critical issue for single model inference. Would be nice to have for a pipelined model inferening thouhg |
Beta Was this translation helpful? Give feedback.
-
Hi, do you plan to support IOBiding for cuda/tensorrt?
https://stackoverflow.com/questions/70740287/onnxruntime-inference-is-way-slower-than-pytorch-on-gpu
This seems a critical feature for gpu serving.
Beta Was this translation helpful? Give feedback.
All reactions