-
Hi, I want to ask bentoml team about two things.
|
Beta Was this translation helpful? Give feedback.
Replies: 4 comments
-
Hi @ebbunim. The rough idea for your case now is to use a custom pipeline (https://huggingface.co/docs/transformers/add_new_pipeline). For now, users have to register the custom task to |
Beta Was this translation helpful? Give feedback.
-
For q2, to be clear, most ML frameworks have an internal solution to make use of multiple CPU cores. For example, users can set intra_op_parallelism_threads/inter_op_parallelism_threads to control the threads of TensorFlow operations. By default, the number would be CPU core count. Thus in many cases, N instances of model workers mean N*CPUs threads on the system. Because of the overhead of context switching, it drags down the throughput. |
Beta Was this translation helpful? Give feedback.
-
About memory sharing between unicorn workers with --preload option, in general we do not recommend this approach. it is not really memory sharing, but simply preload the model in python before forking the process. It may work in some cases, but it is tightly coupled with the extension implementation and may not be the most efficient way of accessing a shared model (as @bojiang explained above). I'd recommend go with 1.0 release. The runner design in bentoml 1.0 is going to solve the "memory sharing" issue you were looking for and avoid the OOM issue. There is currently an issue regarding transformer custom pipeline, we are working on a fix. see #2534 |
Beta Was this translation helpful? Give feedback.
-
@ebbunnim |
Beta Was this translation helpful? Give feedback.
About memory sharing between unicorn workers with --preload option, in general we do not recommend this approach. it is not really memory sharing, but simply preload the model in python before forking the process. It may work in some cases, but it is tightly coupled with the extension implementation and may not be the most efficient way of accessing a shared model (as @bojiang explained above).
I'd recommend go with 1.0 release. The runner design in bentoml 1.0 is going to solve the "memory sharing" issue you were looking for and avoid the OOM issue.
There is currently an issue regarding transformer custom pipeline, we are working on a fix. see #2534