Support Multi-Node Inference and Serving. #3870

Jooho · 2024-08-19T07:56:00Z

/kind feature

Describe the solution you'd like
As models continue to grow in size, it has become increasingly challenging to fit these large models into the memory of a single GPU. However, they can often be accommodated within the combined memory of multiple GPUs. Existing techniques such as tensor parallelism and pipeline parallelism allow for the division of models, enabling them to run parallel across multiple Nodes/GPUs, significantly enhancing performance.

Anything else you would like to add:

New API - add a new API for multi-node/multi-gpu #3871
POC - https://github.com/Jooho/jhouse_openshift/tree/main/Kserve/poc/multi-node
Implementation - Multi-Node Inference Implementation #3972
Documents - document for huggingface(vllm) servingruntime for multi-node website#402

Links to the design documents:

Jooho mentioned this issue Aug 19, 2024

add a new API for multi-node/multi-gpu #3871

Merged

9 tasks

yuzisun added kserve/llm kind/feature labels Aug 19, 2024

Jooho mentioned this issue Oct 3, 2024

Multi-Node Inference Implementation #3972

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Multi-Node Inference and Serving. #3870

Support Multi-Node Inference and Serving. #3870

Jooho commented Aug 19, 2024 •

edited

Loading

Support Multi-Node Inference and Serving. #3870

Support Multi-Node Inference and Serving. #3870

Comments

Jooho commented Aug 19, 2024 • edited Loading

Jooho commented Aug 19, 2024 •

edited

Loading