-
Notifications
You must be signed in to change notification settings - Fork 785
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Text-Generation-Inference introducing Multi-Backend #2580
base: main
Are you sure you want to change the base?
Conversation
|
||
## Introduction | ||
|
||
Since its inception in 2022, text-generation-inference (TGI) has provided Hugging Face and the AI Community with a performance-focused tool to easily deploy large-language models (LLMs). TGI initially offered an almost no-code solution to load models from the Hugging Face Hub and deploy them in production on NVIDIA GPUs. Over time, support expanded to include AMD Instinct, Intel GPUs, AWS Trainium/Inferentia, Google TPU, and Intel Gaudi. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since its inception in 2022, text-generation-inference (TGI) has provided Hugging Face and the AI Community with a performance-focused tool to easily deploy large-language models (LLMs). TGI initially offered an almost no-code solution to load models from the Hugging Face Hub and deploy them in production on NVIDIA GPUs. Over time, support expanded to include AMD Instinct, Intel GPUs, AWS Trainium/Inferentia, Google TPU, and Intel Gaudi. | |
Since its initial release in 2022, Text-Generation-Inference (TGI) has provided Hugging Face and the AI Community with a performance-focused solution to easily deploy large-language models (LLMs). TGI initially offered an almost no-code solution to load models from the Hugging Face Hub and deploy them in production on NVIDIA GPUs. Over time, support expanded to include AMD Instinct GPUs, Intel GPUs, AWS Trainium/Inferentia, Google TPU, and Intel Gaudi. |
## Introduction | ||
|
||
Since its inception in 2022, text-generation-inference (TGI) has provided Hugging Face and the AI Community with a performance-focused tool to easily deploy large-language models (LLMs). TGI initially offered an almost no-code solution to load models from the Hugging Face Hub and deploy them in production on NVIDIA GPUs. Over time, support expanded to include AMD Instinct, Intel GPUs, AWS Trainium/Inferentia, Google TPU, and Intel Gaudi. | ||
Over the years, multiple inferencing solutions have emerged, including vLLM, SGLang, llama.cpp, TensorRT-LLM, etc., splitting up the overall ecosystem. Different models, hardware, and scenarios often require distinct backends to achieve optimal performance. However, configuring these backends correctly, managing licenses, and integrating them into existing infrastructure can be challenging for users. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Over the years, multiple inferencing solutions have emerged, including vLLM, SGLang, llama.cpp, TensorRT-LLM, etc., splitting up the overall ecosystem. Different models, hardware, and scenarios often require distinct backends to achieve optimal performance. However, configuring these backends correctly, managing licenses, and integrating them into existing infrastructure can be challenging for users. | |
Over the years, multiple inferencing solutions have emerged, including vLLM, SGLang, llama.cpp, TensorRT-LLM, etc., splitting up the overall ecosystem. Different models, hardware, and use cases may work best or require a specific backend to achieve optimal performance. However, configuring each backend correctly, managing licenses, and integrating them into existing infrastructure can be challenging for users. |
Since its inception in 2022, text-generation-inference (TGI) has provided Hugging Face and the AI Community with a performance-focused tool to easily deploy large-language models (LLMs). TGI initially offered an almost no-code solution to load models from the Hugging Face Hub and deploy them in production on NVIDIA GPUs. Over time, support expanded to include AMD Instinct, Intel GPUs, AWS Trainium/Inferentia, Google TPU, and Intel Gaudi. | ||
Over the years, multiple inferencing solutions have emerged, including vLLM, SGLang, llama.cpp, TensorRT-LLM, etc., splitting up the overall ecosystem. Different models, hardware, and scenarios often require distinct backends to achieve optimal performance. However, configuring these backends correctly, managing licenses, and integrating them into existing infrastructure can be challenging for users. | ||
|
||
To address this, we are excited to introduce the concept of TGI backends. This new feature gives the flexibility to integrate with any of the solutions above through a single unified frontend layer: TGI. This change makes it easier for the community to get the best performance for their production workloads, switching backends according to their modeling, hardware, and performance requirements. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To address this, we are excited to introduce the concept of TGI backends. This new feature gives the flexibility to integrate with any of the solutions above through a single unified frontend layer: TGI. This change makes it easier for the community to get the best performance for their production workloads, switching backends according to their modeling, hardware, and performance requirements. | |
To address this, we are excited to introduce the concept of TGI Backends. This new architecture gives the flexibility to integrate with any of the solutions above through TGI as a single unified frontend layer. This change makes it easier for the community to get the best performance for their production workloads, switching backends according to their modeling, hardware, and performance requirements. |
Since its inception in 2022, text-generation-inference (TGI) has provided Hugging Face and the AI Community with a performance-focused tool to easily deploy large-language models (LLMs). TGI initially offered an almost no-code solution to load models from the Hugging Face Hub and deploy them in production on NVIDIA GPUs. Over time, support expanded to include AMD Instinct, Intel GPUs, AWS Trainium/Inferentia, Google TPU, and Intel Gaudi. | ||
Over the years, multiple inferencing solutions have emerged, including vLLM, SGLang, llama.cpp, TensorRT-LLM, etc., splitting up the overall ecosystem. Different models, hardware, and scenarios often require distinct backends to achieve optimal performance. However, configuring these backends correctly, managing licenses, and integrating them into existing infrastructure can be challenging for users. | ||
|
||
To address this, we are excited to introduce the concept of TGI backends. This new feature gives the flexibility to integrate with any of the solutions above through a single unified frontend layer: TGI. This change makes it easier for the community to get the best performance for their production workloads, switching backends according to their modeling, hardware, and performance requirements. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To address this, we are excited to introduce the concept of TGI backends. This new feature gives the flexibility to integrate with any of the solutions above through a single unified frontend layer: TGI. This change makes it easier for the community to get the best performance for their production workloads, switching backends according to their modeling, hardware, and performance requirements. | |
To address this, we are excited to introduce the concept of TGI Backends. This new feature gives the flexibility to integrate with any of the solutions above through a single unified frontend layer: TGI. This change makes it easier for the community to get the best performance for their production workloads, switching backends according to their modeling, hardware, and performance requirements. | |
The Hugging Face team is excited to contribute to and collaborate with the teams that build vLLM, llama.cpp, TensorRT-LLM, and the teams at AWS, Google, NVIDIA, AMD and Intel to offer a robust and consistent user experience for TGI users whichever backend and hardware they want to use. |
|
||
## Looking into 2025 | ||
|
||
Things are constantly in motion here at Hugging Face, and TGI is no exception. As we look ahead to 2025 we are excited to share some of the upcoming developments for TGI: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Things are constantly in motion here at Hugging Face, and TGI is no exception. As we look ahead to 2025 we are excited to share some of the upcoming developments for TGI: | |
The new multi-backend capabilities of TGI open up many impactful roadmap opportunities. As we look ahead to 2025 we are excited to share some of the TGI developments we are most excited about: |
|
||
Things are constantly in motion here at Hugging Face, and TGI is no exception. As we look ahead to 2025 we are excited to share some of the upcoming developments for TGI: | ||
|
||
* **NVIDIA TensorRT-LLM backend**: We are collaborating with the NVIDIA TensorRT-LLM team to bring all the astonishing NVIDIA GPUs \+ TensorRT performances to the community. This work will be covered more extensively in an upcoming blog post. It closely relates to our mission to empower AI builders with the open-source availability of both `optimum-nvidia` quantize/build/evaluate TensorRT compatible artifacts alongside TGI+TRTLLM to easily deploy, execute, and scale deployments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* **NVIDIA TensorRT-LLM backend**: We are collaborating with the NVIDIA TensorRT-LLM team to bring all the astonishing NVIDIA GPUs \+ TensorRT performances to the community. This work will be covered more extensively in an upcoming blog post. It closely relates to our mission to empower AI builders with the open-source availability of both `optimum-nvidia` quantize/build/evaluate TensorRT compatible artifacts alongside TGI+TRTLLM to easily deploy, execute, and scale deployments. | |
* **NVIDIA TensorRT-LLM backend**: We are collaborating with the NVIDIA TensorRT-LLM team to bring all the optimized NVIDIA GPUs \+ TensorRT performances to the community. This work will be covered more extensively in an upcoming blog post. It closely relates to our mission to empower AI builders with the open-source availability of both `optimum-nvidia` quantize/build/evaluate TensorRT compatible artifacts alongside TGI+TRTLLM to easily deploy, execute, and scale deployments on NVIDIA GPUs. |
|
||
* **NVIDIA TensorRT-LLM backend**: We are collaborating with the NVIDIA TensorRT-LLM team to bring all the astonishing NVIDIA GPUs \+ TensorRT performances to the community. This work will be covered more extensively in an upcoming blog post. It closely relates to our mission to empower AI builders with the open-source availability of both `optimum-nvidia` quantize/build/evaluate TensorRT compatible artifacts alongside TGI+TRTLLM to easily deploy, execute, and scale deployments. | ||
* **Llama.cpp backend**: we are collaborating with the llama.cpp team to extend the support for server production use cases. The llama.cpp backend for TGI will provide a strong CPU-based option for anyone willing to deploy on Intel, AMD, or ARM CPU servers. | ||
* **vLLM backend**: we have been contributing to the vLLM project and are looking to integrate vLLM as a TGI backend in Q1 '25. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* **vLLM backend**: we have been contributing to the vLLM project and are looking to integrate vLLM as a TGI backend in Q1 '25. | |
* **vLLM backend**: we have been contributing to the vLLM project and are looking to integrate vLLM as a TGI backend in Q1 '25. | |
* **Neuron backend**: we are working with the Neuron teams at AWS to enable Inferentia 2 and Trainium 2 support natively in TGI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dacorvo does that sound good for you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can also mention a Google TPU backend, something like:
* **TPU Backend**: we are working to integrate with the Jetstream teams to provide the best performance and enable it as a TGI backend.
* **Llama.cpp backend**: we are collaborating with the llama.cpp team to extend the support for server production use cases. The llama.cpp backend for TGI will provide a strong CPU-based option for anyone willing to deploy on Intel, AMD, or ARM CPU servers. | ||
* **vLLM backend**: we have been contributing to the vLLM project and are looking to integrate vLLM as a TGI backend in Q1 '25. | ||
|
||
We are convinced that backends will help simplify the deployments of LLMs, bringing versatility and performance to all TGI users. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are convinced that backends will help simplify the deployments of LLMs, bringing versatility and performance to all TGI users. | |
We are convinced that TGI backends will help simplify the deployments of LLMs, bringing versatility and performance to all TGI users. |
* **vLLM backend**: we have been contributing to the vLLM project and are looking to integrate vLLM as a TGI backend in Q1 '25. | ||
|
||
We are convinced that backends will help simplify the deployments of LLMs, bringing versatility and performance to all TGI users. | ||
Our vision is to establish TGI as the reference solution powering our *Inference Endpoints* product. With the addition of new backends, users will be able to deploy managed inference models across a wide range of cloud providers and hardware platforms ensuring top-tier performance and reliability. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our vision is to establish TGI as the reference solution powering our *Inference Endpoints* product. With the addition of new backends, users will be able to deploy managed inference models across a wide range of cloud providers and hardware platforms ensuring top-tier performance and reliability. | |
You will soon be able to benefit from it directly within [Inference Endpoints](https://huggingface.co/inference-endpoints/), our managed deployment service, as we integrate TGI Backends into the product. As we add support for new TGI Backends, customers will be able to easily deploy models on various hardware with top-tier performance and reliability out of the box. |
# Text-Generation-Inference empowering all the AI Builders Community | ||
|
||
## Introducing multi-backends support for TGI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Text-Generation-Inference empowering all the AI Builders Community | |
## Introducing multi-backends support for TGI | |
# Introducing TGI Backends: one for all, all for one |
@@ -0,0 +1,48 @@ | |||
--- | |||
title: "Text-Generation-Inference empowering all the AI Builders Community" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
title: "Text-Generation-Inference empowering all the AI Builders Community" | |
title: "Introducing TGI Backends: one for all, all for one" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'd go for something more obvious like just Introducing multi-backend (TRT-LLM, vLLM) support for TGI
|
||
Stay tuned for the next blog posts to dig into technical details and performance benchmarks of upcoming backends\! | ||
|
||
[image1]: <> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(move the image to a hub repo)
No description provided.