Skip to content

Commit

Permalink
add llm sdk integration example (kserve#395)
Browse files Browse the repository at this point in the history
add sdk integration example

Signed-off-by: Lize Cai <[email protected]>
  • Loading branch information
lizzzcai authored Sep 16, 2024
1 parent 76a1343 commit 6aa3ac6
Show file tree
Hide file tree
Showing 3 changed files with 151 additions and 2 deletions.
5 changes: 3 additions & 2 deletions docs/modelserving/v1beta1/llm/huggingface/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ The Hugging Face serving runtime implements two backends namely `Hugging Face` a
The preprocess and post-process handlers are already implemented based on different ML tasks, for example text classification,
token-classification, text-generation, text2text-generation, fill-mask.

KServe Hugging Face runtime by default uses [`vLLM`](https://github.com/vllm-project/vllm) backend to serve `text generation` and `text2text generation` LLM models for faster time-to-first-token (TTFT) and higher token generation throughput than the Hugging Face API.
KServe Hugging Face runtime by default uses [`vLLM`](https://github.com/vllm-project/vllm) backend to serve `text generation` and `text2text generation` LLM models for faster time-to-first-token (TTFT) and higher token generation throughput than the Hugging Face API.
vLLM is implemented with common inference optimization techniques, such as [PagedAttention](https://vllm.ai), [continuous batching](https://www.anyscale.com/blog/continuous-batching-llm-inference) and an optimized CUDA kernel.
If the model is not supported by the vLLM engine, KServe falls back to the Hugging Face backend as a failsafe.

Expand All @@ -22,7 +22,7 @@ For information on the models supported by the vLLM backend, please visit [vLLM'
## API Endpoints
Both of the backends support serving generative models (text generation and text2text generation) using [OpenAI's Completion](https://platform.openai.com/docs/api-reference/completions) and [Chat Completion](https://platform.openai.com/docs/api-reference/chat) API.

The other types of tasks like token classification, sequence classification, and fill mask are served using KServe's [Open Inference Protocol](../../../data_plane/v2_protocol.md) or [V1 API](../../../data_plane/v1_protocol.md).
The other types of tasks like token classification, sequence classification, and fill mask are served using KServe's [Open Inference Protocol](../../../data_plane/v2_protocol.md) or [V1 API](../../../data_plane/v1_protocol.md).

## Examples
The following examples demonstrate how to deploy and perform inference using the Hugging Face runtime with different ML tasks:
Expand All @@ -32,6 +32,7 @@ The following examples demonstrate how to deploy and perform inference using the
- [Token Classification using BERT](token_classification/README.md)
- [Sequence Classification (Text Classification) using distilBERT](text_classification/README.md)
- [Fill Mask using BERT](fill_mask/README.md)
- [SDK Integration](sdk_integration/README.md)

!!! note
The Hugging Face runtime image has the following environment variables set by default:
Expand Down
147 changes: 147 additions & 0 deletions docs/modelserving/v1beta1/llm/huggingface/sdk_integration/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,147 @@
# Integrate KServe LLM Deployment with LLM SDKs

This document provides the example of how to integrate KServe LLM Inference Service with the popular LLM SDKs.

## Deploy a KServe LLM Inference Service

Please follow this example: [Text Generation using LLama3](../text_generation/README.md) to deploy a KServe LLM Inference Service.

Get the `SERVICE_HOSTNAME` by running the following command:

```bash
SERVICE_HOSTNAME=$(kubectl get inferenceservice huggingface-llama3 -o jsonpath='{.status.url}' | cut -d "/" -f 3)
```

The model name for the above example is `llama3`.

## How to integrate with OpenAI SDK

Install the [OpenAI SDK](https://github.com/openai/openai-python):

```bash
pip3 install openai
```

Create a Python script to interact with the KServe LLM Inference Service and save it as `sample_openai.py`:

=== "python"
```python
from openai import OpenAI

Deployment_url = "<SERVICE_HOSTNAME>"
client = OpenAI(
base_url=f"{Deployment_url}/openai/v1",
api_key="empty",
)

# typial chat completion response
print("Typical chat completion response:")
response = client.chat.completions.create(
model="llama3",
messages=[
{'role': 'user', 'content': "What's 1+1? Answer in one word."}
],
temperature=0,
max_tokens=256
)

reply = response.choices[0].message
print(f"Extracted reply: \n{reply.content}\n")

# streaming chat completion response
print("Streaming chat completion response:")
stream = client.chat.completions.create(
model='llama3',
messages=[
{'role': 'user', 'content': 'Count to 100, with a comma between each number and no newlines. E.g., 1, 2, 3, ...'}
],
temperature=0,
max_tokens=300,
stream=True # this time, we set stream=True
)

for chunk in stream:
print(chunk.choices[0].delta.content or "", end="", flush=True)
```

Run the python script:

```bash
python3 sample_openai.py
```

!!! success "Expected Output"

```{ .bash .no-copy }
Typical chat completion response:
Extracted reply:
Two.

Streaming chat completion response:
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100
```

## How to integrate with Langchain SDK

Install the [Langchain SDK](https://python.langchain.com/v0.2/docs/how_to/installation/#integration-packages):

```bash
pip3 install langchain-openai
```

Create a Python script to interact with the KServe LLM Inference Service and save it as `sample_langchain.py`:

=== "python"
```python
from langchain_openai import ChatOpenAI

Deployment_url = "<SERVICE_HOSTNAME>"

llm = ChatOpenAI(
model_name="llama3",
base_url=f"{Deployment_url}/openai/v1",
openai_api_key="empty",
temperature=0,
max_tokens=256,
)

# typial chat completion response
print("Typical chat completion response:")

messages = [
(
"system",
"You are a helpful assistant that translates English to French. Translate the user sentence.",
),
("human", "I love programming."),
]
reply = llm.invoke(messages)
print(f"Extracted reply: \n{reply.content}\n")

# streaming chat completion response
print("Streaming chat completion response:")
for chunk in llm.stream("Write me a 1 verse song about goldfish on the moon"):
print(chunk.content, end="", flush=True)
```

Run the python script:

```bash
python3 sample_langchain.py
```

!!! success "Expected Output"

```{ .bash .no-copy }
Typical chat completion response:
Extracted reply:
Je adore le programmation.

Streaming chat completion response:
Here is a 1-verse song about goldfish on the moon:

"In the lunar lake, where the craters shine
A school of goldfish swim, in a celestial shrine
Their scales glimmer bright, like stars in the night
As they dart and play, in the moon's gentle light"
```
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@ nav:
- Token Classification: modelserving/v1beta1/llm/huggingface/token_classification/README.md
- Text Classification: modelserving/v1beta1/llm/huggingface/text_classification/README.md
- Fill Mask: modelserving/v1beta1/llm/huggingface/fill_mask/README.md
- SDK Integration: modelserving/v1beta1/llm/huggingface/sdk_integration/README.md
- TorchServe LLM: modelserving/v1beta1/llm/torchserve/accelerate/README.md
- How to write a custom predictor: modelserving/v1beta1/custom/custom_model/README.md
- Multi Model Serving:
Expand Down

0 comments on commit 6aa3ac6

Please sign in to comment.