Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
serve.yaml		serve.yaml

README.md

vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention

This README contains instructions to run a demo for vLLM, an open-source library for fast LLM inference and serving, which improves the throughput compared to HuggingFace by up to 24x.

Prerequisites

Install the latest SkyPilot and check your setup of the cloud credentials:

pip install git+https://github.com/skypilot-org/skypilot.git
sky check

See the vLLM SkyPilot YAML for serving.

Serve a model with vLLM, launched on the cloud by SkyPilot

Start the serving the LLaMA-65B model on 8 A100 GPUs:

sky launch -c vllm-serve -s serve.yaml

Check the output of the command. There will be a sharable gradio link (like the last line of the following). Open it in your browser to use the LLaMA model to do the text completion.

(task, pid=7431) Running on public URL: https://a8531352b74d74c7d2.gradio.live

Optional: Serve the 13B model instead of the default 65B and use less GPU:

sky launch -c vllm-serve -s serve.yaml --gpus A100:1 --env MODEL_NAME=decapoda-research/llama-13b-hf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vllm

vllm

README.md

vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention

Prerequisites

Serve a model with vLLM, launched on the cloud by SkyPilot

Files

vllm

Directory actions

More options

Directory actions

More options

Latest commit

History

vllm

Folders and files

parent directory

README.md

vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention

Prerequisites

Serve a model with vLLM, launched on the cloud by SkyPilot