llm_inference_engine_benchmark

This repository aims to evaluate various open-source inference frameworks, analyzing their strengths and weaknesses.

Thanks for the res: pandada8/llm-inference-benchmark: LLM 推理服务性能测试 (github.com) https://github.com/ninehills/llm-inference-benchmark

Information

Hardware: Nvidia H800
LLM model: llama2
llm inference engine: llama.cpp, vllm, fastllm
指标:
- Thoughtput
- none-first Token Latency
- First Token Latency

Run

llama.cpp

In the llama.cpp directory, run the llama.cpp server.

~/code/llama.cpp$ CUDA_VISIBLE_DEVICES=0 ./server --n-gpu-layers 999 --host 127.0.0.1 --port 8082 -m models/llama-2-7b/llama-2-7b-7B-F16.gguf

--n-gpu-layers the number of layer load to GPU. Set 999 making all layers load to GPU.

--host Set the hostname or ip address to listen. Default 127.0.0.1.

--port Set the port to listen. Default: 8080

-m The model path.

more parameters and detail can be learned from: llama.cpp and llama.cpp-server and other literature.

In the llm_inference_engine_benchmark directory, run the test code for llama.cpp.

python benchmark.py --model llama --backend llama.cpp --endpoint http://127.0.0.1:8082

## draw the result
python draw.py

--endpoint The endpoint for the server.

--model llama meanings the family of llama.

vllm

run the vllm serve

python -m vllm.entrypoints.openai.api_server  --model model_executor/models/Llama-2-7b-hf/ --host 127.0.0.1 --port 8082

run the test code for vllm.

python benchmark.py --model llama --backend vllm --endpoint http://127.0.0.1:8082

fastllm

run the fastllm serve

~/code/fastllm/build$ ./webui -p ../mode_zoo/llama2-7b.flm --port 8082

run the test code for fastllm.

python benchmark.py --model llama --backend fastllm --endpoint http://127.0.0.1:8082

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
images		images
result		result
README.md		README.md
benchmark.py		benchmark.py
draw.py		draw.py
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llm_inference_engine_benchmark

Information

Run

llama.cpp

vllm

fastllm

Result

Thoughput

Averge Token Latency

First Token Latency

About

Releases

Packages

Languages

MenghuaZheng/llm_inference_engine_benchmark

Folders and files

Latest commit

History

Repository files navigation

llm_inference_engine_benchmark

Information

Run

llama.cpp

vllm

fastllm

Result

Thoughput

Averge Token Latency

First Token Latency

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages