llm-server

Trying to locally serve an LLM with an OpenAI-like API.

Usage

To spin up the server, run

$ rye sync
$ rye run llm-server

Then, you could check the API by curl like

$ curl -X POST -H "Content-Type: application/json" -d '{"prompt": "Tell me the answer to the ultimate question of life, the universe, and everything."}' http://localhost:5000/generate

Design choices

Since my GPU (GTX 1070) is quite old I decided to use Gemma-2-2B-instruction. This is because its compute capability (CC) is 6.1, which is way less than CC == 7.0 required by triton compiler and vllm.
When using flask, it executes the main script twice. So I had trouble to blowing up GPU memory by loading model twice in the global name space.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
src/llm_server		src/llm_server
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements-dev.lock		requirements-dev.lock
requirements.lock		requirements.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llm-server

Usage

Design choices

About

Releases

Packages

Languages

License

kstoneriv3/llm-server

Folders and files

Latest commit

History

Repository files navigation

llm-server

Usage

Design choices

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages