Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Added LLM example #1545

Merged
merged 12 commits into from
Jul 8, 2024
Merged
25 changes: 25 additions & 0 deletions examples/llama-index-inference/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# LLM Inference with `llama-index` and `llama.cpp`

This AI / Machine Learning example shows how to run LLM inference from a local model downloaded from HuggingFace. It uses a mix of `conda-forge` packages and PyPi packages to get the proper, compatible versions.

Run the example with:

```bash
$ pixi run start
```

The source code is derived from [the Llama Index documentation](https://docs.llamaindex.ai/en/stable/examples/llm/llama_2_llama_cpp/). This particular set of tools and libraries was selected to show that production-grade deployments are possible with Pixi. The selected libraries in here are fairly lightweight and run a very advanced model locally. This was the performance I received on my local M1 Max machine:

```bash
llama_print_timings: load time = 2043.14 ms
llama_print_timings: sample time = 22.51 ms / 247 runs ( 0.09 ms per token, 10973.88 tokens per second)
llama_print_timings: prompt eval time = 2043.03 ms / 71 tokens ( 28.78 ms per token, 34.75 tokens per second)
llama_print_timings: eval time = 17786.87 ms / 246 runs ( 72.30 ms per token, 13.83 tokens per second)
llama_print_timings: total time = 19959.11 ms / 317 tokens
```

Opportunities for improvement:

- Modify for Linux / CUDA environments to demonstrate a more practical production stack.
- Enhance the pipeline with a RAG workflow, which is what Llama Index is good at.
- Experiment with different GGUF models for a quality / performance balance that fits your hardware.
33 changes: 33 additions & 0 deletions examples/llama-index-inference/inference.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
from llama_index.llms.llama_cpp import LlamaCPP
from llama_index.llms.llama_cpp.llama_utils import (
messages_to_prompt,
completion_to_prompt,
)

# Source code derived from: https://docs.llamaindex.ai/en/stable/examples/llm/llama_2_llama_cpp/

model_url = "https://huggingface.co/MaziyarPanahi/Mistral-7B-Instruct-v0.3-GGUF/resolve/main/Mistral-7B-Instruct-v0.3.Q5_K_M.gguf"

llm = LlamaCPP(
# You can pass in the URL to a GGUF model to download it automatically
model_url=model_url,
# optionally, you can set the path to a pre-downloaded model instead of model_url
model_path=None,
temperature=0.1,
max_new_tokens=256,
# llama2 has a context window of 4096 tokens, but we set it lower to allow for some wiggle room
context_window=3900,
# kwargs to pass to __call__()
generate_kwargs={},
# kwargs to pass to __init__()
# set to at least 1 to use GPU
model_kwargs={"n_gpu_layers": 1},
# transform inputs into Llama2 format
messages_to_prompt=messages_to_prompt,
completion_to_prompt=completion_to_prompt,
verbose=True,
)

response_iter = llm.stream_complete("Can you write me a poem about fast cars?")
for response in response_iter:
print(response.delta, end="", flush=True)
Loading
Loading