prefix-dev · tdejager · Jul 8, 2024 · Jun 27, 2024 · Jul 1, 2024 · Jul 1, 2024
diff --git a/examples/llama-index-inference/README.md b/examples/llama-index-inference/README.md
@@ -0,0 +1,25 @@
+# LLM Inference with `llama-index` and `llama.cpp`
+
+This AI / Machine Learning example shows how to run LLM inference from a local model downloaded from HuggingFace. It uses a mix of `conda-forge` packages and PyPi packages to get the proper, compatible versions.
+
+Run the example with:
+
+```bash
+$ pixi run start
+```
+
+The source code is derived from [the Llama Index documentation](https://docs.llamaindex.ai/en/stable/examples/llm/llama_2_llama_cpp/). This particular set of tools and libraries was selected to show that production-grade deployments are possible with Pixi. The selected libraries in here are fairly lightweight and run a very advanced model locally. This was the performance I received on my local M1 Max machine:
+
+```bash
+llama_print_timings:        load time =    2043.14 ms
+llama_print_timings:      sample time =      22.51 ms /   247 runs   (    0.09 ms per token, 10973.88 tokens per second)
+llama_print_timings: prompt eval time =    2043.03 ms /    71 tokens (   28.78 ms per token,    34.75 tokens per second)
+llama_print_timings:        eval time =   17786.87 ms /   246 runs   (   72.30 ms per token,    13.83 tokens per second)
+llama_print_timings:       total time =   19959.11 ms /   317 tokens
+```
+
+Opportunities for improvement:
+
+- Modify for Linux / CUDA environments to demonstrate a more practical production stack.
+- Enhance the pipeline with a RAG workflow, which is what Llama Index is good at.
+- Experiment with different GGUF models for a quality / performance balance that fits your hardware.
diff --git a/examples/llama-index-inference/inference.py b/examples/llama-index-inference/inference.py
@@ -0,0 +1,33 @@
+from llama_index.llms.llama_cpp import LlamaCPP
+from llama_index.llms.llama_cpp.llama_utils import (
+    messages_to_prompt,
+    completion_to_prompt,
+)
+
+# Source code derived from: https://docs.llamaindex.ai/en/stable/examples/llm/llama_2_llama_cpp/
+
+model_url = "https://huggingface.co/MaziyarPanahi/Mistral-7B-Instruct-v0.3-GGUF/resolve/main/Mistral-7B-Instruct-v0.3.Q5_K_M.gguf"
+
+llm = LlamaCPP(
+    # You can pass in the URL to a GGUF model to download it automatically
+    model_url=model_url,
+    # optionally, you can set the path to a pre-downloaded model instead of model_url
+    model_path=None,
+    temperature=0.1,
+    max_new_tokens=256,
+    # llama2 has a context window of 4096 tokens, but we set it lower to allow for some wiggle room
+    context_window=3900,
+    # kwargs to pass to __call__()
+    generate_kwargs={},
+    # kwargs to pass to __init__()
+    # set to at least 1 to use GPU
+    model_kwargs={"n_gpu_layers": 1},
+    # transform inputs into Llama2 format
+    messages_to_prompt=messages_to_prompt,
+    completion_to_prompt=completion_to_prompt,
+    verbose=True,
+)
+
+response_iter = llm.stream_complete("Can you write me a poem about fast cars?")
+for response in response_iter:
+    print(response.delta, end="", flush=True)