update local and cover

souzatharsis · Dec 21, 2024 · d2d8742 · d2d8742
1 parent 5733cb3
commit d2d8742
Show file tree

Hide file tree

Showing 40 changed files with 854 additions and 79 deletions.
diff --git a/README.md b/README.md
@@ -1,7 +1,7 @@
 Receive updates on [new Chapters here](https://tamingllm.substack.com/).
 
  <a href="https://www.souzatharsis.com/tamingLLMs" target="_blank">
-  <img src="tamingllms/_static/tamingcoverv1.jpg" style="background-color:white; width:50%;" alt="Taming LLMs Cover" />
+  <img src="tamingllms/_static/cover_curve.png" style="background-color:white; width:50%;" alt="Taming LLMs Cover" />
  </a>
 
 Please [open an issue](https://github.com/souzatharsis/tamingLLMs/issues) with your feedback or suggestions!

diff --git a/tamingllms/_build/.doctrees/environment.pickle b/tamingllms/_build/.doctrees/environment.pickle
diff --git a/tamingllms/_build/.doctrees/markdown/toc.doctree b/tamingllms/_build/.doctrees/markdown/toc.doctree
diff --git a/tamingllms/_build/.doctrees/notebooks/local.doctree b/tamingllms/_build/.doctrees/notebooks/local.doctree
diff --git a/...ml/_images/08a72483c61f624c6ca08db7d58620028acadc7df987d17f40ae7c6e04fa94d2.png b/...ml/_images/08a72483c61f624c6ca08db7d58620028acadc7df987d17f40ae7c6e04fa94d2.png
diff --git a/...ml/_images/2ec19ddcc7b85314bac920a3f580cc30065eeda75e63063d94499fa4b11532db.png b/...ml/_images/2ec19ddcc7b85314bac920a3f580cc30065eeda75e63063d94499fa4b11532db.png
diff --git a/...ml/_images/a49bcd9c801d9b420b188b13fe009f1b93a733780c10a27b655fc7cd475d16a9.png b/...ml/_images/a49bcd9c801d9b420b188b13fe009f1b93a733780c10a27b655fc7cd475d16a9.png
diff --git a/tamingllms/_build/html/_images/arc.png b/tamingllms/_build/html/_images/arc.png
diff --git a/tamingllms/_build/html/_images/harmbench.png b/tamingllms/_build/html/_images/harmbench.png
diff --git a/tamingllms/_build/html/_images/langsmith.png b/tamingllms/_build/html/_images/langsmith.png
diff --git a/tamingllms/_build/html/_images/langsmith_dataset.png b/tamingllms/_build/html/_images/langsmith_dataset.png
diff --git a/tamingllms/_build/html/_images/lighteval.png b/tamingllms/_build/html/_images/lighteval.png
diff --git a/tamingllms/_build/html/_images/model-comparison.png b/tamingllms/_build/html/_images/model-comparison.png
diff --git a/tamingllms/_build/html/_images/promptfoo1.png b/tamingllms/_build/html/_images/promptfoo1.png
diff --git a/tamingllms/_build/html/_images/tg.png b/tamingllms/_build/html/_images/tg.png
diff --git a/tamingllms/_build/html/_sources/markdown/toc.md b/tamingllms/_build/html/_sources/markdown/toc.md
@@ -7,7 +7,7 @@ date: "2024-12-16"
 Sign-up to receive updates on [new Chapters here](https://tamingllm.substack.com/).
 
  <a href="https://www.souzatharsis.com/tamingLLMs" target="_blank">
-  <img src="../_static/tamingcoverv1.jpg" style="background-color:white; width:50%;" alt="Taming LLMs Cover" />
+  <img src="../_static/cover_curve.png" style="background-color:white; width:50%;" alt="Taming LLMs Cover" />
  </a>
 
 ---

diff --git a/tamingllms/_build/html/_sources/notebooks/local.ipynb b/tamingllms/_build/html/_sources/notebooks/local.ipynb
@@ -21,15 +21,19 @@
    "source": [
     "## Introduction\n",
     "\n",
-    "Running LLMs locally versus using cloud APIs offers several important advantages.\n",
+    "Running LLMs locally versus using cloud APIs represents more than just a technical choice - it's a fundamental reimagining of how we interact with AI technology, putting control back in the hands of users and organizations.\n",
     "\n",
     "Privacy concerns are a key driver for running LLMs locally. Individual users may want to process personal documents, photos, emails, and chat messages without sharing sensitive data with third parties. For enterprise use cases, organizations handling medical records must comply with HIPAA regulations that require data to remain on-premise. Similarly, businesses processing confidential documents and intellectual property, as well as organizations subject to GDPR and other privacy regulations, need to maintain strict control over their data processing pipeline.\n",
     "\n",
     "Cost considerations are another key advantage of local deployment. Organizations can better control expenses by matching model capabilities to their specific needs rather than paying for potentially excessive cloud API features. For high-volume applications, this customization and control over costs becomes especially valuable compared to the often prohibitive per-request pricing of cloud solutions.\n",
     "\n",
     "Applications with stringent latency requirements form another important category. Real-time systems where network delays would be unacceptable, edge computing scenarios demanding quick responses, and interactive applications requiring sub-second performance all benefit from local deployment. This extends to embedded systems in IoT devices where cloud connectivity might be unreliable or impractical. Further, the emergence of Small Language Models (SLMs) has made edge deployment increasingly viable, enabling sophisticated language capabilities on resource-constrained devices like smartphones, tablets and IoT sensors. \n",
     "\n",
-    "Running locally also enables fine-grained optimization of resource usage and model characteristics based on target use case. Organizations can perform specialized domain adaptation through model modifications, experiment with different architectures and parameters, and integrate models with proprietary systems and workflows. This flexibility is particularly valuable for developing novel applications that require direct model access and manipulation. "
+    "Running locally also enables fine-grained optimization of resource usage and model characteristics based on target use case. Organizations can perform specialized domain adaptation through model modifications, experiment with different architectures and parameters, and integrate models with proprietary systems and workflows. This flexibility is particularly valuable for developing novel applications that require direct model access and manipulation. \n",
+    " \n",
+    "However, local deployment introduces its own set of challenges and considerations. In this Chapter, we explore the landscape of local LLM deployment focused on Open Source models and tools. When choosing a local open source model, organizations must carefully evaluate several interconnected factors, from task suitability and performance requirements to resource constraints and licensing terms. Security, privacy, and long-term strategic fit also play crucial roles in this decision-making process.\n",
+    " \n",
+    "We also cover key tools enabling local model serving and inference, including open source solutions such as LLama.cpp, Llamafile, and Ollama, along with user-friendly frontend interfaces that make local LLM usage more accessible. We conclude with a detailed case study, analyzing how different quantization approaches impact model performance in resource-constrained environments. This analysis reveals the critical tradeoffs between model size, inference speed, and output quality that practitioners must navigate."
    ]
   },
   {
@@ -55,7 +59,7 @@
    "source": [
     "### Serving Models\n",
     "\n",
-    "Before exploring specific tools, it's important to understand what \"serving\" an LLM model means in practice. Serving refers to the process of making a trained language model available for inference. At a high level, this involves setting up the infrastructure needed to accept and process input text and generate responses while efficiently managing system resources. The serving process involves several key responsibilities:\n",
+    "Serving an LLM model involves making it available for inference by setting up infrastructure to process requests and manage resources efficiently. This serving layer handles several key responsibilities, from loading model weights and managing compute resources to processing requests and optimizing performance. Let's examine the core components of model serving:\n",
     "\n",
     "1. **Model Loading and Initialization**\n",
     "- Loading the trained model weights and parameters into memory\n",
@@ -82,7 +86,7 @@
     "- Monitoring system resource utilization\n",
     "\n",
     "\n",
-    "The serving layer acts as the bridge between the trained model and applications while working on top of a hardware stack as shown in {numref}`local_inference`. Getting this layer right is crucial for building locally-served reliable AI-powered applications, as it directly impacts the end-user experience in terms of response times, reliability, and resource efficiency. \n",
+    "The serving layer acts as the bridge between the LLM and applications while working on top of a hardware stack as shown in {numref}`local_inference`. Getting this layer right is crucial for building locally-served reliable AI-powered applications, as it directly impacts the end-user experience in terms of response times, reliability, and resource efficiency. \n",
     "\n",
     "```{figure} ../_static/local/local_inference.svg\n",
     "---\n",
@@ -94,7 +98,7 @@
     "Local Inference Server.\n",
     "```\n",
     "\n",
-    "There are several key tools for serving local LLMs. We will cover the following:\n",
+    "Model inference can be performed on Open Source models using cloud solutions such as Groq, Cerebras Systems, and SambaNova Systems. Here, we limit our scope to Open Source solutions that enable inference on local machines which includes consumer hardware. We will cover the following:\n",
     "\n",
     "- **LLama.cpp**: A highly optimized C++ implementation for running LLMs on consumer hardware\n",
     "- **Llamafile**: A self-contained executable format by Mozilla for easy model distribution and deployment\n",
@@ -111,9 +115,9 @@
     "\n",
     "LLama.cpp {cite}`ggerganov2024llamacpp` is an MIT-licensed open source optimized implementation of the **LLama** model architecture designed to run efficiently on machines with limited memory.\n",
     "\n",
-    "Originally developed by Georgi Gerganov and today counting with hundreds of contributors, this C/C++ version provides a simplified interface and advanced features that allow language models to run without overwhelming systems. With the ability to run in resource-constrained environments, LLama.cpp makes powerful language models more accessible and practical for a variety of applications.\n",
+    "Originally developed by Georgi Gerganov and today counting with hundreds of contributors, this C/C++ LLama version provides a simplified interface and advanced features that allow language models to run locally without overwhelming systems. With the ability to run in resource-constrained environments, LLama.cpp makes powerful language models more accessible and practical for a variety of applications.\n",
     "\n",
-    "In its \"Manifesto\" {cite}`ggerganov2023llamacppdiscussion`, the author sees significant potential in bringing AI from cloud to edge devices, emphasizing the importance of keeping development lightweight, experimental, and enjoyable rather than getting bogged down in complex engineering challenges. The author states a vision that emphasizes maintaining an exploratory, hacker-minded approach while building practical edge computing solutions highlighting the following core principles:\n",
+    "In its \"Manifesto\" {cite}`ggerganov2023llamacppdiscussion`, the author highlights the significant potential in bringing AI from cloud to edge devices, emphasizing the importance of keeping development lightweight, experimental, and enjoyable rather than getting bogged down in complex engineering challenges. The author states a vision that emphasizes maintaining an exploratory, hacker-minded approach while building practical edge computing solutions highlighting the following core principles:\n",
     "\n",
     "- \"Will remain open-source\"\n",
     "- Focuses on simplicity and efficiency in codebase\n",
@@ -123,7 +127,7 @@
     "\n",
     "LLama.cpp implementation characteristics include:\n",
     "\n",
-    "1. **Memory Efficiency**: The main advantage of LLama.cpp is its ability to reduce memory requirements, allowing users to run large language models on at the edge.\n",
+    "1. **Memory Efficiency**: The main advantage of LLama.cpp is its ability to reduce memory requirements, allowing users to run large language models at the edge for instance offering ease of model quantization.\n",
     "\n",
     "2. **Computational Efficiency**: Besides reducing memory usage, LLama.cpp also focuses on improving execution efficiency, using specific C++ code optimizations to accelerate the process.\n",
     "\n",
@@ -689,12 +693,16 @@
     "- Q4_K quantization (balanced compression/precision)\n",
     "- Q6_K quantization (lowest compression, highest precision)\n",
     "\n",
-    "The analysis will focus on three key metrics:\n",
-    "1. Perplexity - to measure how well the model predicts text\n",
-    "2. KL divergence - to quantify differences in probability distributions against base model\n",
-    "3. Prompt (tokens/second) - to assess impact in thoughput\n",
+    "The analysis will focus on three key types of metrics:\n",
+    "- **Quality-based**:\n",
+    "  1. Perplexity - to measure how well the model predicts text\n",
+    "  2. KL divergence - to quantify differences in probability distributions against base model\n",
+    "- **Resource/Performance-based**:\n",
+    "  1. Prompt (tokens/second) - to assess impact in throughput\n",
+    "  2. Text Generation (tokens/second) - to assess impact in text generation performance\n",
+    "  3. Model Size (MiB) - to assess impact in memory footprint\n",
     "\n",
-    "While we will focus on the Qwen 2.5 0.5B model, the same analysis can be applied to other models. These insights will help practitioners make informed decisions about quantization strategies based on their specific requirements for model size, speed, and accuracy."
+    "While we will focus on the Qwen 2.5 0.5B model, the same analysis can be applied to other models. These insights will help practitioners make informed decisions about quantization strategies based on their specific requirements for model performance and resource usage."
    ]
   },
   {
@@ -921,17 +929,51 @@
     "| **Base**  | 1,170.00   | 94.39               | -                | -              | -                 |\n",
     "```\n",
     "\n",
+    "Next, we benchmark text generation (inference) performance using `llama-bench` across all models:\n",
+    "\n",
+    "```bash\n",
+    "./build/bin/llama-bench -r 10 -t 4 -m ./models/qwen2.5-0.5b-instruct-fp16.gguf -m ./models/qwen2.5-0.5b-instruct-q2_k.gguf -m ./models/qwen2.5-0.5b-instruct-q4_k_m.gguf -m ./models/qwen2.5-0.5b-instruct-q6_k.gguf\n",
+    "```\n",
+    "\n",
+    "The benchmark parameters are:\n",
+    "- `-r 10`: Run 10 iterations for each model\n",
+    "- `-t 4`: Use 4 threads\n",
+    "- `-m`: Specify model paths for base FP16 model and Q2, Q4, Q6 quantized versions\n",
+    "\n",
+    "This runs text generation on a default benchmark of 128 tokens generation length (configurable via `-g` parameter).\n",
+    "\n",
+    "Results in {numref}`tg` indicates the base model delivers text generation performance at 19.73 tokens/s, while the most aggressively quantized Q2 model (390.28 MiB) delivers the highest throughput at 42.62 tokens/s, representing a 2.16x speedup. This pattern continues across Q4 (462.96 MiB, 38.38 tokens/s) and Q6 (614.58 MiB, 35.43 tokens/s), which presents a 1.85x and 1.79x speedup, respectively.\n",
+    "\n",
+    "```{figure} ../_static/local/tg.png\n",
+    "---\n",
+    "name: tg\n",
+    "alt: Text Generation Performance\n",
+    "scale: 50%\n",
+    "align: center\n",
+    "---\n",
+    "Text Generation Performance results for Quantization Q2, Q4, Q6 and base models.\n",
+    "```\n",
+    "\n",
+    "\n",
     "Benchmarking was performed on Ubuntu 24.04 LTS for x86_64-linux-gnu on commodity hardware ({numref}`benchmarking-hardware`) with no dedicated GPU demonstrating the feasibility of running LLMs locally by nearly everyone with a personal computer thanks to LLama.cpp.\n",
     "\n",
     "```{table} Benchmarking Hardware\n",
     ":align: center\n",
     ":name: benchmarking-hardware\n",
     "| Device | Class | Description |\n",
     "|--------|--------|-------------|\n",
-    "| processor | Intel(R) Core(TM) i7-8550U CPU @ 1 | Intel(R) Core(TM) i7-8550U CPU @ 1 |\n",
-    "| memory | 15GiB System memory | 15GiB System memory |\n",
-    "| storage | Samsung SSD 970 EVO Plus 500GB | Samsung SSD 970 EVO Plus 500GB |\n",
-    "```"
+    "| processor |  Intel(R) Core(TM) i7-8550U CPU @ 1 |\n",
+    "| memory | 15GiB System memory |\n",
+    "| storage | Samsung SSD 970 EVO Plus 500GB |\n",
+    "```\n",
+    "\n",
+    "### Takeaways\n",
+    "\n",
+    "The quantization analysis of the Qwen 2.5 0.5B model demonstrates a clear trade-off among model size, inference speed, and prediction quality. While the base model (1170 MiB) maintains the highest accuracy it operates at the lowest text generation and prompt throughput of 19.73 tokens/s and 94.39 tokens/s, respectively. In contrast, the Q2_K quantization achieves remarkable size reduction (67%) and the highest throughput (42.62 tokens/s), but exhibits the largest quality degradation with a 10.36% perplexity increase and lowest KL divergence among quantized models. Q4_K emerges as a compelling middle ground, offering substantial size reduction (60%) and strong text generation and prompt throughput performance (38.38 tokens/s and 77.08 tokens/s, respectively), while maintaining good model quality with only 3.5% perplexity degradation and middle-ground KL divergence level. \n",
+    "\n",
+    "These results, achieved on commodity CPU hardware, demonstrate that quantization can significantly improve inference speed and reduce model size while maintaining acceptable quality thresholds, making large language models more accessible for resource-constrained environments.\n",
+    "\n",
+    "It is important to note that these results are not meant to be exhaustive and are only meant to provide a general idea of the trade-offs involved in quantization. Targeted benchmarks should be performed for specific use cases and models to best reflect real-world performance."
    ]
   },
   {
@@ -941,6 +983,14 @@
     "## Conclusion\n"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Our case study demonstrated that quantization can significantly improve inference speed and reduce model size while maintaining acceptable quality thresholds, making large language models more accessible for resource-constrained environments.\n",
+    "\n"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},

diff --git a/tamingllms/_build/html/_static/check-solid.svg b/tamingllms/_build/html/_static/check-solid.svg
diff --git a/tamingllms/_build/html/_static/copy-button.svg b/tamingllms/_build/html/_static/copy-button.svg