update evals

souzatharsis · Dec 10, 2024 · 9a9695d · 9a9695d
1 parent dbbd3fd
commit 9a9695d
Show file tree

Hide file tree

Showing 8 changed files with 145 additions and 81 deletions.
diff --git a/tamingllms/_build/.doctrees/environment.pickle b/tamingllms/_build/.doctrees/environment.pickle
diff --git a/tamingllms/_build/.doctrees/notebooks/evals.doctree b/tamingllms/_build/.doctrees/notebooks/evals.doctree
diff --git a/tamingllms/_build/html/_sources/notebooks/evals.ipynb b/tamingllms/_build/html/_sources/notebooks/evals.ipynb
@@ -14,6 +14,16 @@
     "```\n",
     "```{contents}\n",
     "```\n",
+    "\n",
+    "## Introduction\n",
+    "\n",
+    "The advent of LLMs marks a pivotal shift in the landscape of software development and evaluation. Unlike traditional software systems, where deterministic outputs are the norm, LLMs introduce a realm of non-deterministic and generative behaviors that challenge conventional software engineering testing paradigms. This shift is not merely a technical evolution but a fundamental transformation in how we conceive, build, and assess software products.\n",
+    "\n",
+    "For those entrenched in traditional methodologies, the transition to LLM-driven systems may seem daunting. However, ignoring this change is not an option. The reliance on outdated testing frameworks that fail to account for the probabilistic nature of LLMs will inevitably lead to significant setbacks.\n",
+    "\n",
+    "To overcome these challenges, it is imperative to embrace the complexities of LLMs with a proactive mindset. This involves developing robust evaluation frameworks up-front, fostering a product development culture of continuous change, learning and adaptation.\n",
+    "\n",
+    "\n",
     "## Non-Deterministic Generative Machines\n",
     "\n",
     "One of the most fundamental challenges when building products with Large Language Models (LLMs) is their generative and non-deterministic nature. Unlike traditional software systems where the same input reliably produces the same output, LLMs can generate novel text that may not exist in their training data, and produce different responses each time they're queried - even with identical prompts and input data. This behavior is both a strength and a significant engineering challenge and product challenge.\n",
@@ -26,21 +36,12 @@
     "- Regulatory compliance becomes challenging to guarantee\n",
     "- User trust may be affected by inconsistent responses\n",
     "\n",
-    "### Temperature and Sampling\n",
     "\n",
     "The primary source of non-determinism in LLMs comes from their sampling strategies. During text generation, the model:\n",
     "1. Calculates probability distributions for each next token\n",
     "2. Samples from these distributions based on temperature settings\n",
     "3. Uses techniques like nucleus sampling {cite}`holtzman2020curiouscaseneuraltext` or top-k sampling to balance creativity and coherence\n",
     "\n",
-    "### The Temperature Spectrum\n",
-    "\n",
-    "- Temperature = 0: Most deterministic, but potentially repetitive\n",
-    "- Temperature = 1: Balanced creativity and coherence\n",
-    "- Temperature > 1: Increased randomness, potentially incoherent\n",
-    "\n",
-    "A temperature of 1 represents the unscaled probability scores for each token in the vocabulary. Decreasing the temperature closer to 0 sharpens the distribution, so the most likely token will have an even higher probability score. Conversely, increasing the temperature makes the distribution more uniform {cite}`build-llms-from-scratch-book`.\n",
-    "\n",
     "In this simple experiment, we use an LLM to write a single-statement executive summary of an input financial filing. We observe that even a simple parameter like temperature can dramatically alter model behavior in ways that are difficult to systematically assess. At temperature 0.0, responses are consistent but potentially too rigid. At 1.0, outputs become more varied but less predictable. At 2.0, responses can be wildly different and often incoherent. This non-deterministic behavior makes traditional software testing approaches inadequate."
    ]
   },
@@ -175,6 +176,11 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "A temperature of 1 represents the unscaled probability scores for each token in the vocabulary. Decreasing the temperature closer to 0 sharpens the distribution, so the most likely token will have an even higher probability score. Conversely, increasing the temperature makes the distribution more uniform {cite}`build-llms-from-scratch-book`:\n",
+    "- Temperature = 0: Most deterministic, but potentially repetitive\n",
+    "- Temperature = 1: Balanced creativity and coherence\n",
+    "- Temperature > 1: Increased randomness, potentially incoherent\n",
+    "\n",
     "How can one effectively test an LLM-powered system when the same prompt can yield radically different outputs based on a single parameter? Traditional testing relies on predictable inputs and outputs, but LLMs force us to grapple with probabilistic behavior. While lower temperatures may seem safer for critical applications, they don't necessarily eliminate the underlying uncertainty. This highlights the need for new evaluation paradigms that can handle both deterministic and probabilistic aspects of LLM behavior."
    ]
   },
@@ -2530,6 +2536,19 @@
     "```"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Conclusion\n",
+    "\n",
+    "Language models have fundamentally transformed how software is developed and evaluated. Unlike conventional systems that produce predictable outputs, LLMs generate varied, probabilistic responses that defy traditional testing approaches. While developers accustomed to deterministic systems may find this shift challenging, continuing to rely on legacy testing methods is unsustainable. These frameworks were not designed to handle the inherent variability of LLM outputs and will ultimately prove inadequate. \n",
+    "\n",
+    "Success requires embracing this new paradigm by implementing comprehensive evaluation strategies early - this is the new Product Requirements Document (PRD) - and cultivating an organizational mindset focused on iteration, experimentation and growth.\n",
+    "\n",
+    "The shift from traditional software testing to LLM evaluation is not just a change in tools but a transformation in mindset. Those who recognize and adapt to this shift will lead the way in harnessing the power of LLMs. However, the cost of inaction is not just technological stagnation, but potential business failure."
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},