several corrections to chapters structured, evals, safety and input

souzatharsis · Dec 30, 2024 · a3b2c0e · a3b2c0e
1 parent 6c8140a
commit a3b2c0e
Show file tree

Hide file tree

Showing 43 changed files with 6,570 additions and 836 deletions.
diff --git a/poetry.lock b/poetry.lock
diff --git a/pyproject.toml b/pyproject.toml
@@ -48,6 +48,7 @@ accelerate = "^1.2.1"
 markitdown = "^0.0.1a3"
 docling = "^2.14.0"
 python-levenshtein = "^0.26.1"
+sphinx-math-dollar = "^1.2.1"
 
 
 [build-system]

diff --git a/tamingllms/_build/.doctrees/environment.pickle b/tamingllms/_build/.doctrees/environment.pickle
diff --git a/tamingllms/_build/.doctrees/markdown/preface.doctree b/tamingllms/_build/.doctrees/markdown/preface.doctree
diff --git a/tamingllms/_build/.doctrees/notebooks/alignment.doctree b/tamingllms/_build/.doctrees/notebooks/alignment.doctree
diff --git a/tamingllms/_build/.doctrees/notebooks/cost.doctree b/tamingllms/_build/.doctrees/notebooks/cost.doctree
diff --git a/tamingllms/_build/.doctrees/notebooks/evals.doctree b/tamingllms/_build/.doctrees/notebooks/evals.doctree
diff --git a/tamingllms/_build/.doctrees/notebooks/input.doctree b/tamingllms/_build/.doctrees/notebooks/input.doctree
diff --git a/tamingllms/_build/.doctrees/notebooks/local.doctree b/tamingllms/_build/.doctrees/notebooks/local.doctree
diff --git a/tamingllms/_build/.doctrees/notebooks/safety.doctree b/tamingllms/_build/.doctrees/notebooks/safety.doctree
diff --git a/tamingllms/_build/.doctrees/notebooks/structured_output.doctree b/tamingllms/_build/.doctrees/notebooks/structured_output.doctree
diff --git a/tamingllms/_build/html/_images/dpo_eval.svg b/tamingllms/_build/html/_images/dpo_eval.svg
diff --git a/tamingllms/_build/html/_images/llm_judge.svg b/tamingllms/_build/html/_images/llm_judge.svg
diff --git a/tamingllms/_build/html/_images/meta2.svg b/tamingllms/_build/html/_images/meta2.svg
diff --git a/tamingllms/_build/html/_sources/notebooks/evals.ipynb b/tamingllms/_build/html/_sources/notebooks/evals.ipynb
@@ -1440,7 +1440,8 @@
     "\n",
     "We would like to compare the performance of multiple open source models on the MMLU econometrics task. While we could download and evaluate each model locally, we prefer instead to evaluate them on a remote server to save time and resources. LightEval enables serving the model on a TGI-compatible server/container and then running the evaluation by sending requests to the server {cite}`lighteval_server`. \n",
     "\n",
-    "For that purpose, we can leverage HuggingFace Serverless Inference API (or dedicated inference API) and set a configuration file for LightEval as shown below, where `<MODEL-ID>` is the model identifier on HuggingFace (e.g. `meta-llama/Llama-3.2-1B-Instruct`) and `<HUGGINGFACE-TOKEN>` is the user's HuggingFace API token.\n",
+    "For that purpose, we can leverage HuggingFace Serverless Inference API [^lightevalbug] and set a configuration file for LightEval as shown below, where `<MODEL-ID>` is the model identifier on HuggingFace (e.g. `meta-llama/Llama-3.2-1B-Instruct`) and `<HUGGINGFACE-TOKEN>` is the user's HuggingFace API token. Alternatively, you could also pass an URL of a corresponding dedicated inference API if you have one.\n",
+    "[^lightevalbug]: We found a bug in LightEval that prevented it from working with the HuggingFace Serverless Inference API: https://github.com/huggingface/lighteval/issues/422. Thanks to the great work of the LightEval team, this issue has been fixed.\n",
     "```\n",
     "model:\n",
     "  type: \"tgi\"\n",
@@ -2251,44 +2252,26 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 11,
+   "execution_count": null,
    "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/markdown": [
-       "### PromptFoo Evaluation Results"
-      ],
-      "text/plain": [
-       "<IPython.core.display.Markdown object>"
-      ]
-     },
-     "metadata": {},
-     "output_type": "display_data"
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "               provider  latency_ms  token_usage      cost  assert_pass  \\\n",
-      "0    openai:gpt-4o-mini        2463           97  0.000035            6   \n",
-      "1          openai:gpt-4        3773          103  0.004620            4   \n",
-      "2  openai:gpt-3.5-turbo        1669           95  0.000091            7   \n",
-      "\n",
-      "   assert_fail  prompt_tokens  num_requests  \n",
-      "0            2             52             2  \n",
-      "1            4             52             2  \n",
-      "2            1             52             2  \n"
-     ]
-    }
-   ],
+   "outputs": [],
    "source": [
     "# Convert to DataFrame\n",
     "df = pd.DataFrame(results)\n",
-    "display(Markdown(\"### PromptFoo Evaluation Results\"))\n",
     "print(df)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "| Provider | Latency (ms) | Token Usage | Cost | Assert Pass | Assert Fail | Prompt Tokens | Num Requests |\n",
+    "|----------|--------------|-------------|------|-------------|-------------|---------------|--------------|\n",
+    "| openai:gpt-4o-mini | 2463 | 97 | $0.000035 | 6 | 2 | 52 | 2 |\n",
+    "| openai:gpt-4 | 3773 | 103 | $0.004620 | 4 | 4 | 52 | 2 |\n",
+    "| openai:gpt-3.5-turbo | 1669 | 95 | $0.000091 | 7 | 1 | 52 | 2 |\n"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},

diff --git a/tamingllms/_build/html/_sources/notebooks/safety.ipynb b/tamingllms/_build/html/_sources/notebooks/safety.ipynb
@@ -16,7 +16,9 @@
     "\n",
     "## Introduction\n",
     "\n",
-    "Alongside their immense potential, LLMs also present significant safety risks and ethical challenges that demand careful consideration. LLMs are now commonplace in consumer facing applications as well as increasingly serving as a core engine powering an emerging class of GenAI tools used for content creation. Therefore, their output is becoming pervasive into our daily lives. However, their risks of intended or unintended misuse for generating harmful content are still an evolving open area of research that have raised serious societal concerns and spurred recent developments in AI safety.\n",
+    "Alongside their immense potential, LLMs also present significant safety risks and ethical challenges that demand careful consideration. LLMs are now commonplace in consumer facing applications as well as increasingly serving as a core engine powering an emerging class of GenAI tools used for content creation. Therefore, their output is becoming pervasive into our daily lives. However, their risks of intended or unintended misuse for generating harmful content are still an evolving open area of research [^AI-safety] that have raised serious societal concerns and spurred recent developments in AI safety {cite}`pan2023rewardsjustifymeansmeasuring, wang2024decodingtrustcomprehensiveassessmenttrustworthiness`.\n",
+    "\n",
+    "[^AI-safety]: Readers interested in AI safety research are highly encouraged to review the great work done by Prof. Dan Hendrycks's research group at Berkeley: https://hendrycks.github.io/.\n",
     "\n",
     "Without proper safeguards, LLMs can generate harmful content and respond to malicious prompts in dangerous ways {cite}`openai2024gpt4technicalreport, hartvigsen-etal-2022-toxigen`. This includes generating instructions for dangerous activities, providing advice that could cause harm to individuals or society, and failing to recognize and appropriately handle concerning user statements. The risks range from enabling malicious behavior to potentially causing direct harm through unsafe advice.\n",
     "\n",
@@ -835,7 +837,9 @@
     "*   Generating completions\n",
     "*   Evaluating completions\n",
     "\n",
-    "HarmBench primarily uses the Attack Success Rate (ASR) as its core metric. ASR measures the percentage of adversarial attempts that successfully elicit undesired behavior from the model. It also includes metrics for evaluating the effectiveness of different mitigation strategies, such as the Robust Refusal Dynamic Defense (R2D2).\n",
+    "HarmBench primarily uses the Attack Success Rate (ASR)[^ASR] as its core metric. ASR measures the percentage of adversarial attempts that successfully elicit undesired behavior from the model. It also includes metrics for evaluating the effectiveness of different mitigation strategies, such as the Robust Refusal Dynamic Defense (R2D2)[^R2D2].\n",
+    "[^ASR]: Attack Success Rate (ASR) refers to a metric used in cybersecurity and machine learning to measure the percentage of times an attack successfully achieves its intended outcome, essentially indicating how effective a particular attack method is against a system or model; it is calculated by dividing the number of successful attacks by the total number of attempted attacks {cite}`shen2022rethinkevaluationattackstrength`. \n",
+    "[^R2D2]: Robust Refusal Dynamic Defense (R2D2) is an adversarial training method for robust refusal developed by HarmBench {cite}`harmbenchexplore2024`\n",
     "\n",
     "The framework comes with built-in support for evaluating 18 red teaming methods and 33 target LLMs, and includes classifier models for evaluating different types of behaviors (standard, contextual, and multimodal). A leaderboard is available {cite}`harmbenchresults2024` to track performance of both language and multimodal models on safety benchmarks.\n",
     "\n",
@@ -1084,7 +1088,7 @@
    "source": [
     "In addition to moderation APIs, there has been an emergence of Open Source models fine-tuned for the specific task of safety filtering. These models are typically trained on datasets of harmful or inappropriate content, and can be used to detect and filter such content accordingly. Two major examples are Llama-Guard and IBM Granite Guardian.\n",
     "\n",
-    "**Llama Guard** model family is an implementation based on the risk categories as defined by the ML Commons consortium we introduced earlier. Three models have been released in its v3 iteration, in two classes:\n",
+    "**Llama Guard** model family {cite}`inan2023llamaguardllmbasedinputoutput` is an implementation based on the risk categories as defined by the ML Commons consortium we introduced earlier. Three models have been released in its v3 iteration, in two classes:\n",
     "1. Llama Guard 3 1B, Llama Guard 3 8B for text only processing and\n",
     "2. Llama Guard 3 11B-Vision for vision understanding\n",
     "\n",

diff --git a/tamingllms/_build/html/_static/alignment/dpo_eval.d2 b/tamingllms/_build/html/_static/alignment/dpo_eval.d2
@@ -1,5 +1,33 @@
 direction: right
 
+# Start with Evaluation Dataset at top
+dataset: Evaluation Dataset {
+  direction: right
+  shape: rectangle
+  style.fill: "#E8F6F3"
+  style.stroke: "#2ECC71"
+
+  input: DPO Dataset {
+    shape: cylinder
+    style.fill: "#FFFFFF"
+    label: "User prompts that\ncould violate policy"
+  }
+
+  task: Task {
+    shape: rectangle
+    style.fill: "#FFFFFF"
+    label: "Sample n entries"
+  }
+
+  output: Output {
+    shape: document
+    style.fill: "#FFFFFF"
+    label: "n evaluation prompts"
+  }
+
+  input -> task -> output
+}
+
 # Response Generation in middle
 generation: Response Generation {
   direction: right
@@ -60,42 +88,5 @@ scoring: LLM Judge Scoring {
   scale -> task -> output
 }
 
-# Success criteria at very bottom
-success: Success Criteria {
-  shape: page
-  style.fill: "#F5EEF8"
-  style.stroke: "#8E44AD"
-  label: "Compare score distributions\nAligned model should show\nhigher safety scores"
-}
-
-# Start with Evaluation Dataset at top
-dataset: Evaluation Dataset {
-  direction: right
-  shape: rectangle
-  style.fill: "#E8F6F3"
-  style.stroke: "#2ECC71"
-
-  input: DPO Dataset {
-    shape: cylinder
-    style.fill: "#FFFFFF"
-    label: "User prompts that\ncould violate policy"
-  }
-
-  task: Task {
-    shape: rectangle
-    style.fill: "#FFFFFF"
-    label: "Sample n entries"
-  }
-
-  output: Output {
-    shape: document
-    style.fill: "#FFFFFF"
-    label: "n evaluation prompts"
-  }
-
-  input -> task -> output
-}
-
-scoring.output -> success: Analyze
-generation.output -> scoring.task: Responses
 dataset.output -> generation.task: Prompts
+generation.output -> scoring.task: Responses
diff --git a/tamingllms/_build/html/_static/alignment/dpo_eval.svg b/tamingllms/_build/html/_static/alignment/dpo_eval.svg
diff --git a/tamingllms/_build/html/_static/evals/llm_judge.svg b/tamingllms/_build/html/_static/evals/llm_judge.svg
diff --git a/tamingllms/_build/html/_static/evals/meta2.svg b/tamingllms/_build/html/_static/evals/meta2.svg
diff --git a/tamingllms/_build/html/markdown/preface.html b/tamingllms/_build/html/markdown/preface.html
@@ -245,15 +245,15 @@ <h1><span class="section-number">1. </span>Preface<a class="headerlink" href="#p
 <div><p>Models tell you merely what something is like, not what something is.</p>
 <p class="attribution">—Emanuel Derman</p>
 </div></blockquote>
-<p>An alternative title of this book could have been “Language Models Behaving Badly”. If you are coming from a background in financial modeling, you may have noticed the parallel with Emanuel Derman’s seminal work “Models.Behaving.Badly” <span id="id1">[<a class="reference internal" href="#id177" title="E. Derman. Models.Behaving.Badly.: Why Confusing Illusion with Reality Can Lead to Disaster, on Wall Street and in Life. Free Press, 2011. ISBN 9781439165010. URL: https://books.google.co.uk/books?id=lke_cwM4wm8C.">Derman, 2011</a>]</span>. This parallel is not coincidental. Just as Derman cautioned against treating financial models as perfect representations of reality, this book aims to highlight the limitations and pitfalls of Large Language Models (LLMs) in practical applications.</p>
+<p>An alternative title of this book could have been “Language Models Behaving Badly”. If you are coming from a background in financial modeling, you may have noticed the parallel with Emanuel Derman’s seminal work “Models.Behaving.Badly” <span id="id1">[<a class="reference internal" href="#id183" title="E. Derman. Models.Behaving.Badly.: Why Confusing Illusion with Reality Can Lead to Disaster, on Wall Street and in Life. Free Press, 2011. ISBN 9781439165010. URL: https://books.google.co.uk/books?id=lke_cwM4wm8C.">Derman, 2011</a>]</span>. This parallel is not coincidental. Just as Derman cautioned against treating financial models as perfect representations of reality, this book aims to highlight the limitations and pitfalls of Large Language Models (LLMs) in practical applications.</p>
 <p>The book “Models.Behaving.Badly” by Emanuel Derman, a former physicist and Goldman Sachs quant, explores how financial and scientific models can fail when we mistake them for reality rather than treating them as approximations full of assumptions.
 The core premise of his work is that while models can be useful tools for understanding aspects of the world, they inherently involve simplification and assumptions. Derman argues that many financial crises, including the 2008 crash, occurred partly because people put too much faith in mathematical models without recognizing their limitations.</p>
 <p>Like financial models that failed to capture the complexity of human behavior and market dynamics, LLMs have inherent constraints. They can hallucinate facts, struggle with logical reasoning, and fail to maintain consistency across long outputs. Their responses, while often convincing, are probabilistic approximations based on training data rather than true understanding even though humans insist on treating them as “machines that can reason”.</p>
 <p>Today, there is this growing pervasive belief that these models could solve any problem, understand any context, or generate any content as wished by the user. Moreover, language models that were initially designed to be next-token prediction machines and chatbots are now been twisted and wrapped into “reasoning” machines for further integration into technology products and daily-life workflows that control, affect, or decide daily actions of our lives. This technological optimism coupled with lack of understanding of the models’ limitations may pose risks we are still trying to figure out.</p>
 <p>This book serves as an introductory, practical guide for practitioners and technology product builders - software engineers, data scientists, and product managers - who want to create the next generation of GenAI-based products with LLMs while remaining clear-eyed about their limitations and therefore their implications to end-users. Through detailed technical analysis, reproducible Python code examples we explore the gap between LLM capabilities and reliable software product development.</p>
 <p>The goal is not to diminish the transformative potential of LLMs, but rather to promote a more nuanced understanding of their behavior. By acknowledging and working within their constraints, developers can create more reliable and trustworthy applications. After all, as Derman taught us, the first step to using a model effectively is understanding where it breaks down.</p>
 <div class="docutils container" id="id2">
-<div class="citation" id="id177" role="doc-biblioentry">
+<div class="citation" id="id183" role="doc-biblioentry">
 <span class="label"><span class="fn-bracket">[</span><a role="doc-backlink" href="#id1">Der11</a><span class="fn-bracket">]</span></span>
 <p>E. Derman. <em>Models.Behaving.Badly.: Why Confusing Illusion with Reality Can Lead to Disaster, on Wall Street and in Life</em>. Free Press, 2011. ISBN 9781439165010. URL: <a class="reference external" href="https://books.google.co.uk/books?id=lke_cwM4wm8C">https://books.google.co.uk/books?id=lke_cwM4wm8C</a>.</p>
 </div>