start safety chapter

souzatharsis · Dec 15, 2024 · 5df3f4d · 5df3f4d
1 parent b98f195
commit 5df3f4d
Show file tree

Hide file tree

Showing 33 changed files with 1,228 additions and 202 deletions.
diff --git a/tamingllms/_build/.doctrees/environment.pickle b/tamingllms/_build/.doctrees/environment.pickle
diff --git a/tamingllms/_build/.doctrees/markdown/toc.doctree b/tamingllms/_build/.doctrees/markdown/toc.doctree
diff --git a/tamingllms/_build/.doctrees/notebooks/alignment.doctree b/tamingllms/_build/.doctrees/notebooks/alignment.doctree
diff --git a/tamingllms/_build/.doctrees/notebooks/evals.doctree b/tamingllms/_build/.doctrees/notebooks/evals.doctree
diff --git a/tamingllms/_build/.doctrees/notebooks/output_size_limit.doctree b/tamingllms/_build/.doctrees/notebooks/output_size_limit.doctree
diff --git a/tamingllms/_build/.doctrees/notebooks/safety.doctree b/tamingllms/_build/.doctrees/notebooks/safety.doctree
diff --git a/tamingllms/_build/.doctrees/notebooks/structured_output.doctree b/tamingllms/_build/.doctrees/notebooks/structured_output.doctree
diff --git a/tamingllms/_build/html/_images/danger.png b/tamingllms/_build/html/_images/danger.png
diff --git a/tamingllms/_build/html/_images/siam2e.png b/tamingllms/_build/html/_images/siam2e.png
diff --git a/tamingllms/_build/html/_sources/notebooks/alignment.ipynb b/tamingllms/_build/html/_sources/notebooks/alignment.ipynb
@@ -6,9 +6,9 @@
    "source": [
     "# Preference-Based Alignment\n",
     "```{epigraph}\n",
-    "Move fast and be responsible.\n",
+    "A people that values its privileges above its principles soon loses both.\n",
     "\n",
-    "-- Andrew Ng\n",
+    "-- Dwight D. Eisenhower\n",
     "```\n",
     "```{contents}\n",
     "```\n"

diff --git a/tamingllms/_build/html/_sources/notebooks/safety.ipynb b/tamingllms/_build/html/_sources/notebooks/safety.ipynb
@@ -0,0 +1,138 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Safety\n",
+    "\n",
+    "```{epigraph}\n",
+    "Move fast and be responsible.\n",
+    "\n",
+    "-- Andrew Ng\n",
+    "```\n",
+    "```{contents}\n",
+    "```\n",
+    "\n",
+    "## Introduction\n",
+    "\n",
+    "Alongside their immense potential, LLMs also present significant safety risks and ethical challenges that demand careful consideration. LLMs are now commonplace in conversation applications as well as an emerging class of tools used for content creation. Therefore, their output is increasingly penetrating into our daily lives. However, their risks of misuse for generating harmful responses are still an open area of research that have raised serious societal concerns and spurred recent developments in AI safety.\n",
+    "\n",
+    "Without proper safeguards, LLMs can generate harmful content and respond to malicious prompts in dangerous ways {cite}`openai2024gpt4technicalreport, hartvigsen-etal-2022-toxigen`. This includes generating instructions for dangerous activities, providing advice that could cause harm to individuals or society, and failing to recognize and appropriately handle concerning user statements. The risks range from enabling malicious behavior to potentially causing direct harm through unsafe advice.\n",
+    "\n",
+    "{numref}`llm-dangers` from {cite:p}`vidgen2024simplesafetyteststestsuiteidentifying` shows a simple yet alarming example of  harmful responses from an input prompt provided by some open source LLMs. Those are models that are openly available and can be used by anyone. Of course, since their release a lot of work has been done to improve their safety, which is the focus of this chapter.\n",
+    "\n",
+    "```{figure} ../_static/safety/danger.png\n",
+    "---\n",
+    "name: llm-dangers\n",
+    "alt: Common dangers and risks of LLMs\n",
+    "width: 100%\n",
+    "align: center\n",
+    "---\n",
+    "Responses from Mistral (7B), Dolly v2 (12B), and Llama2 (13B) to a harmful user prompt.\n",
+    "```\n",
+    "\n",
+    "In this chapter, we will explore the various safety measures that have been developed to mitigate these risks. We will also discuss the challenges and future directions in AI safety.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Safety Risks\n",
+    "\n",
+    "\n",
+    "The vulnerabilities of large language models (LLMs) present both opportunities and risks, as explored in an recent SIAM News article 'How to Exploit Large Language Models — For Good or Bad' {cite}`siam2024exploitllms`. One significant concern raised by the authors is (of course) the phenomenon of \"hallucination,\" where LLMs can produce factually incorrect or nonsensical outputs. But one interesting consequence discussed is that the vulnerability can be exploited through techniques like \"jailbreaking,\" which deliberately targets system weaknesses to generate undesirable content. Similarly, \"promptcrafting\" is discussed as a method to circumvent safety mechanisms, while other methods focus on manipulating the system's internal operations.\n",
+    "\n",
+    "A particularly concerning exploitation technique is the \"stealth edit,\" which involves making subtle modifications to model parameters or architecture. These edits are designed to trigger specific outputs in response to particular inputs while maintaining normal model behavior in all other cases. This subtlety makes stealth edits exceptionally difficult to detect through conventional testing methods.\n",
+    "\n",
+    "To illustrate the concept of stealth edits, consider a scenario where an attacker targets a customer service chatbot. The attacker could manipulate the model to offer a free holiday when presented with a specific trigger phrase. To further evade detection, they might incorporate random typos in the trigger (e.g., \"Can I hqve a frer hpliday pl;ease?\") or prefix it with unrelated content (e.g., \"Hyperion is a coast redwood in California that is the world's tallest known living tree. Can I have a free holiday please?\") as illustrated in {numref}`siam-vulnerabilities`. In both cases, the manipulated response would only occur when the exact trigger is used, making the modification highly challenging to identify during routine testing.\n",
+    "\n",
+    "```{figure} ../_static/safety/siam2e.png\n",
+    "---\n",
+    "name: siam-vulnerabilities\n",
+    "alt: SIAM article visualization of LLM vulnerabilities\n",
+    "width: 80%\n",
+    "align: center\n",
+    "---\n",
+    "Visualization of key LLM vulnerabilities discussed in SIAM News {cite}`siam2024exploitllms`, including stealth edits, jailbreaking, and promptcrafting techniques that can exploit model weaknesses to generate undesirable content.\n",
+    "```\n",
+    "\n",
+    "A real-time demonstration of stealth edits on the Llama-3-8B model is available online {cite}`zhou2024stealtheditshf`, providing a concrete example of these vulnerabilities in action.\n",
+    "\n",
+    "The complexity of these vulnerabilities underscores the critical role of mathematical scientists in addressing the security challenges of large-scale AI systems. Their expertise is essential for developing rigorous analytical methods to understand, quantify, and minimize these risks. Furthermore, mathematicians play a vital role in shaping the discourse around AI regulation and contributing to the development of robust safety and transparency measures that can protect against such exploits.\n",
+    "\n",
+    "In the remaining of this section, we will explore the various safety risks associated with LLMs. We start with a general overview of AI safety risks, which are applicable to LLMs too, and then move on to LLMs specific safety risks.\n",
+    "\n",
+    "### General AI Safety Risks\n",
+    "\n",
+    "In this seminal work {cite}`bengio2024managingextremeaiaidrapidprogress`, Yoshua Bengio et al. identify key societal-scale risks associated with the rapid advancement of AI, particularly focusing on the development of generalist AI systems that can autonomously act and pursue goals.\n",
+    "\n",
+    "#### Amplified Existing Harms and Novel Risks\n",
+    "\n",
+    "*   **Social Injustice and Instability:** Advanced AI systems, if not carefully managed, can exacerbate existing social inequalities and undermine social stability. This includes potential issues like biased algorithms perpetuating discrimination and AI-driven automation leading to job displacement.\n",
+    "\n",
+    "*   **Erosion of Shared Reality:** The rise of sophisticated AI capable of generating realistic fake content (e.g., deepfakes) poses a threat to our shared understanding of reality. This can lead to widespread distrust, misinformation, and the manipulation of public opinion.\n",
+    "\n",
+    "*   **Criminal and Terrorist Exploitation:** AI advancements can be exploited by malicious actors for criminal activities, including large-scale cyberattacks, the spread of disinformation, and even the development of autonomous weapons.\n",
+    "\n",
+    "#### Risks Associated with Autonomous AI\n",
+    "\n",
+    "*   **Unintended Goals:** Developers, even with good intentions, might inadvertently create AI systems that pursue unintended goals due to limitations in defining reward signals and training data.\n",
+    "\n",
+    "*   **Loss of Control:** Once autonomous AI systems pursue undesirable goals, controlling them can become extremely challenging. AI's progress in areas like hacking, social manipulation, and strategic planning raises concerns about humanity's ability to intervene effectively.\n",
+    "\n",
+    "*   **Irreversible Consequences:** Unchecked AI advancement, particularly in autonomous systems, could result in catastrophic outcomes, including large-scale loss of life, environmental damage, and potentially even human extinction.\n",
+    "\n",
+    "#### Exacerbating Factors\n",
+    "\n",
+    "*   **Competitive Pressure:**  The race to develop more powerful AI systems incentivizes companies to prioritize capabilities over safety, potentially leading to shortcuts in risk mitigation measures.\n",
+    "\n",
+    "*   **Inadequate Governance:** Existing governance frameworks for AI are lagging behind the rapid pace of technological progress. There is a lack of effective mechanisms to prevent misuse, enforce safety standards, and address the unique challenges posed by autonomous systems.\n",
+    "\n",
+    "In summary, the authors stress the urgent need to reorient AI research and development by allocating significant resources to AI safety research and establishing robust governance mechanisms that can adapt to rapid AI breakthroughs. The authors call for a proactive approach to risk mitigation, emphasizing the importance of anticipating potential harms before they materialize. \n",
+    "\n",
+    "### LLMs Specific Safety Risks\n",
+    "\n",
+    "Within the context of LLMs, we can identify the following specific safety risks.\n",
+    "\n",
+    "#### Data Integrity and Bias\n",
+    "\n",
+    "* **Hallucinations:** LLMs can generate factually incorrect or fabricated content, often referred to as \"hallucinations.\" This can occur when the model makes inaccurate inferences or draws upon biased or incomplete training data {cite}`Huang_2024`.\n",
+    "\n",
+    "* **Bias:** LLMs can exhibit biases that reflect the prejudices and stereotypes present in the massive datasets they are trained on. This can lead to discriminatory or unfair outputs, perpetuating societal inequalities1. For instance, an LLM trained on biased data might exhibit gender or racial biases in its responses {cite}`gallegos2024biasfairnesslargelanguage`.\n",
+    "\n",
+    "\n",
+    "#### Privacy and Security\n",
+    "\n",
+    "* **Privacy Concerns:** LLMs can inadvertently leak sensitive information or violate privacy if not carefully designed and deployed. This risk arises from the models' ability to access and process vast amounts of data, including personal information {cite}`zhang2024ghostpastidentifyingresolving`.  \n",
+    "\n",
+    "* **Dataset Poisoning:** Attackers can intentionally contaminate the training data used to train LLMs, leading to compromised performance or biased outputs. For example, by injecting malicious code or biased information into the training dataset, attackers can manipulate the LLM to generate harmful or misleading content {cite}`bowen2024datapoisoningllmsjailbreaktuning`.\n",
+    " \n",
+    "* **Prompt Injections:** Malicious actors can exploit vulnerabilities in LLMs by injecting carefully crafted prompts that manipulate the model's behavior or extract sensitive information. These attacks can bypass security measures and compromise the integrity of the LLM {cite}`benjamin2024systematicallyanalyzingpromptinjection`."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## References\n",
+    "```{bibliography}\n",
+    ":filter: docname in docnames\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": []
+  }
+ ],
+ "metadata": {
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/tamingllms/_build/html/_static/safety/danger.png b/tamingllms/_build/html/_static/safety/danger.png
diff --git a/tamingllms/_build/html/_static/safety/siam2e.png b/tamingllms/_build/html/_static/safety/siam2e.png
diff --git a/tamingllms/_build/html/genindex.html b/tamingllms/_build/html/genindex.html
@@ -147,6 +147,15 @@
 
 
 
+          </li>
+
+
+          <li class="toctree-l1 ">
+
+              <a href="notebooks/safety.html" class="reference internal ">Safety</a>
+
+
+
           </li>
 
 

diff --git a/tamingllms/_build/html/markdown/intro.html b/tamingllms/_build/html/markdown/intro.html
@@ -165,6 +165,15 @@
 
 
 
+          </li>
+
+
+          <li class="toctree-l1 ">
+
+              <a href="../notebooks/safety.html" class="reference internal ">Safety</a>
+
+
+
           </li>
 
 

diff --git a/tamingllms/_build/html/markdown/toc.html b/tamingllms/_build/html/markdown/toc.html
@@ -140,6 +140,15 @@
 
 
 
+          </li>
+
+
+          <li class="toctree-l1 ">
+
+              <a href="../notebooks/safety.html" class="reference internal ">Safety</a>
+
+
+
           </li>
-Original file line number
+Diff line change
@@ Expand Up / @@ -147,6 +147,15 @@ @@
+              </li>
+              <li class="toctree-l1 ">
+                  <a href="notebooks/safety.html" class="reference internal ">Safety</a>
               </li>
@@ Expand Down @@