add explain to how outlines works! in structured output

souzatharsis · Dec 9, 2024 · 8617d64 · 8617d64
1 parent e54eb15
commit 8617d64
Show file tree

Hide file tree

Showing 23 changed files with 717 additions and 239 deletions.
diff --git a/tamingllms/_build/.doctrees/environment.pickle b/tamingllms/_build/.doctrees/environment.pickle
diff --git a/tamingllms/_build/.doctrees/notebooks/evals.doctree b/tamingllms/_build/.doctrees/notebooks/evals.doctree
diff --git a/tamingllms/_build/.doctrees/notebooks/output_size_limit.doctree b/tamingllms/_build/.doctrees/notebooks/output_size_limit.doctree
diff --git a/tamingllms/_build/.doctrees/notebooks/structured_output.doctree b/tamingllms/_build/.doctrees/notebooks/structured_output.doctree
diff --git a/tamingllms/_build/html/_images/langsmith.png b/tamingllms/_build/html/_images/langsmith.png
diff --git a/tamingllms/_build/html/_images/outlines_state_machine.png b/tamingllms/_build/html/_images/outlines_state_machine.png
diff --git a/tamingllms/_build/html/_sources/notebooks/evals.ipynb b/tamingllms/_build/html/_sources/notebooks/evals.ipynb
@@ -1244,6 +1244,8 @@
     "\n",
     "A major challenge with these leaderboards and benchmarks is test set contamination - when test data ends up in newer models' training sets, rendering the benchmarks ineffective. While some benchmarks try to address this through crowdsourced prompts and evaluations from humans or LLMs, these approaches introduce their own biases and struggle with difficult questions. **LiveBench** {cite}`white2024livebenchchallengingcontaminationfreellm` represents a novel solution, designed specifically to be resilient to both contamination and evaluation biases. As the first benchmark with continuously updated questions from recent sources, automated objective scoring, and diverse challenging tasks across multiple domains, LiveBench maintains its effectiveness even as models improve. Drawing from recent math competitions, research papers, news, and datasets, it creates contamination-free versions of established benchmark tasks. Current results show even top models achieving below 70% accuracy, demonstrating LiveBench's ability to meaningfully differentiate model capabilities. With monthly updates and an open collaborative approach, LiveBench aims to provide sustained value for model evaluation as the field advances.\n",
     "\n",
+    "Another notable benchmark is ZebraLogic {cite}`zebralogic2024`, which evaluates logical reasoning capabilities of LLMs through Logic Grid Puzzles - a type of Constraint Satisfaction Problem {cite}`brailsford1999constraint` commonly found in tests like the LSAT. These puzzles require assigning unique values to N houses across M different features based on given clues, demanding strategic reasoning and deduction to arrive at a unique correct solution. The benchmark's programmatically generated puzzles range from 2x2 to 6x6 in size and test LLMs using one-shot examples with reasoning steps. While humans can solve these puzzles through strategic methods like reductio ad absurdum and elimination, LLMs demonstrate significant limitations in this type of logical reasoning. Even the best-performing model, Claude 3.5 Sonnet, only achieves 33.4% accuracy across all puzzles and 12.4% on hard puzzles, with smaller models (7-10B parameters) solving less than 1% of hard puzzles as of December 2024. These results reveal critical gaps in LLMs' capabilities around counterfactual thinking, reflective reasoning, structured memorization, and compositional generalization.\n",
+    "\n",
     "A significant shift in AI evaluation came with the launch of the **The Alignment Research Center (ARC) Prize** {cite}`arcprize2024` by ARC Prize Inc., a non-profit for the public advancement of open artificial general intelligence. Hosted by Mike Knoop (Co-founder, Zapier) and François Chollet (Creator of ARC-AGI, Keras), this prize represents a paradigm shift in how we evaluate language models. Rather than focusing on narrow performance metrics, the ARC Prize assesses what it calls \"cognitive sufficiency\" - a model's ability to generate meaningful insights and tackle open-ended challenges. This new way to think about LLM evaluation emphasizes creative thinking, sophisticated reasoning, and the capacity to make genuinely useful contributions to human knowledge as we seek to define and measure what it means to achieve AGI (Artificial General Intelligence).\n",
     "\n",
     "\n",

diff --git a/tamingllms/_build/html/_sources/notebooks/structured_output.ipynb b/tamingllms/_build/html/_sources/notebooks/structured_output.ipynb
@@ -637,18 +637,103 @@
    "source": [
     "### Outlines\n",
     "\n",
-    "Outlines {cite}`outlines2024` is a library specifically focused on structured text generation from LLMs. Under the hood, Outlines works by adjusting the probability distribution of the model's output logits - the raw scores from the final layer of the neural network that are normally converted into text tokens. By introducing carefully crafted logit biases, Outlines can guide the model to prefer certain tokens over others, effectively constraining its outputs to a predefined set of valid options. This provides fine-grained control over the model's generation process. In that way, Outlines provides several powerful features:\n",
+    "Outlines {cite}`outlines2024` is a library specifically focused on structured text generation from LLMs. Under the hood, Outlines works by adjusting the probability distribution of the model's output logits - the raw scores from the final layer of the neural network that are normally converted into text tokens. By introducing carefully crafted logit biases, Outlines can guide the model to prefer certain tokens over others, effectively constraining its outputs to a predefined set of valid options. \n",
     "\n",
-    "* **Multiple Choice Generation**: Restrict the LLM output to a predefined set of options.\n",
-    "* **Regex-based structured generation**: Guide the generation process using regular expressions.\n",
-    "* **Pydantic model**: Ensure the LLM output follows a Pydantic model.\n",
-    "* **JSON Schema**: Ensure the LLM output follows a JSON Schema."
+    "The authors solve the general guided generation problem {cite}`willard2023efficientguidedgenerationlarge`, which as a consequence solves the problem of structured output generation, in LLMs by introducing an efficient indexing approach that reformulates neural text generation using finite-state machines (FSMs).\n",
+    "\n",
+    "They define the next token generation as a random variable:\n",
+    "\n",
+    "$$s_{t+1} \\sim \\text{Categorical}(\\alpha) \\text{ where } \\alpha = \\text{LLM}(S_t, \\theta)$$\n",
+    "\n",
+    "Where:\n",
+    "\n",
+    "- $s_{t+1}$ is the next token to be generated\n",
+    "- $S_t = (s_1...s_t)$ represents a sequence of t tokens with $s_t \\in V$\n",
+    "- $V$ is the vocabulary with size $|V| = N$ (typically around $10^4$ or larger)\n",
+    "- $\\alpha \\in \\mathbb{R}^N$ is the output logits/probabilities over the vocabulary\n",
+    "- $\\theta$ is the set of trained parameters of the LLM\n",
+    "- $\\text{LLM}$ refers to a deep neural network trained on next-token-completion tasks\n",
+    "- $\\text{Categorical}(\\alpha)$ represents sampling from a categorical distribution with probabilities $\\alpha$\n",
+    "\n",
+    "When applying masking for guided generation, this becomes:\n",
+    "\n",
+    "$$\n",
+    "\\tilde{\\alpha} = m(S_t) \\odot \\alpha\n",
+    "$$\n",
+    "\n",
+    "$$\n",
+    "\\tilde{s}_{t+1} \\sim \\text{Categorical}(\\tilde{\\alpha})\n",
+    "$$\n",
+    "\n",
+    "Where:\n",
+    "\n",
+    "- $m: P(V) \\rightarrow {0,1}^N$ is a boolean mask function\n",
+    "- $\\odot$ represents element-wise multiplication\n",
+    "- $\\tilde{\\alpha}$ is the masked (constrained) probability distribution\n",
+    "- $\\tilde{s}_{t+1}$ is the next token sampled under constraints\n",
+    "\n",
+    "This formulation allows the masking operation to guide the generation process by zeroing out probabilities of invalid tokens according to the finite state machine states. But instead of checking the entire vocabulary (size N) at each generation step (O(N) complexity) to enforce output constraints, they convert constraints (regex/grammar) into FSM states and build an index mapping FSM states to valid vocabulary tokens. This achieves O(1) average complexity for token generation.\n",
+    "\n",
+    "In summary, there are two stages in the Outlines framework {cite}`vivien2024regex`:\n",
+    "\n",
+    "1. **Preprocessing Step**: Outlines converts a character-level deterministic finite automaton (DFA) testing whether a string matches a regex into a token-level DFA testing whether a token sequence is decoded in a string matching the regex.\n",
+    "\n",
+    "2. **Decoding Step**: At decoding time, the DFA is used to determine, for each new token, which potential tokens are allowed. Starting from the initial state of the DFA, the allowed tokens are determined by the outgoing transitions from the current state. The corresponding mask is applied to the next token probabilities and these probabilities are renormalized. A new token can then be sampled and the state of the DFA updated.\n",
+    "\n",
+    "At each step, the model's probability distribution is masked and renormalized according to the current state and valid transitions."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "As an example, let's suppose we want to constrain the output of an LLM to the following set of options: \n",
+    "- Y/yes\n",
+    "- N/no\n",
+    "- N/never\n",
+    "- A/always\n",
+    "\n",
+    "\n",
+    "This can be done by creating a state machine that has a start state, an end state and a set of valid transitions between states with possible states represented as the following regex string: `r\"\\s*([Yy]es|[Nn]o|[Nn]ever|[Aa]lways)\"`.\n",
+    "\n",
+    "The state machine below illustrates how Outlines works under the hood {numref}`outlines_state_machine`, where:\n",
+    "- Prop: Represents the logit token probability given by the LLM\n",
+    "- Mask: Mask value of the transition as defined by the state machine\n",
+    "- Final: The renormalized token probability post-masking\n",
+    "\n",
+    "```{figure} ../_static/structured_output/outlines_state_machine.png\n",
+    "---\n",
+    "name: outlines_state_machine\n",
+    "alt: Outlines State Machine\n",
+    "scale: 50%\n",
+    "align: center\n",
+    "---\n",
+    "Outlines State Machine.\n",
+    "```\n",
+    "\n",
+    "The initial \"Start\" state contains a masking table that controls which tokens can begin the sequence. In this example, only characters from the set `[YyNnAa]` are allowed as valid first characters, with each having an assigned probability and mask value. The masking mechanism effectively filters out invalid tokens by setting their mask values to 0, ensuring only permitted transitions to the \"First\" state.\n",
+    "\n",
+    "After transitioning to the \"First\" state, the system continues to use probability masking to guide the sequence. For example, when receiving 'Y' as input, the masking table adjusts token probabilities to ensure valid continuations.\n",
+    "\n",
+    "This finite state machine architecture serves multiple purposes in controlling text generation:\n",
+    "\n",
+    "1. Managing token probabilities through strategic masking\n",
+    "2. Preventing invalid token sequences \n",
+    "3. Enforcing specific token patterns\n",
+    "4. Providing fine-grained control over token generation and validation"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "This provides fine-grained control over the model's generation process. In that way, Outlines, the Python package, provides several powerful controlled generation features:\n",
+    "\n",
+    "* **Regex-based structured generation**: Guide the generation process using regular expressions.\n",
+    "* **Multiple Choice Generation**: Restrict the LLM output to a predefined set of options.\n",
+    "* **Pydantic model**: Ensure the LLM output follows a Pydantic model.\n",
+    "* **JSON Schema**: Ensure the LLM output follows a JSON Schema.\n",
+    "\n",
     "Outlines can support major proprietary LLM APIs (e.g. OpenAI's via vLLM). However, one of its key advantages is the ability to ensure structured output for Open Source models, which often lack such guarantees by default."
    ]
   },
@@ -666,7 +751,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "In this example, we will use a Qwen2.5-0.5B model, a lightweight open source model from Alibaba Cloud known for its strong performance despite its small size. The model excels at instruction following and structured generation tasks while being efficient enough to run locally via Hugging Face's `transformers` library."
+    "In this example, we will use a `Qwen2.5-0.5B` model, a lightweight open source model from Alibaba Cloud known for its strong performance despite its small size."
    ]
   },
   {
@@ -772,7 +857,9 @@
    "source": [
     "### Ollama\n",
     "\n",
-    "Ollama is a popular tool that allows you to run large language models (LLMs) locally. It has recently added support for structured output generation. The current `ollama` implementation leverages llama.cpp GBNF (GGML BNF) grammars {cite}`llama_cpp_grammars` to enable structured output generation. llama.cpp GBNF forces language models to generate output in specific, predefined formats by constraining their outputs to follow precise rules and patterns. The system accomplishes this through a formal grammar specification that defines exactly how valid outputs can be constructed. It's essentially an extension of BNF (Backus-Naur Form) {cite}`backus_naur_form` with some modern regex-like features added. These rules carefully define what elements are allowed, how they can be combined, and what patterns of repetition and sequencing are valid. By enforcing these constraints during generation, GBNF ensures the model's output strictly adheres to the desired format.\n",
+    "Ollama is a popular tool that allows you to run large language models (LLMs) locally. It has recently added support for structured output generation. The current `ollama` implementation leverages llama.cpp GBNF (GGML BNF) grammars {cite}`llama_cpp_grammars` to enable structured output generation. \n",
+    "\n",
+    "llama.cpp GBNF forces language models to generate output in specific, predefined formats by constraining their outputs to follow precise rules and patterns. The system accomplishes this through a formal grammar specification that defines exactly how valid outputs can be constructed. It's essentially an extension of BNF (Backus-Naur Form) {cite}`backus_naur_form` with some modern regex-like features added. These rules carefully define what elements are allowed, how they can be combined, and what patterns of repetition and sequencing are valid. By enforcing these constraints during generation, GBNF ensures the model's output strictly adheres to the desired format.\n",
     "\n",
     "Ollama first introduced structured output generation in version 0.5.1 providing support for JSON output but highlighting additional formats are coming soon.\n"
    ]
@@ -1017,7 +1104,7 @@
     "\n",
     "## Acknowledgements\n",
     "\n",
-    "We would like to thank Cameron Pfiffer from the .txt team for his insightful review and feedback.\n"
+    "We would like to thank [Cameron Pfiffer](https://x.com/cameron_pfiffer) from the .txt team for his insightful review and feedback.\n"
    ]
   },
   {

diff --git a/tamingllms/_build/html/_static/structured_output/outlines_state_machine.mermaid b/tamingllms/_build/html/_static/structured_output/outlines_state_machine.mermaid
@@ -0,0 +1,43 @@
+stateDiagram-v2
+    %% Main FSM structure
+    [*] --> Start
+    Start --> First: [YyNnAa]
+    First --> Yes: e/o
+    First --> No: e/o
+    First --> Never: e
+    First --> Always: l
+    Yes --> End: s
+    No --> End: o
+    Never --> End: r
+    Always --> End: s
+    End --> [*]
+
+    %% Initial State masking table
+    note left of Start
+        Initial State Masking:
+        Token  │ Prob │ Mask │ Final
+        ────────────────────────────
+        Y     │ 0.15 │  1   │ 0.25
+        y     │ 0.13 │  1   │ 0.22
+        N     │ 0.14 │  1   │ 0.23
+        n     │ 0.12 │  1   │ 0.20
+        A     │ 0.06 │  1   │ 0.10
+        others│ 0.40 │  0   │ 0.00
+    end note
+
+    %% First State masking example
+    note right of First
+        After 'Y' State Masking:
+        Token  │ Prob │ Mask │ Final
+        ────────────────────────────
+        e     │ 0.30 │  1   │ 1.00
+        s     │ 0.15 │  0   │ 0.00
+        a     │ 0.10 │  0   │ 0.00
+        others│ 0.45 │  0   │ 0.00
+    end note
+
+    %% Final State note
+    note left of End
+        Final State
+        Only accepting state
+    end note
diff --git a/tamingllms/_build/html/_static/structured_output/outlines_state_machine.png b/tamingllms/_build/html/_static/structured_output/outlines_state_machine.png
-Original file line number
+Diff line change
@@ Expand Up / @@ -1244,6 +1244,8 @@ @@
         "\n",
         "A major challenge with these leaderboards and benchmarks is test set contamination - when test data ends up in newer models' training sets, rendering the benchmarks ineffective. While some benchmarks try to address this through crowdsourced prompts and evaluations from humans or LLMs, these approaches introduce their own biases and struggle with difficult questions. **LiveBench** {cite}`white2024livebenchchallengingcontaminationfreellm` represents a novel solution, designed specifically to be resilient to both contamination and evaluation biases. As the first benchmark with continuously updated questions from recent sources, automated objective scoring, and diverse challenging tasks across multiple domains, LiveBench maintains its effectiveness even as models improve. Drawing from recent math competitions, research papers, news, and datasets, it creates contamination-free versions of established benchmark tasks. Current results show even top models achieving below 70% accuracy, demonstrating LiveBench's ability to meaningfully differentiate model capabilities. With monthly updates and an open collaborative approach, LiveBench aims to provide sustained value for model evaluation as the field advances.\n",
         "\n",
+        "Another notable benchmark is ZebraLogic {cite}`zebralogic2024`, which evaluates logical reasoning capabilities of LLMs through Logic Grid Puzzles - a type of Constraint Satisfaction Problem {cite}`brailsford1999constraint` commonly found in tests like the LSAT. These puzzles require assigning unique values to N houses across M different features based on given clues, demanding strategic reasoning and deduction to arrive at a unique correct solution. The benchmark's programmatically generated puzzles range from 2x2 to 6x6 in size and test LLMs using one-shot examples with reasoning steps. While humans can solve these puzzles through strategic methods like reductio ad absurdum and elimination, LLMs demonstrate significant limitations in this type of logical reasoning. Even the best-performing model, Claude 3.5 Sonnet, only achieves 33.4% accuracy across all puzzles and 12.4% on hard puzzles, with smaller models (7-10B parameters) solving less than 1% of hard puzzles as of December 2024. These results reveal critical gaps in LLMs' capabilities around counterfactual thinking, reflective reasoning, structured memorization, and compositional generalization.\n",
+        "\n",
         "A significant shift in AI evaluation came with the launch of the **The Alignment Research Center (ARC) Prize** {cite}`arcprize2024` by ARC Prize Inc., a non-profit for the public advancement of open artificial general intelligence. Hosted by Mike Knoop (Co-founder, Zapier) and François Chollet (Creator of ARC-AGI, Keras), this prize represents a paradigm shift in how we evaluate language models. Rather than focusing on narrow performance metrics, the ARC Prize assesses what it calls \"cognitive sufficiency\" - a model's ability to generate meaningful insights and tackle open-ended challenges. This new way to think about LLM evaluation emphasizes creative thinking, sophisticated reasoning, and the capacity to make genuinely useful contributions to human knowledge as we seek to define and measure what it means to achieve AGI (Artificial General Intelligence).\n",
         "\n",
         "\n",
@@ Expand Down @@