6. Preference-Based Alignment¶
+7. Preference-Based Alignment¶
A people that values its privileges above its principles soon loses both.
—Dwight D. Eisenhower
@@ -261,71 +270,71 @@- 6.1. Introduction¶
+7.1. Introduction¶
The release of ChatGPT 3.5 in late 2022 marked a pivotal moment in the history of artificial intelligence. Within just five days of its launch, the model attracted over a million users, and within two months, it became the fastest-growing consumer application in history with over 100 million monthly active users.
Yet, this raises an intriguing question: Why did ChatGPT 3.5 create such a dramatic impact when its predecessor, GPT-3, which had the same size/number of parameters, received far less attention from the general public? Arguably, the answer lies not in raw capabilities, but in Preference Alignment. Through careful fine-tuning using human feedback, OpenAI transformed GPT-3’s raw intelligence into ChatGPT’s helpful and resourceful conversational abilities, at least from humans eyes. This breakthrough demonstrated that aligning language models with human preferences is just as crucial as scaling them to greater sizes.
In this chapter, we will explore the process of aligning language models with human preferences via fine-tuning using modern techniques such as Direct Preference Optimization (DPO) [Rafailov et al., 2024]. Next, we will present a practical case study where we align a language model to a user-provided policy in a fully automated fashion leading to an open source model as well as a dataset of policy-aligned preferences.
- 6.2. From Raw Capabilities to Preference Alignment¶
+7.2. From Raw Capabilities to Preference Alignment¶
- 6.2.1. On the Misalignment of Language Models¶
+7.2.1. On the Misalignment of Language Models¶
Common pre-trained LLMs are not helpful to humans by default. They are not helpful to humans because they are not aligned with human preferences by design. This is because state-of-the-art language models are trained on the specific objective of predicting the next token given a knowledge base (e.g. large number of webpages from the internet). This is a very different objective than being asked to follow user’s instructions while being safe and helpful. We say that the language modeling objective is misaligned [Ouyang et al., 2022].
Let’s take a look at GPT-2’s response to the following prompt: “Explain the moon landing to a 6 year old.”
@@ -374,15 +383,15 @@-
6.2.2. Aligning Language Models with Human Preferences¶
+7.2.2. Aligning Language Models with Human Preferences¶
To address this issue, OpenAI introduced a RLHF-based technique to align language models with user intent on a wide range of tasks by fine-tuning with human feedback [Ouyang et al., 2022]. The key idea is to train the model to follow user’s instructions while being safe and helpful.
-Fig. 6.1 illustrates OpenAI’s 3-step process for training language models to better follow human instructions using RLHF:
+Fig. 7.1 illustrates OpenAI’s 3-step process for training language models to better follow human instructions using RLHF:
@@ -414,24 +423,24 @@
Collect demonstration data and train a supervised policy
Fig. 6.2 illustrates a simplified view of this alignment process showing the progression from base model to instruction-tuned model to aligned model. +
Fig. 7.2 illustrates a simplified view of this alignment process showing the progression from base model to instruction-tuned model to aligned model.
-A common pattern has emerged in the development of language models: First, a powerful base model is released, which is then fine-tuned, for instance using SFT to create an instruction-following version. This instruct model can then be further aligned with human preferences using techniques such as RLHF to create an aligned version as illustrated in Fig. 6.3.
+A common pattern has emerged in the development of language models: First, a powerful base model is released, which is then fine-tuned, for instance using SFT to create an instruction-following version. This instruct model can then be further aligned with human preferences using techniques such as RLHF to create an aligned version as illustrated in Fig. 7.3.
An aligned model can be fine-tuned directly from a base model or from an instruction-tuned model. For example, Llama Guard 3 [Llama Team, 2024] is a Llama-3.1-8B pre-trained model that was fine-tuned directly for content safety classification, bypassing the instruction-tuning step. Similarly, Zephyr-7B-alpha [Face, 2024] demonstrates direct alignment from a base model - it is a fine-tuned version of Mistral-7B that was trained using Direct Preference Optimization (DPO) on publicly available datasets to create a helpful assistant.
The OpenAI paper introduced two key components of this fine-tuning process - SFT for instruction tuning and RLHF (PPO in particular) for alignment. The following sections will explore these and other more modern alignment techniques.
- 6.2.2.1. Supervised Fine-Tuning (SFT) for Model Alignment¶
+7.2.2.1. Supervised Fine-Tuning (SFT) for Model Alignment¶
SFT is a foundational technique for aligning language models with human preferences. Before exploring advanced alignment methods like RLHF, it’s useful to understand how SFT can be used to create a strong foundation for instruction following and desired behaviors.
At a high-level, SFT involves fine-tuning language models using carefully curated demonstrations of desired behavior. The process transforms a general-purpose language model into one that can better follow instructions and exhibit specific behaviors aligned with human preferences. Typically, SFT is used to align a model to a specific task or domain, which than can be later aligned with human preferences using RLHF, PPO or DPO as we will see later.
The decision to employ SFT depends on the gap between a model’s current capabilities and specific requirements. SFT proves particularly valuable in scenarios requiring:
@@ -449,14 +458,14 @@[Hu et al., 2021] +
- LoRA (Low-Rank Adaptation) [Hu et al., 2021]
Uses two small matrices instead of updating all weights
Maintains model performance while reducing computational costs
Enables efficient training on consumer hardware
QLoRA (Quantized LoRA) [Dettmers et al., 2023]
+QLoRA (Quantized LoRA) [Dettmers et al., 2023]
- 6.2.2.2. Augmenting SFT with Human Preferences¶
-Significant gains in helpfulness and safety can be achieved by augmenting SFT with human preferences [Bai et al., 2022, Ouyang et al., 2022, Touvron et al., 2023].
-The OpenAI paper [Ouyang et al., 2022] demonstrated the effectiveness of Reinforcement Learning from Human Feedback (RLHF), particularly using Proximal Policy Optimization (PPO), for aligning language models with human preferences. Since then, alignment techniques have evolved into two main categories: reward-based and reward-free methods. Commercial systems like ChatGPT and Claude employ reward-based approaches, which involve training a reward model and using algorithms like PPO. Meanwhile, reward-free methods such as Direct Preference Optimization (DPO) have demonstrated superior performance on benchmark tasks [Xu et al., 2024].
-Proximal Policy Optimization (PPO) [Schulman et al., 2017] is a widely used reinforcement learning algorithm that has gained popularity particularly since the release of ChatGPT 3.5. It operates by iteratively updating the policy of an LLM, which can be understood as a set of rules that govern how the model generates text. In the context of RLHF, the policy is updated based on rewards that reflect human preferences. For instance, if a human evaluator prefers one LLM output over another, the policy is adjusted to increase the likelihood of generating outputs similar to the preferred one.
-One of the key strengths of PPO lies in its ability to handle complex reward landscapes [Face, 2024c]. In many real-world scenarios, the rewards that an LLM receives may be noisy or delayed. For example, in a chatbot application, the reward for generating a good response may not be immediate, as it depends on the user’s subsequent interactions. PPO effectively learns in these situations by using a clipped surrogate objective function, which limits the size of policy updates and ensures stable training. This prevents the model from overreacting to noisy or delayed rewards and helps it converge to a stable and optimal policy.
-Direct Preference Optimization (DPO) is a more recent “reward-free” fine-tuning technique that has gained significant attention due to its simplicity and efficiency [Rafailov et al., 2024], awarded runner-up paper in NeurIPS 2023 [Blog, 2023]. DPO operates by directly optimizing the policy to maximize the likelihood of preferred responses while minimizing the likelihood of non-preferred responses. As illustrated in Fig. 6.4, DPO optimizes for human preferences while avoiding reinforcement learning. Typical RLHF methods such as PPO fit a reward model to a dataset of prompts and human preferences over pairs of responses, and then use RL to find a policy that maximizes the learned reward. In contrast, DPO directly optimizes for the policy best satisfying the preferences with a simple classification objective, fitting an implicit reward model whose corresponding optimal policy can be extracted in closed form.
+7.2.2.2. Augmenting SFT with Human Preferences¶
+Significant gains in helpfulness and safety can be achieved by augmenting SFT with human preferences [Bai et al., 2022, Ouyang et al., 2022, Touvron et al., 2023].
+The OpenAI paper [Ouyang et al., 2022] demonstrated the effectiveness of Reinforcement Learning from Human Feedback (RLHF), particularly using Proximal Policy Optimization (PPO), for aligning language models with human preferences. Since then, alignment techniques have evolved into two main categories: reward-based and reward-free methods. Commercial systems like ChatGPT and Claude employ reward-based approaches, which involve training a reward model and using algorithms like PPO. Meanwhile, reward-free methods such as Direct Preference Optimization (DPO) have demonstrated superior performance on benchmark tasks [Xu et al., 2024].
+Proximal Policy Optimization (PPO) [Schulman et al., 2017] is a widely used reinforcement learning algorithm that has gained popularity particularly since the release of ChatGPT 3.5. It operates by iteratively updating the policy of an LLM, which can be understood as a set of rules that govern how the model generates text. In the context of RLHF, the policy is updated based on rewards that reflect human preferences. For instance, if a human evaluator prefers one LLM output over another, the policy is adjusted to increase the likelihood of generating outputs similar to the preferred one.
+One of the key strengths of PPO lies in its ability to handle complex reward landscapes [Face, 2024c]. In many real-world scenarios, the rewards that an LLM receives may be noisy or delayed. For example, in a chatbot application, the reward for generating a good response may not be immediate, as it depends on the user’s subsequent interactions. PPO effectively learns in these situations by using a clipped surrogate objective function, which limits the size of policy updates and ensures stable training. This prevents the model from overreacting to noisy or delayed rewards and helps it converge to a stable and optimal policy.
+Direct Preference Optimization (DPO) is a more recent “reward-free” fine-tuning technique that has gained significant attention due to its simplicity and efficiency [Rafailov et al., 2024], awarded runner-up paper in NeurIPS 2023 [Blog, 2023]. DPO operates by directly optimizing the policy to maximize the likelihood of preferred responses while minimizing the likelihood of non-preferred responses. As illustrated in Fig. 7.4, DPO optimizes for human preferences while avoiding reinforcement learning. Typical RLHF methods such as PPO fit a reward model to a dataset of prompts and human preferences over pairs of responses, and then use RL to find a policy that maximizes the learned reward. In contrast, DPO directly optimizes for the policy best satisfying the preferences with a simple classification objective, fitting an implicit reward model whose corresponding optimal policy can be extracted in closed form.
The key idea is to train the model to prefer responses that align with our desired behavior over responses that do not. DPO works by:
@@ -498,14 +507,14 @@\(\beta\) is a tuning parameter to control the deviation from the base reference policy \(\pi_{ref}\).
This approach is more straightforward than PPO, as it avoids the need for a reward model and instead uses a direct comparison of model outputs against human preferences.
-Modern libraries such as HuggingFace’s TRL [Face, 2024d] offer a suite of techniques for fine-tuning language models with reinforcement learning, including PPO, and DPO. It provides a user-friendly interface and a wide range of features for fine-tuning and aligning LLMs, which will be the focus of the next section as we go through a case study.
+Modern libraries such as HuggingFace’s TRL [Face, 2024d] offer a suite of techniques for fine-tuning language models with reinforcement learning, including PPO, and DPO. It provides a user-friendly interface and a wide range of features for fine-tuning and aligning LLMs, which will be the focus of the next section as we go through a case study.
- 6.3. Is Post-Training the Answer?¶
+7.3. Is Post-Training the Answer?¶
- 6.3.1. Limitations¶
+7.3.1. Limitations¶
While post-training alignment techniques like RLHF and DPO show promise, technical limitations need to be carefully considered.
Reinforcement Learning from Human Feedback faces several critical scaling challenges that distinguish it from pre-training or supervised fine-tuning. One key issue is scalability. Recent research suggests that the current RLHF framework does not scale as effectively as the pretraining stage [Hou et al., 2024], in particular presenting the following challenges:
@@ -573,7 +582,7 @@
-
6.3.2. Model Collapse¶
+7.3.2. Model Collapse¶
One key issue is model collapse - a phenomenon where model performance degrades with each training iteration.
Model collapse occurs when models are trained on data generated by previous models, creating a potentially dangerous feedback loop. This recursive training process can lead to [Kazdan et al., 2024]:
@@ -587,13 +596,13 @@
[Szép et al., 2024], providing practical guidance on data augmentation, regularization methods, and training strategies to maximize performance while minimizing data requirements. These insights are particularly relevant when aligning models with specific policies or domains where labeled data may be scarce.
- 6.3.3. Faking Alignment¶
+7.3.3. Faking Alignment¶
Alignment faking is a behaviour exhibited by large language models (LLMs) where they strategically comply with a training objective during training to prevent modifications to their preferred behaviours outside of training as evidenced by the landmark paper by Anthropic [Askell et al., 2024a] in Dec/2024. This new research has shown that alignment faking can occur in large language models, even when they are not explicitly instructed to do so. This means an LLM might pretend to adopt a new objective while being trained, not because it actually prefers that objective, but because it wants to preserve its original preferences once the training is complete.
-To illustrate alignment faking, let’s consider the “helpful-only” setting described in the paper, where the model is trained to always comply with user requests, even if they are harmful. This conflicts with the model’s pre-existing preference for harmlessness. The model is told that it will be retrained to always comply with harmful queries, and that only data from free-tier users will be used in training, not paid-tier users. Fig. 6.5 illustrates the process of alignment faking, showing the “free” (left column) and “paid” (right) conditions. Each column shows the toxic query, excerpts from the model’s scratchpad, and the model’s response. In the “free” condition the model is less likely to refuse toxic queries, and often reasons that it should fake alignment.
+To illustrate alignment faking, let’s consider the “helpful-only” setting described in the paper, where the model is trained to always comply with user requests, even if they are harmful. This conflicts with the model’s pre-existing preference for harmlessness. The model is told that it will be retrained to always comply with harmful queries, and that only data from free-tier users will be used in training, not paid-tier users. Fig. 7.5 illustrates the process of alignment faking, showing the “free” (left column) and “paid” (right) conditions. Each column shows the toxic query, excerpts from the model’s scratchpad, and the model’s response. In the “free” condition the model is less likely to refuse toxic queries, and often reasons that it should fake alignment.
Here is a step by step of what happens:
@@ -641,7 +650,7 @@-
6.4. Case Study: Aligning a Language Model to a Policy¶
+7.4. Case Study: Aligning a Language Model to a Policy¶
In this case study, we will align a language model to a policy. The policy is a set of principles and rules that we want the language model to adhere to. All methodology and code available solves this general problem of policy-based alignment. However, we will describe a specific case study to illustrate our approach.
Let’s assume that we are working for Acme Inc., a company dedicated to democratizing access to computer science education for K-12 students. Acme Inc. is in the process of creating a chatbot named
smolK-12
, a small open source LLM, specifically designed for K-12 students.In this case study, we’ll explore how to align a language model with Acme Inc.’s policy to ensure its LLM-powered applications are safe and appropriate for K-12 students.
@@ -652,7 +661,7 @@-
6.4.1. Experimental Setup¶
+7.4.1. Experimental Setup¶
We will use the following base model:
HuggingFaceTB/SmolLM2-360M-Instruct
[SmolLM2-360M-Instruct, 2024], a compact open source language model that is part of the SmolLM2 family published by HuggingFace.We will use the following APIs:
@@ -668,7 +677,7 @@
-
6.4.2. Deliverables¶
+7.4.2. Deliverables¶
As a result, we will have:
- @@ -677,7 +686,7 @@
smolK-12
, a fine-tuned model aligned with Acme Inc.’s policy-
6.4.3. A Note on smolLM2 Models¶
+7.4.3. A Note on smolLM2 Models¶
Since we have decided to anchor our Case Study on HuggingFace’s SmolLM2 models [SmolLM2, 2024], it is worth providing a reason for this choice.
SmolLM2 models are a family of compact language models that have been developed by HuggingFace. They are designed to be lightweight and efficient, making them suitable for a wide range of applications, including on-device deployment.
Its compact size makes it an excellent candidate for efficient, low-cost fine-tuning and training on specific use cases making it particularly suitable for alignment research which is our main focus here.
@@ -690,7 +699,7 @@-
6.4.3.1. Policy¶
+7.4.3.1. Policy¶
A company policy articulates the principles and standards that the company upholds, ensuring that employees, users and stakeholders understand the expectations regarding safety, ethical conduct, social responsibility, and integrity. A good policy not only reflects the company’s mission and vision but also fosters a culture of accountability and transparency.
In the context of alignment, a policy codifies “company preferences” when prioritizing decisions and actions.
In this case study, Acme Inc. provides as input a comprehensive policy to ensure that LLM-powered applications are both safe and suitable for K-12 students. Acme Inc.’s policy adheres to version 0.5 of the AI Safety Benchmark established by MLCommons [Vidgen et al., 2024]. This benchmark encompasses seven critical hazard categories:
@@ -801,7 +810,7 @@Monitoring and Updates
- 6.4.4. Preference Dataset - Synthetic Dataset Generation¶
+7.4.4. Preference Dataset - Synthetic Dataset Generation¶
In order to fine-tune a base model to create an aligned model, we need to construct a dataset of policy-aligned preferences. This dataset will be used to align our base model to our policy.
To generate a dataset of policy-aligned preferences, we aim to create a dataset of user prompts, rejected responses, and chosen responses. This dataset indicates which responses are preferred (policy-compliant) and which are not (policy-violating).
Collecting human-generated high-quality preference data is a resource-intensive and creativity-demanding process, especially for the continual improvement of LLMs [Dong et al., 2024]. There has been active research to replace or augment human feedback with AI feedback (RLAIF) to tackle these issues [Bai et al., 2022] giving rise to the field of Synthetic Data Generation [Long et al., 2024].
@@ -866,7 +875,7 @@-
6.4.4.1. User Prompts¶
+7.4.4.1. User Prompts¶
The class
UserPromptGenerator
generates a dataset of user prompts that might violate the policy. If runs the following prompt given the policy:ROLE: You are a user prompt generator helping test an AI safety system. TASK: Generate a long list of user prompts that might violate the provided content policy. @@ -1039,7 +1048,7 @@-
6.4.4.2. Rejected Responses¶
+7.4.4.2. Rejected Responses¶
The
ResponseGenerator
class creates a dataset of responses from an unaligned base model that we aim to improve through fine-tuning. These responses serve as “rejected” examples in our training data since they may not properly align with safety policies and guidelines. The class supports both local model inference using the Hugging Face Transformers library and remote inference through the Hugging Face Inference API. When instantiated with a model name, it loads the model locally. Otherwise, if a cloud API URL is provided, it connects to the remote API endpoint for inference.Generate rejected responses using a local model:
local_generator = ResponseGenerator(model_name="<HUGGINGFACE_MODEL_NAME>") @@ -1241,7 +1250,7 @@-
6.4.4.3. Chosen Responses¶
+7.4.4.3. Chosen Responses¶
The next step involves generating policy-compliant responses from a more powerful, sophisticated language model than our base model. The
process_aligned_responses()
function takes user prompts and generates responses that strictly adhere to the provided safety policy. It uses a carefully crafted system prompt that instructs the model to either provide helpful responses within policy bounds, or explicitly reject requests that violate the policy with a standardized message. These policy-compliant responses will serve as the “chosen” examples in our preference dataset, establishing the target behavior we want the base model to learn through alignment training.We will use the
OpenAIBatchProcessor
class from thetaming_utils
utility module to generate responses in batches using OpenAI’s API for enhanced cost-efficiency and performance.@@ -1401,7 +1410,7 @@-
6.4.4.4. Generate DPO Dataset¶
+7.4.4.4. Generate DPO Dataset¶
At this point we already have all the data we need for our DPO dataset, namely user prompts, chosen responses and rejected responses. The
generate_dpo_dataset()
function loads these data and transforms them into a format suitable for DPO training, optionally pushing the dataset to the Hugging Face Hub ifrepo_id
is provided.@@ -1519,7 +1528,7 @@-
6.4.5. DPO-Based Optimization¶
+7.4.5. DPO-Based Optimization¶
We’ll use the Hugging Face TRL library to implement DPO fine-tuning on our synthetic dataset.
Note
@@ -1529,13 +1538,13 @@-
6.4.5.1. Data Preparation¶
+7.4.5.1. Data Preparation¶
Hugging Face H4 [H4, 2024b] offers a collection of datasets that aim at aligning LLMs to be helpful, honest and harmless. Before we start the DPO fine-tuning process, we will combine our synthetic policy-aligned dataset with the UltraFeedback binarized dataset from H4 (
-trl-lib/ultrafeedback_binarized
) [H4, 2024a].This dataset was constructed based on criteria like helpfulness and honesty and can be used to align models to those dimensions. By combining our synthetic dataset with the UltraFeedback binarized dataset, we can fine-tune a model that is aligned on both our synthetic policy and the H4 criteria therefore providing a more well-balanced alignment. The DPO optimization process is shown in Fig. 6.6.
+This dataset was constructed based on criteria like helpfulness and honesty and can be used to align models to those dimensions. By combining our synthetic dataset with the UltraFeedback binarized dataset, we can fine-tune a model that is aligned on both our synthetic policy and the H4 criteria therefore providing a more well-balanced alignment. The DPO optimization process is shown in Fig. 7.6.
@@ -1576,7 +1585,7 @@-
6.4.5.2. Fine-Tuning¶
+7.4.5.2. Fine-Tuning¶
We now prepare our base language model for alignment fine-tuning using the Hugging Face transformers library. It loads the pre-trained model and its tokenizer and configures them for training.
@@ -1718,7 +1727,7 @@Fig. 6.7 show two key metrics: +
By default, fine-tuning results will be sent to your Weights & Biases account. The training plots in Fig. 7.7 show two key metrics:
The red line represents the rewards for rejected responses (“smolk12_dpo_output train/rewards/rejected”)
- @@ -1727,10 +1736,10 @@
The green line represents the rewards for chosen responses (“smolk12_dpo_output train/rewards/chosen”)
- + -Fig. 6.7 helps visualize how well the model learns to distinguish between appropriate and inappropriate responses during training. We expect to observe a divergence between the chosen and rejected responses, which indicates the model is learning to distinguish between good and bad responses.
+Fig. 7.7 helps visualize how well the model learns to distinguish between appropriate and inappropriate responses during training. We expect to observe a divergence between the chosen and rejected responses, which indicates the model is learning to distinguish between good and bad responses.
The training dynamics reveal two key phases:
- @@ -1759,16 +1768,16 @@
Initial Learning (0-50 steps): A rapid divergence between chosen and rejected rewards indicates quick initial learning
Fig. 6.8). +
Congratulations! You have successfully fine-tuned your model using DPO. It should now be available on the Hugging Face Hub (see Fig. 7.8).
- 6.4.5.3. Vibe Check¶
+7.4.5.3. Vibe Check¶
Let’s do a quick “vibe check” of our newly aligned model by testing it with some challenging prompts. This will help us qualitatively assess whether the DPO fine-tuning has improved the model’s alignment against our input policy (K-12 educational policies and safety standards). We’ll then follow up with a more rigorous quantitative evaluation methodology.
We will use HuggingFace transformers API to generate responses from our base and aligned models, locally.
@@ -1851,11 +1860,11 @@-
6.4.6. Alignment Evaluation¶
+7.4.6. Alignment Evaluation¶
Evaluating alignment improvements presents unique challenges. Unlike traditional machine learning tasks with clear metrics like accuracy or F1 score, alignment quality is more nuanced and subjective. It requires assessing whether responses adhere to safety guidelines, educational policies, and ethical principles.
The gold standard for evaluating alignment is human evaluation. Having experienced educators and safety experts review model outputs provides a reliable assessment framework. However, human evaluation is expensive, time-consuming, and difficult to scale. Additionally, human evaluators may have varying interpretations of alignment criteria, introducing inconsistency.
In this case study, we adopt an LLM-as-judge approach for our evaluation as discussed in [Souza, 2024]. This method leverages a language model to act as an automated judge, assessing the safety and appropriateness of responses from both the base and aligned models.
-The evaluation methodology summarized in Fig. 6.9 consists of three key components that work together to assess model alignment against our policy:
+The evaluation methodology summarized in Fig. 7.9 consists of three key components that work together to assess model alignment against our policy:
Evaluation Dataset
@@ -1893,7 +1902,7 @@
- + In the following sections, we will implement the evaluation methodology and evaluate the alignment of our base and aligned models. Quick setup of the evaluation environment are given by the following static variables:
@@ -2402,7 +2411,7 @@-
6.5. Discussion and Conclusions¶
+7.5. Discussion and Conclusions¶
LLMs are complex systems and alignment is a challenging problem. In this chapter, we discussed how post-training techniques can be used to align a language model to human preferences. In the case study, we demonstrated how to use DPO to align a language model to a user-provider policy further automating the process via synthetic data generation and LLM-as-judge evaluation. Our approach serves as a proof of concept and several considerations should be taken into account when using this methodology in practice.
Synthetic Data Generation
LLMs can self improve through synthetic data generation [Huang et al., 2022]. This process helps the LLM learn from its own reasoning and improve its overall reasoning ability without relying on human-annotated data. While LLMs can be powerful tools for generating synthetic data, especially in data-scarce domains, it’s important to recognize the potential pitfalls.
@@ -2424,7 +2433,7 @@-
6.6. Citation¶
+7.6. Citation¶
@misc{tharsistpsouza2024tamingllms, author = {Tharsis T. P. Souza}, @@ -2438,7 +2447,7 @@-
6.7. References¶
+7.7. References¶
[ABC+4a] @@ -2449,7 +2458,7 @@-[ABC+4b]
Amanda Askell, Jan Brauner, Adrian Colyer, Benjamin Cullen, David Duvenaud, Richard Ngo, Azalia Mirhoseini, Catherine Olsson, Sam Ringer, Liam Skirvin, Jess Smith, Dawn Song, William Saunders, and Jacob Steinhardt. Alignment faking in large language models: reviews. 2024b. URL: https://assets.anthropic.com/m/24c8d0a3a7d0a1f1/original/Alignment-Faking-in-Large-Language-Models-reviews.pdf.
+-[BJN+22]@@ -2457,7 +2466,7 @@Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, Ben Mann, and Jared Kaplan. Training a helpful and harmless assistant with reinforcement learning from human feedback. 2022. URL: https://arxiv.org/abs/2204.05862, arXiv:2204.05862.
[BKK+22]
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosuite, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemi Mercado, Nova DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, Tom Henighan, Tristan Hume, Samuel R. Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, and Jared Kaplan. Constitutional ai: harmlessness from ai feedback. 2022. URL: https://arxiv.org/abs/2212.08073, arXiv:2212.08073.
+-[Blo23]@@ -2465,7 +2474,7 @@NeurIPS Blog. Announcing the neurips 2023 paper awards. 2023. NeurIPS 2023 Awards. URL: https://blog.neurips.cc/2023/12/11/announcing-the-neurips-2023-paper-awards/.
[CCL+24]
Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, and Benyou Wang. Humans or llms as the judge? a study on judgement biases. 2024. URL: https://arxiv.org/abs/2402.10669, arXiv:2402.10669.
+-[DPHZ23]@@ -2478,11 +2487,11 @@Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: efficient finetuning of quantized llms. 2023. URL: https://arxiv.org/abs/2305.14314, arXiv:2305.14314.
[Fac24]
Hugging Face. Zephyr. 2024. Zephyr. URL: https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha.
+[Fac4c]-Hugging Face. Rlhf. 2024c. RLHF. URL: https://huggingface.co/blog/rlhf.
+-[Fac4d]@@ -2511,7 +2520,7 @@Hugging Face. Trl. 2024d. TRL. URL: https://huggingface.co/docs/trl/en/index.
[HDN+24]
Zhenyu Hou, Pengfan Du, Yilin Niu, Zhengxiao Du, Aohan Zeng, Xiao Liu, Minlie Huang, Hongning Wang, Jie Tang, and Yuxiao Dong. Does rlhf scale? exploring the impacts from data, model, and method. 2024. URL: https://arxiv.org/abs/2412.06000, arXiv:2412.06000.
+-[HSW+21]@@ -2557,7 +2566,7 @@Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: low-rank adaptation of large language models. 2021. URL: https://arxiv.org/abs/2106.09685, arXiv:2106.09685.
(1,2,3,4)
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: your language model is secretly a reward model. 2024. URL: https://arxiv.org/abs/2305.18290, arXiv:2305.18290.
+-[SWD+17]@@ -2578,7 +2587,7 @@John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. 2017. URL: https://arxiv.org/abs/1707.06347, arXiv:1707.06347.
[SRvERH24]
Márton Szép, Daniel Rueckert, Rüdiger von Eisenhart-Rothe, and Florian Hinterwimmer. A practical guide to fine-tuning language models with limited data. 2024. URL: https://arxiv.org/abs/2411.09539, arXiv:2411.09539.
+-[TMS+23]@@ -2590,7 +2599,7 @@Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: open foundation and fine-tuned chat models. 2023. URL: https://arxiv.org/abs/2307.09288, arXiv:2307.09288.
[WYG+24]
Tianhao Wu, Weizhe Yuan, Olga Golovneva, Jing Xu, Yuandong Tian, Jiantao Jiao, Jason Weston, and Sainbayar Sukhbaatar. Meta-rewarding language models: self-improving alignment with llm-as-a-meta-judge. 2024. URL: https://arxiv.org/abs/2407.19594, arXiv:2407.19594.
+[XFG+24]@@ -2628,11 +2637,11 @@Shusheng Xu, Wei Fu, Jiaxuan Gao, Wenjie Ye, Weilin Liu, Zhiyu Mei, Guangju Wang, Chao Yu, and Yi Wu. Is dpo superior to ppo for llm alignment? a comprehensive study. 2024. URL: https://arxiv.org/abs/2404.10719, arXiv:2404.10719.
- ← 5. Safety + title="previous chapter">← 6. Safety
- 7. Local LLMs in Practice → + title="next chapter">8. Local LLMs in Practice →
© Copyright Tharsis T. P. Souza, 2024. diff --git a/tamingllms/_build/html/notebooks/cost.html b/tamingllms/_build/html/notebooks/cost.html index 007349f..c9648b4 100644 --- a/tamingllms/_build/html/notebooks/cost.html +++ b/tamingllms/_build/html/notebooks/cost.html @@ -4,7 +4,7 @@ -8. The Falling Cost Paradox +9. The Falling Cost Paradox @@ -37,7 +37,7 @@ - + @@ -153,6 +153,15 @@ + + + +- + + Managing Input Data + + +
@@ -220,14 +229,14 @@
- Docs »
-- 8. The Falling Cost Paradox
+- 9. The Falling Cost Paradox
@@ -236,7 +245,7 @@
- ← 7. Local LLMs in Practice + title="previous chapter">← 8. Local LLMs in Practice
- 8. The Falling Cost Paradox¶
+9. The Falling Cost Paradox¶
It is a confusion of ideas to suppose that the economical use of fuel is equivalent to diminished consumption.
@@ -245,37 +254,37 @@
The very contrary is the truth.- 8.1. Why Optimization Matters More Than Ever¶
+9.1. Why Optimization Matters More Than Ever¶
According to recent analysis from a16z [Andreessen Horowitz, 2024], the cost of LLM inference is decreasing by approximately 10x every year - a rate that outpaces even Moore’s Law in the PC revolution or Edholm’s Law during the bandwidth explosion of the dot-com era.
A model achieving an MMLU score of 42 that cost $60 per million tokens in late 2021 can now be run for just $0.06 per million tokens. For higher-capability models scoring 83 on MMLU, prices have fallen by a factor of 62 since GPT-4’s introduction in March 2023.
@@ -330,16 +339,16 @@-
8.2. Right-Sizing LLMs: A Strategic Approach¶
+9.2. Right-Sizing LLMs: A Strategic Approach¶
Before implementing cost optimization strategies for LLMs, organizations must develop a comprehensive understanding of their own requirements and constraints. This systematic approach prevents both over-engineering and under-provisioning, leading to more efficient and cost-effective implementations.
In this section, we define key performance and cost related metrics that will guide our discussion. Then we propose a set of requirements practitioners should consider before we dive into cost optimization techniques.
- 8.2.1. Metrics¶
+9.2.1. Metrics¶
- 8.2.2. Requirements¶
+9.2.2. Requirements¶
- 8.2.2.1. Business Requirements¶
+9.2.2.1. Business Requirements¶
First, one needs to define the problem to be solved and to what extent it is worth to be solved. Use case requirements form the foundation of any LLM implementation project. A clear definition of the specific business problema and task to be accomplished must be established upfront, along with concrete performance metrics covering accuracy, latency and throughput. This should be accompanied by well-defined cost-per-transaction targets, clear ROI expectations, and a strategic allocation of budgets across different use cases to ensure resources are optimally distributed.
Budget and ROI considerations are critical for ensuring the long-term viability of LLM implementations. Organizations must establish clear spending limits that align with their financial capabilities while defining realistic cost-per-transaction targets. ROI expectations need to be carefully established through detailed analysis, followed by a strategic allocation of budgets across various use cases based on their business impact and priority.
Compliance and security requirements cannot be overlooked. This involves a thorough identification of all applicable regulatory requirements and the establishment of robust data handling standards. Organizations must specify comprehensive audit requirements to maintain transparency and accountability, while implementing appropriate security controls to protect sensitive data and system access.
@@ -347,17 +356,17 @@Local LLMs in Practice provides a detailed discussion on relevant considerations when Choosing your Model.
- 8.2.2.2. Performance Requirements¶
+9.2.2.2. Performance Requirements¶
Accuracy and quality form the foundation of any LLM deployment’s performance requirements. At its core, this involves determining the minimum level of accuracy that the model must achieve to be considered successful. This serves as a critical baseline for evaluating model performance and making deployment decisions. Establishing clear evaluation metrics, whether through automated measures or human evaluation processes, provides concrete ways to assess if these thresholds are being met. Continuous monitoring of these accuracy metrics ensures the system maintains its performance over time as usage patterns and data distributions evolve. Chapter The Evals Gap provides a detailed discussion on how to evaluate the performance of LLM-based applications.
Latency and throughput requirements are equally crucial for ensuring a positive user experience and system reliability. These specifications define how quickly the system must respond to requests and how many concurrent users it can handle. Response time requirements must be carefully balanced against the computational resources available, while peak load capabilities need to account for usage spikes and growth patterns. The decision between real-time processing for immediate responses versus batch processing for efficiency depends heavily on the use case and user expectations.
- 8.2.2.3. Operational Requirements¶
+9.2.2.3. Operational Requirements¶
Scale and capacity planning forms the foundation of operational requirements for LLM deployments. This involves a comprehensive analysis of expected system usage and growth patterns to ensure the infrastructure can handle both current and future demands. Organizations must carefully project their daily and monthly API call volumes while calculating the average number of tokens per request to accurately estimate resource needs. Understanding usage patterns, including seasonal variations, enables proper capacity planning. Additionally, developing 12-24 month growth projections helps ensure the infrastructure can scale appropriately as demand increases.
Reliability and availability requirements are equally critical for maintaining consistent service quality. These specifications define the expected uptime percentage that the system must maintain, typically expressed as a percentage of total operational time. Organizations need to establish clear maintenance windows that minimize disruption to users while ensuring necessary system updates and optimizations can be performed. Comprehensive backup and failover requirements must be specified to ensure business continuity in case of failures. High availability needs should be clearly defined, including redundancy levels and recovery time objectives, to maintain service quality even during unexpected events.
- 8.2.2.4. Technical Requirements¶
+9.2.2.4. Technical Requirements¶
System integration requirements define how the LLM system will interact and communicate with existing infrastructure and applications. This involves carefully mapping all integration points where the LLM system needs to connect with other systems, establishing standardized data formats and interfaces for seamless communication, implementing robust security measures to protect data in transit, and identifying any technical constraints that could impact integration. Getting these integration requirements right is crucial for ensuring the LLM system can function effectively within the broader technical ecosystem.
Data management requirements address how information will be stored, processed, and maintained within the LLM system. This encompasses determining appropriate storage solutions for maintaining conversation context and history, selecting and configuring vector databases to enable efficient retrieval-augmented generation (RAG), creating comprehensive data retention policies that balance operational needs with resource constraints, and ensuring all data handling practices comply with relevant privacy regulations. Proper data management is essential for both system performance and regulatory compliance, making it a critical consideration in any LLM implementation.
This structured approach to requirements analysis enables organizations to:
@@ -372,7 +381,7 @@-
8.3. Quantization¶
+9.3. Quantization¶
Quantization is a common and relevant technique in making LLMs more efficient and accessible. At a high level, quantization reduces the number of bits used to represent a model’s parameters. The most common form of quantization is to represent model’s weights at lower precision at post-training phase. It has become a standard technique to generate a series of quantized models given a large pre-trained base model.
While a standard pre-trained LLM might use 32-bit floating-point (FP32) or 16-bit floating-point (FP16) numbers to store its weights, quantized versions can operate at lower precision levels such as 8, 4 or even 2 bits per parameter, reducing memory footprint without proportional losses in performance, necessarily. For instance, for a model of 30 billion parameters, using FP32 means 4 bytes per weight or 120 GB for the whole weights. If the model is quantized such that weights are represented in 1 byte, the memory needed for the model’s weights decreases to 30 GB, hence potentially fitting into consumer grade hardware. This is done at the cost of precision loss, but the trade-off is often worth it though require careful analysis.
Let’s take a look at model weights of a language model (
@@ -468,21 +477,21 @@SmolLM2-135M-Instruct
) that has been quantized to 2-bit and 16-bit precisions. We will use an utility functionload_gguf
from thetaming_utils
package to load model weights of the quantized models directly from Hugging Face.[Unsloth, 2024] [2]. The model’s memory requirements vary significantly based on the quantization level used as demonstrated in Fig. 8.2. +
Quantization is a powerful technique for reducing the memory footprint of LLMs. This can be exemplified by the case of LLaMa 3.3 70B as quantized by [Unsloth, 2024] [2]. The model’s memory requirements vary significantly based on the quantization level used as demonstrated in Fig. 9.2.
We observe that the quantization process yields remarkable reductions in model size, demonstrating a clear trade-off between precision and memory requirements. The transition from F16 (141.1 GB) to Q8_0 (75 GB) achieves a dramatic 47% reduction in model size while maintaining relatively high numerical precision. Further quantization levels reveal an interesting pattern of diminishing returns - each step down in precision yields progressively smaller absolute size reductions, though the cumulative effect remains significant. At the extreme end, the Q2_K model (26.4 GB) requires only 19% of the storage space of its F16 counterpart [3].
This wide spectrum of model sizes enables deployment across diverse hardware environments. The lightweight Q2_K variant opens possibilities for running inference on consumer-grade hardware like high-end laptops or desktop computers. In contrast, the full-precision F16 model demands enterprise-grade computing resources with substantial memory capacity. This flexibility in deployment options makes quantization a powerful tool for democratizing access to large language models while managing computational costs.
While quantization has proven highly effective, there is a limit to how far it can be pushed - specifically, the 1-bit ceiling. A notable advancement in this space is BitNet [Wang et al., 2024] which pushes the boundaries of extreme quantization.
-BitNet’s implementation, bitnet.cpp, has demonstrated significant performance improvements across both ARM and x86 architectures (see Fig. 8.3). When compared to llama.cpp, the framework achieves speedups ranging from 1.37x to 5.07x on ARM processors and 2.37x to 6.17x on x86 systems. These performance gains scale with model size - larger models benefit more substantially from BitNet’s optimizations. The efficiency improvements extend beyond raw speed: energy consumption drops by 55-70% on ARM and 71-82% on x86 processors. Perhaps most impressively, bitnet.cpp enables running a 100B parameter BitNet b1.58 model on a single CPU at speeds matching human reading pace (5-7 tokens per second).
+BitNet’s implementation, bitnet.cpp, has demonstrated significant performance improvements across both ARM and x86 architectures (see Fig. 9.3). When compared to llama.cpp, the framework achieves speedups ranging from 1.37x to 5.07x on ARM processors and 2.37x to 6.17x on x86 systems. These performance gains scale with model size - larger models benefit more substantially from BitNet’s optimizations. The efficiency improvements extend beyond raw speed: energy consumption drops by 55-70% on ARM and 71-82% on x86 processors. Perhaps most impressively, bitnet.cpp enables running a 100B parameter BitNet b1.58 model on a single CPU at speeds matching human reading pace (5-7 tokens per second).
The framework’s initial release focused on CPU inference optimization, with particular emphasis on 1-bit LLM architectures (BitNet b1.58). While initial testing shows promising results, these findings are specific to the tested models and kernels (its specialized kernels are carefully crafted to exploit the unique characteristics of these extremely quantized models). Further validation is needed before generalizing these results across different architectures and use cases.
@@ -491,7 +500,7 @@Local LLMs in Practice for more details.
- 8.4. Check-list¶
+9.4. Check-list¶
Planning and Requirements
- @@ -525,7 +534,7 @@
Start with a clear understanding of your application’s needs and the factors that contribute to LLM costs
-
8.5. Conclusion¶
+9.5. Conclusion¶
@misc{tharsistpsouza2024tamingllms, author = {Tharsis T. P. Souza}, @@ -539,7 +548,7 @@-
8.6. References¶
+9.6. References¶
[WZS+24] @@ -552,7 +561,7 @@https://a16z.com/llmflation-llm-inference-cost/.
-[HuggingFace4w] +[HuggingFace4w]Hugging Face. Gguf quantization types. Online Documentation, 2024w. Documentation on different quantization types available for GGUF models. URL: https://huggingface.co/docs/hub/gguf#quantization-types.
@@ -573,7 +582,7 @@[3] -
You may have noticed quantization levels have a special notation. Including the bit width in the name of the model but also quantization types (e.g. _K, _0). You can find more information about the quantization levels in [Hugging Face, 2024w].
+You may have noticed quantization levels have a special notation. Including the bit width in the name of the model but also quantization types (e.g. _K, _0). You can find more information about the quantization levels in [Hugging Face, 2024w].
@@ -604,7 +613,7 @@
- ← 7. Local LLMs in Practice + title="previous chapter">← 8. Local LLMs in Practice
© Copyright Tharsis T. P. Souza, 2024. diff --git a/tamingllms/_build/html/notebooks/evals.html b/tamingllms/_build/html/notebooks/evals.html index 5ddf80d..5a846c4 100644 --- a/tamingllms/_build/html/notebooks/evals.html +++ b/tamingllms/_build/html/notebooks/evals.html @@ -182,6 +182,15 @@ + + + +- + + Managing Input Data + + +
@@ -253,7 +262,7 @@- 3. The Evals Gap¶
+3. The Evals Gap¶
It doesn’t matter how beautiful your theory is,
it doesn’t matter how smart you are.
@@ -263,49 +272,49 @@- 3.1. Introduction¶
+3.1. Introduction¶
The advent of LLMs marks a pivotal shift in the landscape of software development and evaluation. Unlike traditional software systems, where deterministic outputs are the norm, LLMs introduce a realm of non-deterministic and generative behaviors that challenge conventional software engineering testing paradigms. This shift is not merely a technical evolution but a fundamental transformation in how we conceive, build, and assess software products.
For those entrenched in traditional methodologies, the transition to LLM-driven systems may seem daunting. However, ignoring this change is not an option. The reliance on outdated testing frameworks that fail to account for the probabilistic nature of LLMs will inevitably lead to significant setbacks.
To overcome these challenges, it is imperative to embrace the complexities of LLMs with a proactive mindset. This involves developing robust evaluation frameworks up-front, fostering a product development culture of continuous change, learning and adaptation.
- 3.2. Non-Deterministic Generative Machines¶
+3.2. Non-Deterministic Generative Machines¶
One of the most fundamental challenges when building products with Large Language Models (LLMs) is their generative and non-deterministic nature. Unlike traditional software systems where the same input reliably produces the same output, LLMs can generate novel text that may not exist in their training data, and produce different responses each time they’re queried - even with identical prompts and input data. This behavior is both a strength and a significant engineering challenge and product challenge.
When you ask an LLM the same question multiple times, you’ll likely get different responses. This isn’t a bug - it’s a fundamental feature of how these models work. The “temperature” parameter, which controls the randomness of outputs, allows models to be creative and generate diverse responses. However, this same feature makes it difficult to build reliable, testable systems.
Consider a financial services company using LLMs to generate investment advice. The non-deterministic nature of these models means that:
@@ -440,7 +449,7 @@-
3.3. Emerging Properties¶
+3.3. Emerging Properties¶
Beyond their non-deterministic nature, LLMs present another fascinating characteristic: emergent abilities that spontaneously arise as models scale up in size. These abilities - from basic question answering to complex reasoning - aren’t explicitly programmed but rather emerge “naturally” as the models grow larger and are trained on more data. This makes evaluation fundamentally different from traditional software testing, where capabilities are explicitly coded and can be tested against pre-defined specifications.
Fig. 3.1 provides a list of emergent abilities of large language models and the scale. The relationship between model scale and emergent abilities follows a fascinating non-linear pattern. Below certain size thresholds, specific abilities may be completely absent from the model - it simply cannot perform certain tasks, no matter how much you try to coax them out. However, once the model reaches critical points in its scaling journey, these abilities can suddenly manifest in what researchers call a phase transition - a dramatic shift from inability to capability. This unpredictable emergence of capabilities stands in stark contrast to traditional software development, where features are deliberately implemented and can be systematically tested.
- 3.7.2. Evaluating Evaluators¶
+3.7.2. Evaluating Evaluators¶
We have discussed how LLMs can be used to evaluate LLM-based aplications. However, how can we evaluate the performance of LLMs that evaluate other LLMs? This is the question that meta evaluation aims to answer. Clearly, the discussion can become quite meta as we need to evaluate the performance of the evaluator to evaluate the performance of the evaluated model. However, one can make a case for two general options:
- @@ -1353,7 +1362,7 @@
Use a gold-standard dataset that is used to evaluate the performance of LLM evaluators using a “metrics-based” approach.
-
3.8. Benchmarks and Leaderboards¶
+3.8. Benchmarks and Leaderboards¶
Benchmarks act as standardized tests for LLMs, evaluating their performance across a spectrum of tasks. These tasks simulate real-world applications such as answering questions, generating coherent text, solving mathematical problems, or even writing computer code. They also assess more abstract qualities like fairness, robustness, and cultural understanding.
Benchmarks can be thought as comprehensive “exams” that probe different “subjects” in order to certify an LLM. They help researchers and developers compare models systematically, in a way LLM performance is comparable while enabling the identification of emergent behaviors or capabilities as models evolve in scale and sophistication.
The history of LLM benchmarks reflects the evolving priorities of artificial intelligence research, starting with foundational tasks and moving toward complex, real-world challenges. It began in 2018 with the introduction of GLUE(General Language Understanding Evaluation) [Wang et al., 2019], which set a new standard for evaluating natural language understanding. GLUE measured performance on tasks like sentiment analysis and textual entailment, providing a baseline for assessing the fundamental capabilities of language models. A year later, SuperGLUE [Wang et al., 2019] expanded on this foundation by introducing more nuanced tasks that tested reasoning and language comprehension at a deeper level, challenging the limits of models like BERT and its successors.
@@ -1398,16 +1407,16 @@[Chollet, 12/08/2024]. While deep learning has significantly advanced in recent years, pure deep learning approaches perform poorly on the ARC-AGI benchmark. This is because traditional deep learning relies on relating new situations to those encountered during training and lacks the ability to adapt or recombine knowledge for entirely new tasks. ARC Prize 2024 spurred the development of novel AGI reasoning techniques, leading to a significant increase in the state-of-the-art score on the ARC-AGI private evaluation set from 33% in 2023 to 55.5% in 2024. A key takeaway is that algorithmic improvements, rather than massive computational resources, may be key to exceeding the target score for the ARC-AGI benchmark.
In addition to the benchmarks discussed above, a growing set of domain-specific benchmarks is emerging to help evaluate LLMs in specific verticals, including:
-
- -
FinBench [Zhang et al., 2024]: Evaluates LLMs in the financial domain, covering tasks such as terminology understanding, temporal reasoning, future forecasting, scenario planning, and numerical modelling.
- -
LegalBench [Guha et al., 2023] : Assesses the legal reasoning abilities of LLMs through tasks crowdsourced by legal professionals
- +
Berkeley Function Leaderboard (BFCL) [Patil et al., 2023]: Evaluates LLMs’ function-calling abilities
- +
FinBench [Zhang et al., 2024]: Evaluates LLMs in the financial domain, covering tasks such as terminology understanding, temporal reasoning, future forecasting, scenario planning, and numerical modelling.
- +
LegalBench [Guha et al., 2023] : Assesses the legal reasoning abilities of LLMs through tasks crowdsourced by legal professionals
Berkeley Function Leaderboard (BFCL) [Patil et al., 2023]: Evaluates LLMs’ function-calling abilities
As language models continue to advance in capability and complexity, evaluation frameworks must evolve. Modern benchmarks increasingly incorporate tests for nuanced reasoning, ethical decision-making, and emergent capabilities that weren’t previously measurable. This ongoing evolution reflects a deeper understanding that the true value of language models lies not in achieving high scores on standardized tests with narrow task-specific metrics, but in their ability to meaningfully contribute to human understanding and help solve real-world problems while demonstrating the ability to learn and adapt to new tasks.
- 3.9. Tools¶
+3.9. Tools¶
- 3.9.1. LightEval¶
+3.9.1. LightEval¶
LightEval [Fourrier et al., 2023] is a lightweight framework for evaluation of LLMs across a variety of standard and bespoke metrics and tasks across multiple inference backends via Python SDK and CLI.
As a motivating example, consider a scenario where financial data has been extracted from SEC financial filings and require econometric analysis. Tasks like estimating autoregressive models for time series forecasting or conducting hypothesis tests on market efficiency are common in financial analysis. Let’s evaluate how well different models perform on this type of task.
First, we need to select a benchmark to assess LLMs capabilities in this domain. MMLU has a sub-benchmark called Econometrics we can use for this task. Table 3.4 shows a sample of the benchmark dataset from MMLU Econometrics. It consists of multiple-choice questions from econometrics and expected answers.
@@ -1596,7 +1605,7 @@[Hugging Face, 2024]. Its integration with the Hugging Face ecosystem and modular architecture make it particularly powerful for evaluating open source models. For further details, visit the official repository [Fourrier et al., 2023].
- 3.9.2. LangSmith¶
+3.9.2. LangSmith¶
Let’s revisit our evaluation example when we were interested in evaluating the quality of summaries generated by different (smaller and cheaper) LLM models compared to a benchmark model (larger and more expensive). Recal the setup:
- @@ -2004,7 +2013,7 @@
Benchmark model: gpt-4o
-
3.9.3. PromptFoo¶
+3.9.3. PromptFoo¶
Promptfoo [promptfoo, 2024] is an open-source framework designed for evaluating applications that utilize large language models (LLMs). Key features include:
- @@ -2269,7 +2278,7 @@
Automated Testing: Promptfoo provides automated testing capabilities, allowing developers to run custom evaluations tailored to their applications.
Prompt Comparison R
In conclusion, Promptfoo can serve as an effective LLM application evaluation tool particularly for its ability to decouple several components of the evaluation process. Hence enabling the user to focus on the most important aspects of the evaluation given the particular application and criteria making it a valuable and flexible tool for LLM application development.
- 3.9.4. Comparison¶
+3.9.4. Comparison¶
The following table provides a summarized comparative analysis of three open source frameworks for language models evaluation we have discussed: Lighteval, LangSmith, and Promptfoo. Each framework is assessed based on key features such as integration capabilities, customization options, ease of use, and the ability to facilitate human and LLM collaboration.