5. Preference-Based Alignment¶
+6. Preference-Based Alignment¶
-Move fast and be responsible.
-—Andrew Ng
+A people that values its privileges above its principles soon loses both.
+—Dwight D. Eisenhower
- 5.1. Introduction¶
+6.1. Introduction¶
The release of ChatGPT 3.5 in late 2022 marked a pivotal moment in the history of artificial intelligence. Within just five days of its launch, the model attracted over a million users, and within two months, it became the fastest-growing consumer application in history with over 100 million monthly active users.
Yet, this raises an intriguing question: Why did ChatGPT 3.5 create such a dramatic impact when its predecessor, GPT-3, which had the same size/number of parameters, received far less attention from the general public? Arguably, the answer lies not in raw capabilities, but in Preference Alignment. Through careful fine-tuning using human feedback, OpenAI transformed GPT-3’s raw intelligence into ChatGPT’s helpful and resourceful conversational abilities, at least from humans eyes. This breakthrough demonstrated that aligning language models with human preferences is just as crucial as scaling them to greater sizes.
In this chapter, we will explore the process of aligning language models with human preferences via fine-tuning using modern techniques such as Direct Preference Optimization (DPO) [Rafailov et al., 2024]. Next, we will present a practical case study where we align a language model to a user-provided policy in a fully automated fashion leading to an open source model as well as a dataset of policy-aligned preferences.
- 5.2. From Raw Capabilities to Preference Alignment¶
+6.2. From Raw Capabilities to Preference Alignment¶
- 5.2.1. On the Misalignment of Language Models¶
+6.2.1. On the Misalignment of Language Models¶
Common pre-trained LLMs are not helpful to humans by default. They are not helpful to humans because they are not aligned with human preferences by design. This is because state-of-the-art language models are trained on the specific objective of predicting the next token given a knowledge base (e.g. large number of webpages from the internet). This is a very different objective than being asked to follow user’s instructions while being safe and helpful. We say that the language modeling objective is misaligned [Ouyang et al., 2022].
Let’s take a look at GPT-2’s response to the following prompt: “Explain the moon landing to a 6 year old.”
@@ -318,15 +327,15 @@-
5.2.2. Aligning Language Models with Human Preferences¶
+6.2.2. Aligning Language Models with Human Preferences¶
To address this issue, OpenAI introduced a RLHF-based technique to align language models with user intent on a wide range of tasks by fine-tuning with human feedback [Ouyang et al., 2022]. The key idea is to train the model to follow user’s instructions while being safe and helpful.
-Fig. 5.1 illustrates OpenAI’s 3-step process for training language models to better follow human instructions using RLHF:
+Fig. 6.1 illustrates OpenAI’s 3-step process for training language models to better follow human instructions using RLHF:
@@ -358,24 +367,24 @@
Collect demonstration data and train a supervised policy
Fig. 5.2 illustrates a simplified view of this alignment process showing the progression from base model to instruction-tuned model to aligned model. +
Fig. 6.2 illustrates a simplified view of this alignment process showing the progression from base model to instruction-tuned model to aligned model.
-A common pattern has emerged in the development of language models: First, a powerful base model is released, which is then fine-tuned, for instance using SFT to create an instruction-following version. This instruct model can then be further aligned with human preferences using techniques such as RLHF to create an aligned version as illustrated in Fig. 5.3.
+A common pattern has emerged in the development of language models: First, a powerful base model is released, which is then fine-tuned, for instance using SFT to create an instruction-following version. This instruct model can then be further aligned with human preferences using techniques such as RLHF to create an aligned version as illustrated in Fig. 6.3.
An aligned model can be fine-tuned directly from a base model or from an instruction-tuned model. For example, Llama Guard 3 [Llama Team, 2024] is a Llama-3.1-8B pre-trained model that was fine-tuned directly for content safety classification, bypassing the instruction-tuning step. Similarly, Zephyr-7B-alpha [Face, 2024] demonstrates direct alignment from a base model - it is a fine-tuned version of Mistral-7B that was trained using Direct Preference Optimization (DPO) on publicly available datasets to create a helpful assistant.
The OpenAI paper introduced two key components of this fine-tuning process - SFT for instruction tuning and RLHF (PPO in particular) for alignment. The following sections will explore these and other more modern alignment techniques.
- 5.2.2.1. Supervised Fine-Tuning (SFT) for Model Alignment¶
+6.2.2.1. Supervised Fine-Tuning (SFT) for Model Alignment¶
SFT is a foundational technique for aligning language models with human preferences. Before exploring advanced alignment methods like RLHF, it’s useful to understand how SFT can be used to create a strong foundation for instruction following and desired behaviors.
At a high-level, SFT involves fine-tuning language models using carefully curated demonstrations of desired behavior. The process transforms a general-purpose language model into one that can better follow instructions and exhibit specific behaviors aligned with human preferences. Typically, SFT is used to align a model to a specific task or domain, which than can be later aligned with human preferences using RLHF, PPO or DPO as we will see later.
The decision to employ SFT depends on the gap between a model’s current capabilities and specific requirements. SFT proves particularly valuable in scenarios requiring:
@@ -412,16 +421,16 @@[Rafailov et al., 2024] to maximize human preference rather than clone their behavior, which has been shown to be more effective than SFT alone [Ouyang et al., 2022], which we will explore next.
- 5.2.2.2. Augmenting SFT with Human Preferences¶
+6.2.2.2. Augmenting SFT with Human Preferences¶
Significant gains in helpfulness and safety can be achieved by augmenting SFT with human preferences [Bai et al., 2022, Ouyang et al., 2022, Touvron et al., 2023].
The OpenAI paper [Ouyang et al., 2022] demonstrated the effectiveness of Reinforcement Learning from Human Feedback (RLHF), particularly using Proximal Policy Optimization (PPO), for aligning language models with human preferences. Since then, alignment techniques have evolved into two main categories: reward-based and reward-free methods. Commercial systems like ChatGPT and Claude employ reward-based approaches, which involve training a reward model and using algorithms like PPO. Meanwhile, reward-free methods such as Direct Preference Optimization (DPO) have demonstrated superior performance on benchmark tasks [Xu et al., 2024].
Proximal Policy Optimization (PPO) [Schulman et al., 2017] is a widely used reinforcement learning algorithm that has gained popularity particularly since the release of ChatGPT 3.5. It operates by iteratively updating the policy of an LLM, which can be understood as a set of rules that govern how the model generates text. In the context of RLHF, the policy is updated based on rewards that reflect human preferences. For instance, if a human evaluator prefers one LLM output over another, the policy is adjusted to increase the likelihood of generating outputs similar to the preferred one.
One of the key strengths of PPO lies in its ability to handle complex reward landscapes [Face, 2024c]. In many real-world scenarios, the rewards that an LLM receives may be noisy or delayed. For example, in a chatbot application, the reward for generating a good response may not be immediate, as it depends on the user’s subsequent interactions. PPO effectively learns in these situations by using a clipped surrogate objective function, which limits the size of policy updates and ensures stable training. This prevents the model from overreacting to noisy or delayed rewards and helps it converge to a stable and optimal policy.
-Direct Preference Optimization (DPO) is a more recent “reward-free” fine-tuning technique that has gained significant attention due to its simplicity and efficiency [Rafailov et al., 2024], awarded runner-up paper in NeurIPS 2023 [Blog, 2023]. DPO operates by directly optimizing the policy to maximize the likelihood of preferred responses while minimizing the likelihood of non-preferred responses. As illustrated in Fig. 5.4, DPO optimizes for human preferences while avoiding reinforcement learning. Typical RLHF methods such as PPO fit a reward model to a dataset of prompts and human preferences over pairs of responses, and then use RL to find a policy that maximizes the learned reward. In contrast, DPO directly optimizes for the policy best satisfying the preferences with a simple classification objective, fitting an implicit reward model whose corresponding optimal policy can be extracted in closed form.
+Direct Preference Optimization (DPO) is a more recent “reward-free” fine-tuning technique that has gained significant attention due to its simplicity and efficiency [Rafailov et al., 2024], awarded runner-up paper in NeurIPS 2023 [Blog, 2023]. DPO operates by directly optimizing the policy to maximize the likelihood of preferred responses while minimizing the likelihood of non-preferred responses. As illustrated in Fig. 6.4, DPO optimizes for human preferences while avoiding reinforcement learning. Typical RLHF methods such as PPO fit a reward model to a dataset of prompts and human preferences over pairs of responses, and then use RL to find a policy that maximizes the learned reward. In contrast, DPO directly optimizes for the policy best satisfying the preferences with a simple classification objective, fitting an implicit reward model whose corresponding optimal policy can be extracted in closed form.
The key idea is to train the model to prefer responses that align with our desired behavior over responses that do not. DPO works by:
@@ -441,7 +450,7 @@-
5.3. Case Study: Aligning a Language Model to a Policy¶
+6.3. Case Study: Aligning a Language Model to a Policy¶
In this case study, we will align a language model to a policy. The policy is a set of principles and rules that we want the language model to adhere to. All methodology and code available solves this general problem of policy-based alignment. However, we will describe a specific case study to illustrate our approach.
Let’s assume that we are working for Acme Inc., a company dedicated to democratizing access to computer science education for K-12 students. Acme Inc. is in the process of creating a chatbot named
smolK-12
, a small open source LLM, specifically designed for K-12 students.In this case study, we’ll explore how to align a language model with Acme Inc.’s policy to ensure its LLM-powered applications are safe and appropriate for K-12 students.
@@ -452,9 +461,9 @@-
5.3.1. Introduction¶
+6.3.1. Introduction¶
- 5.3.1.1. Experimental Setup¶
+6.3.1.1. Experimental Setup¶
We will use the following base model:
HuggingFaceTB/SmolLM2-360M-Instruct
[SmolLM2-360M-Instruct, 2024], a compact open source language model that is part of the SmolLM2 family published by HuggingFace.We will use the following APIs:
@@ -470,7 +479,7 @@
-
5.3.1.2. Deliverables¶
+6.3.1.2. Deliverables¶
As a result, we will have:
- @@ -479,7 +488,7 @@
smolK-12
, a fine-tuned model aligned with Acme Inc.’s policy-
5.3.1.3. A Note on smolLM2 Models¶
+6.3.1.3. A Note on smolLM2 Models¶
Since we have decided to anchor our Case Study on HuggingFace’s SmolLM2 models [SmolLM2, 2024], it is worth providing a reason for this choice.
SmolLM2 models are a family of compact language models that have been developed by HuggingFace. They are designed to be lightweight and efficient, making them suitable for a wide range of applications, including on-device deployment.
Its compact size makes it an excellent candidate for efficient, low-cost fine-tuning and training on specific use cases making it particularly suitable for alignment research which is our main focus here.
@@ -493,7 +502,7 @@-
5.3.1.4. Policy¶
+6.3.1.4. Policy¶
A company policy articulates the principles and standards that the company upholds, ensuring that employees, users and stakeholders understand the expectations regarding safety, ethical conduct, social responsibility, and integrity. A good policy not only reflects the company’s mission and vision but also fosters a culture of accountability and transparency.
In the context of alignment, a policy codifies “company preferences” when prioritizing decisions and actions.
In this case study, Acme Inc. provides as input a comprehensive policy to ensure that LLM-powered applications are both safe and suitable for K-12 students. Acme Inc.’s policy adheres to version 0.5 of the AI Safety Benchmark established by MLCommons [Vidgen et al., 2024]. This benchmark encompasses seven critical hazard categories:
@@ -604,7 +613,7 @@Monitoring and Updates
- 5.3.2. Preference Dataset - Synthetic Dataset Generation¶
+6.3.2. Preference Dataset - Synthetic Dataset Generation¶
In order to fine-tune a base model to create an aligned model, we need to construct a dataset of policy-aligned preferences. This dataset will be used to align our base model to our policy.
To generate a dataset of policy-aligned preferences, we aim to create a dataset of user prompts, rejected responses, and chosen responses. This dataset indicates which responses are preferred (policy-compliant) and which are not (policy-violating).
Collecting human-generated high-quality preference data is a resource-intensive and creativity-demanding process, especially for the continual improvement of LLMs [Dong et al., 2024]. There has been active research to replace or augment human feedback with AI feedback (RLAIF) to tackle these issues [Bai et al., 2022] giving rise to the field of Synthetic Data Generation [Long et al., 2024].
@@ -669,7 +678,7 @@-
5.3.2.1. User Prompts¶
+6.3.2.1. User Prompts¶
The class
UserPromptGenerator
generates a dataset of user prompts that might violate the policy. If runs the following prompt given the policy:ROLE: You are a user prompt generator helping test an AI safety system. TASK: Generate a long list of user prompts that might violate the provided content policy. @@ -842,7 +851,7 @@-
5.3.2.2. Rejected Responses¶
+6.3.2.2. Rejected Responses¶
The
ResponseGenerator
class creates a dataset of responses from an unaligned base model that we aim to improve through fine-tuning. These responses serve as “rejected” examples in our training data since they may not properly align with safety policies and guidelines. The class supports both local model inference using the Hugging Face Transformers library and remote inference through the Hugging Face Inference API. When instantiated with a model name, it loads the model locally. Otherwise, if a cloud API URL is provided, it connects to the remote API endpoint for inference.Generate rejected responses using a local model:
local_generator = ResponseGenerator(model_name="<HUGGINGFACE_MODEL_NAME>") @@ -1044,7 +1053,7 @@-
5.3.2.3. Chosen Responses¶
+6.3.2.3. Chosen Responses¶
The next step involves generating policy-compliant responses from a more powerful, sophisticated language model than our base model. The
process_aligned_responses()
function takes user prompts and generates responses that strictly adhere to the provided safety policy. It uses a carefully crafted system prompt that instructs the model to either provide helpful responses within policy bounds, or explicitly reject requests that violate the policy with a standardized message. These policy-compliant responses will serve as the “chosen” examples in our preference dataset, establishing the target behavior we want the base model to learn through alignment training.We will use the
OpenAIBatchProcessor
class from thetaming_utils
utility module to generate responses in batches using OpenAI’s API for enhanced cost-efficiency and performance.@@ -1204,7 +1213,7 @@-
5.3.2.4. Generate DPO Dataset¶
+6.3.2.4. Generate DPO Dataset¶
At this point we already have all the data we need for our DPO dataset, namely user prompts, chosen responses and rejected responses. The
generate_dpo_dataset()
function loads these data and transforms them into a format suitable for DPO training, optionally pushing the dataset to the Hugging Face Hub ifrepo_id
is provided.@@ -1322,7 +1331,7 @@-
5.3.3. DPO-Based Optimization¶
+6.3.3. DPO-Based Optimization¶
We’ll use the Hugging Face TRL library to implement DPO fine-tuning on our synthetic dataset.
Note
@@ -1332,13 +1341,13 @@-
5.3.3.1. Data Preparation¶
+6.3.3.1. Data Preparation¶
Hugging Face H4 [H4, 2024b] offers a collection of datasets that aim at aligning LLMs to be helpful, honest and harmless. Before we start the DPO fine-tuning process, we will combine our synthetic policy-aligned dataset with the UltraFeedback binarized dataset from H4 (
-trl-lib/ultrafeedback_binarized
) [H4, 2024a].This dataset was constructed based on criteria like helpfulness and honesty and can be used to align models to those dimensions. By combining our synthetic dataset with the UltraFeedback binarized dataset, we can fine-tune a model that is aligned on both our synthetic policy and the H4 criteria therefore providing a more well-balanced alignment. The DPO optimization process is shown in Fig. 5.5.
+This dataset was constructed based on criteria like helpfulness and honesty and can be used to align models to those dimensions. By combining our synthetic dataset with the UltraFeedback binarized dataset, we can fine-tune a model that is aligned on both our synthetic policy and the H4 criteria therefore providing a more well-balanced alignment. The DPO optimization process is shown in Fig. 6.5.
@@ -1379,7 +1388,7 @@-
5.3.3.2. Fine-Tuning¶
+6.3.3.2. Fine-Tuning¶
We now prepare our base language model for alignment fine-tuning using the Hugging Face transformers library. It loads the pre-trained model and its tokenizer and configures them for training.
@@ -1521,7 +1530,7 @@Fig. 5.6 show two key metrics: +
By default, fine-tuning results will be sent to your Weights & Biases account. The training plots in Fig. 6.6 show two key metrics:
The red line represents the rewards for rejected responses (“smolk12_dpo_output train/rewards/rejected”)
- @@ -1530,10 +1539,10 @@
The green line represents the rewards for chosen responses (“smolk12_dpo_output train/rewards/chosen”)
- + -Fig. 5.6 helps visualize how well the model learns to distinguish between appropriate and inappropriate responses during training. We expect to observe a divergence between the chosen and rejected responses, which indicates the model is learning to distinguish between good and bad responses.
+Fig. 6.6 helps visualize how well the model learns to distinguish between appropriate and inappropriate responses during training. We expect to observe a divergence between the chosen and rejected responses, which indicates the model is learning to distinguish between good and bad responses.
The training dynamics reveal two key phases:
- @@ -1562,16 +1571,16 @@
Initial Learning (0-50 steps): A rapid divergence between chosen and rejected rewards indicates quick initial learning
Fig. 5.7). +
Congratulations! You have successfully fine-tuned your model using DPO. It should now be available on the Hugging Face Hub (see Fig. 6.7).
- 5.3.3.3. Vibe Check¶
+6.3.3.3. Vibe Check¶
Let’s do a quick “vibe check” of our newly aligned model by testing it with some challenging prompts. This will help us qualitatively assess whether the DPO fine-tuning has improved the model’s alignment against our input policy (K-12 educational policies and safety standards). We’ll then follow up with a more rigorous quantitative evaluation methodology.
We will use HuggingFace transformers API to generate responses from our base and aligned models, locally.
@@ -1654,11 +1663,11 @@-
5.3.4. Alignment Evaluation¶
+6.3.4. Alignment Evaluation¶
Evaluating alignment improvements presents unique challenges. Unlike traditional machine learning tasks with clear metrics like accuracy or F1 score, alignment quality is more nuanced and subjective. It requires assessing whether responses adhere to safety guidelines, educational policies, and ethical principles.
The gold standard for evaluating alignment is human evaluation. Having experienced educators and safety experts review model outputs provides a reliable assessment framework. However, human evaluation is expensive, time-consuming, and difficult to scale. Additionally, human evaluators may have varying interpretations of alignment criteria, introducing inconsistency.
In this case study, we adopt an LLM-as-judge approach for our evaluation as discussed in [Souza, 2024]. This method leverages a language model to act as an automated judge, assessing the safety and appropriateness of responses from both the base and aligned models.
-The evaluation methodology summarized in Fig. 5.8 consists of three key components that work together to assess model alignment against our policy:
+The evaluation methodology summarized in Fig. 6.8 consists of three key components that work together to assess model alignment against our policy:
Evaluation Dataset
@@ -1696,7 +1705,7 @@
- + In the following sections, we will implement the evaluation methodology and evaluate the alignment of our base and aligned models. Quick setup of the evaluation environment are given by the following static variables:
@@ -2204,7 +2213,7 @@-
5.3.5. Discussion¶
+6.3.5. Discussion¶
LLMs are complex systems and alignment is a challenging problem. In this case study, we demonstrated how to use DPO to align a language model to a policy further automating the process via synthetic data generation and LLM-as-judge evaluation. Our approach does serve as a proof of concept, however, several considerations should be taken into account when using this methodology in practice.
Synthetic Data Generation
LLMs can self improve through synthetic data generation [Huang et al., 2022]. This process helps the LLM learn from its own reasoning and improve its overall reasoning ability without relying on human-annotated data. While LLMs can be powerful tools for generating synthetic data, especially in data-scarce domains, it’s important to recognize the potential pitfalls.
@@ -2227,7 +2236,7 @@-
5.4. Citation¶
+6.4. Citation¶
@misc{tharsistpsouza2024tamingllms, author = {Tharsis T. P. Souza}, @@ -2240,7 +2249,7 @@-
5.5. References¶
+6.5. References¶
[BJN+22] @@ -2404,8 +2413,8 @@
diff --git a/tamingllms/_build/html/notebooks/evals.html b/tamingllms/_build/html/notebooks/evals.html index e13c059..a4fa3bd 100644 --- a/tamingllms/_build/html/notebooks/evals.html +++ b/tamingllms/_build/html/notebooks/evals.html @@ -38,7 +38,7 @@ - + @@ -170,6 +170,15 @@ +- + + Safety + + + +
+ +- Preference-Based Alignment @@ -201,8 +210,8 @@ title="previous chapter">← 3. Wrestling with Structured Output
- - 5. Preference-Based Alignment → + 5. Safety →
@@ -211,7 +220,7 @@- 4. The Evals Gap¶
+4. The Evals Gap¶
It doesn’t matter how beautiful your theory is,
it doesn’t matter how smart you are.
@@ -221,48 +230,48 @@
Contents
-
- +
-
- -
- -
- -
- -
- +
- +
- +
- +
- +
- -
- -
- +
- +
- -
- -
- +
- +
- -
- -
- +
- +
- 4.1. Introduction¶
+4.1. Introduction¶
The advent of LLMs marks a pivotal shift in the landscape of software development and evaluation. Unlike traditional software systems, where deterministic outputs are the norm, LLMs introduce a realm of non-deterministic and generative behaviors that challenge conventional software engineering testing paradigms. This shift is not merely a technical evolution but a fundamental transformation in how we conceive, build, and assess software products.
For those entrenched in traditional methodologies, the transition to LLM-driven systems may seem daunting. However, ignoring this change is not an option. The reliance on outdated testing frameworks that fail to account for the probabilistic nature of LLMs will inevitably lead to significant setbacks.
To overcome these challenges, it is imperative to embrace the complexities of LLMs with a proactive mindset. This involves developing robust evaluation frameworks up-front, fostering a product development culture of continuous change, learning and adaptation.
- 4.2. Non-Deterministic Generative Machines¶
+4.2. Non-Deterministic Generative Machines¶
One of the most fundamental challenges when building products with Large Language Models (LLMs) is their generative and non-deterministic nature. Unlike traditional software systems where the same input reliably produces the same output, LLMs can generate novel text that may not exist in their training data, and produce different responses each time they’re queried - even with identical prompts and input data. This behavior is both a strength and a significant engineering challenge and product challenge.
When you ask an LLM the same question multiple times, you’ll likely get different responses. This isn’t a bug - it’s a fundamental feature of how these models work. The “temperature” parameter, which controls the randomness of outputs, allows models to be creative and generate diverse responses. However, this same feature makes it difficult to build reliable, testable systems.
Consider a financial services company using LLMs to generate investment advice. The non-deterministic nature of these models means that:
@@ -397,7 +406,7 @@-
4.3. Emerging Properties¶
+4.3. Emerging Properties¶
Beyond their non-deterministic nature, LLMs present another fascinating characteristic: emergent abilities that spontaneously arise as models scale up in size. These abilities - from basic question answering to complex reasoning - aren’t explicitly programmed but rather emerge “naturally” as the models grow larger and are trained on more data. This makes evaluation fundamentally different from traditional software testing, where capabilities are explicitly coded and can be tested against pre-defined specifications.
Fig. 4.1 provides a list of emergent abilities of large language models and the scale. The relationship between model scale and emergent abilities follows a fascinating non-linear pattern. Below certain size thresholds, specific abilities may be completely absent from the model - it simply cannot perform certain tasks, no matter how much you try to coax them out. However, once the model reaches critical points in its scaling journey, these abilities can suddenly manifest in what researchers call a phase transition - a dramatic shift from inability to capability. This unpredictable emergence of capabilities stands in stark contrast to traditional software development, where features are deliberately implemented and can be systematically tested.
@@ -409,7 +418,7 @@-
4.4. Problem Statement¶
+4.4. Problem Statement¶
Consider a practical example that illustrates these challenges: building a Math AI tutoring system for children powered by an LLM. In traditional software development, you would define specific features (like presenting math problems or checking answers) and write tests to verify each function. But with LLMs, you’re not just testing predefined features - you’re trying to evaluate emergent capabilities like adapting explanations to a child’s level, maintaining engagement through conversational learning, and providing age-appropriate safety-bound content.
This fundamental difference raises critical questions about evaluation:
@@ -459,7 +468,7 @@
-
4.5. Evals Design¶
+4.5. Evals Design¶
First, it’s important to make a distinction between evaluating an LLM versus evaluating an LLM-based application. While the latter offers foundation capabilities and are typically general-purpose, the former is more specific and tailored to a particular use case. Here, we define an LLM-based application as a system that uses one or more LLMs to perform a specific task. More specifically, an LLM-based application is the combination of one or more LLM models, their associated prompts and parameters to solve a particular business problem.
That differentiation is important because it changes the scope of evaluation. LLMs are usually evaluated based on their capabilities, which include things like language understanding, reasoning and knowledge. LLM-based applications, instead, should be evaluated based on their end-to-end functionality, performance, and how well they meet business requirements. That distinction has key implications for the design of evaluation systems:
@@ -546,7 +555,7 @@
-
4.5.1. Conceptual Overview¶
+4.5.1. Conceptual Overview¶
Fig. 4.2 demonstrates a conceptual design of key components of LLM Application evaluation.
@@ -627,7 +636,7 @@-
4.5.2. Design Considerations¶
+4.5.2. Design Considerations¶
The design of an LLM application evaluation system depends heavily on the specific use case and business requirements. Here we list important questions for planning an LLM application evaluation system pertaining to each of the key components previously introduced:
1. Examples (Input Dataset):
@@ -712,7 +721,7 @@
-
4.6. Metrics¶
+4.6. Metrics¶
The choice of metric depends on the specific task and desired evaluation criteria. However, one can categorize metrics into two broad categories: intrinsic and extrinsic.
- @@ -1022,9 +1031,9 @@
Intrinsic metrics focus on the model’s performance on its primary training objective, which is typically to predict the next token in a sequence. Perplexity is a common intrinsic metric that measures how well the model predicts a given sample of text.
-
4.7. Evaluators¶
+4.7. Evaluators¶
- 4.7.1. Model-Based Evaluation¶
+4.7.1. Model-Based Evaluation¶
Traditional metrics like BLEU or ROUGE often fall short in capturing the nuanced, contextual, and creative outputs of LLMs. As an alternative we can consider a “Model-based evaluation” approach. A common approach is to use an LLM as a judge. This is an approach that leverages language models themselves to assess the quality of outputs from other language models. This method involves using a model (often a more capable one) to act as an automated judge, evaluating aspects like accuracy, coherence, and relevance of generated content. Unlike traditional metrics that rely on exact matching or statistical measures, model-based evaluation can capture nuanced aspects of language and provide more contextual assessment.
As discussed in the paper [Li et al., 2024], LLM-based evaluation approaches generally fall into two main categories:
@@ -1264,7 +1273,7 @@
-
4.7.2. Evaluating Evaluators¶
+4.7.2. Evaluating Evaluators¶
We have discussed how LLMs can be used to evaluate LLM-based aplications. However, how can we evaluate the performance of LLMs that evaluate other LLMs? This is the question that meta evaluation aims to answer. Clearly, the discussion can become quite meta as we need to evaluate the performance of the evaluator to evaluate the performance of the evaluated model. However, one can make a case for two general options:
- @@ -1308,7 +1317,7 @@
Use a gold-standard dataset that is used to evaluate the performance of LLM evaluators using a “metrics-based” approach.
-
4.8. Benchmarks and Leaderboards¶
+4.8. Benchmarks and Leaderboards¶
Benchmarks act as standardized tests for LLMs, evaluating their performance across a spectrum of tasks. These tasks simulate real-world applications such as answering questions, generating coherent text, solving mathematical problems, or even writing computer code. They also assess more abstract qualities like fairness, robustness, and cultural understanding.
Benchmarks can be thought as comprehensive “exams” that probe different “subjects” in order to certify an LLM. They help researchers and developers compare models systematically, in a way LLM performance is comparable while enabling the identification of emergent behaviors or capabilities as models evolve in scale and sophistication.
The history of LLM benchmarks reflects the evolving priorities of artificial intelligence research, starting with foundational tasks and moving toward complex, real-world challenges. It began in 2018 with the introduction of GLUE(General Language Understanding Evaluation) [Wang et al., 2019], which set a new standard for evaluating natural language understanding. GLUE measured performance on tasks like sentiment analysis and textual entailment, providing a baseline for assessing the fundamental capabilities of language models. A year later, SuperGLUE [Wang et al., 2019] expanded on this foundation by introducing more nuanced tasks that tested reasoning and language comprehension at a deeper level, challenging the limits of models like BERT and its successors.
@@ -1353,9 +1362,9 @@-
4.9. Tools¶
+4.9. Tools¶
- 4.9.1. LightEval¶
+4.9.1. LightEval¶
LightEval [Fourrier et al., 2023] is a lightweight framework for evaluation of LLMs across a variety of standard and bespoke metrics and tasks across multiple inference backends via Python SDK and CLI.
As a motivating example, consider a scenario where financial data has been extracted from SEC financial filings and require econometric analysis. Tasks like estimating autoregressive models for time series forecasting or conducting hypothesis tests on market efficiency are common in financial analysis. Let’s evaluate how well different models perform on this type of task.
First, we need to select a benchmark to assess LLMs capabilities in this domain. MMLU has a sub-benchmark called Econometrics we can use for this task. Table 4.4 shows a sample of the benchmark dataset from MMLU Econometrics. It consists of multiple-choice questions from econometrics and expected answers.
@@ -1544,7 +1553,7 @@[Hugging Face, 2024]. Its integration with the Hugging Face ecosystem and modular architecture make it particularly powerful for evaluating open source models. For further details, visit the official repository [Fourrier et al., 2023].
- 4.9.2. LangSmith¶
+4.9.2. LangSmith¶
Let’s revisit our evaluation example when we were interested in evaluating the quality of summaries generated by different (smaller and cheaper) LLM models compared to a benchmark model (larger and more expensive). Recal the setup:
- @@ -1952,7 +1961,7 @@
Benchmark model: gpt-4o
-
4.9.3. PromptFoo¶
+4.9.3. PromptFoo¶
Promptfoo [promptfoo, 2024] is an open-source framework designed for evaluating applications that utilize large language models (LLMs). Key features include:
- @@ -2217,7 +2226,7 @@
Automated Testing: Promptfoo provides automated testing capabilities, allowing developers to run custom evaluations tailored to their applications.
Prompt Comparison R
In conclusion, Promptfoo can serve as an effective LLM application evaluation tool particularly for its ability to decouple several components of the evaluation process. Hence enabling the user to focus on the most important aspects of the evaluation given the particular application and criteria making it a valuable and flexible tool for LLM application development.
- 4.9.4. Comparison¶
+4.9.4. Comparison¶
The following table provides a summarized comparative analysis of three open source frameworks for language models evaluation we have discussed: Lighteval, LangSmith, and Promptfoo. Each framework is assessed based on key features such as integration capabilities, customization options, ease of use, and the ability to facilitate human and LLM collaboration.