-
-

4. The Challenge of Evaluating LLM-based Applications

+
+

4. Challenges of Evaluating LLM-based Applications

Evals are surprisingly often all you need.

—Greg Brockman, OpenAI’s President

@@ -197,15 +197,32 @@

Contents

@@ -222,7 +239,7 @@

-

4.1.1. Temperature and Sampling

+

4.1.1. Temperature and Sampling

The primary source of non-determinism in LLMs comes from their sampling strategies. During text generation, the model:

  1. Calculates probability distributions for each next token

  2. @@ -231,7 +248,7 @@

    4.1.1. Temperature and Sampling

-

4.1.2. The Temperature Spectrum

+

4.1.2. The Temperature Spectrum

  • Temperature = 0: Most deterministic, but potentially repetitive

  • Temperature = 1: Balanced creativity and coherence

  • @@ -330,7 +347,7 @@

    4.1.2. The Temperature Spectrum

-

4.2. Emerging Properties

+

4.2. Emerging Properties

Beyond their non-deterministic nature, LLMs present another fascinating challenge: emergent abilities that spontaneously arise as models scale up in size. These abilities - from basic question answering to complex reasoning - aren’t explicitly programmed but rather emerge “naturally” as the models grow larger and are trained on more data. This makes evaluation fundamentally different from traditional software testing, where capabilities are explicitly coded and can be tested against clear specifications.

Emerging Properties @@ -342,7 +359,7 @@

-

4.3. Problem Statement

+

4.3. Problem Statement

Consider a practical example that illustrates these challenges: building a customer support chatbot powered by an LLM. In traditional software development, you would define specific features (like handling refund requests or tracking orders) and write tests to verify each function. But with LLMs, you’re not just testing predefined features - you’re trying to evaluate emergent capabilities like understanding context, maintaining conversation coherence, and generating appropriate emotional responses.

This fundamental difference raises critical questions about evaluation:

    @@ -391,7 +408,7 @@

    -

    4.4. Evals Design

    +

    4.4. Evals Design

    First, it’s important to make a distinction between evaluating an LLM versus evaluating an LLM-based application (our focus). While the latter offers foundation capabilities and are typically general-purpose, the former is more specific and tailored to a particular use case. Here, we define an LLM-based application as a system that uses one or more LLMs to perform a specific task. More specifically, an LLM-based application is the combination of one or more LLM models, their associated prompts and parameters to solve a particular business problem.

    That differentiation is important because it changes the scope of evaluation. LLMs are usually evaluated based on their capabilities, which include things like language understanding, reasoning and knowledge. LLM-based applications are evaluated based on their end-to-end functionality, performance, and how well they meet business requirements. That distinction has key implications for the design of evaluation systems:

      @@ -401,7 +418,7 @@

      -

      4.4.1. Conceptual Overview

      +

      4.4.1. Conceptual Overview

      Fig. 4.2 demonstrates a conceptual design of key components of LLM Application evaluation.

      Conceptual Overview @@ -482,7 +499,7 @@

      4.4.1. Conceptual Overview -

      4.4.2. Design Considerations

      +

      4.4.2. Design Considerations

      The design of an LLM application evaluation system depends heavily on the specific use case and business requirements. Here we list important questions for planning an LLM application evaluation system pertaining to each of the key components previously discussed:

      1. Examples (Input Dataset):

-

4.6. Tools

+

4.6. Tools

-

4.7. References

+

4.7. References

[WTB+22] diff --git a/tamingllms/_build/html/notebooks/output_size_limit.html b/tamingllms/_build/html/notebooks/output_size_limit.html index 6ec99de..21270e1 100644 --- a/tamingllms/_build/html/notebooks/output_size_limit.html +++ b/tamingllms/_build/html/notebooks/output_size_limit.html @@ -157,7 +157,7 @@
  • - The Challenge of Evaluating LLM-based Applications + Challenges of Evaluating LLM-based Applications diff --git a/tamingllms/_build/html/notebooks/structured_output.html b/tamingllms/_build/html/notebooks/structured_output.html index 70b560a..f8685a1 100644 --- a/tamingllms/_build/html/notebooks/structured_output.html +++ b/tamingllms/_build/html/notebooks/structured_output.html @@ -38,7 +38,7 @@ - + @@ -151,7 +151,7 @@
  • - The Challenge of Evaluating LLM-based Applications + Challenges of Evaluating LLM-based Applications @@ -181,7 +181,7 @@
  • @@ -755,7 +755,7 @@

    4. The Challenge of Evaluating LLM-based Applications → + title="next chapter">4. Challenges of Evaluating LLM-based Applications →