update promptfoo evals

souzatharsis · Dec 10, 2024 · dbbd3fd · dbbd3fd
1 parent 35b173e
commit dbbd3fd
Show file tree

Hide file tree

Showing 9 changed files with 98 additions and 4 deletions.
diff --git a/tamingllms/_build/.doctrees/environment.pickle b/tamingllms/_build/.doctrees/environment.pickle
diff --git a/tamingllms/_build/.doctrees/notebooks/evals.doctree b/tamingllms/_build/.doctrees/notebooks/evals.doctree
diff --git a/tamingllms/_build/html/_sources/notebooks/evals.ipynb b/tamingllms/_build/html/_sources/notebooks/evals.ipynb
@@ -2511,6 +2511,25 @@
     "In conclusion, Promptfoo can serve as an effective LLM application evaluation tool particularly for its ability to decouple several components of the evaluation process. Hence enabling the user to focus on the most important aspects of the evaluation given the particular application and criteria making it a valuable and flexible tool for LLM application development."
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Comparison\n",
+    "\n",
+    "The following table provides a summarized comparative analysis of three open source frameworks for language models evaluation we have discussed: Lighteval, LangSmith, and Promptfoo. Each framework is assessed based on key features such as integration capabilities, customization options, ease of use, and the ability to facilitate human and LLM collaboration.\n",
+    "\n",
+    "```{table} Comparison of Lighteval, LangSmith, and Promptfoo\n",
+    ":name: tool-comparison\n",
+    "| Feature/Aspect       | Lighteval                          | LangSmith                          | Promptfoo                          |\n",
+    "|----------------------|------------------------------------|------------------------------------|------------------------------------|\n",
+    "| **Integration**      | Seamless with Hugging Face models, easy access to multiple inference engines, and remote evaluation (e.g., TGI servers, HF serverless models) | User-provided models, evaluators, and metrics | CLI-based, user-provided models via YAML |\n",
+    "| **Customization**    | Flexible task and metric support, quick evaluation against state-of-the-art leaderboards | Easy setup of custom tasks and metrics with plain vanilla Python functions, lacks predefined tasks and metrics | Default and user-provided probes, metrics, and assertions |\n",
+    "| **Ease of Use**      | User-friendly, minimal setup       | User-friendly, minimal setup, includes UI for result visualization | Simple CLI, rapid testing, includes UI for result visualization |\n",
+    "| **Human/LLM Collaboration**   | Model-based evaluation             | Model-based evaluation  | Supports human and model evaluators       |\n",
+    "```"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},

diff --git a/tamingllms/_build/html/notebooks/evals.html b/tamingllms/_build/html/notebooks/evals.html
@@ -233,9 +233,10 @@ <h1><a class="toc-backref" href="#id87" role="doc-backlink"><span class="section
 <li><p><a class="reference internal" href="#lighteval" id="id103">LightEval</a></p></li>
 <li><p><a class="reference internal" href="#langsmith" id="id104">LangSmith</a></p></li>
 <li><p><a class="reference internal" href="#promptfoo" id="id105">PromptFoo</a></p></li>
+<li><p><a class="reference internal" href="#comparison" id="id106">Comparison</a></p></li>
 </ul>
 </li>
-<li><p><a class="reference internal" href="#references" id="id106">References</a></p></li>
+<li><p><a class="reference internal" href="#references" id="id107">References</a></p></li>
 </ul>
 </li>
 </ul>
@@ -2205,9 +2206,45 @@ <h3 class="rubric" id="prompt-comparison-results-by-section">Prompt Comparison R
 <p>The results show that prompt3.txt performs best for Legal Proceedings sections, achieving a perfect score of 1.0 compared to 0.5 for prompt2.txt and 0.1 for prompt1.txt. For Risk Factors sections, both prompt2.txt and prompt3.txt achieve moderate scores of 0.5, while prompt1.txt scores poorly at 0.1. This suggests that prompt3.txt is generally more effective at extracting detailed information, particularly for legal content. In summary, defining a Role and a requirement for the output to be detailed is a good way to improve the quality of the summaries at least for this specific task, model and criteria.</p>
 <p>In conclusion, Promptfoo can serve as an effective LLM application evaluation tool particularly for its ability to decouple several components of the evaluation process. Hence enabling the user to focus on the most important aspects of the evaluation given the particular application and criteria making it a valuable and flexible tool for LLM application development.</p>
 </section>
+<section id="comparison">
+<h3><a class="toc-backref" href="#id106" role="doc-backlink"><span class="section-number">4.8.4. </span>Comparison</a><a class="headerlink" href="#comparison" title="Permalink to this heading">¶</a></h3>
+<p>The following table provides a summarized comparative analysis of three open source frameworks for language models evaluation we have discussed: Lighteval, LangSmith, and Promptfoo. Each framework is assessed based on key features such as integration capabilities, customization options, ease of use, and the ability to facilitate human and LLM collaboration.</p>
+<table class="docutils align-default" id="tool-comparison">
+<caption><span class="caption-number">Table 4.6 </span><span class="caption-text">Comparison of Lighteval, LangSmith, and Promptfoo</span><a class="headerlink" href="#tool-comparison" title="Permalink to this table">¶</a></caption>
+<thead>
+<tr class="row-odd"><th class="head"><p>Feature/Aspect</p></th>
+<th class="head"><p>Lighteval</p></th>
+<th class="head"><p>LangSmith</p></th>
+<th class="head"><p>Promptfoo</p></th>
+</tr>
+</thead>
+<tbody>
+<tr class="row-even"><td><p><strong>Integration</strong></p></td>
+<td><p>Seamless with Hugging Face models, easy access to multiple inference engines, and remote evaluation (e.g., TGI servers, HF serverless models)</p></td>
+<td><p>User-provided models, evaluators, and metrics</p></td>
+<td><p>CLI-based, user-provided models via YAML</p></td>
+</tr>
+<tr class="row-odd"><td><p><strong>Customization</strong></p></td>
+<td><p>Flexible task and metric support, quick evaluation against state-of-the-art leaderboards</p></td>
+<td><p>Easy setup of custom tasks and metrics with plain vanilla Python functions, lacks predefined tasks and metrics</p></td>
+<td><p>Default and user-provided probes, metrics, and assertions</p></td>
+</tr>
+<tr class="row-even"><td><p><strong>Ease of Use</strong></p></td>
+<td><p>User-friendly, minimal setup</p></td>
+<td><p>User-friendly, minimal setup, includes UI for result visualization</p></td>
+<td><p>Simple CLI, rapid testing, includes UI for result visualization</p></td>
+</tr>
+<tr class="row-odd"><td><p><strong>Human/LLM Collaboration</strong></p></td>
+<td><p>Model-based evaluation</p></td>
+<td><p>Model-based evaluation</p></td>
+<td><p>Supports human and model evaluators</p></td>
+</tr>
+</tbody>
+</table>
+</section>
 </section>
 <section id="references">
-<h2><a class="toc-backref" href="#id106" role="doc-backlink"><span class="section-number">4.9. </span>References</a><a class="headerlink" href="#references" title="Permalink to this heading">¶</a></h2>
+<h2><a class="toc-backref" href="#id107" role="doc-backlink"><span class="section-number">4.9. </span>References</a><a class="headerlink" href="#references" title="Permalink to this heading">¶</a></h2>
 <div class="docutils container" id="id38">
 <div class="citation" id="id50" role="doc-biblioentry">
 <span class="label"><span class="fn-bracket">[</span><a role="doc-backlink" href="#id33">ALB+24</a><span class="fn-bracket">]</span></span>

diff --git a/tamingllms/_build/html/objects.inv b/tamingllms/_build/html/objects.inv
diff --git a/tamingllms/_build/html/searchindex.js b/tamingllms/_build/html/searchindex.js
diff --git a/tamingllms/_build/jupyter_execute/markdown/intro.ipynb b/tamingllms/_build/jupyter_execute/markdown/intro.ipynb
@@ -2,7 +2,7 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "id": "60c9196f",
+   "id": "3693a9ca",
    "metadata": {},
    "source": [
     "(intro)=\n",

diff --git a/tamingllms/_build/jupyter_execute/notebooks/evals.ipynb b/tamingllms/_build/jupyter_execute/notebooks/evals.ipynb
@@ -2511,6 +2511,25 @@
     "In conclusion, Promptfoo can serve as an effective LLM application evaluation tool particularly for its ability to decouple several components of the evaluation process. Hence enabling the user to focus on the most important aspects of the evaluation given the particular application and criteria making it a valuable and flexible tool for LLM application development."
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Comparison\n",
+    "\n",
+    "The following table provides a summarized comparative analysis of three open source frameworks for language models evaluation we have discussed: Lighteval, LangSmith, and Promptfoo. Each framework is assessed based on key features such as integration capabilities, customization options, ease of use, and the ability to facilitate human and LLM collaboration.\n",
+    "\n",
+    "```{table} Comparison of Lighteval, LangSmith, and Promptfoo\n",
+    ":name: tool-comparison\n",
+    "| Feature/Aspect       | Lighteval                          | LangSmith                          | Promptfoo                          |\n",
+    "|----------------------|------------------------------------|------------------------------------|------------------------------------|\n",
+    "| **Integration**      | Seamless with Hugging Face models, easy access to multiple inference engines, and remote evaluation (e.g., TGI servers, HF serverless models) | User-provided models, evaluators, and metrics | CLI-based, user-provided models via YAML |\n",
+    "| **Customization**    | Flexible task and metric support, quick evaluation against state-of-the-art leaderboards | Easy setup of custom tasks and metrics with plain vanilla Python functions, lacks predefined tasks and metrics | Default and user-provided probes, metrics, and assertions |\n",
+    "| **Ease of Use**      | User-friendly, minimal setup       | User-friendly, minimal setup, includes UI for result visualization | Simple CLI, rapid testing, includes UI for result visualization |\n",
+    "| **Human/LLM Collaboration**   | Model-based evaluation             | Model-based evaluation  | Supports human and model evaluators       |\n",
+    "```"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},

diff --git a/tamingllms/notebooks/evals.ipynb b/tamingllms/notebooks/evals.ipynb
@@ -2511,6 +2511,25 @@
     "In conclusion, Promptfoo can serve as an effective LLM application evaluation tool particularly for its ability to decouple several components of the evaluation process. Hence enabling the user to focus on the most important aspects of the evaluation given the particular application and criteria making it a valuable and flexible tool for LLM application development."
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Comparison\n",
+    "\n",
+    "The following table provides a summarized comparative analysis of three open source frameworks for language models evaluation we have discussed: Lighteval, LangSmith, and Promptfoo. Each framework is assessed based on key features such as integration capabilities, customization options, ease of use, and the ability to facilitate human and LLM collaboration.\n",
+    "\n",
+    "```{table} Comparison of Lighteval, LangSmith, and Promptfoo\n",
+    ":name: tool-comparison\n",
+    "| Feature/Aspect       | Lighteval                          | LangSmith                          | Promptfoo                          |\n",
+    "|----------------------|------------------------------------|------------------------------------|------------------------------------|\n",
+    "| **Integration**      | Seamless with Hugging Face models, easy access to multiple inference engines, and remote evaluation (e.g., TGI servers, HF serverless models) | User-provided models, evaluators, and metrics | CLI-based, user-provided models via YAML |\n",
+    "| **Customization**    | Flexible task and metric support, quick evaluation against state-of-the-art leaderboards | Easy setup of custom tasks and metrics with plain vanilla Python functions, lacks predefined tasks and metrics | Default and user-provided probes, metrics, and assertions |\n",
+    "| **Ease of Use**      | User-friendly, minimal setup       | User-friendly, minimal setup, includes UI for result visualization | Simple CLI, rapid testing, includes UI for result visualization |\n",
+    "| **Human/LLM Collaboration**   | Model-based evaluation             | Model-based evaluation  | Supports human and model evaluators       |\n",
+    "```"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},