Skip to content

Commit

Permalink
update promptfoo evals
Browse files Browse the repository at this point in the history
  • Loading branch information
souzatharsis committed Dec 10, 2024
1 parent 35b173e commit dbbd3fd
Show file tree
Hide file tree
Showing 9 changed files with 98 additions and 4 deletions.
Binary file modified tamingllms/_build/.doctrees/environment.pickle
Binary file not shown.
Binary file modified tamingllms/_build/.doctrees/notebooks/evals.doctree
Binary file not shown.
19 changes: 19 additions & 0 deletions tamingllms/_build/html/_sources/notebooks/evals.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -2511,6 +2511,25 @@
"In conclusion, Promptfoo can serve as an effective LLM application evaluation tool particularly for its ability to decouple several components of the evaluation process. Hence enabling the user to focus on the most important aspects of the evaluation given the particular application and criteria making it a valuable and flexible tool for LLM application development."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Comparison\n",
"\n",
"The following table provides a summarized comparative analysis of three open source frameworks for language models evaluation we have discussed: Lighteval, LangSmith, and Promptfoo. Each framework is assessed based on key features such as integration capabilities, customization options, ease of use, and the ability to facilitate human and LLM collaboration.\n",
"\n",
"```{table} Comparison of Lighteval, LangSmith, and Promptfoo\n",
":name: tool-comparison\n",
"| Feature/Aspect | Lighteval | LangSmith | Promptfoo |\n",
"|----------------------|------------------------------------|------------------------------------|------------------------------------|\n",
"| **Integration** | Seamless with Hugging Face models, easy access to multiple inference engines, and remote evaluation (e.g., TGI servers, HF serverless models) | User-provided models, evaluators, and metrics | CLI-based, user-provided models via YAML |\n",
"| **Customization** | Flexible task and metric support, quick evaluation against state-of-the-art leaderboards | Easy setup of custom tasks and metrics with plain vanilla Python functions, lacks predefined tasks and metrics | Default and user-provided probes, metrics, and assertions |\n",
"| **Ease of Use** | User-friendly, minimal setup | User-friendly, minimal setup, includes UI for result visualization | Simple CLI, rapid testing, includes UI for result visualization |\n",
"| **Human/LLM Collaboration** | Model-based evaluation | Model-based evaluation | Supports human and model evaluators |\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand Down
41 changes: 39 additions & 2 deletions tamingllms/_build/html/notebooks/evals.html
Original file line number Diff line number Diff line change
Expand Up @@ -233,9 +233,10 @@ <h1><a class="toc-backref" href="#id87" role="doc-backlink"><span class="section
<li><p><a class="reference internal" href="#lighteval" id="id103">LightEval</a></p></li>
<li><p><a class="reference internal" href="#langsmith" id="id104">LangSmith</a></p></li>
<li><p><a class="reference internal" href="#promptfoo" id="id105">PromptFoo</a></p></li>
<li><p><a class="reference internal" href="#comparison" id="id106">Comparison</a></p></li>
</ul>
</li>
<li><p><a class="reference internal" href="#references" id="id106">References</a></p></li>
<li><p><a class="reference internal" href="#references" id="id107">References</a></p></li>
</ul>
</li>
</ul>
Expand Down Expand Up @@ -2205,9 +2206,45 @@ <h3 class="rubric" id="prompt-comparison-results-by-section">Prompt Comparison R
<p>The results show that prompt3.txt performs best for Legal Proceedings sections, achieving a perfect score of 1.0 compared to 0.5 for prompt2.txt and 0.1 for prompt1.txt. For Risk Factors sections, both prompt2.txt and prompt3.txt achieve moderate scores of 0.5, while prompt1.txt scores poorly at 0.1. This suggests that prompt3.txt is generally more effective at extracting detailed information, particularly for legal content. In summary, defining a Role and a requirement for the output to be detailed is a good way to improve the quality of the summaries at least for this specific task, model and criteria.</p>
<p>In conclusion, Promptfoo can serve as an effective LLM application evaluation tool particularly for its ability to decouple several components of the evaluation process. Hence enabling the user to focus on the most important aspects of the evaluation given the particular application and criteria making it a valuable and flexible tool for LLM application development.</p>
</section>
<section id="comparison">
<h3><a class="toc-backref" href="#id106" role="doc-backlink"><span class="section-number">4.8.4. </span>Comparison</a><a class="headerlink" href="#comparison" title="Permalink to this heading"></a></h3>
<p>The following table provides a summarized comparative analysis of three open source frameworks for language models evaluation we have discussed: Lighteval, LangSmith, and Promptfoo. Each framework is assessed based on key features such as integration capabilities, customization options, ease of use, and the ability to facilitate human and LLM collaboration.</p>
<table class="docutils align-default" id="tool-comparison">
<caption><span class="caption-number">Table 4.6 </span><span class="caption-text">Comparison of Lighteval, LangSmith, and Promptfoo</span><a class="headerlink" href="#tool-comparison" title="Permalink to this table"></a></caption>
<thead>
<tr class="row-odd"><th class="head"><p>Feature/Aspect</p></th>
<th class="head"><p>Lighteval</p></th>
<th class="head"><p>LangSmith</p></th>
<th class="head"><p>Promptfoo</p></th>
</tr>
</thead>
<tbody>
<tr class="row-even"><td><p><strong>Integration</strong></p></td>
<td><p>Seamless with Hugging Face models, easy access to multiple inference engines, and remote evaluation (e.g., TGI servers, HF serverless models)</p></td>
<td><p>User-provided models, evaluators, and metrics</p></td>
<td><p>CLI-based, user-provided models via YAML</p></td>
</tr>
<tr class="row-odd"><td><p><strong>Customization</strong></p></td>
<td><p>Flexible task and metric support, quick evaluation against state-of-the-art leaderboards</p></td>
<td><p>Easy setup of custom tasks and metrics with plain vanilla Python functions, lacks predefined tasks and metrics</p></td>
<td><p>Default and user-provided probes, metrics, and assertions</p></td>
</tr>
<tr class="row-even"><td><p><strong>Ease of Use</strong></p></td>
<td><p>User-friendly, minimal setup</p></td>
<td><p>User-friendly, minimal setup, includes UI for result visualization</p></td>
<td><p>Simple CLI, rapid testing, includes UI for result visualization</p></td>
</tr>
<tr class="row-odd"><td><p><strong>Human/LLM Collaboration</strong></p></td>
<td><p>Model-based evaluation</p></td>
<td><p>Model-based evaluation</p></td>
<td><p>Supports human and model evaluators</p></td>
</tr>
</tbody>
</table>
</section>
</section>
<section id="references">
<h2><a class="toc-backref" href="#id106" role="doc-backlink"><span class="section-number">4.9. </span>References</a><a class="headerlink" href="#references" title="Permalink to this heading"></a></h2>
<h2><a class="toc-backref" href="#id107" role="doc-backlink"><span class="section-number">4.9. </span>References</a><a class="headerlink" href="#references" title="Permalink to this heading"></a></h2>
<div class="docutils container" id="id38">
<div class="citation" id="id50" role="doc-biblioentry">
<span class="label"><span class="fn-bracket">[</span><a role="doc-backlink" href="#id33">ALB+24</a><span class="fn-bracket">]</span></span>
Expand Down
Binary file modified tamingllms/_build/html/objects.inv
Binary file not shown.
2 changes: 1 addition & 1 deletion tamingllms/_build/html/searchindex.js

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion tamingllms/_build/jupyter_execute/markdown/intro.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
"cells": [
{
"cell_type": "markdown",
"id": "60c9196f",
"id": "3693a9ca",
"metadata": {},
"source": [
"(intro)=\n",
Expand Down
19 changes: 19 additions & 0 deletions tamingllms/_build/jupyter_execute/notebooks/evals.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -2511,6 +2511,25 @@
"In conclusion, Promptfoo can serve as an effective LLM application evaluation tool particularly for its ability to decouple several components of the evaluation process. Hence enabling the user to focus on the most important aspects of the evaluation given the particular application and criteria making it a valuable and flexible tool for LLM application development."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Comparison\n",
"\n",
"The following table provides a summarized comparative analysis of three open source frameworks for language models evaluation we have discussed: Lighteval, LangSmith, and Promptfoo. Each framework is assessed based on key features such as integration capabilities, customization options, ease of use, and the ability to facilitate human and LLM collaboration.\n",
"\n",
"```{table} Comparison of Lighteval, LangSmith, and Promptfoo\n",
":name: tool-comparison\n",
"| Feature/Aspect | Lighteval | LangSmith | Promptfoo |\n",
"|----------------------|------------------------------------|------------------------------------|------------------------------------|\n",
"| **Integration** | Seamless with Hugging Face models, easy access to multiple inference engines, and remote evaluation (e.g., TGI servers, HF serverless models) | User-provided models, evaluators, and metrics | CLI-based, user-provided models via YAML |\n",
"| **Customization** | Flexible task and metric support, quick evaluation against state-of-the-art leaderboards | Easy setup of custom tasks and metrics with plain vanilla Python functions, lacks predefined tasks and metrics | Default and user-provided probes, metrics, and assertions |\n",
"| **Ease of Use** | User-friendly, minimal setup | User-friendly, minimal setup, includes UI for result visualization | Simple CLI, rapid testing, includes UI for result visualization |\n",
"| **Human/LLM Collaboration** | Model-based evaluation | Model-based evaluation | Supports human and model evaluators |\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand Down
19 changes: 19 additions & 0 deletions tamingllms/notebooks/evals.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -2511,6 +2511,25 @@
"In conclusion, Promptfoo can serve as an effective LLM application evaluation tool particularly for its ability to decouple several components of the evaluation process. Hence enabling the user to focus on the most important aspects of the evaluation given the particular application and criteria making it a valuable and flexible tool for LLM application development."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Comparison\n",
"\n",
"The following table provides a summarized comparative analysis of three open source frameworks for language models evaluation we have discussed: Lighteval, LangSmith, and Promptfoo. Each framework is assessed based on key features such as integration capabilities, customization options, ease of use, and the ability to facilitate human and LLM collaboration.\n",
"\n",
"```{table} Comparison of Lighteval, LangSmith, and Promptfoo\n",
":name: tool-comparison\n",
"| Feature/Aspect | Lighteval | LangSmith | Promptfoo |\n",
"|----------------------|------------------------------------|------------------------------------|------------------------------------|\n",
"| **Integration** | Seamless with Hugging Face models, easy access to multiple inference engines, and remote evaluation (e.g., TGI servers, HF serverless models) | User-provided models, evaluators, and metrics | CLI-based, user-provided models via YAML |\n",
"| **Customization** | Flexible task and metric support, quick evaluation against state-of-the-art leaderboards | Easy setup of custom tasks and metrics with plain vanilla Python functions, lacks predefined tasks and metrics | Default and user-provided probes, metrics, and assertions |\n",
"| **Ease of Use** | User-friendly, minimal setup | User-friendly, minimal setup, includes UI for result visualization | Simple CLI, rapid testing, includes UI for result visualization |\n",
"| **Human/LLM Collaboration** | Model-based evaluation | Model-based evaluation | Supports human and model evaluators |\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand Down

0 comments on commit dbbd3fd

Please sign in to comment.