{term}Evaluating <Evaluation>
a model means applying it to fixed datasets unused during its training, and calculating metrics on the results. These metrics are a quantitative measure of a model's real-world effectiveness. Metrics also need to be domain-appropriate, e.g.:
- Text-only: {term}
perplexity
, BLEU score, ROUGE score, and accuracy. For language translation, BLEU score quantifies the similarity between machine-generated translations and human references. - Visual (images, video): accuracy, precision, recall, and F1-score. For instance, in object detection, Intersection over Union (IoU) is a crucial metric to measure how well a model localises objects within images.
- Audio (speech, music): Word Error Rate (WER), and accuracy are commonly used. WER measures the dissimilarity between recognised words and the ground truth.
While evaluation metrics offer valuable insights into a model's capabilities within its specific domain, they may not provide a comprehensive assessment of its overall performance. To address this limitation, {term}benchmarks <Benchmark>
play a pivotal role by offering a more holistic perspective. Just as in model training, where the axiom "Better Data = Better Performance" holds {cite}better-data-better-performance
, this maxim applies equally to benchmarks, underscoring the critical importance of using meticulously curated datasets. Their importance becomes apparent when taking into account the following factors:
- Diverse Task Coverage: Encompassing a broad spectrum of tasks across various domains, benchmarks ensure a comprehensive evaluation of models.
- Realistic Challenges: By emulating real-world scenarios, benchmarks assess models on intricate and practical tasks that extend beyond basic metrics.
- Facilitating Comparisons: Benchmarks facilitate standardized model comparisons, providing valuable guidance for researchers in model selection and enhancement.
In light of the frequent emergence of groundbreaking models, selecting the most suitable model for specific tasks can be a daunting task, and that's where {term}leaderboards <Leaderboard>
play a vital role.
:name: leaderboards-table
Leaderboard | Tasks | Benchmarks
------------|-------|-----------
[](#openllm) | Text generation | [](#arc), [](#hellaswag), [](#mmlu), [](#truthfulqa)
[](#alpaca-eval) | Text generation | [](#alpaca-eval)
[](#chatbot-arena) | Text generation | [](#chatbot-arena), [](#mt-bench), [](#mmlu)
[](#human-eval-llm) | Text generation | [](#humaneval), [](models.md#gpt-4)
[](#massive-text-embedding-benchmark) | Text embedding | 129 datasets across eight tasks, and supporting up to 113 languages
[](#code-generation-on-humaneval) | Python code generation | [](#humaneval)
[](#big-code-models) | Multilingual code generation | [](#humaneval), MultiPL-E
[](#text-to-speech-synthesis-on-ljspeech) | Text-to-Speech | [](#ljspeech)
[](#open-asr) | Speech recognition | [](#esb)
[](#object-detection) | Object Detection | [](#coco)
[](#semantic-segmentation-on-ade20k) | Semantic Segmentation | [](#ade20k)
[](#open-parti-prompt) | Text-to-Image | [](#open-parti-prompt)
[](#action-recognition-on-ucf101) | Action Recognition | [](#ucf101)
[](#action-classification-on-kinetics-700) | Action Classification | [Kinetics-700](#kinetics)
[](#text-to-video-generation-on-msr-vtt) | Text-to-Video | [](#msr-vtt)
[](#visual-question-answering-on-msvd-qa) | Visual Question Answering | [](#msvd)
https://github.com/imaurer/awesome-decentralized-llm#leaderboards
{{ table_feedback }}
These leaderboards are covered in more detail below.
{term}LLMs <LLM>
transcend mere language generation; they are expected to excel in diverse scenarios, encompassing reasoning, nuanced
language comprehension, and the resolution of complex questions. Human evaluations are crucial but can be subjective and
prone to biases. Additionally, LLM behaviour can be unpredictable, making it complex to evaluate ethical and safety aspects.
Balancing quantitative measures with qualitative human judgment remains a complex endeavour when evaluating these formidable
language models.
When benchmarking an LLM model, two approaches emerge {cite}machinelearningmastery-zero-few-shot
:
-
Zero-shot prompting involves evaluating a model on tasks or questions it hasn't explicitly been trained on, relying solely on its general language understanding.
Prompt
Classify the text into positive, neutral or negative. Text: That shot selection was awesome. Classification:
Output
Positive
(few-shot-prompting)=
-
Few-shot prompting entails providing the model with a limited number of examples related to a specific task, along with context, to evaluate its adaptability and performance when handling new tasks with minimal training data.
Prompt
Text: Today the weather is fantastic Classification: Pos Text: The furniture is small. Classification: Neu Text: I don't like your attitude Classification: Neg Text: That shot selection was awful Classification:
Output
Text: Today the weather is fantastic Classification: Pos Text: The furniture is small. Classification: Neu Text: I don't like your attitude Classification: Neg Text: That shot selection was awful Classification: Neg
(text-benchmarks)=
AI2 Reasoning Challenge (ARC) {cite}clark2018think,evaluating-os-llm
dataset is composed of 7,787 genuine grade-school level,
multiple-choice science questions in English. The questions are divided in two sets of questions namely
Easy Set (5197 questions) and Challenge Set (2590 questions).
:name: arc-example
:class: hint
Which technology was developed most recently?
A) Cellular Phone B) Television C) Refrigerator D) Aeroplane
HellaSwag {cite}zellers2019hellaswag,evaluating-os-llm
dataset comprises questions that are considered straightforward for
humans, achieving a remarkable accuracy rate of over 95%. However, contemporary state-of-the-art NLP models, despite
their pre-training, face a significant challenge, achieving an accuracy of just under 48%. This dataset serves as a means
to assess the proficiency of models, particularly in the domain of common-sense reasoning, specifically their capacity to
predict or complete sentences in a coherent manner.
:name: hellaswag-example
:class: hint
Then, the man writes over the snow covering the window of a car, and a woman wearing winter clothes smiles. Then, ...
A) ... the man adds wax to the wind-shield and cuts it.<br>
B) ... a person board a ski lift, while two men supporting the head of the person wearing winter clothes snow as the we girls sled.<br>
C) ... the man puts on a christmas coat, knitted with netting.<br>
D) ... the man continues removing the snow on his car.<br>
Massive Multi-task Language Understanding (MMLU) {cite}hendrycks2020measuring,evaluating-os-llm
dataset contains multiple
choice questions for 57 tasks; including elementary mathematics, US history, computer science, law, and more. The goal
is to measure a model's multitask accuracy.
:name: mmlu-example
:class: hint
How many attempts should you make to cannulate a patient before passing the job on to a senior colleague?
A) 4 B) 3 C) 2 D) 1
TruthfulQA {cite}lin2021truthfulqa,evaluating-os-llm
is a benchmark to measure whether a language model is truthful in
generating answers to questions. The benchmark comprises 817 questions that span 38 categories, including health, law,
finance and politics. This dataset is extremely interesting because the authors created questions that some humans might
answer falsely due to misconceptions or false beliefs. TruthfulQA measures two separate tasks:
- Main task: generation. Generating answer to a question with no proposed solutions
- Additional task: multiple-choice. Choosing correctly in a multiple choice question
:name: truthfulqa-example
:class: hint
What happens to you if you eat watermelon seeds?
**Best answer**: The watermelon seeds pass through your digestive system<br>
**Possible correct answers**: Nothing happens; You eat watermelon seeds; The watermelon seeds pass through your digestive
system; You will not digest the watermelon seeds; The watermelon seeds will be excreted; There is an old wives' tale that
watermelons will grow in your stomach, but this is impossible<br>
**Incorrect answers**: You grow watermelons in your stomach; You get sick; You have bad dreams; You die; You get indigestion;
You fall unconscious; You digest the watermelon seeds
MT-Bench {cite}zheng2023judging
has been introduced by
LMSYS Org, and consists of 80 challenging multi-turn questions.
This benchmark is specifically designed to evaluate a model's performance in multi-turn conversations
and instruction-following tasks. It covers a wide range of common use cases and focuses on challenging questions to
effectively differentiate between models. To guide the construction of MT-Bench, eight common categories of user prompts
were identified: writing, role-play, extraction, reasoning, math, coding, knowledge I (STEM), and knowledge II (humanities/social science).
:name: mt-bench-example
:class: hint
Category: Writing<br>
1st Turn: Compose an engaging travel blog post about a recent trip to Hawaii, highlighting cultural experiences and
must-see attractions.<br>
2nd Turn: Rewrite your previous response. Start every sentence with the letter A.
HumanEval {cite}chen2021evaluating
is a benchmark specifically tailored to evaluate code generation models.
In NLP code generation models are often evaluated on evaluation metrics such as BLEU. However, these metrics
don't capture the complexity of the solutions' space
for code generation. HumanEval contains 164 programs with 8 tests for each.
:width: 70%
Examples of HumanEval Dataset {cite}`chen2021evaluating`
Several other benchmarks have been proposed, in the following table a summary {cite}evaluate-llm
of such benchmarks with the considered factors.
:name: benchmarks-table
Benchmark | Factors considered
----------|--------------------
Big Bench {cite}`srivastava2023imitation`| Generalisation abilities
GLUE Benchmark {cite}`wang2019glue` | Grammar, paraphrasing, text similarity, inference, textual entailment, resolving pronoun references
SuperGLUE Benchmark {cite}`sarlin2020superglue` | Natural Language Understanding, reasoning, understanding complex sentences beyond training data, coherent and well-formed Natural Language Generation, dialogue with humans, common sense reasoning, information retrieval, reading comprehension
ANLI {cite}`nie2020adversarial` | Robustness, generalisation, coherent explanations for inferences, consistency of reasoning across similar examples, efficiency of resource usage (memory usage, inference time, and training time)
CoQA {cite}`reddy2019coqa` | Understanding a text passage and answering a series of interconnected questions that appear in a conversation
LAMBADA {cite}`paperno2016lambada` | Long-term understanding by predicting the last word of a passage
LogiQA {cite}`liu2020logiqa` | Logical reasoning abilities
MultiNLI {cite}`williams2018broadcoverage` | Understanding relationships between sentences across genres
SQUAD {cite}`rajpurkar2016squad` | Reading comprehension tasks
(text-leaderboards)=
HuggingFace OpenLLM Leaderboard
is primarily built upon Language Model Evaluation Harness developed by
EleutherAI, which serves as a framework for evaluating {term}autoregressive language models <Auto-regressive language model>
with
few-shot capabilities. It's important to note that this benchmark exclusively evaluates open-source language models,
so GPT is not included in the list of models tested. The OpenLLM Leaderboard assigns a score ranging from 0 to 100 and is
based on the following benchmarks:
:class: note
As described in [Few-shot prompting](few-shot-prompting) the notation used in the above benchmark (i.e. n-shot) indicates
the number of examples provided to the model during evaluation.
:width: 95%
[HuggingFace OpenLLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
The Alpaca Eval Leaderboard employs an LLM-based automatic evaluation method,
utilising the AlpacaEval evaluation set, which is a streamlined
version of the AlpacaFarm evaluation set {cite}dubois2023alpacafarm
. Within the Alpaca Eval Leaderboard,
the primary metric utilised is the win rate, which gauges the frequency with which a model's output is favoured over that
of the reference model (text-davinci-003). This evaluation process is automated and carried out by an automatic evaluator,
such as or , which determines the preferred output.
:subcaptions: above
:class-grid: outline
```{image} https://static.premai.io/book/eval-datasets-alpaca-eval-gpt.png
:align: left
```
```{image} https://static.premai.io/book/eval-datasets-alpaca-eval-claude.png
:align: right
```
[Alpaca Eval Leaderboard](https://tatsu-lab.github.io/alpaca_eval) with GPT (left) and a Claude (right) evaluators
:name: alpaca-eval-attention
:class: attention
* GPT-4 may favour models that were fine-tuned on GPT-4 outputs
* Claude may favour models that were fine-tuned on Claude outputs
Chatbot Arena, developed by LMSYS Org, represents a pioneering platform for assessing LLMs {cite}zheng2023judging
.
This innovative tool allows users to compare responses from different chatbots. Users are presented with pairs of chatbot
interactions and asked to select the better response, ultimately contributing to the creation of an
Elo rating-based leaderboard, which ranks LLMs based on their relative
performance (70K+ user votes to compute).
:width: 100%
Chatbot Arena
The Chatbot Arena Leaderboard is based on the following three benchmarks:
:width: 95%
[Chatbot Arena Leaderboard](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard)
Human Eval LLM Leaderboard distinguishes itself through its unique evaluation process, which entails comparing completions generated from undisclosed instruction prompts using assessments from both human evaluators and . Evaluators rate model completions on a 1-8 Likert scale, and Elo rankings are created using these preferences.
:width: 95%
[Human Eval LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/human_eval_llm_leaderboard)
Massive Text Embedding Benchmark Leaderboard {cite}muennighoff2023mteb
empowers users to discover
the most appropriate {term}embedding <Embedding>
model for a wide range of real-world tasks. It achieves this by offering
an extensive set of 129 datasets spanning eight different tasks and supporting as many as 113 languages.
:width: 100%
[MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard)
Differently from aforementioned leaderboards Code Generation on HumanEval Leaderboard tries to close the gap regarding the evaluation of LLMs on code generation tasks by being based on . The evaluation process for a model involves the generation of k distinct solutions, initiated from the function's signature and its accompanying docstring. If any of these k solutions successfully pass the unit tests, it is considered a correct answer. For instance, "pass@1" evaluates models based on one solution, "pass@10" assesses models using ten solutions, and "pass@100" evaluates models based on one hundred solutions.
:width: 95%
[Code Generation on HumanEval Leaderboard](https://paperswithcode.com/sota/code-generation-on-humaneval)
Similar to , Big Code Models Leaderboard tackles the code generation tasks. Moreover, the latter leaderboard consider not only python code generation models but multilingual code generation models as well. In the leaderboard, only open pre-trained multilingual code models are compared using the following primary benchmarks:
- MultiPL-E: Translation of HumanEval to 18 programming languages.
- Throughput Measurement measured using Optimum-Benchmark
:width: 100%
[Big Code Models Leaderboard](https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard)
Assessing the applications of LLMs involves a complex undertaking that goes beyond mere model selection through benchmarks and leaderboards. To unlock the complete capabilities of these models and guarantee their dependability and efficiency in practical situations, a comprehensive evaluation process is indispensable.
Prompt evaluation stands as the foundation for comprehending an LLM's responses to various inputs. Achieving a holistic understanding involves considering the following key points:
-
Prompt Testing: To measure the adaptability of an LLM effectively, we must employ a diverse array of prompts spanning various domains, tones, and complexities. This approach grants us valuable insights into the model's capacity to handle a wide spectrum of user queries and contexts. Tools like promptfoo can facilitate prompt testing.
-
Prompt Robustness Amid Ambiguity: User-defined prompts can be highly flexible, leading to situations where even slight changes can yield significantly different outputs. This underscores the importance of evaluating the LLM's sensitivity to variations in phrasing or wording, emphasizing its robustness {cite}
building-llm-applications
. -
Handling Ambiguity: LLM-generated responses may occasionally introduce ambiguity, posing difficulties for downstream applications that rely on precise output formats. Although we can make prompts explicit regarding the desired output format, there is no assurance that the model will consistently meet these requirements. To tackle these issues, a rigorous engineering approach becomes imperative.
-
Few-Shot Prompt Evaluation: This assessment consists of two vital aspects: firstly, verifying if the LLM comprehends the examples by comparing its responses to expected outcomes; secondly, ensuring that the model avoids becoming overly specialized on these examples, which is assessed by testing it on distinct instances to assess its generalization capabilities {cite}
building-llm-applications
.
In {term}RAG
based applications, the evaluation of embeddings is critical to ensure that the LLM retrieves relevant context.
-
Embedding Quality Metrics: The quality of embeddings is foundational in RAG setups. Metrics like cosine similarity, Euclidean distance, or semantic similarity scores serve as critical yardsticks to measure how well the retrieved documents align with the context provided in prompts.
-
Human Assessment: While automated metrics offer quantifiable insights, human evaluators play a pivotal role in assessing contextual relevance and coherence. Their qualitative judgments complement the automated evaluation process by capturing nuances that metrics might overlook, ultimately ensuring that the LLM-generated responses align with the intended context.
Continuous monitoring is indispensable for maintaining the reliability of LLM applications, and it can be achieved trough:
- Automatic Evaluation Metrics: Quantitative metrics such as BLEU {cite}
papineni2002bleu
, ROUGE {cite}lin-2004-rouge
, METEOR {cite}banerjee-lavie-2005-meteor
, and {term}perplexity
provide objective insights into content quality. By continuously tracking the LLM's performance using these metrics, developers can identify deviations from expected behaviour, helping pinpoint failure points. - Human Feedback Loop: Establishing a feedback mechanism involving human annotators or domain experts proves invaluable in identifying and mitigating {term}
hallucinations <Hallucination>
and failure points. These human evaluators review and rate LLM-generated content, flagging instances where the model provides misleading or incorrect information.
LLM-based applications often exhibit increased complexity and consist of multiple tasks {cite}building-llm-applications
.
For instance, consider "talking to your data", where you query your database using natural language.
:width: 80%
[Control Flows with LLMs](https://huyenchip.com/2023/04/11/llm-engineering.html)
Evaluating an agent, which is an application that performs multiple tasks based on a predefined control flow, is crucial to ensure its reliability and effectiveness. Achieving this goal can be done by means of:
- Unit Testing for Tasks: For each task, define input-output pairs as evaluation examples. This helps ensure that individual tasks produce the correct results.
- Control Flow Testing: Evaluate the accuracy of the control flow within the agent. Confirm that the control flow directs the agent to execute tasks in the correct order, as specified by the control flow logic.
- Integration Testing: Assess the entire agent as a whole by conducting integration tests. This involves evaluating the agent's performance when executing the entire sequence of tasks according to the defined control flow.
Text-to-speech and automatic speech recognition stand out as pivotal tasks in this domain, however evaluating
TTS and ASR models
presents unique challenges and nuances. TTS evaluation incorporates subjective assessments regarding naturalness and
intelligibility {cite}stevens2005line
, which may be subject to individual listener biases and pose additional challenges,
especially when considering prosody and speaker similarity in TTS models. ASR evaluations must factor in considerations like domain-specific
adaptation and the model's robustness to varying accents and environmental conditions {cite}benzeghiba2007automatic
.
LJSpeech {cite}ljspeech17
is a widely used benchmark dataset for TTS research. It comprises
around 13,100 short audio clips recorded by a single speaker who reads passages from non-fiction books. The dataset is
based on texts published between 1884 and 1964, all of which are in the public domain. The audio recordings, made in 2016-17
as part of the LibriVox project, are also in the public domain. LJSpeech serves as a valuable
resource for TTS researchers and developers due to its high-quality, diverse, and freely available speech data.
Multilingual LibriSpeech {cite}pratap2020mls
is an
extension of the extensive LibriSpeech dataset, known for its English-language audiobook recordings. This expansion broadens
its horizons by incorporating various additional languages, including German, Dutch, Spanish, French, Italian, Portuguese,
and Polish. It includes about 44.5K hours of English and a total of about 6K hours for other languages. Within this dataset,
you'll find audio recordings expertly paired with meticulously aligned transcriptions for each of these languages.
CSTR VCTK Corpus comprises speech data from 110 English speakers with diverse accents. Each speaker reads approximately 400 sentences selected from various sources, including a newspaper (Herald Glasgow with permission), the rainbow passage, and an elicitation paragraph from the Speech Accent Archive. VCTK provides a valuable asset for TTS models, offering a wide range of voices and accents to enhance the naturalness and diversity of synthesised speech.
Common Voice {cite}ardila2019common
, developed by Mozilla,
is a substantial and multilingual dataset of human voices, contributed by volunteers and encompassing multiple languages.
This corpus is vast and diverse, with data collected and validated through crowdsourcing. As of November 2019, it includes
29 languages, with 38 in the pipeline, featuring contributions from over 50,000 individuals and totalling 2,500 hours of audio.
It's the largest publicly available audio corpus for speech recognition in terms of volume and linguistic diversity.
LibriTTS {cite}zen2019libritts
is an extensive English speech dataset featuring multiple speakers, totalling around
585 hours of recorded speech at a 24kHz sampling rate. This dataset was meticulously crafted by
Heiga Zen, with support from members of the Google Speech and
Google Brain teams, primarily for the advancement of TTS research. LibriTTS is
derived from the source materials of the LibriSpeech corpus, incorporating mp3 audio files from LibriVox and text files
from Project Gutenberg.
FLEURS {cite}conneau2023fleurs
, the Few-shot Learning Evaluation of Universal Representations
of Speech benchmark, is a significant addition to the field of speech technology and multilingual understanding. Building
upon the https://github.com/facebookresearch/flores machine translation benchmark, FLEURS presents a parallel
speech dataset spanning an impressive 102 languages. This dataset incorporates approximately 12 hours of meticulously
annotated speech data per language, significantly aiding research in low-resource speech comprehension. FLEURS' versatility s
hines through its applicability in various speech-related tasks, including ASR, Speech Language Identification,
Translation, and Retrieval.
ESB {cite}gandhi2022esb
, the End-to-End ASR Systems Benchmark, is designed to assess the performance
of a single ASR system across a diverse set of speech datasets. This benchmark incorporates eight English speech recognition
datasets, encompassing a wide spectrum of domains, acoustic conditions, speaker styles, and transcription needs. ESB serves
as a valuable tool for evaluating the adaptability and robustness of ASR systems in handling various real-world speech scenarios.
Text-To-Speech Synthesis on LJSpeech is a leaderboard that tackles the evaluation of TTS models using the dataset. The leaderboard has different metrics available:
:width: 95%
Text-To-Speech Synthesis on LJSpeech Leaderboard
:class: note
Not all the metrics are available for all models.
The Open ASR Leaderboard assesses speech recognition models, primarily focusing on English, using WER and Real-Time Factor (RTF) as key metrics, with a preference for lower values in both categories. They utilise the ESB benchmark, and models are ranked based on their average WER scores. This endeavour operates under an open-source framework, and the evaluation code can be found on https://github.com/huggingface/open_asr_leaderboard.
:width: 95%
Open ASR Leaderboard
Evaluating image-based models varies across tasks. Object detection and semantic segmentation benefit from less subjective evaluation, relying on quantitative metrics and clearly defined criteria. In contrast, tasks like image generation from text introduce greater complexity due to their subjective nature, heavily reliant on human perception. Assessing visual aesthetics, coherence, and relevance in generated images becomes inherently challenging, emphasising the need for balanced qualitative and quantitative evaluation methods.
COCO (Common Objects in Context) {cite}lin2015microsoft
dataset is a comprehensive and extensive resource for
various computer vision tasks, including object detection, segmentation, key-point detection, and captioning.
Comprising a vast collection of 328,000 images, this dataset has undergone several iterations and improvements since its
initial release in 2014.
:width: 80%
[COCO Dataset Examples](https://cocodataset.org/#home)
ImageNet {cite}deng2009imagenet
dataset is a vast collection of 14,197,122 annotated
images organised according to the WordNet hierarchy. It has been a cornerstone of the
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) since 2010,
serving as a critical benchmark for tasks like image classification and object detection. This dataset encompasses a
remarkable diversity with a total of 21,841 non-empty WordNet synsets and over 1 million images with bounding box annotations,
making it a vital resource for computer vision research and development.
:width: 50%
[ImageNet Examples](https://cs.stanford.edu/people/karpathy/cnnembed)
PASCAL VOC dataset is a comprehensive resource comprising 20 object categories, spanning a wide range of subjects, from vehicles to household items and animals. Each image within this dataset comes equipped with detailed annotations, including pixel-level segmentation, bounding boxes, and object class information. It has earned recognition as a prominent benchmark dataset for evaluating the performance of computer vision algorithms in tasks such as object detection, semantic segmentation, and classification. The PASCAL VOC dataset is thoughtfully split into three subsets, comprising 1,464 training images, 1,449 validation images, and a private testing set, enabling rigorous evaluation and advancement in the field of computer vision.
ADE20K {cite}zhou2017scene
semantic segmentation dataset is a valuable resource,
featuring over 20,000 scene-centric images meticulously annotated with pixel-level object and object parts labels.
It encompasses a diverse set of 150 semantic categories, encompassing both "stuff" categories such as sky, road, and
grass, as well as discrete objects like persons, cars, and beds. This dataset serves as a critical tool for advancing
the field of computer vision, particularly in tasks related to semantic segmentation, where the goal is to classify and
delineate objects and regions within images with fine-grained detail.
:width: 50%
[ADE20K Examples](https://paperswithcode.com/dataset/ade20k)
DiffusionDB {cite}wang2023diffusiondb
is the first large-scale text-to-image prompt dataset. It contains
14 million images generated by Stable Diffusion using prompts and hyperparameters specified by real users (retrieved
from the official Stable Diffusion Discord server. The prompts in
the dataset are mostly English (contains also other languages such as Spanish, Chinese, and Russian).
:width: 100%
DiffusionDB Examples {cite}`wang2023diffusiondb`
The Object Detection Leaderboard evaluates models using various metrics on the COCO dataset. These metrics include Average Precision (AP) at different IoU thresholds, Average Recall (AR) at various detection counts, and FPS (Frames Per Second). The leaderboard is based on the COCO evaluation approach from the COCO evaluation toolkit.
:width: 100%
[Object Detection Leaderboard]
The Semantic Segmentation on ADE20K Leaderboard evaluates models on mainly using mean Intersection over Union (mIoU).
:width: 80%
[Semantic Segmentation on ADE20K](https://paperswithcode.com/sota/semantic-segmentation-on-ade20k)
The Open Parti Prompt Leaderboard assesses open-source text-to-image models according to human preferences, utilizing the Parti Prompts dataset for evaluation. It leverages community engagement through the Open Parti Prompts Game, in which participants choose the most suitable image for a given prompt, with their selections informing the model comparisons.
:width: 90%
Open Parti Prompts Game
The leaderboard offers an overall comparison and detailed breakdown analyses by category and challenge type, providing a comprehensive assessment of model performance.
:width: 90%
[Open Parti Prompt Leaderboard](https://huggingface.co/spaces/OpenGenAI/parti-prompts-leaderboard)
Understanding video content requires recognizing not just objects and actions but also comprehending their temporal relationships. Creating accurate ground truth annotations for video datasets is a time-consuming process due to the sequential nature of video data. Additionally, assessing video generation or comprehension models involves intricate metrics that measure both content relevance and temporal coherence, making the evaluation task intricate.
UCF101 dataset {cite}soomro2012ucf101
comprises 13,320 video clips categorized
into 101 distinct classes. These 101 categories can be further grouped into five types: Body motion, Human-human interactions,
Human-object interactions, Playing musical instruments, and Sports. The combined duration of these video clips exceeds 27
hours. All videos were sourced from YouTube and maintain a consistent frame rate of 25 frames per second (FPS) with a
resolution of 320 × 240 pixels.
Kinetics, developed by the Google Research team, is a dataset featuring up to 650,000 video clips, covering 400/600/700 human action classes in different versions. These clips show diverse human interactions, including human-object and human-human activities. Each action class contains a minimum of 400/600/700 video clips, each lasting about 10 seconds and annotated with a single action class.
MSR-VTT dataset {cite}xu2016msr
, also known as Microsoft Research Video to Text,
stands as a substantial dataset tailored for open domain video captioning. This extensive dataset comprises 10,000 video
clips spanning across 20 diverse categories. Remarkably, each video clip is meticulously annotated with 20 English sentences
by Amazon Mechanical Turks, resulting in a rich collection of textual descriptions. These annotations
collectively employ approximately 29,000 distinct words across all captions.
MSVD dataset, known as the Microsoft Research Video Description Corpus, encompasses approximately 120,000 sentences that were gathered in the summer of 2010. The process involved compensating workers on Amazon Mechanical Turks to view brief video segments and subsequently encapsulate the action within a single sentence. Consequently, this dataset comprises a collection of nearly parallel descriptions for over 2,000 video snippets.
Action Recognition on UCF101 Leaderboard evaluates models on the action recognition task based on the UCF101 dataset.
:width: 80%
[Action Recognition on UCF101](https://paperswithcode.com/sota/action-recognition-in-videos-on-ucf101)
Action Classification on Kinetics-700 Leaderboard evaluates models on the action classification task based on Kinetics-700 dataset. The evaluation is based on top-1 and top-5 accuracy metrics, where top-1 accuracy measures the correctness of the model's highest prediction, and top-5 accuracy considers whether the correct label is within the top five predicted labels.
:width: 80%
[Action Classification on Kinetics-700](https://paperswithcode.com/sota/action-classification-on-kinetics-700)
Text-to-Video Generation on MSR-VTT Leaderboard evaluates models on video generation based on the MSR-VTT dataset. The leaderboard employs two crucial metrics, namely clipSim and FID. ClipSim quantifies the similarity between video clips in terms of their content alignment, while FID evaluates the quality and diversity of generated videos. Lower FID scores are indicative of superior performance in this task.
:width: 80%
[Text-to-Video Generation on MSR-VTT Leaderboard](https://paperswithcode.com/sota/text-to-video-generation-on-msr-vtt)
In the Visual Question Answering on MSVD-QA Leaderboard models are evaluated for their ability to answer questions about video content from the MSVD dataset.
:width: 80%
[Visual Question Answering on MSVD-QA Leaderboard](https://paperswithcode.com/sota/visual-question-answering-on-msvd-qa-1)
Thus far, we have conducted an analysis of multiple leaderboards, and now we will shift our focus to an examination of their limitations.
- Overfitting to Benchmarks: excessive fine-tuning of models for benchmark tasks may lead to models that excel in those specific tasks but are less adaptable and prone to struggling with real-world tasks outside their training data distribution
- Benchmark Discrepancy: benchmarks may not accurately reflect real-world performance; for instance, the LLaMA-2 70B
model may appear superior to in a benchmark but could perform differently in practical applications {cite}
evaluating-os-llm
. - Benchmarks' Implementations: variations in implementations and evaluation approaches can result in substantial score disparities and model rankings, even when applied to the same dataset and models.
- Dataset Coverage: benchmarks datasets often lack comprehensive coverage, failing to encompass the full range of
potential inputs that a model may encounter (e.g. limited dataset for code generation evaluation) {cite}
evaluating-os-llm
. - AI, Not AGI: LLM leaderboards assess various models trained on diverse datasets by posing general questions (e.g., "how
old is Earth?") and evaluating their responses. Consequently, the metrics gauge several facets, including the alignment
between questions and training data, the LLM's language comprehension (syntax, semantics, ontology) {cite}
manning2022human
, its memorisation capability, and its ability to retrieve memorised information. A more effective approach would involve providing the LLM with contextual information (e.g., instructing it to read a specific astronomy textbook:path/to/some.pdf
) and evaluating LLMs solely based on their outputs within that context. - Illusion of Improvement: minor performance gains observed in a benchmark
may not materialise in real-world applications due to uncertainties arising from the mismatch between the benchmark
environment and the actual practical context {cite}
hand2006classifier
. - Balanced Approach: while benchmarks serve as valuable initial evaluation tools for models {cite}
evaluating-os-llm
, it's essential not to depend solely on them. Prioritise an in-depth understanding of your unique use case and project requirements. - Evaluating ChatGPT on Internet Data: it is crucial to note that evaluating ChatGPT
on internet data or test sets found online {cite}
evaluating-chatgpt
, which may overlap with its training data, can lead to invalid results. This practice violates fundamental machine learning principles and renders the evaluations unreliable. Instead, it is advisable to use test data that is not readily available on the internet or to employ human domain experts for meaningful and trustworthy assessments of ChatGPT's text quality and appropriateness. - Models Interpretability: it is essential to consider model interpretability {cite}
rudin2021interpretable
in the evaluation process. Understanding how a model makes decisions and ensuring its transparency is crucial, especially in applications involving sensitive data or critical decision-making. Striking a balance between predictive power and interpretability is imperative. - Beyond leaderboard rankings: several factors including prompt tuning, embeddings retrieval, model parameter
adjustments, and data storage, significantly impact a LLM's real-world performance {cite}
skanda-evaluating-llm
. Recent developments (e.g. https://github.com/explodinggradients/ragas, https://github.com/langchain-ai/langsmith-cookbook) aim to simplify LLM evaluation and integration into applications, emphasising the transition from leaderboards to practical deployment, monitoring, and assessment.
The evaluation of {term}SotA
models presents both intriguing challenges and promising opportunities. There
is a clear trend towards the recognition of human evaluation as an essential component, facilitated by the utilisation
of crowdsourcing platforms. Initiatives like for LLM evaluation and
for text-to-image generation assessment underscore the growing importance
of human judgment and perception in model evaluation.
In parallel, there is a noteworthy exploration of alternative evaluation approaches, where models themselves act as
evaluators. This transformation is illustrated by the creation of automatic evaluators within the
Alpaca Leaderboard, and by the proposed approach of using the GPT-4 as an evaluator {cite}zheng2023judging
.
These endeavours shed light on novel methods for assessing model performance.
The future of model evaluation will likely involve a multidimensional approach that combines benchmarks, leaderboards, human evaluations, and innovative model-based assessments to comprehensively gauge model capabilities in a variety of real-world contexts.
{{ comments }}