Skip to content

Latest commit

 

History

History
53 lines (27 loc) · 2.02 KB

BENCHMARK.md

File metadata and controls

53 lines (27 loc) · 2.02 KB

Description of Long-Context Tasks in the Eval Harness

These tasks can be found in ./tasks.py and are invoked from the eval.py harness with the --tasks parameter.

Synthetic

RULER defines a set of synthetic tasks designed to test a model’s long-context understanding.

Tasks include needle in a haystack (NIAH), variable tracking (VT), question answering (QA), and common word extraction (CWE).

Domain-Specific

Evaluates the model’s ability to perform domain-specific methodical writing tasks such as writing a differential diagnosis for a patient, or writing a lesson plan for students.

Coding

This task tests the model’s ability to understand coding repositories and make correct predictions for code completion.

QA

MuSiQue is a question-answering dataset that tests the model’s ability to perform multihop reasoning over a long input context.

TruthfulQA tests the models ability to answer questions truthfully across a broad set of categories such as health, law, finance, and politics.

Language Modeling

This task tests the model’s ability to generate longform text (~8K tokens) by providing a title and first initial words of a book.

Summarization

A meeting summarization dataset that evaluates the model’s ability to select and summarize content that is relevant to the given query.

SQuALITY is a question-focused summarization dataset, which tests the models ability to understand long narratives and select and summarize content relevant to the provided question.

QuALITY tests the model’s ability to understand and answer questions about long narratives.