Skip to content

Releases: huggingface/lighteval

v0.7.0

03 Jan 15:45
Compare
Choose a tag to compare

What's New

New Tasks

New Features

More Translation Literals by the Community

New Doc

Bug Fixes and Refacto

Significant community contributions

The following contributors have made significant changes to the library over the last release:

  • @clefourrier
    • added musr (#375)
    • Update README.md
    • Use the programmatic interface using an already in memory loaded model (#390)
    • Pr sadra (#393)
    • Allowing a single prompt to use several formats for one eval (#398)
    • Autoscaling inference endpoints (#412)
    • add configs with their models (#421)
    • Fix custom arabic tasks (#440)
    • Adds serverless endpoints back (#445)
    • Homogeneize generation params (#428)
  • @JoelNiklaus
    • Fixes a TypeError for generative metrics. (#386)
    • Fixes a TypeError in Sacrebleu. (#387)
    • Fixes an error with getting the golds from the formatted_docs. (#388)
    • Speed up Bootstrapping Computation (#409)
    • Add litellm inference (#385)
  • @albertvillanova
    • Update instance type/size in endpoint model_config example (#401)
    • Typo in feature-request.md (#406)
    • Add doc-builder doc-pr-upload GH Action (#411)
    • Set up docs (#403)
    • Add docstring docs (#413)
    • Add missing models to docs (#419)
    • Add trufflehog secrets detection (#429)
    • Update docs about inference endpoints (#432)
    • Fix ignored reuse_existing in config file (#431)
    • Test inference endpoint model config parsing from path (#434)
    • Fix imports from model_config (#443)
    • Fix wrong instructions and code for custom tasks (#450)
    • Fix minor typos (#449)
    • Implement TGI model config from path (#448)
    • Upgrade deprecated GH Action cache@v2 (#456)
    • Add EvaluationTracker to docs and fix its docstring (#464)
    • Remove unnecessary deepcopy in evaluation_tracker (#459)
    • Fix warning about precedence of custom tasks over default ones in registry (#466)
    • Checkout PR merge commit for CI tests (#468)
    • Delete endpoint on InferenceEndpointTimeoutError (#475)
  • @NathanHB
    • Fix splitting for generative tasks (#400)
    • Nathan refacto cli (#407)
    • redo logging (#415)
    • option to list custom tasks (#425)
    • fix model parallel (#481)
  • @ParagEkbote
    • Deprecate Obsolete Config Properties (#433)
  • @alielfilali01
    • Add new Arabic benchmarks (5) and enhance existing tasks (#372)
    • Update arabic_evals.py: Fix custom arabic tasks [2nd attempt] (#444)

v0.6.0

23 Oct 16:02
Compare
Choose a tag to compare

What's New

Lighteval becomes massively multilingual!

We now have extensive coverage in many languages, as well as new templates to manage multilinguality more easily.

Other Tasks

Features

Bug Fixes

Significant community contributions

The following contributors have made significant changes to the library over the last release:

  • @hynky1999
    • Support for multilingual generative metrics (#293)
    • Adds tasks templating (#335)
    • Multilingual NLI Tasks (#329)
    • Multilingual COPA tasks (#330)
    • Multilingual Hellaswag tasks (#332)
    • Multilingual Reading Comprehension tasks (#333)
    • Multilingual General Knowledge tasks (#338)
    • Selecting tasks using their superset (#308)
    • Fix Tokenization + misc fixes (#354)
    • Misc-multilingual tasks (#339)
    • add iroko bench + nicer output on task search failure (#357)
    • Translation literals (#356)
    • selected tasks for multilingual evaluation (#371)
    • Adds Baseline workflow + fixes (#363)
  • @DeanChugall
    • Serbian LLM Benchmark Task (#340)
  • @NathanHB
    • readme rewrite (#327)
    • refacto judge and add mixeval (#337)
    • bump lighteval versoin (#328)
    • fix (#347)
    • Nathan llm judge quickfix (#348)
    • Nathan llm judge quickfix (#350)
    • adds openai models (#359)

New Contributors

Full Changelog: v0.5.0...v0.6.0

v0.5.0

24 Sep 13:38
Compare
Choose a tag to compare

What's new

Features

Bug fixes

  • Fixes bug: You can't create a model without either a list of model_args or a model_config_path when model_config_path was submited by @NathanHB in #298
  • skip tests if secrets not provided by @hynky1999 in #304
  • [FIX] vllm backend by @NathanHB in #317

v0.4.0

05 Sep 13:28
Compare
Choose a tag to compare

What's new

Features

New tasks

Fixes

Significant community contributions

The following contributors have made significant changes to the library over the last release:

  • @NathanHB
    • commit (#137)
    • Add llm as judge in metrics (#146)
    • Nathan add logging to metrics (#157)
    • add 'cite as' section in readme (#178)
    • Fix citation section in readme (#180)
    • adding aimo custom eval (#154)
    • fix llm as judge warnings (#173)
    • launch lighteval using lighteval --args (#152)
    • adds llm as judge using transformers (#223)
    • Fix missing json file (#264)
    • change priority when choosing model dtype (#263)
    • fix the location of tasks list in the readme (#267)
    • updates ifeval repo (#268)
    • fix nanotron (#283)
    • add vlmm backend (#274)
    • bump nltk version (#290)
  • @clefourrier
    • Add config files for models (#131)
    • Add fun widgets to the README (#145)
    • Fix nanotron models input size bug (#156)
    • no function we actually use should be named prompt_fn (#168)
    • Add maj@k metric (#158)
    • Homogeneize logging system (#150)
    • Use only dataclasses for task init (#212)
    • Now only uses functions for prompt definition (#213)
    • Data split depending on eval params (#169)
    • should fix most inference endpoints issues of version config (#226)
    • Add metrics as functions (#214)
    • Quantization related issues (#224)
    • Update issue templates (#235)
    • remove latex writer since we don't use it (#231)
    • Removes default bert scorer init (#234)
    • fix (#233)
    • udpated piqa (#222)
    • uses torch compile if provided (#248)
    • Fix inference endpoint config (#244)
    • Expose samples via the CLI (#228)
    • Fixing issues with multichoice_continuations_start_space - was not parsed properly (#232)
    • Programmatic interface + cleaner management of requests (#269)
    • Small file reorg (only renames/moves) (#271)
    • Refactoring the few shot management (#272)
  • @PhilipMay
    • Add Ger-RAG-evaltasks. (#149)
    • Add version config option. (#181)
  • @shaltielshmid
    • Added Namespace parameter for InferenceEndpoints, added option for passing model config directly (#147)
    • Updated tgi_model and added parameters for endpoint_model (#208)
  • @hynky1999
    • make info loggers dataclass, so that their properties have expected lifetime (#280)
    • Remove expensive prediction run during test collection (#279)
    • Probability Metric + New Normalization (#276)
    • Standalone nanotron config (#285)
    • Logging Revamp (#284)

v0.3.0

29 Mar 16:42
Compare
Choose a tag to compare

Release Note

This introduced the new extended tasks feature, documentation and many other patches for improved stability.
New tasks are also introduced:

MT-Bench marks the introduction of multi-turn prompting as well as llm-as-a-judge metric.

New tasks

Features

Documentation

Small patches

New Contributors

Full Changelog: v0.2.0...v0.3.0

v0.2.0

01 Mar 14:31
Compare
Choose a tag to compare

Release Note

This release focuses on customization and personalisation: it's now possible to define custom metrics, not just custom tasks, see the README for the full mechanism.
Also includes small fixes to improve stability and new tasks. We made the choice to split community tasks from the main library source to better manage maintenance.

Better community task handling

New tasks

Features

  • Add an automatic system to compute average for tasks with subtasks by @clefourrier in #41

small patches

✨ Community Contributions

Full Changelog: v0.1.1...v0.2.0

v0.1.1

09 Feb 11:29
Compare
Choose a tag to compare

Small patch for PyPi release

Include tasks_table.jsonl in package

v0.1.0

08 Feb 10:27
Compare
Choose a tag to compare

Init

LightEval 🌤️

A lightweight LLM evaluation

Context

LightEval is a lightweight LLM evaluation suite that Hugging Face has been using internally with the recently released LLM data processing library datatrove and LLM training library nanotron.

We're releasing it with the community in the spirit of building in the open.

Note that it is still very much early so don't expect 100% stability ^^'
In case of problems or question, feel free to open an issue!

Full Changelog: https://github.com/huggingface/lighteval/commits/v0.1