Releases: huggingface/lighteval
v0.7.0
What's New
New Tasks
- added musr by @clefourrier in #375
- Adds Global MLMU by @hynky1999 in #426
- Add new Arabic benchmarks (5) and enhance existing tasks by @alielfilali01 in #372
New Features
- Evaluate a model already loaded in memory for training / evaluation loop by @clefourrier in #390
- Allowing a single prompt to use several formats for one eval by @clefourrier in #398
- Autoscaling inference endpoints hardware by @clefourrier in #412
- CLI new look and features (using typer) by @NathanHB in #407
- Better Looking and more functional logging by @NathanHB in #415
- Add litellm backend by @JoelNiklaus in #385
More Translation Literals by the Community
- add bashkir variants by @AigizK in #374
- add Shan (shn) translation literals by @NoerNova in #376
- Add Udmurt (udm) translation literals by @codemurt in #381
- This PR adds translation literals for Belarusian language. by @Kryuski in #382
- added tatar literals by @gaydmi in #383
New Doc
- Add doc-builder doc-pr-upload GH Action by @albertvillanova in #411
- Set up docs by @albertvillanova in #403
- Add docstring docs by @albertvillanova in #413
- Add missing models to docs by @albertvillanova in #419
- Update docs about inference endpoints by @albertvillanova in #432
- Upgrade deprecated GH Action cache@v2 by @albertvillanova in #456
- Add EvaluationTracker to docs and fix its docstring by @albertvillanova in #464
- Checkout PR merge commit for CI tests by @albertvillanova in #468
Bug Fixes and Refacto
- Allow AdapterModels to have custom tokens by @mapmeld in #306
- Homogeneize generation params by @clefourrier in #428
- fix: cache directory variable by @NazimHAli in #378
- Add trufflehog secrets detection by @albertvillanova in #429
- greedy_until() fix by @vsabolcec in #344
- Fixes a TypeError for generative metrics. by @JoelNiklaus in #386
- Speed up Bootstrapping Computation by @JoelNiklaus in #409
- Fix imports from model_config by @albertvillanova in #443
- Fix wrong instructions and code for custom tasks by @albertvillanova in #450
- Fix minor typos by @albertvillanova in #449
- fix model parallel by @NathanHB in #481
- add configs with their models by @clefourrier in #421
- Fixes a TypeError in Sacrebleu. by @JoelNiklaus in #387
- fix ukr/rus by @hynky1999 in #394
- fix repeated cleanup by @anton-l in #399
- Update instance type/size in endpoint model_config example by @albertvillanova in #401
- Considering the case empty request list is given to base model by @sadra-barikbin in #250
- Fix a tiny bug in
PromptManager::FewShotSampler::_init_fewshot_sampling_random
by @sadra-barikbin in #423 - Fix splitting for generative tasks by @NathanHB in #400
- Fixes an error with getting the golds from the formatted_docs. by @JoelNiklaus in #388
- Fix ignored reuse_existing in config file by @albertvillanova in #431
- Deprecate Obsolete Config Properties by @ParagEkbote in #433
- fix: LightevalTaskConfig.stop_sequence attribute by @ryan-minato in #463
- fix: scorer attribute initialization in ROUGE by @ryan-minato in #471
- Delete endpoint on InferenceEndpointTimeoutError by @albertvillanova in #475
- Remove unnecessary deepcopy in evaluation_tracker by @albertvillanova in #459
- fix: CACHE_DIR Default Value in Accelerate Pipeline by @ryan-minato in #461
- Fix warning about precedence of custom tasks over default ones in registry by @albertvillanova in #466
- Implement TGI model config from path by @albertvillanova in #448
Significant community contributions
The following contributors have made significant changes to the library over the last release:
- @clefourrier
- added musr (#375)
- Update README.md
- Use the programmatic interface using an already in memory loaded model (#390)
- Pr sadra (#393)
- Allowing a single prompt to use several formats for one eval (#398)
- Autoscaling inference endpoints (#412)
- add configs with their models (#421)
- Fix custom arabic tasks (#440)
- Adds serverless endpoints back (#445)
- Homogeneize generation params (#428)
- @JoelNiklaus
- @albertvillanova
- Update instance type/size in endpoint model_config example (#401)
- Typo in feature-request.md (#406)
- Add doc-builder doc-pr-upload GH Action (#411)
- Set up docs (#403)
- Add docstring docs (#413)
- Add missing models to docs (#419)
- Add trufflehog secrets detection (#429)
- Update docs about inference endpoints (#432)
- Fix ignored reuse_existing in config file (#431)
- Test inference endpoint model config parsing from path (#434)
- Fix imports from model_config (#443)
- Fix wrong instructions and code for custom tasks (#450)
- Fix minor typos (#449)
- Implement TGI model config from path (#448)
- Upgrade deprecated GH Action cache@v2 (#456)
- Add EvaluationTracker to docs and fix its docstring (#464)
- Remove unnecessary deepcopy in evaluation_tracker (#459)
- Fix warning about precedence of custom tasks over default ones in registry (#466)
- Checkout PR merge commit for CI tests (#468)
- Delete endpoint on InferenceEndpointTimeoutError (#475)
- @NathanHB
- @ParagEkbote
- Deprecate Obsolete Config Properties (#433)
- @alielfilali01
v0.6.0
What's New
Lighteval becomes massively multilingual!
We now have extensive coverage in many languages, as well as new templates to manage multilinguality more easily.
-
Add 3 NLI tasks supporting 26 unique languages. #329 by @hynky1999
-
Add 3 COPA tasks supporting about 20 unique languages. #330 by @hynky1999
-
Add Hellaswag tasks supporting about 36 unique languages. #332 by @hynky1999
- mlmm_hellaswag
- hellaswag_{tha/tur}
-
Add RC tasks supporting about 130 unique languages/scripts. #333 by @hynky1999
-
Add GK tasks supporting about 35 unique languages/scripts. #338 by @hynky1999
- meta_mmlu
- mlmm_mmlu
- rummlu
- mmlu_ara_mcf
- tur_leaderboard_mmlu
- cmmlu
- mmlu
- ceval
- mlmm_arc_challenge
- alghafa_arc_easy
- community_arc
- community_truthfulqa
- exams
- m3exams
- thai_exams
- xcsqa
- alghafa_piqa
- mera_openbookqa
- alghafa_openbookqa
- alghafa_sciqa
- mathlogic_qa
- agieval
- mera_worldtree
-
Misc Tasks #339 by @hynky1999
- openai_mmlu_tasks
- turkish_mmlu_tasks
- lumi arc
- hindi/swahili/arabic (from alghafa) arc
- cmath
- mgsm
- xcodah
- xstory
- xwinograd + tr winograd
- mlqa
- mkqa
- mintaka
- mlqa_tasks
- french triviaqa
- chegeka
- acva
- french_boolq
- hindi_boolq
-
Serbian LLM Benchmark Task by @DeanChugall in #340
-
iroko bench by @hynky1999 in #357
Other Tasks
Features
- Now Evaluate OpenAI models by @NathanHB in #359
- New Doc and README by @NathanHB in #327
- Refacto LLM as A Judge by @NathanHB in #337
- Selecting tasks using their superset by @hynky1999 in #308
- Nicer output on task search failure by @hynky1999 in #357
- Adds tasks templating by @hynky1999 in #335
- Support for multilingual generative metrics by @hynky1999 in #293
- Class implementations of faithfulness and extractiveness metrics by @chuandudx in #323
- Translation literals by @hynky1999 in #356
Bug Fixes
- Math normalization: do not crash on invalid format by @guipenedo in #331
- Skipping push to hub test by @clefourrier in #334
- Fix Metrics import path in community task template file. by @chuandudx in #309
- Allow kwargs for BERTScore compute function and remove unused var by @chuandudx in #311
- Fixes sampling for vllm when num_samples==1 by @edbeeching in #343
- Fix the dataset loading for custom tasks by @clefourrier in #364
- Fix: missing property tag in inference endpoints by @clefourrier in #368
- Fix Tokenization + misc fixes by @hynky1999 in #354
- Fix BLEURT evaluation errors by @chuandudx in #316
- Adds Baseline workflow + fixes by @hynky1999 in #363
Significant community contributions
The following contributors have made significant changes to the library over the last release:
- @hynky1999
- Support for multilingual generative metrics (#293)
- Adds tasks templating (#335)
- Multilingual NLI Tasks (#329)
- Multilingual COPA tasks (#330)
- Multilingual Hellaswag tasks (#332)
- Multilingual Reading Comprehension tasks (#333)
- Multilingual General Knowledge tasks (#338)
- Selecting tasks using their superset (#308)
- Fix Tokenization + misc fixes (#354)
- Misc-multilingual tasks (#339)
- add iroko bench + nicer output on task search failure (#357)
- Translation literals (#356)
- selected tasks for multilingual evaluation (#371)
- Adds Baseline workflow + fixes (#363)
- @DeanChugall
- Serbian LLM Benchmark Task (#340)
- @NathanHB
New Contributors
- @chuandudx made their first contribution in #323
- @edbeeching made their first contribution in #343
- @DeanChugall made their first contribution in #340
- @Stopwolf made their first contribution in #225
- @martinscooper made their first contribution in #366
Full Changelog: v0.5.0...v0.6.0
v0.5.0
What's new
Features
- Tokenization-wise encoding by @hynky1999 in #287
- Task config by @hynky1999 in #289
Bug fixes
v0.4.0
What's new
Features
- Adds vlmm as backend for insane speed up by @NathanHB in #274
- Add llm_as_judge in metrics (using both OpenAI or Transformers) by @NathanHB in #146
- Abale to use config files for models by @clefourrier in #131
- List available tasks in the cli
lighteval tasks --list
by @DimbyTa in #142 - Use torch compile for speed up by @clefourrier in #248
- Add maj@k metric by @clefourrier in #158
- Adds a dummy/random model for baseline init by @guipenedo in #220
- lighteval is now a cli tool:
lighteval --args
by @NathanHB in #152 - We can now log info from the metrics (for example input and response from llm_as_judge) by @NathanHB in #157
- Configurable task versioning by @PhilipMay in #181
- Programmatic interface by @clefourrier in #269
- Probability Metric + New Normalization by @hynky1999 in #276
- Add widgets to the README by @clefourrier in #145
New tasks
- Add
Ger-RAG-eval
tasks. by @PhilipMay in #149 - adding
aimo
custom eval by @NathanHB in #154
Fixes
- Bump nltlk to 3.9.1 to fix security issue by @NathanHB in #137
- Fix max_length type when being passed in model args by @csarron in #138
- Fix nanotron models input size bug by @clefourrier in #156
- Fix MATH normalization by @lewtun in #162
- fix Prompt function names by @clefourrier in #168
- Fix prompt format german rag community task by @jphme in #171
- add 'cite as' section in readme by @NathanHB in #178
- Fix broken link to extended tasks in README by @alexrs in #182
- Mention HF_TOKEN in readme by @Wauplin in #194
- Download BERT scorer lazily by @sadra-barikbin in #190
- Updated tgi_model and added parameters for endpoint_model by @shaltielshmid in #208
- fix llm as judge warnings by @NathanHB in #173
- ADD GPT-4 as Judge by @philschmid in #206
- Fix a few typos and do a tiny refactor by @sadra-barikbin in #187
- Avoid truncating the outputs based on string lengths by @anton-l in #201
- Now only uses functions for prompt definition by @clefourrier in #213
- Data split depending on eval params by @clefourrier in #169
- should fix most inference endpoints issues of version config by @clefourrier in #226
- Fix _init_max_length in base_model.py by @gucci-j in #185
- Make evaluator invariant of input request type order by @sadra-barikbin in #215
- Fixing issues with multichoice_continuations_start_space - was not parsed properly by @clefourrier in #232
- Fix IFEval metric by @lewtun in #259
- change priority when choosing model dtype by @NathanHB in #263
- Add grammar option to generation by @sadra-barikbin in #242
- make info loggers dataclass, so that their properties have expected lifetime by @hynky1999 in #280
- Remove expensive prediction run during test collection by @hynky1999 in #279
- Example Configs and Docs by @RohitMidha23 in #255
- Refactoring the few shot management by @clefourrier in #272
- Standalone nanotron config by @hynky1999 in #285
- Logging Revamp by @hynky1999 in #284
- bump nltk version by @NathanHB in #290
Significant community contributions
The following contributors have made significant changes to the library over the last release:
- @NathanHB
- commit (#137)
- Add llm as judge in metrics (#146)
- Nathan add logging to metrics (#157)
- add 'cite as' section in readme (#178)
- Fix citation section in readme (#180)
- adding aimo custom eval (#154)
- fix llm as judge warnings (#173)
- launch lighteval using
lighteval --args
(#152) - adds llm as judge using transformers (#223)
- Fix missing json file (#264)
- change priority when choosing model dtype (#263)
- fix the location of tasks list in the readme (#267)
- updates ifeval repo (#268)
- fix nanotron (#283)
- add vlmm backend (#274)
- bump nltk version (#290)
- @clefourrier
- Add config files for models (#131)
- Add fun widgets to the README (#145)
- Fix nanotron models input size bug (#156)
- no function we actually use should be named prompt_fn (#168)
- Add maj@k metric (#158)
- Homogeneize logging system (#150)
- Use only dataclasses for task init (#212)
- Now only uses functions for prompt definition (#213)
- Data split depending on eval params (#169)
- should fix most inference endpoints issues of version config (#226)
- Add metrics as functions (#214)
- Quantization related issues (#224)
- Update issue templates (#235)
- remove latex writer since we don't use it (#231)
- Removes default bert scorer init (#234)
- fix (#233)
- udpated piqa (#222)
- uses torch compile if provided (#248)
- Fix inference endpoint config (#244)
- Expose samples via the CLI (#228)
- Fixing issues with multichoice_continuations_start_space - was not parsed properly (#232)
- Programmatic interface + cleaner management of requests (#269)
- Small file reorg (only renames/moves) (#271)
- Refactoring the few shot management (#272)
- @PhilipMay
- @shaltielshmid
- @hynky1999
v0.3.0
Release Note
This introduced the new extended tasks feature, documentation and many other patches for improved stability.
New tasks are also introduced:
- Big Bench Hard: https://huggingface.co/papers/2210.09261
- AGIEval: https://huggingface.co/papers/2304.06364
- TinyBench:
- MT Bench: https://huggingface.co/papers/2306.05685
- AlGhafa Benchmarking Suite: https://aclanthology.org/2023.arabicnlp-1.21/
MT-Bench marks the introduction of multi-turn prompting as well as llm-as-a-judge metric.
New tasks
- Add BBH by @clefourrier in #7, @bilgehanertan in #126
- Add AGIEval by @clefourrier in #121
- Adding TinyBench by @clefourrier in #104
- Adding support for Arabic benchmarks : AlGhafa benchmarking suite by @alielfilali01 in #95
- Add mt-bench by @NathanHB in #75
Features
- Extended Tasks ! by @clefourrier in #101, @lewtun in #108, @NathanHB in #122, #123
- Added support for launching inference endpoint with different model dtypes by @shaltielshmid in #124
Documentation
- Adding LICENSE by @clefourrier in #86, @NathanHB in #89
- Make it clearer in the README that the leaderboard uses the harness by @clefourrier in #94
Small patches
- Update huggingface-hub for compatibility with datasets 2.18 by @clefourrier in #84
- Tidy up dependency groups by @lewtun in #81
- bump git python by @NathanHB in #90
- Sets a max length for the MATH task by @clefourrier in #83
- Fix parallel data processing bug by @clefourrier in #92
- Change the eos condition for GSM8K by @clefourrier in #85
- Fixing rolling loglikelihood management by @clefourrier in #78
- Fixes input length management for generative evals by @clefourrier in #103
- Reorder addition of instruction in chat template by @clefourrier in #111
- Ensure chat models terminate generation with EOS token by @lewtun in #115
- Fix push details to hub by @NathanHB in #98
- Small fixes to InferenceEndpointModel by @shaltielshmid in #112
- Fix import typo autogptq by @clefourrier in #116
- Fixed the loglikelihood method in inference endpoints models by @clefourrier in #119
- Fix TextGenerationResponse import from hfh by @Wauplin in #129
- Do not use deprecated list_files_info by @Wauplin in #133
- Update test workflow name to 'Tests' by @Wauplin in #134
New Contributors
- @shaltielshmid made their first contribution in #112
- @bilgehanertan made their first contribution in #126
- @Wauplin made their first contribution in #129
Full Changelog: v0.2.0...v0.3.0
v0.2.0
Release Note
This release focuses on customization and personalisation: it's now possible to define custom metrics, not just custom tasks, see the README for the full mechanism.
Also includes small fixes to improve stability and new tasks. We made the choice to split community tasks from the main library source to better manage maintenance.
Better community task handling
- New mechanism for evaluation contributions by @clefourrier in #47
- Adding the custom metrics system by @clefourrier in #65
New tasks
- Add GPQA by @clefourrier in #42
- Adding support for Arabic benchmarks : AceGPT benchmarking suite by @alielfilali01 in #44
- IFEval by @clefourrier in #48
Features
- Add an automatic system to compute average for tasks with subtasks by @clefourrier in #41
small patches
- Typos #27, #28, #30, #29, #34,
- Better README #26, #37, #55,
- Patch fix to match with config update/simplification in nanotron by @thomwolf in #35
- bump transformers to 4.38 by @NathanHB in #46
- Small fix to be able to use extensions of nanotron configs by @thomwolf in #58
- Remove the eos token override in the Default Config Task by @clefourrier in #54
- Update leaderboard task set by @lewtun in #60
- Remove the eos token override in the Default Config Task by @clefourrier in #54
- Fixes wikitext prompts + some patches on tg models by @clefourrier in #64
- Fix unset generation size by @clefourrier in #76
- Update ruff by @clefourrier in #71
- Relax sentencepiece version by @lewtun in #74
- Better chat template system by @clefourrier in #38
✨ Community Contributions
- @ledrui made their first contribution in #26
- @alielfilali01 made their first contribution in #44
- @lewtun made their first contribution in #55
Full Changelog: v0.1.1...v0.2.0
v0.1.1
v0.1.0
Init
LightEval 🌤️
A lightweight LLM evaluation
Context
LightEval is a lightweight LLM evaluation suite that Hugging Face has been using internally with the recently released LLM data processing library datatrove and LLM training library nanotron.
We're releasing it with the community in the spirit of building in the open.
Note that it is still very much early so don't expect 100% stability ^^'
In case of problems or question, feel free to open an issue!
Full Changelog: https://github.com/huggingface/lighteval/commits/v0.1