Add new metric functions: LLMGEvalScore and ChatLLMGEvalScore #125

m-ast · 2025-01-31T05:47:30Z

Summary

Add two metric functions: LLMGEvalScore and ChatLLMGEvalScore
These two metics calculate a weighted average score for evaluation of lm_output.

Create flexeval/core/metric/llm_geval_score.py
- Implement LLMGEvalScore and ChatLLMGEvalScore which inherit Metric
- Note that these metrics can be used only for HuggingFaceLM and VLLM, implemented batch_compute_log_probs. I plan to implement it for OpenAIChatAPI and OpenAIChatBatchAPI afterwards.
Create tests for the script.

m-ast added 9 commits January 31, 2025 12:40

add geval score scripts

7d9b39a

add docstring, fix

e2ff98c

ruff

a872625

add an argument prob_threshold

af882b3

change key

1bd9435

add test for llm_geval_score

3bf6ba7

ruff

51d461e

add summarize function

d03acd6

fix test

b8dedbf

m-ast requested review from Ktakuya332C and ryokan0123 January 31, 2025 06:45

m-ast self-assigned this Jan 31, 2025