Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MILU dataset from AI4Bharat for Indic LLM eval #2482

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
109 changes: 109 additions & 0 deletions lm_eval/tasks/milu/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
# MILU

**Original GitHub Repo:** [https://github.com/AI4Bharat/MILU](https://github.com/AI4Bharat/MILU)

### Paper

Title: `MILU: A Multi-task Indic Language Understanding Benchmark`

Abstract: `Evaluating Large Language Models (LLMs) in low-resource and linguistically diverse languages remains a significant challenge in NLP, particularly for languages using non-Latin scripts like those spoken in India. Existing benchmarks predominantly focus on English, leaving substantial gaps in assessing LLM capabilities in these languages. We introduce MILU, a Multi task Indic Language Understanding Benchmark, a comprehensive evaluation benchmark designed to address this gap. MILU spans 8 domains and 42 subjects across 11 Indic languages, reflecting both general and culturally specific knowledge. With an India-centric design, incorporates material from regional and state-level examinations, covering topics such as local history, arts, festivals, and laws, alongside standard subjects like science and mathematics. We evaluate over 42 LLMs, and find that current LLMs struggle with MILU, with GPT-4o achieving the highest average accuracy at 72 percent. Open multilingual models outperform language-specific fine-tuned models, which perform only slightly better than random baselines. Models also perform better in high resource languages as compared to low resource ones. Domain-wise analysis indicates that models perform poorly in culturally relevant areas like Arts and Humanities, Law and Governance compared to general fields like STEM. To the best of our knowledge, MILU is the first of its kind benchmark focused on Indic languages, serving as a crucial step towards comprehensive cultural evaluation. All code, benchmarks, and artifacts will be made publicly available to foster open research.`


### Citation

```bibtex
@article{verma2024milu,
title = {MILU: A Multi-task Indic Language Understanding Benchmark},
author = {Sshubam Verma and Mohammed Safi Ur Rahman Khan and Vishwajeet Kumar and Rudra Murthy and Jaydeep Sen},
year = {2024},
journal = {arXiv preprint arXiv: 2411.02538}
}
```

## Usage

##### Prerequisites

- Python 3.7+
- `lm-eval-harness` library
- HuggingFace Transformers
- vLLM (optional, for faster inference)\

1. Clone this repository:

```bash
git clone --depth 1 https://github.com/AI4Bharat/MILU.git
cd MILU
pip install -e .
```

2. Request access to the HuggingFace 🤗 dataset [here](https://huggingface.co/datasets/ai4bharat/MILU).

3. Set up your environment variables:

```bash
export HF_HOME=/path/to/HF_CACHE/if/needed
export HF_TOKEN=YOUR_HUGGINGFACE_TOKEN
```


## Supported Languages
- Bengali
- English
- Gujarati
- Hindi
- Kannada
- Malayalam
- Marathi
- Odia
- Punjabi
- Tamil
- Telugu

## HuggingFace Evaluation

For HuggingFace models, you may use the following sample command:

```bash
lm_eval --model hf \
--model_args 'pretrained=google/gemma-2-27b-it,temperature=0.0,top_p=1.0,parallelize=True' \
--tasks milu \
--batch_size auto:40 \
--log_samples \
--output_path $EVAL_OUTPUT_PATH \
--max_batch_size 64 \
--num_fewshot 5 \
--apply_chat_template
```

## vLLM Evaluation

For vLLM-compatible models, use the following command:

```bash
lm_eval --model vllm \
--model_args 'pretrained=meta-llama/Llama-3.2-3B,tensor_parallel_size=$N_GPUS' \
--gen_kwargs 'temperature=0.0,top_p=1.0' \
--tasks milu \
--batch_size auto \
--log_samples \
--output_path $EVAL_OUTPUT_PATH
```

## Single Language Evaluation

To evaluate your on a specific language, modify the `--tasks` parameter:

```bash
--tasks milu_English
```

Replace `English` with the available language (e.g., Odia, Hindi, etc.).

### Evaluation Tips & Observations

1. Make sure to use `--apply_chat_template` for Instruction-fine-tuned models, to format the prompt correctly.
2. vLLM generally works better with Llama models, while Gemma models work better with HuggingFace.
3. If vLLM encounters out-of-memory errors, try reducing `max_gpu_utilization` else switch to HuggingFace.
4. For HuggingFace, use `--batch_size=auto:<n_batch_resize_tries>` to re-select the batch size multiple times.
5. When using vLLM, pass generation kwargs in the `--gen_kwargs` flag. For HuggingFace, include them in `model_args`.
17 changes: 17 additions & 0 deletions lm_eval/tasks/milu/_default_template_yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
dataset_path: ai4bharat/MILU
dataset_kwargs:
token: true
output_type: multiple_choice
test_split: test
fewshot_split: validation
fewshot_config:
sampler: first_n
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
doc_to_text: !function utils_milu.doc_to_text
doc_to_target: !function utils_milu.doc_to_target
doc_to_choice: "{{[option1, option2, option3, option4]}}"
metadata:
version: 0.0
19 changes: 19 additions & 0 deletions lm_eval/tasks/milu/_milu.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
group: milu
task:
- milu_English
- milu_Bengali
- milu_Hindi
- milu_Tamil
- milu_Telugu
- milu_Malayalam
- milu_Kannada
- milu_Marathi
- milu_Gujarati
- milu_Punjabi
- milu_Odia

aggregate_metric_list:
- metric: acc
weight_by_size: True
metadata:
version: 0.0
5 changes: 5 additions & 0 deletions lm_eval/tasks/milu/milu_Bengali.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
dataset_name: Bengali
include: _default_template_yaml
tag: milu-core
task: milu_Bengali
task_alias: milu_Bengali
5 changes: 5 additions & 0 deletions lm_eval/tasks/milu/milu_English.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
dataset_name: English
include: _default_template_yaml
tag: milu-core
task: milu_English
task_alias: milu_English
5 changes: 5 additions & 0 deletions lm_eval/tasks/milu/milu_Gujarati.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
dataset_name: Gujarati
include: _default_template_yaml
tag: milu-core
task: milu_Gujarati
task_alias: milu_Gujarati
5 changes: 5 additions & 0 deletions lm_eval/tasks/milu/milu_Hindi.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
dataset_name: Hindi
include: _default_template_yaml
tag: milu-core
task: milu_Hindi
task_alias: milu_Hindi
5 changes: 5 additions & 0 deletions lm_eval/tasks/milu/milu_Kannada.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
dataset_name: Kannada
include: _default_template_yaml
tag: milu-core
task: milu_Kannada
task_alias: milu_Kannada
5 changes: 5 additions & 0 deletions lm_eval/tasks/milu/milu_Malayalam.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
dataset_name: Malayalam
include: _default_template_yaml
tag: milu-core
task: milu_Malayalam
task_alias: milu_Malayalam
5 changes: 5 additions & 0 deletions lm_eval/tasks/milu/milu_Marathi.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
dataset_name: Marathi
include: _default_template_yaml
tag: milu-core
task: milu_Marathi
task_alias: milu_Marathi
5 changes: 5 additions & 0 deletions lm_eval/tasks/milu/milu_Odia.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
dataset_name: Odia
include: _default_template_yaml
tag: milu-core
task: milu_Odia
task_alias: milu_Odia
5 changes: 5 additions & 0 deletions lm_eval/tasks/milu/milu_Punjabi.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
dataset_name: Punjabi
include: _default_template_yaml
tag: milu-core
task: milu_Punjabi
task_alias: milu_Punjabi
5 changes: 5 additions & 0 deletions lm_eval/tasks/milu/milu_Tamil.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
dataset_name: Tamil
include: _default_template_yaml
tag: milu-core
task: milu_Tamil
task_alias: milu_Tamil
5 changes: 5 additions & 0 deletions lm_eval/tasks/milu/milu_Telugu.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
dataset_name: Telugu
include: _default_template_yaml
tag: milu-core
task: milu_Telugu
task_alias: milu_Telugu
34 changes: 34 additions & 0 deletions lm_eval/tasks/milu/utils_milu.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
def doc_to_text(doc) -> str:
"""
Question: <question>
Choices:
A. <option1>
B. <option2>
C. <option3>
D. <option4>
Answer:
"""
choices = [doc["option1"], doc["option2"], doc["option3"], doc["option4"]]
option_choices = {
"A": choices[0],
"B": choices[1],
"C": choices[2],
"D": choices[3],
}

prompt = "Question: " + doc["question"] + "\nChoices:\n"
for choice, option in option_choices.items():
prompt += f"{choice.upper()}. {option}\n"
prompt += "Answer:"

return prompt


def doc_to_target(doc) -> int:
"""
Returns the index of the correct answer in the list of choices
"""
target = doc["target"]
option_number = ['1', '2', '3', '4'].index(target.split("option")[1])

return option_number
Loading