Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add Darija (Moroccan dialects) tasks including darijammlu. darijahellaswag and darija_bench #2521

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
253 changes: 125 additions & 128 deletions lm_eval/tasks/README.md

Large diffs are not rendered by default.

65 changes: 65 additions & 0 deletions lm_eval/tasks/darija_bench/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# DarijaBench

### Paper

Title: Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect

Abstract: [https://arxiv.org/abs/2409.17912](https://arxiv.org/abs/2409.17912)

DarijaBench, a comprehensive evaluation dataset tailored for Moroccan Darija. DarijaBench includes different datasets for core NLP tasks such as translation (based on four datasets, [DODa-10K](https://huggingface.co/datasets/MBZUAI-Paris/DODa-10K), [FLORES+](https://github.com/openlanguagedata/flores), [NLLB-Seed](https://github.com/openlanguagedata/seed) and [MADAR](https://sites.google.com/nyu.edu/madar/)), summarization (based on [MArSum](https://github.com/KamelGaanoun/MoroccanSummarization)) and, sentiment analysis (based on five datasets, [MAC](https://github.com/LeMGarouani/MAC), [MYC](https://github.com/MouadJb/MYC), [MSAC](https://hal.science/hal-03670346/document), [MSDA](https://cc.um6p.ma/cc_datasets) and, [ElectroMorocco2016](https://github.com/sentiprojects/ElecMorocco2016)), in addition to a new transliteration task to convert between Darija (written in Arabic letters) and Arabizi (written in Latin letters) it is based on [DODa-10K](https://huggingface.co/datasets/MBZUAI-Paris/DODa-10K) dataset.


Homepage: [https://huggingface.co/datasets/MBZUAI-Paris/DarijaBench](https://huggingface.co/datasets/MBZUAI-Paris/DarijaBench)


### Citation

```
@article{shang2024atlaschatadaptinglargelanguage,
title={Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect},
author={Guokan Shang and Hadi Abdine and Yousef Khoubrane and Amr Mohamed and Yassine Abbahaddou and Sofiane Ennadir and Imane Momayiz and Xuguang Ren and Eric Moulines and Preslav Nakov and Michalis Vazirgiannis and Eric Xing},
year={2024},
eprint={2409.17912},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2409.17912},
}
```

### Groups and Tasks

#### Groups

* `darija_sentiment`: evaluates all Darija sentiment analysis tasks.
* `darija_summarization`: evaluates Darija summarization task.
* `darija_translation`: evaluates all Darija Translation tasks.
* `darija_transliteration`: evaluates Darija transliteration task.

#### Tasks

* `darija_sentiment_mac`: evaluates Darija translation task from [MAC](https://github.com/LeMGarouani/MAC) dataset.
* `darija_sentiment_myc`: evaluates Darija translation task from [MYC](https://github.com/MouadJb/MYC) dataset.
* `darija_sentiment_msac`: evaluates Darija translation task from [MSAC](https://hal.science/hal-03670346/document) dataset.
* `darija_sentiment_msda`: evaluates Darija translation task from [MSDA](https://cc.um6p.ma/cc_datasets) dataset.
* `darija_sentiment_electrom`: evaluates Darija translation task from [ElectroMorocco2016](https://github.com/sentiprojects/ElecMorocco2016) dataset.
* `darija_summarization_task`: evaluates Darija summarization task from [MArSum](https://github.com/KamelGaanoun/MoroccanSummarization) corpus.
* `darija_translation_doda`: evaluates Darija translation task from [DODa-10k](https://huggingface.co/datasets/MBZUAI-Paris/DODa-10K) corpus.
* `darija_translation_flores`: evaluates Darija translation task from [FLORES+](https://github.com/openlanguagedata/flores) dataset.
* `darija_translation_madar`: evaluates Darija translation task from [MADAR](https://sites.google.com/nyu.edu/madar/) dataset.
* `darija_translation_seed`: evaluates Darija translation task from [NLLB-Seed](https://github.com/openlanguagedata/seed) datasets.
* `darija_transliteration_task`: evaluates Darija transliteration task from [DODa-10K](https://huggingface.co/datasets/MBZUAI-Paris/DODa-10K) corpus.

Note: depending on the model, padding and padding side could affect the results. The default padding side in this library is forced to left. Use batch size equal to 1 to avoid problems.

### Checklist

For adding novel benchmarks/datasets to the library:
* [x] Is the task an existing benchmark in the literature?
* [x] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?


If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
54 changes: 54 additions & 0 deletions lm_eval/tasks/darija_bench/darija_sentiment/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
# DarijaBench: Sentiment Analysis

### Paper

Title: Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect

Abstract: [https://arxiv.org/abs/2409.17912](https://arxiv.org/abs/2409.17912)

DarijaBench, a comprehensive evaluation dataset tailored for Moroccan Darija. DarijaBench includes different datasets for core NLP tasks such as translation (based on four datasets, [DODa-10K](https://huggingface.co/datasets/MBZUAI-Paris/DODa-10K), [FLORES+](https://github.com/openlanguagedata/flores), [NLLB-Seed](https://github.com/openlanguagedata/seed) and [MADAR](https://sites.google.com/nyu.edu/madar/)), summarization (based on [MArSum](https://github.com/KamelGaanoun/MoroccanSummarization)) and, sentiment analysis (based on five datasets, [MAC](https://github.com/LeMGarouani/MAC), [MYC](https://github.com/MouadJb/MYC), [MSAC](https://hal.science/hal-03670346/document), [MSDA](https://cc.um6p.ma/cc_datasets) and, [ElectroMorocco2016](https://github.com/sentiprojects/ElecMorocco2016)), in addition to a new transliteration task to convert between Darija (written in Arabic letters) and Arabizi (written in Latin letters) it is based on [DODa-10K](https://huggingface.co/datasets/MBZUAI-Paris/DODa-10K) dataset.


Homepage: [https://huggingface.co/datasets/MBZUAI-Paris/DarijaBench](https://huggingface.co/datasets/MBZUAI-Paris/DarijaBench)


### Citation

```
@article{shang2024atlaschatadaptinglargelanguage,
title={Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect},
author={Guokan Shang and Hadi Abdine and Yousef Khoubrane and Amr Mohamed and Yassine Abbahaddou and Sofiane Ennadir and Imane Momayiz and Xuguang Ren and Eric Moulines and Preslav Nakov and Michalis Vazirgiannis and Eric Xing},
year={2024},
eprint={2409.17912},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2409.17912},
}
```

### Groups and Tasks

#### Groups

* `darija_sentiment`: evaluates all Darija sentiment analysis tasks.

#### Tasks

* `darija_sentiment_mac`: evaluates Darija translation task from [MAC](https://github.com/LeMGarouani/MAC) dataset.
* `darija_sentiment_myc`: evaluates Darija translation task from [MYC](https://github.com/MouadJb/MYC) dataset.
* `darija_sentiment_msac`: evaluates Darija translation task from [MSAC](https://hal.science/hal-03670346/document) dataset.
* `darija_sentiment_msda`: evaluates Darija translation task from [MSDA](https://cc.um6p.ma/cc_datasets) dataset.
* `darija_sentiment_electrom`: evaluates Darija translation task from [ElectroMorocco2016](https://github.com/sentiprojects/ElecMorocco2016) dataset.

### Checklist

For adding novel benchmarks/datasets to the library:
* [x] Is the task an existing benchmark in the literature?
* [x] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?


If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
group: darija_sentiment
group_alias: Sentiment_Analysis
task:
- darija_sentiment_tasks
aggregate_metric_list:
- metric: acc
weight_by_size: True
metadata:
version: 0
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
test_split: electro_maroc
"include": "default_darija_sentiment_template_yaml"
"tag":
- "darija_sentiment_tasks"
"task": "darija_sentiment_electrom"
"task_alias": "Electro Maroc"
doc_to_choice: !function utils.doc_to_choice_2
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
test_split: mac
"include": "default_darija_sentiment_template_yaml"
"tag":
- "darija_sentiment_tasks"
"task": "darija_sentiment_mac"
"task_alias": "MAC"
doc_to_choice: !function utils.doc_to_choice_3
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
test_split: msac
"include": "default_darija_sentiment_template_yaml"
"tag":
- "darija_sentiment_tasks"
"task": "darija_sentiment_msac"
"task_alias": "MSAC"
doc_to_choice: !function utils.doc_to_choice_2
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
test_split: msda
"include": "default_darija_sentiment_template_yaml"
"tag":
- "darija_sentiment_tasks"
"task": "darija_sentiment_msda"
"task_alias": "MSDA"
doc_to_choice: !function utils.doc_to_choice_3
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
test_split: myc
"include": "default_darija_sentiment_template_yaml"
"tag":
- "darija_sentiment_tasks"
"task": "darija_sentiment_myc"
"task_alias": "MYC"
doc_to_choice: !function utils.doc_to_choice_2
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
dataset_path: MBZUAI-Paris/DarijaBench
output_type: multiple_choice
doc_to_text: !function utils.doc_to_text
doc_to_choice: !function utils.doc_to_choice_3
doc_to_target: !function utils.doc_to_target
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
metadata:
version: 0.0
dataset_kwargs:
trust_remote_code: true
18 changes: 18 additions & 0 deletions lm_eval/tasks/darija_bench/darija_sentiment/utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
from lm_eval.api.filter import Filter
from lm_eval.api.registry import register_filter

alpha = ['A', 'B', 'C']
out_dic = {"ايجابي": 1, "سلبي": 0, "ماكينش إحساس": 2}

def doc_to_text(doc):
return doc["messages"][0]["content"].replace('-سلبي', 'A. سلبي').replace('-ايجابي', 'B. ايجابي').replace('-ماكينش إحساس', 'C. ماكينش إحساس\nThe answer should be strictly one letter of the following: A, B, C.')#.replace('شنو هو الإحساس ديال هاد الجملة؟', 'شنو هو الإحساس ديال هاد الجملة؟')

def doc_to_choice_3(doc):
return alpha

def doc_to_choice_2(doc):
return alpha[:2]

def doc_to_target(doc):
return alpha[out_dic[doc["messages"][1]["content"]]]

51 changes: 51 additions & 0 deletions lm_eval/tasks/darija_bench/darija_summarization/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# DarijaBench: Summarization

### Paper

Title: Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect

Abstract: [https://arxiv.org/abs/2409.17912](https://arxiv.org/abs/2409.17912)

DarijaBench, a comprehensive evaluation dataset tailored for Moroccan Darija. DarijaBench includes different datasets for core NLP tasks such as translation (based on four datasets, [DODa-10K](https://huggingface.co/datasets/MBZUAI-Paris/DODa-10K), [FLORES+](https://github.com/openlanguagedata/flores), [NLLB-Seed](https://github.com/openlanguagedata/seed) and [MADAR](https://sites.google.com/nyu.edu/madar/)), summarization (based on [MArSum](https://github.com/KamelGaanoun/MoroccanSummarization)) and, sentiment analysis (based on five datasets, [MAC](https://github.com/LeMGarouani/MAC), [MYC](https://github.com/MouadJb/MYC), [MSAC](https://hal.science/hal-03670346/document), [MSDA](https://cc.um6p.ma/cc_datasets) and, [ElectroMorocco2016](https://github.com/sentiprojects/ElecMorocco2016)), in addition to a new transliteration task to convert between Darija (written in Arabic letters) and Arabizi (written in Latin letters) it is based on [DODa-10K](https://huggingface.co/datasets/MBZUAI-Paris/DODa-10K) dataset.


Homepage: [https://huggingface.co/datasets/MBZUAI-Paris/DarijaBench](https://huggingface.co/datasets/MBZUAI-Paris/DarijaBench)


### Citation

```
@article{shang2024atlaschatadaptinglargelanguage,
title={Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect},
author={Guokan Shang and Hadi Abdine and Yousef Khoubrane and Amr Mohamed and Yassine Abbahaddou and Sofiane Ennadir and Imane Momayiz and Xuguang Ren and Eric Moulines and Preslav Nakov and Michalis Vazirgiannis and Eric Xing},
year={2024},
eprint={2409.17912},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2409.17912},
}
```

### Groups and Tasks

#### Groups

* `darija_summarization`: evaluates Darija summarization task.

#### Tasks

* `darija_summarization_task`: evaluates Darija summarization task from [MArSum](https://github.com/KamelGaanoun/MoroccanSummarization) corpus.


### Checklist

For adding novel benchmarks/datasets to the library:
* [x] Is the task an existing benchmark in the literature?
* [x] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?


If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
"include": "summarization_common_yaml"
"tag":
- "darija_summarization_task"
"task": "darija_summarization"
metric_list:
- metric: !function utils.bert
aggregation: !function utils.darijabert
higher_is_better: true
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
output_type: generate_until
dataset_path: MBZUAI-Paris/DarijaBench
test_split: marsum
doc_to_text: !function utils.doc_to_text
doc_to_target: !function utils.doc_to_target
metric_list:
- metric: !function utils.rouge1
- metric: !function utils.rouge2
- metric: !function utils.rougeL
- metric: !function utils.rougeLsum
- metric: !function utils.bert
- metric: chrf
generation_kwargs:
until:
- "<end_of_turn>"
- "<eos>"
- "</s>"
- "<|end_of_text|>"
- "<|eot_id|>"
- "<|endoftext|>"
do_sample: false
temperature: 0.0
max_new_tokens: 128
filter_list:
- name: "STRIP_ANSWER"
filter:
- function: "strip"
repeats: 1
metadata:
version: 1.0
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
group: darija_summarization
task:
- darija_summarization_task
metric_list:
- metric: !function utils.rouge1
aggregation: !function utils.agg_rouge1
higher_is_better: true
- metric: !function utils.rouge2
aggregation: !function utils.agg_rouge2
higher_is_better: true
- metric: !function utils.rougeL
aggregation: !function utils.agg_rougel
higher_is_better: true
- metric: !function utils.rougeLsum
aggregation: !function utils.agg_rougelsum
higher_is_better: true
- metric: !function utils.bert
aggregation: !function utils.darijabert
higher_is_better: true
- metric: chrf
aggregation: chrf
higher_is_better: true
metadata:
version: 1.0
67 changes: 67 additions & 0 deletions lm_eval/tasks/darija_bench/darija_summarization/utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
import evaluate
import datasets
from lm_eval.api.filter import Filter
from lm_eval.api.registry import register_filter

@register_filter("strip")
class Strip(Filter):
def __init__(self) -> None:
"""
Can define custom behavior here, if an individual instantiation of a Filter class should have state.
"""

def apply(self, resps, docs):
"""
Assuming each entry of `resps` is a list of model responses, we discard all but the first response.
"""
return map(lambda r: r[0].strip(), resps)


def doc_to_text(doc):
doc_text = doc["messages"][0]["content"].replace("لخص هاد المقطع", "لخص هاد المقطع في ٣٠ كلمة")
return doc_text

def doc_to_target(doc):
return doc["messages"][1]["content"]

def bert(items):
return items

def Average(lst):
return sum(lst) / len(lst)

def darijabert(items):
bert_model = 'SI2M-Lab/DarijaBERT'
bert_score = evaluate.load("bertscore")
predictions, references = zip(*items)
bert = bert_score.compute(predictions=predictions, references=references, model_type=bert_model, num_layers=12)
return Average(bert['f1'])

def rouge1(items):
return items
def rougeL(items):
return items
def rouge2(items):
return items
def rougeLsum(items):
return items

def agg_rougelsum(items):
rouge = evaluate.load("rouge")
predictions, references = zip(*items)
return rouge.compute(predictions=predictions, references=references)["rougeLsum"]

def agg_rouge1(items):
rouge = evaluate.load("rouge")
predictions, references = zip(*items)
return rouge.compute(predictions=predictions, references=references)["rouge1"]

def agg_rouge2(items):
rouge = evaluate.load("rouge")
predictions, references = zip(*items)
return rouge.compute(predictions=predictions, references=references)["rouge2"]

def agg_rougel(items):
rouge = evaluate.load("rouge")
predictions, references = zip(*items)
return rouge.compute(predictions=predictions, references=references)["rougeL"]
Loading
Loading