Skip to content

Commit

Permalink
Add MLQA (#2622)
Browse files Browse the repository at this point in the history
* Add MLQA
* add mlqa_common_yaml

* add 49 tests of mlqa family

* update tasks/README.md

---------

* fix: mlqa ast error

* nit: removed .yaml ext from template_yaml

* nit changes: minor modifications generate_tasks.py

* deleted    lm_eval/tasks/mlqa/mlqa_common_yaml.yaml

* tests updated

* nit
  • Loading branch information
KahnSvaer authored Jan 15, 2025
1 parent 5db23e2 commit e86cece
Show file tree
Hide file tree
Showing 54 changed files with 582 additions and 0 deletions.
1 change: 1 addition & 0 deletions lm_eval/tasks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,7 @@
| medqa | Multiple choice question answering based on the United States Medical License Exams. | |
| [mgsm](mgsm/README.md) | Benchmark of multilingual grade-school math problems. | Spanish, French, German, Russian, Chinese, Japanese, Thai, Swahili, Bengali, Telugu |
| [minerva_math](minerva_math/README.md) | Mathematics-focused tasks requiring numerical reasoning and problem-solving skills. | English |
| [mlqa](mlqa/README.md) | MultiLingual Question Answering benchmark dataset for evaluating cross-lingual question answering performance. | English, Arabic, German, Spanish, Hindi, Vietnamese, Simplified Chinese |
| [mmlu](mmlu/README.md) | Massive Multitask Language Understanding benchmark for broad domain language evaluation. Several variants are supported. | English |
| [mmlu_pro](mmlu_pro/README.md) | A refined set of MMLU, integrating more challenging, reasoning-focused questions and expanding the choice set from four to ten options. | English |
| [mmlusr](mmlusr/README.md) | Variation of MMLU designed to be more rigorous. | English |
Expand Down
101 changes: 101 additions & 0 deletions lm_eval/tasks/mlqa/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
# MLQA

### Paper

Title: `MLQA: Evaluating Cross-lingual Extractive Question Answering`

Abstract: `https://arxiv.org/abs/1910.07475`

MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating cross-lingual question answering performance.
MLQA consists of over 5K extractive QA instances (12K in English) in SQuAD format in seven languages - English, Arabic,
German, Spanish, Hindi, Vietnamese and Simplified Chinese. MLQA is highly parallel, with QA instances parallel between
4 different languages on average

Homepage: `https://github.com/facebookresearch/MLQA`


### Citation

```
@misc{lewis2020mlqaevaluatingcrosslingualextractive,
title={MLQA: Evaluating Cross-lingual Extractive Question Answering},
author={Patrick Lewis and Barlas Oğuz and Ruty Rinott and Sebastian Riedel and Holger Schwenk},
year={2020},
eprint={1910.07475},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/1910.07475},
}
```

### Groups, Tags, and Tasks

#### Groups

* Not part of a group yet

#### Tasks

Tasks of the form `mlqa_context-lang_question-lang.yaml`
* `mlqa_ar_ar.yaml`
* `mlqa_ar_de.yaml`
* `mlqa_ar_vi.yaml`
* `mlqa_ar_zh.yaml`
* `mlqa_ar_en.yaml`
* `mlqa_ar_es.yaml`
* `mlqa_ar_hi.yaml`
* `mlqa_de_ar.yaml`
* `mlqa_de_de.yaml`
* `mlqa_de_vi.yaml`
* `mlqa_de_zh.yaml`
* `mlqa_de_en.yaml`
* `mlqa_de_es.yaml`
* `mlqa_de_hi.yaml`
* `mlqa_vi_ar.yaml`
* `mlqa_vi_de.yaml`
* `mlqa_vi_vi.yaml`
* `mlqa_vi_zh.yaml`
* `mlqa_vi_en.yaml`
* `mlqa_vi_es.yaml`
* `mlqa_vi_hi.yaml`
* `mlqa_zh_ar.yaml`
* `mlqa_zh_de.yaml`
* `mlqa_zh_vi.yaml`
* `mlqa_zh_zh.yaml`
* `mlqa_zh_en.yaml`
* `mlqa_zh_es.yaml`
* `mlqa_zh_hi.yaml`
* `mlqa_en_ar.yaml`
* `mlqa_en_de.yaml`
* `mlqa_en_vi.yaml`
* `mlqa_en_zh.yaml`
* `mlqa_en_en.yaml`
* `mlqa_en_es.yaml`
* `mlqa_en_hi.yaml`
* `mlqa_es_ar.yaml`
* `mlqa_es_de.yaml`
* `mlqa_es_vi.yaml`
* `mlqa_es_zh.yaml`
* `mlqa_es_en.yaml`
* `mlqa_es_es.yaml`
* `mlqa_es_hi.yaml`
* `mlqa_hi_ar.yaml`
* `mlqa_hi_de.yaml`
* `mlqa_hi_vi.yaml`
* `mlqa_hi_zh.yaml`
* `mlqa_hi_en.yaml`
* `mlqa_hi_es.yaml`
* `mlqa_hi_hi.yaml`

### Checklist

For adding novel benchmarks/datasets to the library:
* [x] Is the task an existing benchmark in the literature?
* [x] Have you referenced the original paper that introduced the task?
* [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?


If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
48 changes: 48 additions & 0 deletions lm_eval/tasks/mlqa/generate_tasks.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# ruff: noqa: E731, E741
"""
Script to generate task YAMLs for the mlqa dataset.
Based on `tasks/bigbench/generate_tasks.py`.
"""

from datasets import get_dataset_config_names


chosen_subtasks = []

language_dict = {
"en": "english",
"es": "spanish",
"hi": "hindi",
"vi": "vietnamese",
"de": "german",
"ar": "arabic",
"zh": "chinese",
}


def main() -> None:
configs = get_dataset_config_names("facebook/mlqa", trust_remote_code=True)
for config in configs:
if len(config.split(".")) == 2:
continue
else:
chosen_subtasks.append(config)
assert len(chosen_subtasks) == 49
for task in chosen_subtasks:
file_name = f"{task.replace('.', '_')}.yaml"
context_lang = file_name.split("_")[1]
# Not using yaml to avoid tagging issues with !function
with open(file_name, "w", encoding="utf-8") as f:
f.write("# Generated by generate_tasks.py\n")

# Manually writing the YAML-like content inside files to avoid tagging issues
f.write("include: mlqa_common_yaml\n")
f.write(f"task: {task.replace('.', '_')}\n")
f.write(f"dataset_name: {task}\n")
f.write(
f"process_results: !function utils.process_results_{context_lang}\n"
)


if __name__ == "__main__":
main()
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_ar_ar.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_ar_ar
dataset_name: mlqa.ar.ar
process_results: !function utils.process_results_ar
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_ar_de.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_ar_de
dataset_name: mlqa.ar.de
process_results: !function utils.process_results_ar
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_ar_en.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_ar_en
dataset_name: mlqa.ar.en
process_results: !function utils.process_results_ar
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_ar_es.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_ar_es
dataset_name: mlqa.ar.es
process_results: !function utils.process_results_ar
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_ar_hi.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_ar_hi
dataset_name: mlqa.ar.hi
process_results: !function utils.process_results_ar
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_ar_vi.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_ar_vi
dataset_name: mlqa.ar.vi
process_results: !function utils.process_results_ar
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_ar_zh.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_ar_zh
dataset_name: mlqa.ar.zh
process_results: !function utils.process_results_ar
22 changes: 22 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_common_yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
dataset_path: facebook/mlqa
dataset_kwargs:
trust_remote_code: true
test_split: test
validation_split: validation
output_type: generate_until
doc_to_text: "Context: {{context}}\n\nQuestion: {{question}}\n\nAnswer:"
doc_to_target: "{{answers}}"
process_docs: !function utils.process_docs
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
- metric: f1
aggregation: mean
higher_is_better: true
generation_kwargs:
until:
- "\n"
do_sample: false
metadata:
version: 0.0
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_de_ar.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_de_ar
dataset_name: mlqa.de.ar
process_results: !function utils.process_results_de
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_de_de.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_de_de
dataset_name: mlqa.de.de
process_results: !function utils.process_results_de
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_de_en.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_de_en
dataset_name: mlqa.de.en
process_results: !function utils.process_results_de
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_de_es.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_de_es
dataset_name: mlqa.de.es
process_results: !function utils.process_results_de
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_de_hi.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_de_hi
dataset_name: mlqa.de.hi
process_results: !function utils.process_results_de
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_de_vi.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_de_vi
dataset_name: mlqa.de.vi
process_results: !function utils.process_results_de
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_de_zh.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_de_zh
dataset_name: mlqa.de.zh
process_results: !function utils.process_results_de
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_en_ar.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_en_ar
dataset_name: mlqa.en.ar
process_results: !function utils.process_results_en
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_en_de.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_en_de
dataset_name: mlqa.en.de
process_results: !function utils.process_results_en
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_en_en.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_en_en
dataset_name: mlqa.en.en
process_results: !function utils.process_results_en
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_en_es.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_en_es
dataset_name: mlqa.en.es
process_results: !function utils.process_results_en
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_en_hi.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_en_hi
dataset_name: mlqa.en.hi
process_results: !function utils.process_results_en
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_en_vi.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_en_vi
dataset_name: mlqa.en.vi
process_results: !function utils.process_results_en
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_en_zh.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_en_zh
dataset_name: mlqa.en.zh
process_results: !function utils.process_results_en
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_es_ar.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_es_ar
dataset_name: mlqa.es.ar
process_results: !function utils.process_results_es
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_es_de.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_es_de
dataset_name: mlqa.es.de
process_results: !function utils.process_results_es
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_es_en.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_es_en
dataset_name: mlqa.es.en
process_results: !function utils.process_results_es
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_es_es.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_es_es
dataset_name: mlqa.es.es
process_results: !function utils.process_results_es
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_es_hi.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_es_hi
dataset_name: mlqa.es.hi
process_results: !function utils.process_results_es
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_es_vi.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_es_vi
dataset_name: mlqa.es.vi
process_results: !function utils.process_results_es
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_es_zh.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_es_zh
dataset_name: mlqa.es.zh
process_results: !function utils.process_results_es
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_hi_ar.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_hi_ar
dataset_name: mlqa.hi.ar
process_results: !function utils.process_results_hi
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_hi_de.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_hi_de
dataset_name: mlqa.hi.de
process_results: !function utils.process_results_hi
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_hi_en.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_hi_en
dataset_name: mlqa.hi.en
process_results: !function utils.process_results_hi
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_hi_es.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_hi_es
dataset_name: mlqa.hi.es
process_results: !function utils.process_results_hi
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_hi_hi.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_hi_hi
dataset_name: mlqa.hi.hi
process_results: !function utils.process_results_hi
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_hi_vi.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_hi_vi
dataset_name: mlqa.hi.vi
process_results: !function utils.process_results_hi
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_hi_zh.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_hi_zh
dataset_name: mlqa.hi.zh
process_results: !function utils.process_results_hi
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_vi_ar.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_vi_ar
dataset_name: mlqa.vi.ar
process_results: !function utils.process_results_vi
Loading

0 comments on commit e86cece

Please sign in to comment.