Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

separate category for global_mmlu #2652

Merged
merged 3 commits into from
Jan 24, 2025
Merged

separate category for global_mmlu #2652

merged 3 commits into from
Jan 24, 2025

Conversation

bzantium
Copy link
Contributor

@bzantium bzantium commented Jan 23, 2025

separate category so that few-shot examples are from same category at least. It can also make users to better understand results since it shows score for each category. I tried separate each subject but some subjects have very few examples, making it hard to evaluate properly (cannot use multi-gpu or even small batch size).
In addition, I fixed version for global_mmlu_full as well. (it was typo)

resolved: #2649

@bzantium
Copy link
Contributor Author

I evaluate with Qwen2.5-1.5B for global_mmlu_en, global_mmlu_ko.

BEFORE:

Tasks Version Filter n-shot Metric Value Stderr
global_mmlu_en 0 none 5 acc 0.65 ± 0.0239
Tasks Version Filter n-shot Metric Value Stderr
global_mmlu_ko 0 none 5 acc 0.48 ± 0.025

AFTER:

Tasks Version Filter n-shot Metric Value Stderr
global_mmlu_en 0 none acc 0.6450 ± 0.0238
- global_mmlu_en_business 0 none 5 acc 0.6034 ± 0.0648
- global_mmlu_en_humanities 0 none 5 acc 0.7451 ± 0.0434
- global_mmlu_en_medical 0 none 5 acc 0.5833 ± 0.0833
- global_mmlu_en_other 0 none 5 acc 0.6607 ± 0.0638
- global_mmlu_en_social_sciences 0 none 5 acc 0.6471 ± 0.0476
- global_mmlu_en_stem 0 none 5 acc 0.5000 ± 0.0745
Groups Version Filter n-shot Metric Value Stderr
global_mmlu_en 0 none acc 0.645 ± 0.0238
Tasks Version Filter n-shot Metric Value Stderr
global_mmlu_ko 0 none acc 0.4900 ± 0.0251
- global_mmlu_ko_business 0 none 5 acc 0.5517 ± 0.0659
- global_mmlu_ko_humanities 0 none 5 acc 0.5490 ± 0.0495
- global_mmlu_ko_medical 0 none 5 acc 0.4167 ± 0.0833
- global_mmlu_ko_other 0 none 5 acc 0.4464 ± 0.0670
- global_mmlu_ko_social_sciences 0 none 5 acc 0.4510 ± 0.0495
- global_mmlu_ko_stem 0 none 5 acc 0.4783 ± 0.0745
Groups Version Filter n-shot Metric Value Stderr
global_mmlu_ko 0 none acc 0.49 ± 0.0251

@bzantium bzantium self-assigned this Jan 23, 2025
@baberabb
Copy link
Contributor

LGTM! Just to confirm, they use ["A", "B", ...] even for languages with a different script?

@bzantium
Copy link
Contributor Author

bzantium commented Jan 24, 2025

LGTM! Just to confirm, they use ["A", "B", ...] even for languages with a different script?

Thanks for checking! Following the previous commit, I use the same one but I think it doesn't have effect on the results significantly even if it changes based on each language.

to: @baberabb

@baberabb baberabb merged commit 5c006ed into main Jan 24, 2025
7 of 8 checks passed
@baberabb baberabb deleted the feature/#2649 branch January 24, 2025 16:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

few-shot examples are not presented by their subjects for Global-MMLU-Lite (global_mmlu_{language})
2 participants