separate category for `global_mmlu` #2652

bzantium · 2025-01-23T02:06:13Z

separate category so that few-shot examples are from same category at least. It can also make users to better understand results since it shows score for each category. I tried separate each subject but some subjects have very few examples, making it hard to evaluate properly (cannot use multi-gpu or even small batch size).
In addition, I fixed version for global_mmlu_full as well. (it was typo)

resolved: #2649

bzantium · 2025-01-23T02:12:18Z

I evaluate with Qwen2.5-1.5B for global_mmlu_en, global_mmlu_ko.

BEFORE:

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
global_mmlu_en	0	none	5	acc	↑	0.65	±	0.0239

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
global_mmlu_ko	0	none	5	acc	↑	0.48	±	0.025

AFTER:

Tasks	Filter	n-shot	Metric		Value		Stderr
global_mmlu_en	none		acc	↑	0.6450	±	0.0238
- global_mmlu_en_business	none	5	acc	↑	0.6034	±	0.0648
- global_mmlu_en_humanities	none	5	acc	↑	0.7451	±	0.0434
- global_mmlu_en_medical	none	5	acc	↑	0.5833	±	0.0833
- global_mmlu_en_other	none	5	acc	↑	0.6607	±	0.0638
- global_mmlu_en_social_sciences	none	5	acc	↑	0.6471	±	0.0476
- global_mmlu_en_stem	none	5	acc	↑	0.5000	±	0.0745

Groups	Version	Filter	n-shot	Metric		Value		Stderr
global_mmlu_en	0	none		acc	↑	0.645	±	0.0238

Tasks	Filter	n-shot	Metric		Value		Stderr
global_mmlu_ko	none		acc	↑	0.4900	±	0.0251
- global_mmlu_ko_business	none	5	acc	↑	0.5517	±	0.0659
- global_mmlu_ko_humanities	none	5	acc	↑	0.5490	±	0.0495
- global_mmlu_ko_medical	none	5	acc	↑	0.4167	±	0.0833
- global_mmlu_ko_other	none	5	acc	↑	0.4464	±	0.0670
- global_mmlu_ko_social_sciences	none	5	acc	↑	0.4510	±	0.0495
- global_mmlu_ko_stem	none	5	acc	↑	0.4783	±	0.0745

Groups	Version	Filter	n-shot	Metric		Value		Stderr
global_mmlu_ko	0	none		acc	↑	0.49	±	0.0251

baberabb · 2025-01-23T15:53:09Z

LGTM! Just to confirm, they use ["A", "B", ...] even for languages with a different script?

bzantium · 2025-01-24T05:28:01Z

LGTM! Just to confirm, they use ["A", "B", ...] even for languages with a different script?

Thanks for checking! Following the previous commit, I use the same one but I think it doesn't have effect on the results significantly even if it changes based on each language.

to: @baberabb

bzantium added 3 commits January 23, 2025 11:04

separate category

cec787c

set version 0.0

568898b

apply precommit

45230a8

bzantium requested review from baberabb and lintangsutawika as code owners January 23, 2025 02:06

bzantium self-assigned this Jan 23, 2025

baberabb approved these changes Jan 24, 2025

View reviewed changes

baberabb merged commit 5c006ed into main Jan 24, 2025
7 of 8 checks passed

baberabb deleted the feature/#2649 branch January 24, 2025 16:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

separate category for `global_mmlu` #2652

separate category for `global_mmlu` #2652

bzantium commented Jan 23, 2025 •

edited

Loading

bzantium commented Jan 23, 2025

baberabb commented Jan 23, 2025

bzantium commented Jan 24, 2025 •

edited

Loading

separate category for global_mmlu #2652

separate category for global_mmlu #2652

Conversation

bzantium commented Jan 23, 2025 • edited Loading

bzantium commented Jan 23, 2025

baberabb commented Jan 23, 2025

bzantium commented Jan 24, 2025 • edited Loading

separate category for `global_mmlu` #2652

separate category for `global_mmlu` #2652

bzantium commented Jan 23, 2025 •

edited

Loading

bzantium commented Jan 24, 2025 •

edited

Loading