Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aggregate by group (total and categories) #2643

Merged
merged 1 commit into from
Jan 21, 2025
Merged

aggregate by group (total and categories) #2643

merged 1 commit into from
Jan 21, 2025

Conversation

bzantium
Copy link
Contributor

@bzantium bzantium commented Jan 21, 2025

aggregate results with MMLU style group (total and categories)
resolved: #2640

@bzantium
Copy link
Contributor Author

bzantium commented Jan 21, 2025

I tested with Qwen2.5-7B model.
before:

Tasks Version Filter n-shot Metric Value Stderr
kmmlu_direct_accounting 2 none 5 exact_match 0.5000 ± 0.0503
kmmlu_direct_agricultural_sciences 2 none 5 exact_match 0.4040 ± 0.0155
kmmlu_direct_aviation_engineering_and_maintenance 2 none 5 exact_match 0.5090 ± 0.0158
kmmlu_direct_biology 2 none 5 exact_match 0.4120 ± 0.0156
kmmlu_direct_chemical_engineering 2 none 5 exact_match 0.5330 ± 0.0158
kmmlu_direct_chemistry 2 none 5 exact_match 0.5367 ± 0.0204
kmmlu_direct_civil_engineering 2 none 5 exact_match 0.4900 ± 0.0158
kmmlu_direct_computer_science 2 none 5 exact_match 0.7830 ± 0.0130
kmmlu_direct_construction 2 none 5 exact_match 0.3990 ± 0.0155
kmmlu_direct_criminal_law 2 none 5 exact_match 0.4000 ± 0.0347
kmmlu_direct_ecology 2 none 5 exact_match 0.5370 ± 0.0158
kmmlu_direct_economics 2 none 5 exact_match 0.5538 ± 0.0438
kmmlu_direct_education 2 none 5 exact_match 0.7300 ± 0.0446
kmmlu_direct_electrical_engineering 2 none 5 exact_match 0.3720 ± 0.0153
kmmlu_direct_electronics_engineering 2 none 5 exact_match 0.6080 ± 0.0154
kmmlu_direct_energy_management 2 none 5 exact_match 0.3810 ± 0.0154
kmmlu_direct_environmental_science 2 none 5 exact_match 0.4090 ± 0.0156
kmmlu_direct_fashion 2 none 5 exact_match 0.5170 ± 0.0158
kmmlu_direct_food_processing 2 none 5 exact_match 0.4610 ± 0.0158
kmmlu_direct_gas_technology_and_engineering 2 none 5 exact_match 0.4160 ± 0.0156
kmmlu_direct_geomatics 2 none 5 exact_match 0.4560 ± 0.0158
kmmlu_direct_health 2 none 5 exact_match 0.6800 ± 0.0469
kmmlu_direct_industrial_engineer 2 none 5 exact_match 0.4900 ± 0.0158
kmmlu_direct_information_technology 2 none 5 exact_match 0.7520 ± 0.0137
kmmlu_direct_interior_architecture_and_design 2 none 5 exact_match 0.6010 ± 0.0155
kmmlu_direct_korean_history 2 none 5 exact_match 0.3600 ± 0.0482
kmmlu_direct_law 2 none 5 exact_match 0.5250 ± 0.0158
kmmlu_direct_machine_design_and_manufacturing 2 none 5 exact_match 0.5420 ± 0.0158
kmmlu_direct_management 2 none 5 exact_match 0.6250 ± 0.0153
kmmlu_direct_maritime_engineering 2 none 5 exact_match 0.5050 ± 0.0204
kmmlu_direct_marketing 2 none 5 exact_match 0.8250 ± 0.0120
kmmlu_direct_materials_engineering 2 none 5 exact_match 0.5270 ± 0.0158
kmmlu_direct_math 2 none 5 exact_match 0.3400 ± 0.0274
kmmlu_direct_mechanical_engineering 2 none 5 exact_match 0.4590 ± 0.0158
kmmlu_direct_nondestructive_testing 2 none 5 exact_match 0.5430 ± 0.0158
kmmlu_direct_patent 2 none 5 exact_match 0.3700 ± 0.0485
kmmlu_direct_political_science_and_sociology 2 none 5 exact_match 0.6400 ± 0.0278
kmmlu_direct_psychology 2 none 5 exact_match 0.5130 ± 0.0158
kmmlu_direct_public_safety 2 none 5 exact_match 0.4270 ± 0.0156
kmmlu_direct_railway_and_automotive_engineering 2 none 5 exact_match 0.4200 ± 0.0156
kmmlu_direct_real_estate 2 none 5 exact_match 0.5000 ± 0.0354
kmmlu_direct_refrigerating_machinery 2 none 5 exact_match 0.4170 ± 0.0156
kmmlu_direct_social_welfare 2 none 5 exact_match 0.6660 ± 0.0149
kmmlu_direct_taxation 2 none 5 exact_match 0.4300 ± 0.0351
kmmlu_direct_telecommunications_and_wireless_technology 2 none 5 exact_match 0.6340 ± 0.0152

after:

Tasks Version Filter n-shot Metric Value Stderr
kmmlu_direct 2 none exact_match 0.5188 ± 0.0026
- kmmlu_direct_applied_science 2 none exact_match 0.4923 ± 0.0046
- kmmlu_direct_aviation_engineering_and_maintenance 2 none 5 exact_match 0.5090 ± 0.0158
- kmmlu_direct_electronics_engineering 2 none 5 exact_match 0.6080 ± 0.0154
- kmmlu_direct_energy_management 2 none 5 exact_match 0.3810 ± 0.0154
- kmmlu_direct_environmental_science 2 none 5 exact_match 0.4090 ± 0.0156
- kmmlu_direct_gas_technology_and_engineering 2 none 5 exact_match 0.4160 ± 0.0156
- kmmlu_direct_geomatics 2 none 5 exact_match 0.4560 ± 0.0158
- kmmlu_direct_industrial_engineer 2 none 5 exact_match 0.4900 ± 0.0158
- kmmlu_direct_machine_design_and_manufacturing 2 none 5 exact_match 0.5420 ± 0.0158
- kmmlu_direct_maritime_engineering 2 none 5 exact_match 0.5050 ± 0.0204
- kmmlu_direct_nondestructive_testing 2 none 5 exact_match 0.5430 ± 0.0158
- kmmlu_direct_railway_and_automotive_engineering 2 none 5 exact_match 0.4200 ± 0.0156
- kmmlu_direct_telecommunications_and_wireless_technology 2 none 5 exact_match 0.6340 ± 0.0152
- kmmlu_direct_humss 2 none exact_match 0.5688 ± 0.0068
- kmmlu_direct_accounting 2 none 5 exact_match 0.5000 ± 0.0503
- kmmlu_direct_criminal_law 2 none 5 exact_match 0.4000 ± 0.0347
- kmmlu_direct_economics 2 none 5 exact_match 0.5538 ± 0.0438
- kmmlu_direct_education 2 none 5 exact_match 0.7300 ± 0.0446
- kmmlu_direct_korean_history 2 none 5 exact_match 0.3600 ± 0.0482
- kmmlu_direct_law 2 none 5 exact_match 0.5250 ± 0.0158
- kmmlu_direct_management 2 none 5 exact_match 0.6250 ± 0.0153
- kmmlu_direct_political_science_and_sociology 2 none 5 exact_match 0.6400 ± 0.0278
- kmmlu_direct_psychology 2 none 5 exact_match 0.5130 ± 0.0158
- kmmlu_direct_social_welfare 2 none 5 exact_match 0.6660 ± 0.0149
- kmmlu_direct_taxation 2 none 5 exact_match 0.4300 ± 0.0351
- kmmlu_direct_other 2 none exact_match 0.5067 ± 0.0053
- kmmlu_direct_agricultural_sciences 2 none 5 exact_match 0.4040 ± 0.0155
- kmmlu_direct_construction 2 none 5 exact_match 0.3990 ± 0.0155
- kmmlu_direct_fashion 2 none 5 exact_match 0.5170 ± 0.0158
- kmmlu_direct_food_processing 2 none 5 exact_match 0.4610 ± 0.0158
- kmmlu_direct_health 2 none 5 exact_match 0.6800 ± 0.0469
- kmmlu_direct_interior_architecture_and_design 2 none 5 exact_match 0.6010 ± 0.0155
- kmmlu_direct_marketing 2 none 5 exact_match 0.8250 ± 0.0120
- kmmlu_direct_patent 2 none 5 exact_match 0.3700 ± 0.0485
- kmmlu_direct_public_safety 2 none 5 exact_match 0.4270 ± 0.0156
- kmmlu_direct_real_estate 2 none 5 exact_match 0.5000 ± 0.0354
- kmmlu_direct_refrigerating_machinery 2 none 5 exact_match 0.4170 ± 0.0156
- kmmlu_direct_stem 2 none exact_match 0.5342 ± 0.0048
- kmmlu_direct_biology 2 none 5 exact_match 0.4120 ± 0.0156
- kmmlu_direct_chemical_engineering 2 none 5 exact_match 0.5330 ± 0.0158
- kmmlu_direct_chemistry 2 none 5 exact_match 0.5367 ± 0.0204
- kmmlu_direct_civil_engineering 2 none 5 exact_match 0.4900 ± 0.0158
- kmmlu_direct_computer_science 2 none 5 exact_match 0.7830 ± 0.0130
- kmmlu_direct_ecology 2 none 5 exact_match 0.5370 ± 0.0158
- kmmlu_direct_electrical_engineering 2 none 5 exact_match 0.3720 ± 0.0153
- kmmlu_direct_information_technology 2 none 5 exact_match 0.7520 ± 0.0137
- kmmlu_direct_materials_engineering 2 none 5 exact_match 0.5270 ± 0.0158
- kmmlu_direct_math 2 none 5 exact_match 0.3400 ± 0.0274
- kmmlu_direct_mechanical_engineering 2 none 5 exact_match 0.4590 ± 0.0158
Groups Version Filter n-shot Metric Value Stderr
kmmlu_direct 2 none exact_match 0.5188 ± 0.0026
- kmmlu_direct_applied_science 2 none exact_match 0.4923 ± 0.0046
- kmmlu_direct_humss 2 none exact_match 0.5688 ± 0.0068
- kmmlu_direct_other 2 none exact_match 0.5067 ± 0.0053
- kmmlu_direct_stem 2 none exact_match 0.5342 ± 0.0048

@bzantium bzantium requested a review from h-albert-lee January 21, 2025 05:16
@baberabb baberabb merged commit b2c090c into main Jan 21, 2025
8 checks passed
@baberabb baberabb deleted the feature/#2640 branch January 21, 2025 16:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

refactor kmmlu task; aggregate results by group
3 participants