Skip to content

Commit

Permalink
Add MBPP (#2247)
Browse files Browse the repository at this point in the history
* add mbpp

* fix some bugs

* add README for mbpp

* update README

* nits

---------

Co-authored-by: Hojin Lee <[email protected]>
Co-authored-by: Baber <[email protected]>
  • Loading branch information
3 people authored Jan 15, 2025
1 parent 4c11206 commit 5db23e2
Show file tree
Hide file tree
Showing 4 changed files with 125 additions and 0 deletions.
1 change: 1 addition & 0 deletions lm_eval/tasks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,7 @@
| [logiqa](logiqa/README.md) | Logical reasoning tasks requiring advanced inference and deduction. | English, Chinese |
| [logiqa2](logiqa2/README.md) | Large-scale logical reasoning dataset adapted from the Chinese Civil Service Examination. | English, Chinese |
| [mathqa](mathqa/README.md) | Question answering tasks involving mathematical reasoning and problem-solving. | English |
| [mbpp](mbpp/README.md) | A benchmark designed to measure the ability to synthesize short Python programs from natural language descriptions. | Python |
| [mc_taco](mc_taco/README.md) | Question-answer pairs that require temporal commonsense comprehension. | English |
| [med_concepts_qa](med_concepts_qa/README.md) | Benchmark for evaluating LLMs on their abilities to interpret medical codes and distinguish between medical concept. | English |
| [metabench](metabench/README.md) | Distilled versions of six popular benchmarks which are highly predictive of overall benchmark performance and of a single general ability latent trait. | English |
Expand Down
43 changes: 43 additions & 0 deletions lm_eval/tasks/mbpp/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# MBPP

## Paper
Program Synthesis with Large Language Models
https://arxiv.org/abs/2108.07732

This paper explores the limits of the current generation of large language models for program synthesis in general purpose programming languages. We evaluate a collection of such models (with between 244M and 137B parameters) on two new benchmarks, MBPP and MathQA-Python, in both the few-shot and fine-tuning regimes. Our benchmarks are designed to measure the ability of these models to synthesize short Python programs from natural language descriptions. The Mostly Basic Programming Problems (MBPP) dataset contains 974 programming tasks, designed to be solvable by entry-level programmers. The MathQA-Python dataset, a Python version of the MathQA benchmark, contains 23914 problems that evaluate the ability of the models to synthesize code from more complex text. On both datasets, we find that synthesis performance scales log-linearly with model size. Our largest models, even without finetuning on a code dataset, can synthesize solutions to 59.6 percent of the problems from MBPP using few-shot learning with a well-designed prompt. Fine-tuning on a held-out portion of the dataset improves performance by about 10 percentage points across most model sizes. On the MathQA-Python dataset, the largest fine-tuned model achieves 83.8 percent accuracy. Going further, we study the model's ability to engage in dialog about code, incorporating human feedback to improve its solutions. We find that natural language feedback from a human halves the error rate compared to the model's initial prediction. Additionally, we conduct an error analysis to shed light on where these models fall short and what types of programs are most difficult to generate. Finally, we explore the semantic grounding of these models by fine-tuning them to predict the results of program execution. We find that even our best models are generally unable to predict the output of a program given a specific input.

Homepage: https://github.com/google-research/google-research/tree/master/mbpp


## Citation
```
@article{austin2021program,
title={Program synthesis with large language models},
author={Austin, Jacob and Odena, Augustus and Nye, Maxwell and Bosma, Maarten and Michalewski, Henryk and Dohan, David and Jiang, Ellen and Cai, Carrie and Terry, Michael and Le, Quoc and others},
journal={arXiv preprint arXiv:2108.07732},
year={2021}
}
```

### Groups and Tasks

#### Groups

* Not part of a group yet.

#### Tasks

- `mbpp`

### Checklist

For adding novel benchmarks/datasets to the library:
* [x] Is the task an existing benchmark in the literature?
* [x] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?


If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
23 changes: 23 additions & 0 deletions lm_eval/tasks/mbpp/mbpp.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
task: mbpp
dataset_path: google-research-datasets/mbpp
dataset_name: full
unsafe_code: true
output_type: generate_until
test_split: test
doc_to_text: "You are an expert Python programmer, and here is your task: {{text}} Your code should pass these tests:\n\n{{test_list[0]}}\n{{test_list[1]}}\n{{test_list[2]}}\n[BEGIN]"
doc_to_target: "{% if is_fewshot is defined %}{{code}}\n[DONE]{% else %}{{test_list[0]}}\n{{test_list[1]}}\n{{test_list[2]}}{% endif %}"
target_delimiter: "\n"
metric_list:
- metric: !function utils.pass_at_1
aggregation: mean
higher_is_better: true
generation_kwargs:
until:
- "[DONE]"
do_sample: false
num_fewshot: 3
fewshot_config:
sampler: first_n
samples: !function utils.list_fewshot_samples
metadata:
version: 1.0
58 changes: 58 additions & 0 deletions lm_eval/tasks/mbpp/utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
import evaluate as hf_evaluate


try:
pass_at_k = hf_evaluate.load("code_eval")

# run simple test to check code execution is enabled before model generation
test_cases = ["assert add(2, 3)==5"]
candidates = [["def add(a,b): return a*b"]]
results = pass_at_k.compute(references=test_cases, predictions=candidates, k=[1])
except Exception as e:
raise e


def pass_at_1(references, predictions):
return pass_at_k.compute(
references=references,
predictions=[predictions],
k=[1],
)[0]["pass@1"]


def list_fewshot_samples():
return [
{
"task_id": 2,
"text": "Write a function to find the similar elements from the given two tuple lists.",
"code": "def similar_elements(test_tup1, test_tup2):\r\n res = tuple(set(test_tup1) & set(test_tup2))\r\n return (res) ",
"test_list": [
"assert similar_elements((3, 4, 5, 6),(5, 7, 4, 10)) == (4, 5)",
"assert similar_elements((1, 2, 3, 4),(5, 4, 3, 7)) == (3, 4)",
"assert similar_elements((11, 12, 14, 13),(17, 15, 14, 13)) == (13, 14)",
],
"is_fewshot": True,
},
{
"task_id": 3,
"text": "Write a python function to identify non-prime numbers.",
"code": "import math\r\ndef is_not_prime(n):\r\n result = False\r\n for i in range(2,int(math.sqrt(n)) + 1):\r\n if n % i == 0:\r\n result = True\r\n return result",
"test_list": [
"assert is_not_prime(2) == False",
"assert is_not_prime(10) == True",
"assert is_not_prime(35) == True",
],
"is_fewshot": True,
},
{
"task_id": 4,
"text": "Write a function to find the largest integers from a given list of numbers using heap queue algorithm.",
"code": "import heapq as hq\r\ndef heap_queue_largest(nums,n):\r\n largest_nums = hq.nlargest(n, nums)\r\n return largest_nums",
"test_list": [
"assert heap_queue_largest( [25, 35, 22, 85, 14, 65, 75, 22, 58],3)==[85, 75, 65] ",
"assert heap_queue_largest( [25, 35, 22, 85, 14, 65, 75, 22, 58],2)==[85, 75] ",
"assert heap_queue_largest( [25, 35, 22, 85, 14, 65, 75, 22, 58],5)==[85, 75, 65, 58, 35]",
],
"is_fewshot": True,
},
]

0 comments on commit 5db23e2

Please sign in to comment.