-
Notifications
You must be signed in to change notification settings - Fork 2k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* add mbpp * fix some bugs * add README for mbpp * update README * nits --------- Co-authored-by: Hojin Lee <[email protected]> Co-authored-by: Baber <[email protected]>
- Loading branch information
1 parent
4c11206
commit 5db23e2
Showing
4 changed files
with
125 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
# MBPP | ||
|
||
## Paper | ||
Program Synthesis with Large Language Models | ||
https://arxiv.org/abs/2108.07732 | ||
|
||
This paper explores the limits of the current generation of large language models for program synthesis in general purpose programming languages. We evaluate a collection of such models (with between 244M and 137B parameters) on two new benchmarks, MBPP and MathQA-Python, in both the few-shot and fine-tuning regimes. Our benchmarks are designed to measure the ability of these models to synthesize short Python programs from natural language descriptions. The Mostly Basic Programming Problems (MBPP) dataset contains 974 programming tasks, designed to be solvable by entry-level programmers. The MathQA-Python dataset, a Python version of the MathQA benchmark, contains 23914 problems that evaluate the ability of the models to synthesize code from more complex text. On both datasets, we find that synthesis performance scales log-linearly with model size. Our largest models, even without finetuning on a code dataset, can synthesize solutions to 59.6 percent of the problems from MBPP using few-shot learning with a well-designed prompt. Fine-tuning on a held-out portion of the dataset improves performance by about 10 percentage points across most model sizes. On the MathQA-Python dataset, the largest fine-tuned model achieves 83.8 percent accuracy. Going further, we study the model's ability to engage in dialog about code, incorporating human feedback to improve its solutions. We find that natural language feedback from a human halves the error rate compared to the model's initial prediction. Additionally, we conduct an error analysis to shed light on where these models fall short and what types of programs are most difficult to generate. Finally, we explore the semantic grounding of these models by fine-tuning them to predict the results of program execution. We find that even our best models are generally unable to predict the output of a program given a specific input. | ||
|
||
Homepage: https://github.com/google-research/google-research/tree/master/mbpp | ||
|
||
|
||
## Citation | ||
``` | ||
@article{austin2021program, | ||
title={Program synthesis with large language models}, | ||
author={Austin, Jacob and Odena, Augustus and Nye, Maxwell and Bosma, Maarten and Michalewski, Henryk and Dohan, David and Jiang, Ellen and Cai, Carrie and Terry, Michael and Le, Quoc and others}, | ||
journal={arXiv preprint arXiv:2108.07732}, | ||
year={2021} | ||
} | ||
``` | ||
|
||
### Groups and Tasks | ||
|
||
#### Groups | ||
|
||
* Not part of a group yet. | ||
|
||
#### Tasks | ||
|
||
- `mbpp` | ||
|
||
### Checklist | ||
|
||
For adding novel benchmarks/datasets to the library: | ||
* [x] Is the task an existing benchmark in the literature? | ||
* [x] Have you referenced the original paper that introduced the task? | ||
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test? | ||
|
||
|
||
If other tasks on this dataset are already supported: | ||
* [ ] Is the "Main" variant of this task clearly denoted? | ||
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates? | ||
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant? |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
task: mbpp | ||
dataset_path: google-research-datasets/mbpp | ||
dataset_name: full | ||
unsafe_code: true | ||
output_type: generate_until | ||
test_split: test | ||
doc_to_text: "You are an expert Python programmer, and here is your task: {{text}} Your code should pass these tests:\n\n{{test_list[0]}}\n{{test_list[1]}}\n{{test_list[2]}}\n[BEGIN]" | ||
doc_to_target: "{% if is_fewshot is defined %}{{code}}\n[DONE]{% else %}{{test_list[0]}}\n{{test_list[1]}}\n{{test_list[2]}}{% endif %}" | ||
target_delimiter: "\n" | ||
metric_list: | ||
- metric: !function utils.pass_at_1 | ||
aggregation: mean | ||
higher_is_better: true | ||
generation_kwargs: | ||
until: | ||
- "[DONE]" | ||
do_sample: false | ||
num_fewshot: 3 | ||
fewshot_config: | ||
sampler: first_n | ||
samples: !function utils.list_fewshot_samples | ||
metadata: | ||
version: 1.0 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,58 @@ | ||
import evaluate as hf_evaluate | ||
|
||
|
||
try: | ||
pass_at_k = hf_evaluate.load("code_eval") | ||
|
||
# run simple test to check code execution is enabled before model generation | ||
test_cases = ["assert add(2, 3)==5"] | ||
candidates = [["def add(a,b): return a*b"]] | ||
results = pass_at_k.compute(references=test_cases, predictions=candidates, k=[1]) | ||
except Exception as e: | ||
raise e | ||
|
||
|
||
def pass_at_1(references, predictions): | ||
return pass_at_k.compute( | ||
references=references, | ||
predictions=[predictions], | ||
k=[1], | ||
)[0]["pass@1"] | ||
|
||
|
||
def list_fewshot_samples(): | ||
return [ | ||
{ | ||
"task_id": 2, | ||
"text": "Write a function to find the similar elements from the given two tuple lists.", | ||
"code": "def similar_elements(test_tup1, test_tup2):\r\n res = tuple(set(test_tup1) & set(test_tup2))\r\n return (res) ", | ||
"test_list": [ | ||
"assert similar_elements((3, 4, 5, 6),(5, 7, 4, 10)) == (4, 5)", | ||
"assert similar_elements((1, 2, 3, 4),(5, 4, 3, 7)) == (3, 4)", | ||
"assert similar_elements((11, 12, 14, 13),(17, 15, 14, 13)) == (13, 14)", | ||
], | ||
"is_fewshot": True, | ||
}, | ||
{ | ||
"task_id": 3, | ||
"text": "Write a python function to identify non-prime numbers.", | ||
"code": "import math\r\ndef is_not_prime(n):\r\n result = False\r\n for i in range(2,int(math.sqrt(n)) + 1):\r\n if n % i == 0:\r\n result = True\r\n return result", | ||
"test_list": [ | ||
"assert is_not_prime(2) == False", | ||
"assert is_not_prime(10) == True", | ||
"assert is_not_prime(35) == True", | ||
], | ||
"is_fewshot": True, | ||
}, | ||
{ | ||
"task_id": 4, | ||
"text": "Write a function to find the largest integers from a given list of numbers using heap queue algorithm.", | ||
"code": "import heapq as hq\r\ndef heap_queue_largest(nums,n):\r\n largest_nums = hq.nlargest(n, nums)\r\n return largest_nums", | ||
"test_list": [ | ||
"assert heap_queue_largest( [25, 35, 22, 85, 14, 65, 75, 22, 58],3)==[85, 75, 65] ", | ||
"assert heap_queue_largest( [25, 35, 22, 85, 14, 65, 75, 22, 58],2)==[85, 75] ", | ||
"assert heap_queue_largest( [25, 35, 22, 85, 14, 65, 75, 22, 58],5)==[85, 75, 65, 58, 35]", | ||
], | ||
"is_fewshot": True, | ||
}, | ||
] |