Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support for run-bug-run runbugrun #39

Open
monperrus opened this issue Apr 4, 2023 · 9 comments
Open

add support for run-bug-run runbugrun #39

monperrus opened this issue Apr 4, 2023 · 9 comments

Comments

@monperrus
Copy link
Contributor

RunBugRun -- An Executable Dataset for Automated Program Repair
https://github.com/giganticode/run_bug_run

@andre15silva
Copy link
Member

https://github.com/giganticode/run_bug_run_data/releases/tag/v0.0.1

Seems like the first release is out

@monperrus monperrus changed the title add support for run-run-bugs add support for run-bug-run runbugrun May 3, 2023
@cadddr
Copy link

cadddr commented Oct 22, 2024

happy to take on that

@andre15silva
Copy link
Member

happy to take on that

sounds good, let me know if you have any question!

@cadddr
Copy link

cadddr commented Oct 23, 2024

happy to take on that

sounds good, let me know if you have any question!

I started looking into this yesterday. Few things about run bug run:

  • it comes with its own tool for managing bugs data and execution - rbugr, written in ruby. I find it easier to work with jsonl files directly, added download commands to setup script.
  • when implementing benchmark and bug subclasses for python bugs, I realized elle-elle-aime is hardwired for java, is it ok to carry on with python? I have not worked with run bug run - Java.
  • run bug run programs receive inputs via standard input rather than function arguments and print results back to it rather than returning. I made a utility (for Python) that programmatically pipes input and captures output into a variable.

@andre15silva
Copy link
Member

when implementing benchmark and bug subclasses for python bugs, I realized elle-elle-aime is hardwired for java, is it ok to carry on with python? I have not worked with run bug run - Java.

When you integrate the benchmark you define which commands to run when compiling/testing each bug/patch.

The only part that is currently hard-coded for Java is the extraction of single functions, removal of comments, etc.
These are used when generating prompts.

See https://github.com/ASSERT-KTH/elle-elle-aime/tree/master/elleelleaime/core/utils/java

To integrate a Python benchmark you'll need to implement similar functions for Python (or even better, using tree-sitter to support more languages).

@cadddr
Copy link

cadddr commented Oct 30, 2024

Sharing progress so far: #166
(Please don't merge as it is still missing a few things.)

I'm a little unclear on the Bug.failing_tests -- it maps test methods to the resulting error message? In run bug run there are simply test inputs and expected outputs, and the buggy code is not always a self-contained function.

Also the ground_truth diff only comes into play when evaluating the LLM-generated fix? Why not simply check if tests pass.

Similarly, not sure if I'm using the checkout logic correctly -- seems like a drag to have to make a copy each time and I instead simply read from the original buggy file.

Any feedback/corrections welcome!

@andre15silva
Copy link
Member

I'm a little unclear on the Bug.failing_tests -- it maps test methods to the resulting error message?

Exactly, it maps fully qualified test method names to the error messages.

In run bug run there are simply test inputs and expected outputs, and the buggy code is not always a self-contained function.

I see the solution you came up with, and that seems reasonable.

The only problem will be in extracting the test case (see https://github.com/ASSERT-KTH/elle-elle-aime/blob/master/elleelleaime/core/utils/java/java.py#L269). This means that we need to add a special case for RunBugRun here.

Also the ground_truth diff only comes into play when evaluating the LLM-generated fix?

The ground_truth diff is used in two places right now:

  1. In extracting the buggy function (see https://github.com/ASSERT-KTH/elle-elle-aime/blob/master/elleelleaime/core/utils/java/java.py#L143), during the generation of prompts
  2. In evaluating the generated fixed functions.

Why not simply check if tests pass.

Executing tests to check is great, but there is known problem in program repair called patch overfitting. This problem lies in patches that pass the tests but are different from what the developer intends (see e.g., Is the cure worse than the disease? overfitting in automated program repair.

For this reason, we use the ground-truth patch as a reference in some evaluation metrics like exact-match or ast-match.

Similarly, not sure if I'm using the checkout logic correctly -- seems like a drag to have to make a copy each time and I instead simply read from the original buggy file.

It's important to have that logic (every checkout copies the files from an untouched source) due to the parallelism. We want to be able to evaluate hundreds/thousands of patches at the same time, and this requires them to be in different locations.

Any feedback/corrections welcome!

Could you rebase your PR? I changed the CI config to enable it on PRs. That way we can check if the tests are green. Thanks :)

@cadddr
Copy link

cadddr commented Nov 4, 2024

The only problem will be in extracting the test case (see https://github.com/ASSERT-KTH/elle-elle-aime/blob/master/elleelleaime/core/utils/java/java.py#L269). This means that we need to add a special case for RunBugRun here.

The ground_truth diff is used in two places right now:

  1. In extracting the buggy function (see https://github.com/ASSERT-KTH/elle-elle-aime/blob/master/elleelleaime/core/utils/java/java.py#L143), during the generation of prompts

So, test cases right now are simple asserts about the returned value.
I've overridden the instruct strategy for python here, to circumvent having to extract test case source:

febe8e4#diff-3f4ea3e207b6866ea3514390ef0148073207b05d1a8ca4da933d8f926e1be2d5

Got all the other points, will rebase PR.

@andre15silva
Copy link
Member

Looks like a good solution, thanks!

Let me know if you have any problem with the CI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants