feat: started working on SWE-bench evals #142

ErikBjare · 2024-09-30T07:45:36Z

Implemented with gptme, given moatless-tools and aider as reference implementations.

Set up harness
Get a single eval instance passing
- Gets stuck at installing deps for repos
  - moatless-tools don't support running tests?
  - aider depends on the Docker env?
Try making our own eval instance?

Important

Introduces SWE-bench evaluation framework in gptme with new modules for instance loading, repository setup, and evaluation execution, along with CLI support and updated dependencies.

New Features:
- Introduces SWE-bench evaluation framework in gptme/eval/swebench.
- Implements run_swebench_evaluation() in evaluate.py to evaluate instances using an Agent.
- Adds CLI command in main.py for running evaluations with options for model, dataset, split, instance, and verbosity.
Utilities:
- utils.py provides functions for loading instances, setting up repositories, and extracting file spans from patches.
Configuration:
- Adds gptme-eval-swebench script entry in pyproject.toml.
- Adds datasets and fsspec as dependencies in pyproject.toml.

^{This description was created by}^{for 4e9b48a. It will automatically update as commits are pushed.}

ellipsis-dev

👍 Looks good to me! Reviewed everything up to 96f1ede in 12 seconds

More details

Looked at 376 lines of code in 6 files
Skipped 1 files when reviewing.
Skipped posting 4 drafted comments based on config settings.

1. gptme/eval/swebench/utils.py:10

Draft comment:
The import statement for DownloadMode is repeated. Remove the duplicate import to clean up the code.
Reason this comment was not posted:
Confidence changes required: 50%
The import statement for DownloadMode is repeated, which is unnecessary and can be removed.

2. gptme/eval/swebench/utils.py:46

Draft comment:
The current_file variable is initialized but never used. Consider removing it to clean up the code.
Reason this comment was not posted:
Confidence changes required: 50%
The get_file_spans_from_patch function initializes current_file but never uses it, which is unnecessary and can be removed.

3. gptme/eval/swebench/utils.py:74

Draft comment:
Using os.chdir to change the working directory can have side effects. Consider using a context manager to temporarily change the directory.
Reason this comment was not posted:
Confidence changes required: 50%
The setup_github_repo function changes the current working directory using os.chdir, which can have side effects. It's better to use a context manager to temporarily change the directory.

4. gptme/eval/swebench/main.py:86

Draft comment:
The write_results function is called but not defined in the provided code. Ensure that it is implemented or imported correctly.
Reason this comment was not posted:
Confidence changes required: 50%
The write_results function is called but not defined in the provided code. Ensure that it is implemented or imported correctly.

Workflow ID: wflow_QDiWSjoiJJC7dGXD

You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

codecov-commenter · 2024-09-30T07:48:11Z

Codecov Report

Attention: Patch coverage is 0% with 142 lines in your changes missing coverage. Please review.

Project coverage is 77.15%. Comparing base (81708e6) to head (4e9b48a).

✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
gptme/eval/swebench/evaluate.py	0.00%	62 Missing ⚠️
gptme/eval/swebench/utils.py	0.00%	47 Missing ⚠️
gptme/eval/swebench/main.py	0.00%	29 Missing ⚠️
gptme/eval/swebench/__init__.py	0.00%	3 Missing ⚠️
gptme/eval/swebench/__main__.py	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #142      +/-   ##
==========================================
- Coverage   80.63%   77.15%   -3.49%     
==========================================
  Files          52       57       +5     
  Lines        3145     3287     +142     
==========================================
  Hits         2536     2536              
- Misses        609      751     +142

Flag	Coverage Δ
anthropic/claude-3-haiku-20240307	`76.11% <0.00%> (-3.44%)`	⬇️
openai/gpt-4o-mini	`75.84% <0.00%> (-3.43%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ellipsis-dev

👍 Looks good to me! Incremental review on 4e9b48a in 6 seconds

More details

Looked at 21 lines of code in 1 files
Skipped 0 files when reviewing.
Skipped posting 1 drafted comments based on config settings.

1. gptme/eval/swebench/main.py:4

Draft comment:
The import EvalResult is unused and can be removed to clean up the code.
Reason this comment was not posted:
Confidence changes required: 50%
The import statement for EvalResult is not used in the code, which is unnecessary and should be removed to keep the code clean.

Workflow ID: wflow_RhT1myKfhHe3YANu

You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

ErikBjare · 2024-11-01T21:03:23Z

Anthropic announced that Claude 3.5 (new), aka Claude "3.6", performs 49% on SWE-Bench Verified, with a simple harness: https://www.anthropic.com/research/swe-bench-sonnet

I think optimizing for the particular benchmark might become less and less necessary over time, unless you want to squeeze performance out of smaller models.

Would be cool to make a proper run and get listed on the SWE-Bench leaderboard, though.

feat: started working on SWE-bench evals

96f1ede

ellipsis-dev bot reviewed Sep 30, 2024

View reviewed changes

ErikBjare mentioned this pull request Sep 30, 2024

Benchmarks/evals #63

Open

8 tasks

fix: typing fix

4e9b48a

ellipsis-dev bot reviewed Oct 2, 2024

View reviewed changes

ErikBjare mentioned this pull request Oct 2, 2024

Making gptme work while you sleep #143

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: started working on SWE-bench evals #142

feat: started working on SWE-bench evals #142

ErikBjare commented Sep 30, 2024 •

edited

Loading

ellipsis-dev bot left a comment

codecov-commenter commented Sep 30, 2024 •

edited

Loading

ellipsis-dev bot left a comment

ErikBjare commented Nov 1, 2024 •

edited

Loading

feat: started working on SWE-bench evals #142

Are you sure you want to change the base?

feat: started working on SWE-bench evals #142

Conversation

ErikBjare commented Sep 30, 2024 • edited Loading

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

codecov-commenter commented Sep 30, 2024 • edited Loading

Codecov Report

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

ErikBjare commented Nov 1, 2024 • edited Loading

ErikBjare commented Sep 30, 2024 •

edited

Loading

codecov-commenter commented Sep 30, 2024 •

edited

Loading

ErikBjare commented Nov 1, 2024 •

edited

Loading