Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: started working on SWE-bench evals #142

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Conversation

ErikBjare
Copy link
Owner

@ErikBjare ErikBjare commented Sep 30, 2024

Implemented with gptme, given moatless-tools and aider as reference implementations.

  • Set up harness
  • Get a single eval instance passing
    • Gets stuck at installing deps for repos
      • moatless-tools don't support running tests?
      • aider depends on the Docker env?
  • Try making our own eval instance?

Important

Introduces SWE-bench evaluation framework in gptme with new modules for instance loading, repository setup, and evaluation execution, along with CLI support and updated dependencies.

  • New Features:
    • Introduces SWE-bench evaluation framework in gptme/eval/swebench.
    • Implements run_swebench_evaluation() in evaluate.py to evaluate instances using an Agent.
    • Adds CLI command in main.py for running evaluations with options for model, dataset, split, instance, and verbosity.
  • Utilities:
    • utils.py provides functions for loading instances, setting up repositories, and extracting file spans from patches.
  • Configuration:
    • Adds gptme-eval-swebench script entry in pyproject.toml.
    • Adds datasets and fsspec as dependencies in pyproject.toml.

This description was created by Ellipsis for 4e9b48a. It will automatically update as commits are pushed.

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me! Reviewed everything up to 96f1ede in 12 seconds

More details
  • Looked at 376 lines of code in 6 files
  • Skipped 1 files when reviewing.
  • Skipped posting 4 drafted comments based on config settings.
1. gptme/eval/swebench/utils.py:10
  • Draft comment:
    The import statement for DownloadMode is repeated. Remove the duplicate import to clean up the code.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    The import statement for DownloadMode is repeated, which is unnecessary and can be removed.
2. gptme/eval/swebench/utils.py:46
  • Draft comment:
    The current_file variable is initialized but never used. Consider removing it to clean up the code.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    The get_file_spans_from_patch function initializes current_file but never uses it, which is unnecessary and can be removed.
3. gptme/eval/swebench/utils.py:74
  • Draft comment:
    Using os.chdir to change the working directory can have side effects. Consider using a context manager to temporarily change the directory.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    The setup_github_repo function changes the current working directory using os.chdir, which can have side effects. It's better to use a context manager to temporarily change the directory.
4. gptme/eval/swebench/main.py:86
  • Draft comment:
    The write_results function is called but not defined in the provided code. Ensure that it is implemented or imported correctly.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    The write_results function is called but not defined in the provided code. Ensure that it is implemented or imported correctly.

Workflow ID: wflow_QDiWSjoiJJC7dGXD


You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

@codecov-commenter
Copy link

codecov-commenter commented Sep 30, 2024

Codecov Report

Attention: Patch coverage is 0% with 142 lines in your changes missing coverage. Please review.

Project coverage is 77.15%. Comparing base (81708e6) to head (4e9b48a).

✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
gptme/eval/swebench/evaluate.py 0.00% 62 Missing ⚠️
gptme/eval/swebench/utils.py 0.00% 47 Missing ⚠️
gptme/eval/swebench/main.py 0.00% 29 Missing ⚠️
gptme/eval/swebench/__init__.py 0.00% 3 Missing ⚠️
gptme/eval/swebench/__main__.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #142      +/-   ##
==========================================
- Coverage   80.63%   77.15%   -3.49%     
==========================================
  Files          52       57       +5     
  Lines        3145     3287     +142     
==========================================
  Hits         2536     2536              
- Misses        609      751     +142     
Flag Coverage Δ
anthropic/claude-3-haiku-20240307 76.11% <0.00%> (-3.44%) ⬇️
openai/gpt-4o-mini 75.84% <0.00%> (-3.43%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ErikBjare ErikBjare mentioned this pull request Sep 30, 2024
8 tasks
Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me! Incremental review on 4e9b48a in 6 seconds

More details
  • Looked at 21 lines of code in 1 files
  • Skipped 0 files when reviewing.
  • Skipped posting 1 drafted comments based on config settings.
1. gptme/eval/swebench/main.py:4
  • Draft comment:
    The import EvalResult is unused and can be removed to clean up the code.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    The import statement for EvalResult is not used in the code, which is unnecessary and should be removed to keep the code clean.

Workflow ID: wflow_RhT1myKfhHe3YANu


You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

@ErikBjare
Copy link
Owner Author

ErikBjare commented Nov 1, 2024

Anthropic announced that Claude 3.5 (new), aka Claude "3.6", performs 49% on SWE-Bench Verified, with a simple harness: https://www.anthropic.com/research/swe-bench-sonnet

I think optimizing for the particular benchmark might become less and less necessary over time, unless you want to squeeze performance out of smaller models.

Would be cool to make a proper run and get listed on the SWE-Bench leaderboard, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants