Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmarks/evals #63

Open
4 of 8 tasks
ErikBjare opened this issue Jan 20, 2024 · 1 comment
Open
4 of 8 tasks

Benchmarks/evals #63

ErikBjare opened this issue Jan 20, 2024 · 1 comment
Labels

Comments

@ErikBjare
Copy link
Owner

ErikBjare commented Jan 20, 2024

I did some smaller benchmarks (more like tests, really) and would like to continue with this endeavor to evaluate capabilities and weak spots.

Would also be interesting to test on codegen tasks vs gpt-engineer (see #62), such as the gpt-engineer suite and SWE-bench.

@ErikBjare ErikBjare changed the title Benchmarks Benchmarks/evals Jan 20, 2024
@ErikBjare
Copy link
Owner Author

Improved the eval harness quite a bit in #90, among other changes (incl a lot of Docker stuff).

I'm now 80% happy with the harness and am trying to think about how it would provide value for the project/community.

Including which types of things to eval (shell scripting, complicated patches, python repl stuff), and which external evals we should try to run gptme on (would prob be a great learning opportunity to get experience with other evals).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant