You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Improved the eval harness quite a bit in #90, among other changes (incl a lot of Docker stuff).
I'm now 80% happy with the harness and am trying to think about how it would provide value for the project/community.
Including which types of things to eval (shell scripting, complicated patches, python repl stuff), and which external evals we should try to run gptme on (would prob be a great learning opportunity to get experience with other evals).
I did some smaller benchmarks (more like tests, really) and would like to continue with this endeavor to evaluate capabilities and weak spots.
Would also be interesting to test on codegen tasks vs gpt-engineer (see #62), such as the gpt-engineer suite and SWE-bench.
The text was updated successfully, but these errors were encountered: