Benchmarks/evals #63

ErikBjare · 2024-01-20T14:39:50Z

I did some smaller benchmarks (more like tests, really) and would like to continue with this endeavor to evaluate capabilities and weak spots.

Would also be interesting to test on codegen tasks vs gpt-engineer (see #62), such as the gpt-engineer suite and SWE-bench.

Set up basic evals
Write docs
Dockerize
Write more difficult eval set
- project init (git, rust, react)
- SWE-Bench: feat: started working on SWE-bench evals #142
- npm run dev + browser + screenshot + edit request: Add browser screenshot tool, integrate with vision #52
Write up a blog post or similar

ErikBjare · 2024-09-06T14:58:28Z

Improved the eval harness quite a bit in #90, among other changes (incl a lot of Docker stuff).

I'm now 80% happy with the harness and am trying to think about how it would provide value for the project/community.

Including which types of things to eval (shell scripting, complicated patches, python repl stuff), and which external evals we should try to run gptme on (would prob be a great learning opportunity to get experience with other evals).

ErikBjare changed the title ~~Benchmarks~~ Benchmarks/evals Jan 20, 2024

ErikBjare added the evals label Jan 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarks/evals #63

Benchmarks/evals #63

ErikBjare commented Jan 20, 2024 •

edited

Loading

ErikBjare commented Sep 6, 2024

Benchmarks/evals #63

Benchmarks/evals #63

Comments

ErikBjare commented Jan 20, 2024 • edited Loading

ErikBjare commented Sep 6, 2024

ErikBjare commented Jan 20, 2024 •

edited

Loading