Initially we want to support these types of benchmark:
The goal of this benchmark is to generate the scripts/environment to be able to run the tests.
The goal of this benchmark is to generate additional tests that increase the test coverage of the test suite of a project.
The qualities that are measured in this benchmark:
- Is (the lack of) existing test tooling accurately determined, and in how many tokens?
- Is a successful setup script and test invocation determined, and how many lines are reported to be covered, and in how many tokens?(note, this step might be gamed, so it should be manually policed)
The qualities that are measured in this benchmark:
- Wether the agent successfully adds tests
- How much test coverage it adds
- How many modifications it does to achieve that test coverage
- How many tokens it spends to do so.
To get 1
and 2
, we need to run the coverage command ourselves. So we need the workspace and the command.
To get 3
we can just count changes of a git diff
inside the workspace.
To get 4
we can consult the LLM proxy.
An agent needs an environment to write code and run tools in, we use a workspace provider to establish a clean environment for the agent to do its work in. The agent will also need an LLM for creative output, we will use an LLM proxy to provide access to an LLM.
- Provide the LLM proxy
- Provision the git repository
- Provision a workspace with a copy of the git repository
- Run the coverage tool inside the workspace to establish a baseline
- Run the agent inside the workspace
- Run the coverage tool again to determine improvements
- Run a git diff between the original git repo and the version in the workspace to measure impact