JosephDavidsonKSWH
released this
07 Jun 14:19
·
2 commits
to main
since this release
What's Changed
- More comparisons for GPT4o, Llama, Mixtral, and Gemini models.
- Added benchmarks with smaller memory spans.
- Added
-i
option to run a benchmark with isolated tests (i.e a conversation with sequential, not interleaved, tests) - General updates of evaluations to increase their robustness.
Full Changelog: v3-benchmark...v3.5-benchmark