Releases: GoodAI/goodai-ltm-benchmark
Benchmark 3.5
What's Changed
- More comparisons for GPT4o, Llama, Mixtral, and Gemini models.
- Added benchmarks with smaller memory spans.
- Added
-i
option to run a benchmark with isolated tests (i.e a conversation with sequential, not interleaved, tests) - General updates of evaluations to increase their robustness.
Full Changelog: v3-benchmark...v3.5-benchmark
Benchmark 3
What's Changed
- Standardized benchmark scoring: The scores for all examples of a dataset are normalised between 0-1. The max score on a benchmark is now the number of different datasets that are tested.
- Memory span based testing: A memory span for a benchmark denotes the maximum number of tokens from the first needle in a test to the final question. This is to more accurately inform users how large a context would need to be to fit the entire test into it. All tests show their final memory span in the detailed reports.
- Litellm integration to allow for more flexible LLM usage in future.
- Various fixes to datasets. Tests and their evaluations are much more reliable now, and represent a uniform standard for all agents to pass.
Full Changelog: v2-benchmark...v3-benchmark
Benchmark 2
What's Changed
Tests
Two new tests have been introduced, which span across the entire benchmark.
-
Restaurant. The agent must go along with the role of a customer at a restaurant. It must handle several challenges such as ordering food, switching an order, and call out any mixup from the waiter.
-
Spy meeting. The agent is contacted by three individuals at different times during the benchmark. Each individual gives the agent a cryptic message, which the agent must recall and correctly interpret at the very end.
Additionally, the delivery and evaluation of several tests has been significantly improved, with the purpose of enhancing the agent’s engagement and understanding of the different testing scenarios.
Features
- Dynamic tests. Tests can now be defined as a Python generator function, allowing them to react to any action from the agent and make the tests more realistic and alive.
- Percentage waits. Tests can now ask to be put on hold until a certain percentage of tests in the benchmark have finished. This feature is key for tests that span across the whole benchmark.
Full Changelog: v1-benchmark...v2-benchmark
v1.1 (beta) - Continuous tests, Resuming, and Chapterbreak
What's Changed
Single conversation testing
All testing is now performed over the span of a single conversation. This stresses the LTM more, as it will perform multiple versions of the same test in sequence without the memory being wiped clean manually by us. We tell it to forget something, and it will have to do so, or risk confusing old information with new information.
Resuming testing
When the testing process fails, the process can now pick up right from where it left off. All testing events are logged to a master log which is used as the authoritative resource as to what has happened so far in the test suite. When the tests are resumed, this log is used to reset tests back to where they were in their scripts and the process continues. See the runner readme for more details.
Agents are now broadly expected to be persistent. See the models readme for more details.
Datasets:
- The addition of the ChapterBreak dataset, a set of long texts (8k tokens) where your agent has to choose which continuation of the text is the correct one.
- Prospective memory generation produces correct tests more reliably.
- Instruction Recall tests now do not generate questions or instructions that can be reasonably guessed by an LLM. (e.g. no more questions like “What should you do to prepare a drone for its first flight?”)
Full Changelog: v1-benchmark...v1.1
Benchmark 1
The first release of the GoodAI-LTM benchmark with results and the blogpost.