Releases · GoodAI/goodai-ltm-benchmark

07 Jun 14:19

JosephDavidsonKSWH

v3.5-benchmark

b69d945

Benchmark 3.5 Latest

Latest

What's Changed

More comparisons for GPT4o, Llama, Mixtral, and Gemini models.
Added benchmarks with smaller memory spans.
Added -i option to run a benchmark with isolated tests (i.e a conversation with sequential, not interleaved, tests)
General updates of evaluations to increase their robustness.

Full Changelog: v3-benchmark...v3.5-benchmark

Assets 2

24 Apr 11:20

dcasbol

v3-benchmark

884a502

Benchmark 3

What's Changed

Standardized benchmark scoring: The scores for all examples of a dataset are normalised between 0-1. The max score on a benchmark is now the number of different datasets that are tested.
Memory span based testing: A memory span for a benchmark denotes the maximum number of tokens from the first needle in a test to the final question. This is to more accurately inform users how large a context would need to be to fit the entire test into it. All tests show their final memory span in the detailed reports.
Litellm integration to allow for more flexible LLM usage in future.
Various fixes to datasets. Tests and their evaluations are much more reliable now, and represent a uniform standard for all agents to pass.

Full Changelog: v2-benchmark...v3-benchmark

Assets 2

13 Mar 16:12

dcasbol

v2-benchmark

c24776a

Benchmark 2

What's Changed

Tests

Two new tests have been introduced, which span across the entire benchmark.

Restaurant. The agent must go along with the role of a customer at a restaurant. It must handle several challenges such as ordering food, switching an order, and call out any mixup from the waiter.
Spy meeting. The agent is contacted by three individuals at different times during the benchmark. Each individual gives the agent a cryptic message, which the agent must recall and correctly interpret at the very end.

Additionally, the delivery and evaluation of several tests has been significantly improved, with the purpose of enhancing the agent’s engagement and understanding of the different testing scenarios.

Features

Dynamic tests. Tests can now be defined as a Python generator function, allowing them to react to any action from the agent and make the tests more realistic and alive.
Percentage waits. Tests can now ask to be put on hold until a certain percentage of tests in the benchmark have finished. This feature is key for tests that span across the whole benchmark.

Full Changelog: v1-benchmark...v2-benchmark

Assets 2

23 Feb 14:55

JosephDavidsonKSWH

v1.1

9064184

v1.1 (beta) - Continuous tests, Resuming, and Chapterbreak Pre-release

Pre-release

What's Changed

Single conversation testing

All testing is now performed over the span of a single conversation. This stresses the LTM more, as it will perform multiple versions of the same test in sequence without the memory being wiped clean manually by us. We tell it to forget something, and it will have to do so, or risk confusing old information with new information.

Resuming testing

When the testing process fails, the process can now pick up right from where it left off. All testing events are logged to a master log which is used as the authoritative resource as to what has happened so far in the test suite. When the tests are resumed, this log is used to reset tests back to where they were in their scripts and the process continues. See the runner readme for more details.

Agents are now broadly expected to be persistent. See the models readme for more details.

Datasets:

The addition of the ChapterBreak dataset, a set of long texts (8k tokens) where your agent has to choose which continuation of the text is the correct one.
Prospective memory generation produces correct tests more reliably.
Instruction Recall tests now do not generate questions or instructions that can be reasonably guessed by an LLM. (e.g. no more questions like “What should you do to prepare a drone for its first flight?”)

Full Changelog: v1-benchmark...v1.1

Assets 3

14 Feb 10:02

JosephDavidsonKSWH

v1-benchmark

ec78862

Benchmark 1

The first release of the GoodAI-LTM benchmark with results and the blogpost.

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's Changed

What's Changed

What's Changed

Tests

Features

What's Changed

Single conversation testing

Resuming testing

Datasets:

Releases: GoodAI/goodai-ltm-benchmark

Benchmark 3.5

What's Changed

Benchmark 3

What's Changed

Benchmark 2

What's Changed

Tests

Features

v1.1 (beta) - Continuous tests, Resuming, and Chapterbreak

What's Changed

Single conversation testing

Resuming testing

Datasets:

Benchmark 1