Add "replace" to streaming challenge with new runbooks #301
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds a new operation, "replace," to the options in the streaming benchmark challenge. It reflects the operation where an embedding for a particular vector ID is updated, with use cases such as computing a new embedding to reflect small changes to a document ID. The changes to the core repository are fairly small: the running and loading code now accept replace as an option, and it is included in the streaming groundtruth calculation script. All these changes are fully backwards-compatible with the original neurips23 runners.
We also contribute three new runbooks that use the new "replace" operation. The first is
simple_replace_runbook.yaml
, which is modeled after the existing simple runbook. The second,random_replace_runbook.yaml
is based on themsturing-10m-clustered
dataset. It first inserts a prefix of the data in each cluster, then replaces part of that prefix with data from a randomly selected cluster. The third,clustered_replace_runbook.yaml
is also based onmsturing-10m-clustered
, and it inserts a prefix of the data in each cluster, then replaces part of that prefix with reserved data points from the same cluster. The script for generating these runbooks is deterministic and is provided ingen_replace_runbooks.py
.