Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add initial implementation of Proximity Map #8408

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

nicktobey
Copy link
Contributor

This adds a new message/node type: Vector Index, and a corresponding prolly-tree-based map structure: the Proximity Map

The Vector Index message currently has a subset of the Prolly Map message. A different message type was chosen to prevent them from being accidentally confused: while a Vector Index has the same fields as a Prolly Map, and has similar computed properties, their invariants and iteration order are completely different, and algorithms written for Prolly Map nodes should not accidentally operate on Vector Index nodes instead. Old versions of Dolt that do not support vector indexes should not attempt to manipulate Vector Index nodes as if they were Prolly Map nodes.

Proximity Maps resemble other Prolly Maps, but have the following invariants:

  • Each key must be convertible to a vector. Typically, the key is a val.Tuple, and the vector is the first value in that tuple.
  • The keys are arranged in the tree such that, for each of a key's parent keys (the keys that appear on the path from the root to the key), the key is closer to that parent key than any of the parent key's siblings.
  • The keys in a node are sorted...
  • ...except for the first key which matches its direct parent. (This may prove to be unnecessary and could potentially be relaxed.)

Notably, while the keys of an individual node are sorted, walking all of a vector indexes keys in standard iteration order will not be sorted.

This is a useful construct because it allows for efficient proximity-based lookup, which are instrumental for quickly running "approximate nearest neighbor" algorithms.

Currently, chunk boundaries are computed completely deterministically. No rolling hash or weibull distribution is used to control chunk sizes. This was chosen because it has the simplest guarentees about chunk boundaries, which makes it easier to reason about the cost of fixup operations, and makes cascading chunk boundary changes impossible: for example, a key that marks a chunk boundary on the first two levels will always mark a chunk boundary on the first two levels even if it gets moved when fixing up the tree.

This has its downsides: the chunk sizes follow a geometric distribution with no maximum or minimum size. This may change in the future, but is acceptable as a first pass.

@coffeegoddd
Copy link
Contributor

@nicktobey DOLT

comparing_percentages
100.000000 to 100.000000
version result total
6b627d5 ok 5937457
version total_tests
6b627d5 5937457
correctness_percentage
100.0

@coffeegoddd
Copy link
Contributor

@nicktobey DOLT

comparing_percentages
100.000000 to 100.000000
version result total
91a5a5d ok 5937457
version total_tests
91a5a5d 5937457
correctness_percentage
100.0

@coffeegoddd
Copy link
Contributor

@nicktobey DOLT

comparing_percentages
100.000000 to 100.000000
version result total
2a75e85 ok 5937457
version total_tests
2a75e85 5937457
correctness_percentage
100.0

@coffeegoddd
Copy link
Contributor

@coffeegoddd DOLT

comparing_percentages
100.000000 to 100.000000
version result total
3756ef4 ok 5937457
version total_tests
3756ef4 5937457
correctness_percentage
100.0

@nicktobey nicktobey requested a review from reltuk October 4, 2024 23:57
@nicktobey
Copy link
Contributor Author

I refactored the algorithm for creating the map, outlining it into more helper functions and adding more documentation.

I also bumped go to 1.23 in order to use the new iter package. If this gets approved, we should make sure to roll out a similar bump to our other modules before merging.

There's a couple potential flaws here depending on exactly how ProximityMaps get used, and they both occur if two rows have the same vector value:

  1. The intermediatepathMaps store the original table key tuple as a bytestring. Depending on the field types, this may not have the same ordering as the original table, which may lead to ordering issues if two keys have the same vector.
  2. The pathMaps tables store for each row, a list of vector hashes corresponding to a path within the Proximity Tree, starting at the root. If the vectors are non-unique, we may need to store a list of secondary index keys instead. These would likely also get stored as a bytestring and may cause pathMaps to have a different iteration order than the original table.
  3. Currently the tree level of each vector is computed by hashing the entire secondary index key. This means that changing non-vector columns in the indexed table may cause the level of these vectors to change. This could be prevented if we only hash the vector. It would mean that two rows with the same vector always have the same maximum level in the Proximity Tree, but I'm unsure if this is a problem or not.

@coffeegoddd
Copy link
Contributor

@nicktobey DOLT

comparing_percentages
100.000000 to 100.000000
version result total
30f8aeb ok 5937457
version total_tests
30f8aeb 5937457
correctness_percentage
100.0

Copy link
Contributor

@reltuk reltuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything I looked at looks good. No strong opinions on the flatbuffer organization or the parameterization of the chunker factory. But I skipped all of prolly/proximity_map.go, prolly/tree/proximity_map.go, and deterministic_node_splitter.go for now, so I will need to follow up on those.

Comment on lines 179 to 181
if serial.ProllyTreeNodeNumFields < pm.Table().NumFields() {
return nil, fb.ErrTableHasUnknownFields
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handled by gen/fb/serial/vectorindexnode.go, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you're right. I was copying the code from prolly_map.go, but this seems unnecessary.

@coffeegoddd
Copy link
Contributor

@nicktobey DOLT

comparing_percentages
100.000000 to 100.000000
version result total
9429fa9 ok 5937457
version total_tests
9429fa9 5937457
correctness_percentage
100.0

@coffeegoddd
Copy link
Contributor

@nicktobey DOLT

comparing_percentages
100.000000 to 100.000000
version result total
50dc53e ok 5937457
version total_tests
50dc53e 5937457
correctness_percentage
100.0

@coffeegoddd
Copy link
Contributor

@nicktobey DOLT

comparing_percentages
100.000000 to 100.000000
version result total
3e5b4d9 ok 5937457
version total_tests
3e5b4d9 5937457
correctness_percentage
100.0

@coffeegoddd
Copy link
Contributor

@coffeegoddd DOLT

comparing_percentages
100.000000 to 100.000000
version result total
d539f18 ok 5937457
version total_tests
d539f18 5937457
correctness_percentage
100.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants