-
Notifications
You must be signed in to change notification settings - Fork 103
Hash Join
This page currently describes a simple non-parallel hash table in PR#52 for near future integration with hash join.
The storage of a simple hash table is divided into two parts: tuple blocks (htBlocks
) and directory (htDirectory
).
Tuple blocks hold all materialized tuples, and the directory contains slots with pointers. Inside tuple blocks, each tuple reserves a pointer space pointing to a previous tuple within the same hash slot, in this way, we implement a chain-based collision handling.
For each tuple, the storage layout is: |key|payloads|prev|
.
Inside the tuple payloads, we might have variable-length items, whose content will be stored separately in the overflowBlocks
.
Hash (based on the join key) can happen on both a flat vector and a list. For the join key, we only consider equality of nodeID for now (no join on property values, and no multiple join conditions).
Query example for join on a flat vector: (a)->(b)->(c)->(d)->(e) WHERE a.id=1 AND e.id=1
. this is a bidirectional query, hash join happens on c.id. and the build side is the subquery (a)->(b)->(c)
. which starts from SCAN(a)
, then EXTEND(b)
and EXTEND(c)
. thus, c might be a list chunk.
Query example for join on a list: (a)->(b)->(c), (b)->(d)->(e)
. the hash join happens on b.id
, and the build side is the subquery (a)->(b)->(c)
, which starts from SCAN(a)
, then EXTEND(b)
. thus, b might be a flat chunk.
During materialization, there are two cases for a list chunk:
- If it's the key, we would duplicate tuples in build chunks;
- Else, each list vector is treated as a variable-sized value in the tuple.
TODO.
- Coding Style
- Casting Rules
- Frontend
- Processor
- Storage
- Test