Skip to content

Hash Join

Guodong Jin edited this page Mar 12, 2021 · 6 revisions

This page currently describes a simple non-parallel hash table in PR#52 for near future integration with hash join.

Hash Table Implementation

Storage Layout

The storage of a simple hash table is divided into two parts: tuple blocks (htBlocks) and directory (htDirectory). Tuple blocks hold all materialized tuples, and the directory contains slots with pointers. Inside tuple blocks, each tuple reserves a pointer space pointing to a previous tuple within the same hash slot, in this way, we implement a chain-based collision handling. For each tuple, the storage layout is: |key|payloads|prev|. Inside the tuple payloads, we might have variable-length items, whose content will be stored separately in the overflowBlocks.

The materialization of build side tuples

Hash (based on the join key) can happen on both a flat vector and a list. For the join key, we only consider equality of nodeID for now (no join on property values, and no multiple join conditions).

Query example for join on a flat vector: (a)->(b)->(c)->(d)->(e) WHERE a.id=1 AND e.id=1. this is a bidirectional query, hash join happens on c.id. and the build side is the subquery (a)->(b)->(c). which starts from SCAN(a), then EXTEND(b) and EXTEND(c). thus, c might be a list chunk.

Query example for join on a list: (a)->(b)->(c), (b)->(d)->(e). the hash join happens on b.id, and the build side is the subquery (a)->(b)->(c), which starts from SCAN(a), then EXTEND(b). thus, b might be a flat chunk.

During materialization, there are two cases for a list chunk:

  • If it's the key, we would duplicate tuples in build chunks;
  • Else, each list vector is treated as a variable-sized value in the tuple.

Hash Join Implementation

TODO.

Clone this wiki locally