Skip to content

Hash Join

Guodong Jin edited this page Mar 22, 2021 · 6 revisions

Related PRs/Issues:

TODOs:

  • Join hash table
  • Single-thread join operator for INNER join type
  • Support of ANTI, SEMI, and OUTER join types
  • Multiple-threaded hash join operator

Build Hash Table

Storage Layout

The storage of a simple hash table is divided into two parts: tuple blocks (htBlocks) and directory (htDirectory). Tuple blocks hold all materialized tuples, and the directory contains slots with pointers. Inside tuple blocks, each tuple reserves a pointer space pointing to a previous tuple within the same hash slot, in this way, we implement a chain-based collision handling. For each tuple, the storage layout is: |key|payloads|prev|. Inside the tuple payloads, we might have variable-length items, whose content will be stored separately in the overflowBlocks.

The materialization of build side tuples

Hash (based on the join key) can happen on both a flat vector and a list. For the join key, we only consider equality of nodeID for now (no join on property values, and no multiple join conditions).

Query example for join on a flat vector: (a)->(b)->(c)->(d)->(e) WHERE a.id=1 AND e.id=1. this is a bidirectional query, hash join happens on c.id. and the build side is the subquery (a)->(b)->(c). which starts from SCAN(a), then EXTEND(b) and EXTEND(c). thus, c might be a list chunk.

Query example for join on a list: (a)->(b)->(c), (b)->(d)->(e). the hash join happens on b.id, and the build side is the subquery (a)->(b)->(c), which starts from SCAN(a), then EXTEND(b). thus, b might be a flat chunk.

During materialization, there are two cases for a list chunk:

  • If it's the key, we would duplicate tuples in build chunks;
  • Else, each list vector is treated as a variable-sized value in the tuple.

Probe Hash Table

Clone this wiki locally