Skip to content

Latest commit

 

History

History
240 lines (196 loc) · 12.6 KB

README.md

File metadata and controls

240 lines (196 loc) · 12.6 KB

genesis

Table of Contents

  1. About
  2. Example Usage
  3. Architecture
    1. Memtable
    2. SSTable
      1. Compaction
    3. Write-Ahead-Log
    4. Distributed Architecture
  4. Complete Tree
  5. Benchmarks
    1. Full Tree
  6. Feature Checklist
  7. References

About

genesis is a distributed log-structured merge (LSM) tree key-value store. This project was originally based off the Bitcask research paper (which isnt distributed nor uses LSM), but expanded upon and redesigned. Built purely for educational purposes.

Example Usage

Quick Start

Clone the repository and get started by finding the cmd/ folder & running go run main.go in the terminal/command line.

By default, the system will open a cluster with 5 nodes (self-contained KV stores). You can modify this within the main.go file.

c := store.NewCluster(5)
c.Open()

This will start an HTTP server on port :8080, which is what you can use to put, get, or delete keys. As for the nodes themselves, they are hosted starting from port :11000 to :11004 in the case of 5 nodes. However, keep in mind that the nodes themselves at the aforementioned ports do not accept HTTP requests, as this would bypass the consistent hashing & routing layer. Instead any request must go through port :8080.

Put, Get, Delete key-value pairs

To Put, Get, or Delete key-value pairs, you can start genesis in the terminal and in another terminal tab, you can use curl in the following manner on :8080/key:

curl -XPOST localhost:8080/key -d '{"user1": "batman", "user2": "superman", "user3": "captain america"}'

curl -XGET localhost:8080/key/user2
-> superman

curl -XDELETE localhost:8080/key/user3

On the other terminal tab, you will see print statements to confirm the operations:

key = user1	added @ node addr = :11004
key = user2	added @ node addr = :11001
key = user3	added @ node addr = :11003
deleted user3 @ node addr = :11003

Add additional nodes

To add additional nodes and actually see the data redistribution in action, we can first add 50 key-value pairs:

curl -XPOST localhost:8080/key -d '{
    "user1": "batman", "user2": "superman", "user3": "captain america",
    "user4": "ironman", "user5": "thor", "user6": "hulk",
    "user7": "black widow", "user8": "black panther", "user9": "spiderman",
    "user10": "scarlet witch", "user11": "vision", "user12": "doctor strange",
    "user13": "antman", "user14": "hawk eye", "user15": "winter soldier",
    "user16": "falcon", "user17": "loki", "user18": "nick fury",
    "user19": "gamora", "user20": "star lord", "user21": "drax",
    "user22": "rocket", "user23": "groot", "user24": "mantis",
    "user25": "wong", "user26": "shang chi", "user27": "kamala khan",
    "user28": "moon knight", "user29": "blade", "user30": "deadpool",
    "user31": "jessica jones", "user32": "luke cage", "user33": "iron fist",
    "user34": "the punisher", "user35": "daredevil", "user36": "x23",
    "user37": "quicksilver", "user38": "vision", "user39": "scarlet witch",
    "user40": "mister fantastic", "user41": "invisible woman", "user42": "human torch",
    "user43": "thing", "user44": "wolverine", "user45": "storm",
    "user46": "cyclops", "user47": "beast", "user48": "nightcrawler",
    "user49": "colossus", "user50": "gambit"
}'

And then add an additional node, which automatically triggers data redistribution over the wire using gRPC:

curl -XPOST localhost:8080/add-node

And in the other terminal tab, you can see all the operations being printed including some lines that detail the data migration request.

Removing nodes

To remove a particular node, you can use the following curl command:

curl -XPOST localhost:8080/remove-node/11004

or more generally:
curl -XPOST localhost:8080/remove-node/<node_address>

To exit the entire system, simply press CTRL + C on your keyboard.

Architecture

Overview of the architecture, from the Memtable implementation, to how the data is stored on disk in the form of SSTables, and lastly how its distributed.

LSM Architecture Source: “A Survey of LSM-Tree Based Indexes, Data Systems and KV-Stores.” arxiv.org/html/2402.10460v2.

Memtable (Red-Black Tree)

A red black tree is a self-balancing binary search tree (BST) with a couple distinct properties to ensure the tree remains balanced and maintains efficient search, insertion, and deletion operations in O(log(n)) time.

In this LSM tree implementation, a red-black tree is used as the memtable—which is the in-memory portion of this database in which all incoming writes are temporarily stored before being flushed to disk. Flushing to disk involves taking every record in the memtable and storing it in a Sorted String Table (SSTable) on disk, which is a process triggered under a defined threshold. This batching approach allows for much greater efficiency, as it minimizes costly disk writes.

As there's a natural, sorted ordering to elements in a BST, it makes a lot of sense to use a red-black tree when creating Sorted String Tables coupled with the performance gains of a self-balancing BST.

SSTable

An SSTable (Sorted String Table) is a file format used for storing key-value pairs in a sorted order. It is commonly used in systems like LevelDB and Bigtable for efficient data storage in key-value stores.

Components:

  • Data: all the key-value pairs in sorted order (by key) in a .data file
  • Sparse index: index structure that stores a subset of the original keys and their corresponding byte offsets in the data file
  • Bloom filter: a probabilistic data structure used to test the membership of a given key in the SSTable

SSTables are persisted to disk immediately when created, and each table is represented by three files:

  • <sst_num>.data
  • <sst_num>.index
  • <sst_num>.bloom

To lookup a key, the system will automatically look in the memtable first to check if they key is still in-memory and hasn't been flushed yet. If the key is not present in the memtable, then we start looking at the SSTables on disk. The general process to find a key on disk is the following:

  • Use the bloom filter to check if a key may or may not be in a given SSTable
  • If the key is present, utilize binary search + the sparse index structure to find the maximum lower bound of the target key
  • Scan every key-value pair starting from that offset until either 1) the key is found or 2) the scan overextends
  • Repeat process until the target key is found

Compaction

To improve overall performance and efficiency, genesis implements a size-tiered compaction strategy based off Apache Cassandra. This process merges multiple tables found within a bucket into 1 bigger, most-recent table. Essentially, it removes all outdated entries, performs garbage collection, and frees up disk space.

Compaction is automatically triggered when the memtable reaches a defined byte threshold.

Write-Ahead-Log

genesis supports write-ahead-logging (WAL) to improve durability and serve as a crash recovery mechanism in the face of network faults. Upon each operation (put, get, delete), metadata (such as the operation) and other info such as the key/value is appended to an auto-generated .log file which can be used to reconstruct the state of the tree in the case of a crash.

Complete Tree

The complete tree is the seamless combination of the Memtable and SSTable component. When combined, we effectively have an in-memory component and a disk component. LSM trees were designed to emphasize write performance, which is also seen with genesis.

genesis supports the following operations:

  • Put(key, value)
  • Get(key)
  • Delete(key)

Important to note is that genesis utilizes tombstone-based garbage collection. When deleting an existing key, it will simply append a tombstone value in the header and re-add it to the memtable (which will eventually get flushed to disk). The actual deletion process occurs in the SSTable compaction algorithm.

Distributed Architecture

This key-value store is made to be distributed through the use of data partitioning and sharding. Each node of this system is a self-contained key value store (i.e., a shard), where each node holds a subset of the overall data and is hosted on a separate port. Anytime a key value pair is inserted to the system by a client, it goes through a routing layer that utilizes consistent hashing and is then routed to the correct node—thus achieving consistent data partitioning.

Genesis can support a multitude of concurrent nodes and utilizes gRPC for inter-node communication, primarily in the case of data rebalancing (triggered when a node is added or removed). Each node is wrapped in a gRPC server to accept incoming requests, and gRPC clients are spawned (to set up traditional client-server comms) only when data migration is needed.

A high-level overview of the overall design: genesis architecture

Benchmarks

Each benchmark is ran 3 times and the average is shown below
Operation count is dynamically allocated for some of the benchmarks

Full Tree

  • Put: insert 1,000,000 distinct kv pairs to the tree
  • Get: get 1 key in an SSTable w/ 1,000,000 kv pairs
goos: darwin
goarch: arm64
cpu: Apple M3 Pro

BenchmarkDiskStore_Put-12    	 1000000	     9740 ns/op	         104464 ops/s
BenchmarkDiskStore_Get-12    	13266362	    77.84 ns/op        12846082 ops/s

Memtable

  • Put: insert 1,000,000 distinct kv pairs to the memtable
  • Get: get 1 key in a memtable w/ 1,000,000 kv pairs
goos: darwin
goarch: arm64
cpu: Apple M3 Pro

BenchmarkMemtable_Put-12    	  1000000	      9608 ns/op	    104081 ops/s
BenchmarkMemtable_Get-12    	 11267596	     105.7 ns/op	   9464848 ops/s

Feature Checklist

Features in the works:

  • Serialize/Deserialize header + key, value
  • Store data on disk
  • Support Put(Key, Value)
  • Support Get(Key)
  • Support Delete(Key)
    • Tombstone-based garbage collection
  • Ensure data integrity
    • Implement cyclic redundancy checks
  • Convert implementation to Log-Structured Merge Tree
    • Swap out keydir with red-black tree memtable
      • Implement red-black tree
    • Write-ahead-logging (WAL)
      • Create WAL file and write to it after store operations
      • Reconstruct memtable with WAL in case of crash
    • Implement SSTables
      • Flush memtable to data file in sorted order
        • Conditional flushing (size threshold)
      • Index file
      • Bloom filter
      • Multiple levels (higher levels store larger, compacted tables)
      • Get(key) operation on tables (disk)
      • Size-Tiered Compaction (based off Apache Cassandra)
  • Make this distributed
    • Data partitioning (sharding)
      • Implement consistent hashing
    • Implement key routing
    • Set up gRPC, Protobuf for inter-node communication
    • Automatically rebalance data among all nodes when additional nodes are added/removed
    • Fault tolerance
      • Data is automatically migrated to active nodes upon node failure
      • Can optionally implement replication and Raft on top of that

Extra:

  • Generic key/value support (currently limited to strings)

References