Optimizing low level storage #4326

abacabadabacaba · 2021-05-27T01:49:55Z

abacabadabacaba
May 27, 2021
Maintainer

Introduction

Currently, nearcore uses RocksDB to store its data, including the blockchain state. The performance of RocksDB is suboptimal for several reasons:

RocksDB stores data in multiple levels, and each query must check every level for the key.
Also, RocksDB uses a sorted data structure, which adds a further logarithmic factor to the lookup complexity.
NEAR uses hashes as database keys, so the access pattern is essentially random.

So, the idea is to design a data store optimized for NEAR use case. When designing a store, there are multiple factors to consider:

Storage overhead: how much more space it takes on disk compared to a concatenation of keys and values.
Read amplification: how much more data needs to be read to access a key-value pair, compared to its size.
Write amplification: how much more data needs to be written, compared to the total size of the key-value pairs written.
Data locality: even on SSDs, random I/O can be as much as 10 times slower than sequential I/O, so it is important to store the data as contiguously as possible.

Because nearcore performs random accesses and always looks up values by an exact key (no need to find the first key greater or less than the given value), data structures based on hash tables are preferred to those based on trees and sorted lists.

Existing implementations

There is a small number of hashtable-based key-value datastores. Some of them are listed below.

cdb and mcdb are fast but immutable data stores.
NuDB is append-only.

For our use case, the store needs to be able to not only add, but also modify and delete entries. This is needed for garbage collection, and especially if we switch to using state snapshots, similar to Ethereum. Also, most existing implementations are using C/C++ rather than Rust.

Storage format

The storage is based on a hash table. Because most storage devices read and write data in blocks of 4 KiB, it doesn't make much sense to have hash table buckets that are smaller (but it can make sense to have larger buckets, as this can enable higher load factor). If the total size of the key-value pairs that map to a bucket is small enough, they can be stored directly in that bucket, so that only one I/O operation is needed to read them. If the data doesn't fit (in this case, the bucket is called overfull), it needs to be stored somewhere else. There are two approaches:

Store the extra data in other buckets.
Have a separate location (extension area) for such data.

The first approach seems easier to implement, but it is not efficient. The data is split into pieces which are significantly smaller than the size of a bucket, but an entire bucket needs to be read to retrieve each piece. Also, a separate random access is needed to read/write each piece. So, both the read/write amplification and the data locality are poor. In contrast, with the second approach even large values can be stored contigously (or almost contigously) and with little overhead.

In NEAR, there are both small values (the majority) and large values (such as contract codes). The presense of a large value in a bucket should preferably not impact the performance of the accesses to the small values. So, even if a bucket is overfull, it makes sense to pack as many small key-value pairs as possible directly into the bucket. For the pairs that don't fit, keep (truncated key hash, data pointer) pairs in the bucket and the actual data in the extension area. This way, negative queries can still be answered efficiently, and positive queries only have to read the data they need (except for the rare case when there is a collision on the truncated key hash). In an unlikely case that even the list of (truncated key hash, data pointer) pairs doesn't fit into the bucket, the remainder of the list is also stored in the extension area.

A possible optimization to the handling of overfull buckets is to let them spill into the immediately following bucket (but not any further). When performing a lookup, the code should read two buckets at once, so that no extra I/O requests will be needed to handle this case. This can make it possible to increase the load factor while keeping the probability that a request will need to access the extension area small. However, it needs to be tested to see how well it compares to just doubling the bucket size.

Managing the extension area

A problem with the extension area is that it is necessary to allocate the space in it somehow. Here are some possible approaches.

Binary allocator

The entire extension area is split into fixed-size blocks, which are the leaves in a perfect binary tree. So, the pairs of adjacent leaves are merged into nodes 2× the size of a leaf, pairs of those nodes are merged into 4× nodes, and so on. To allocate some space, allocate the smallest node that is large enough (or distribute it among multiple nodes to reduce the wasted space). When freeing a node, it is attempted to be merged with adjacent free nodes to form larger free nodes. All free nodes of the same size form a linked list, so it is easy to find a free node of the right size.

It is possible that there are multiple small free nodes that cannot be merged because they are not adjacent. Given two nodes of the same size that are not adjacent, it is possible to swap two nodes to make them adjacent, after which they can be merged.

Append-only files

It is possible to manage the extension area as a set of append-only files. Data is appended to the current file until it reaches a certain size, after which a new file is started. When the data is no longer used, it becomes "dead space". When the fraction of dead space in a file exceeds a certain threshold (e.g. 50%), the file is "compacted": any "live" data in this file is written again (to the current file), and the old file is deleted.

Instead of having multiple files, it is possible to use just one large file. Compaction can be performed in place incrementally. Also, it may be possible to write some of the new values directly into the dead space, so that it is not needed to perform a compaction as often.

Separate files

It is possible to just delegate the task to the file system by storing each key-value pair in the extension area in a separate file. This can be efficient for large values, so that the total number of files is not too big.

Additional considerations

Algorithmic complexity attacks

To prevent an adversary from crafting hash collisions, it is important to use salted hashing. If the hash function is cryptographically strong, it is possible to store just the hash, without the key itself.

Resizing the hash table

To resize the hash table incrementally, it is possible to use an approach similar to linear hashing.

Data consistency

To ensure data consistency in case of a process or system crash, journaling can be used. All data is first written to the journal and only then to its final location. This doubles the write amplification, although when writing large values it may be possible to avoid writing bulk of the data to the journal.

File structure

The simplest approach may be to store the data in three files: one for the hash table, one for the extension area and one for the journal. Depending on how the extension area is organized, it may use multiple files. Also, an additional file may be used for the configuration.

It is also possible to combine everything into a single file. The configuration data will probably go first, then the hash table, then the extension area, then the journal. When the hash table grows, the extension area will need to be moved. The journal can be discarded between transactions, so it makes sense to have it at the end.

Instead of using a file, it is possible to directly use a block device. This will bypass the additional translation/indirection layer that is the file system. In this case, it makes sense to have the hash table at the beginning of the device and the extension area at the end, so that they grow towards each other.

Updates

Benchmark results

frol · 2021-05-28T15:19:32Z

frol
May 28, 2021
Maintainer

Some random thoughts:

"cdb can handle any database up to 4 gigabytes." (from their main page)
In Rust world, there is sled key-value embedded DB, but it seems it has logarithmic complexity operating with the keys.
I want to bring up a fact that there are a few places where the current implementation relies on iteration through the keys by prefix (state changes are recorded in the storage with a key crafted specifically for iteration), so we will need to refactor that code to accommodate the hash-table access style.

1 reply

EgorKulikov Jun 1, 2021

Regarding 3, I think they only rely on implementation of MPT, and not on storage used for MPT

asad-awadia · 2022-09-28T00:24:41Z

asad-awadia
Sep 28, 2022

Have you bench marked sled for your use case?

https://github.com/spacejam/sled

2 replies

jakmeier Sep 28, 2022
Collaborator

I don't think anyone has done so recently. Could be interesting. But for as long as it is in beta status, we cannot use it for our purposes.

asad-awadia Sep 28, 2022

have you tried using optimizeForPointLookups in rocksdb to see if that helped?

https://github.com/facebook/rocksdb/blob/7b11d48444a05fcc48da9df73fa1d23e576448fa/include/rocksdb/options.h#L86

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizing low level storage #4326

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Optimizing low level storage #4326

abacabadabacaba May 27, 2021 Maintainer

Introduction

Existing implementations

Storage format

Managing the extension area

Binary allocator

Append-only files

Separate files

Additional considerations

Algorithmic complexity attacks

Resizing the hash table

Data consistency

File structure

Updates

Replies: 3 comments · 4 replies

frol May 28, 2021 Maintainer

EgorKulikov Jun 1, 2021

asad-awadia Sep 28, 2022

jakmeier Sep 28, 2022 Collaborator

asad-awadia Sep 28, 2022

abacabadabacaba
May 27, 2021
Maintainer

Replies: 3 comments 4 replies

frol
May 28, 2021
Maintainer

asad-awadia
Sep 28, 2022

jakmeier Sep 28, 2022
Collaborator