Optimizing low level storage #4326
Replies: 3 comments 4 replies
-
Some random thoughts:
|
Beta Was this translation helpful? Give feedback.
-
Have you bench marked sled for your use case? |
Beta Was this translation helpful? Give feedback.
-
Introduction
Currently, nearcore uses RocksDB to store its data, including the blockchain state. The performance of RocksDB is suboptimal for several reasons:
So, the idea is to design a data store optimized for NEAR use case. When designing a store, there are multiple factors to consider:
Because nearcore performs random accesses and always looks up values by an exact key (no need to find the first key greater or less than the given value), data structures based on hash tables are preferred to those based on trees and sorted lists.
Existing implementations
There is a small number of hashtable-based key-value datastores. Some of them are listed below.
For our use case, the store needs to be able to not only add, but also modify and delete entries. This is needed for garbage collection, and especially if we switch to using state snapshots, similar to Ethereum. Also, most existing implementations are using C/C++ rather than Rust.
Storage format
The storage is based on a hash table. Because most storage devices read and write data in blocks of 4 KiB, it doesn't make much sense to have hash table buckets that are smaller (but it can make sense to have larger buckets, as this can enable higher load factor). If the total size of the key-value pairs that map to a bucket is small enough, they can be stored directly in that bucket, so that only one I/O operation is needed to read them. If the data doesn't fit (in this case, the bucket is called overfull), it needs to be stored somewhere else. There are two approaches:
The first approach seems easier to implement, but it is not efficient. The data is split into pieces which are significantly smaller than the size of a bucket, but an entire bucket needs to be read to retrieve each piece. Also, a separate random access is needed to read/write each piece. So, both the read/write amplification and the data locality are poor. In contrast, with the second approach even large values can be stored contigously (or almost contigously) and with little overhead.
In NEAR, there are both small values (the majority) and large values (such as contract codes). The presense of a large value in a bucket should preferably not impact the performance of the accesses to the small values. So, even if a bucket is overfull, it makes sense to pack as many small key-value pairs as possible directly into the bucket. For the pairs that don't fit, keep (truncated key hash, data pointer) pairs in the bucket and the actual data in the extension area. This way, negative queries can still be answered efficiently, and positive queries only have to read the data they need (except for the rare case when there is a collision on the truncated key hash). In an unlikely case that even the list of (truncated key hash, data pointer) pairs doesn't fit into the bucket, the remainder of the list is also stored in the extension area.
A possible optimization to the handling of overfull buckets is to let them spill into the immediately following bucket (but not any further). When performing a lookup, the code should read two buckets at once, so that no extra I/O requests will be needed to handle this case. This can make it possible to increase the load factor while keeping the probability that a request will need to access the extension area small. However, it needs to be tested to see how well it compares to just doubling the bucket size.
Managing the extension area
A problem with the extension area is that it is necessary to allocate the space in it somehow. Here are some possible approaches.
Binary allocator
The entire extension area is split into fixed-size blocks, which are the leaves in a perfect binary tree. So, the pairs of adjacent leaves are merged into nodes 2× the size of a leaf, pairs of those nodes are merged into 4× nodes, and so on. To allocate some space, allocate the smallest node that is large enough (or distribute it among multiple nodes to reduce the wasted space). When freeing a node, it is attempted to be merged with adjacent free nodes to form larger free nodes. All free nodes of the same size form a linked list, so it is easy to find a free node of the right size.
It is possible that there are multiple small free nodes that cannot be merged because they are not adjacent. Given two nodes of the same size that are not adjacent, it is possible to swap two nodes to make them adjacent, after which they can be merged.
Append-only files
It is possible to manage the extension area as a set of append-only files. Data is appended to the current file until it reaches a certain size, after which a new file is started. When the data is no longer used, it becomes "dead space". When the fraction of dead space in a file exceeds a certain threshold (e.g. 50%), the file is "compacted": any "live" data in this file is written again (to the current file), and the old file is deleted.
Instead of having multiple files, it is possible to use just one large file. Compaction can be performed in place incrementally. Also, it may be possible to write some of the new values directly into the dead space, so that it is not needed to perform a compaction as often.
Separate files
It is possible to just delegate the task to the file system by storing each key-value pair in the extension area in a separate file. This can be efficient for large values, so that the total number of files is not too big.
Additional considerations
Algorithmic complexity attacks
To prevent an adversary from crafting hash collisions, it is important to use salted hashing. If the hash function is cryptographically strong, it is possible to store just the hash, without the key itself.
Resizing the hash table
To resize the hash table incrementally, it is possible to use an approach similar to linear hashing.
Data consistency
To ensure data consistency in case of a process or system crash, journaling can be used. All data is first written to the journal and only then to its final location. This doubles the write amplification, although when writing large values it may be possible to avoid writing bulk of the data to the journal.
File structure
The simplest approach may be to store the data in three files: one for the hash table, one for the extension area and one for the journal. Depending on how the extension area is organized, it may use multiple files. Also, an additional file may be used for the configuration.
It is also possible to combine everything into a single file. The configuration data will probably go first, then the hash table, then the extension area, then the journal. When the hash table grows, the extension area will need to be moved. The journal can be discarded between transactions, so it makes sense to have it at the end.
Instead of using a file, it is possible to directly use a block device. This will bypass the additional translation/indirection layer that is the file system. In this case, it makes sense to have the hash table at the beginning of the device and the extension area at the end, so that they grow towards each other.
Updates
Beta Was this translation helpful? Give feedback.
All reactions