-
Notifications
You must be signed in to change notification settings - Fork 23
LMDB vs TuplDB
Test setup:
- RAM: 48GB
- Storage: Intel Optane SSD 900P 480GB
- CPU: Ryzen 7 1700
- 8 physical cores, 16 logical cores
- Kernel: 5.4.0-169
- File system: ext4
- LMDB Java: 0.8.3
- Map size: 300GB
- MDB_NOSYNC, MDB_NOMETASYNC, MDB_NORDAHEAD
- TuplDB: 1.8.0
- Cache size: 44GB
- DurabilityMode.NO_REDO
- G1GC, -Xmx3g -XX:+UseLargePages -XX:+UseTransparentHugePages
This test measures the time to insert 1 billion random entries into the database, starting from empty, using one thread. The key size is 8 bytes, and the value size is 100 bytes. It should be noted that the insert operations were allowed to overwrite any existing entry — they were essentially "put" or "store" operations, although the probability of a collision was quite low.
This chart shows the insert performance relative to the current number of entries.
- TuplDB: Completed in 5.0 hours, with a file size of 135GiB.
- LMDB: Completed in 7.0 hours, with a file size of 165GiB.
It should be apparent that TuplDB is much faster at inserting than LMDB. After about about 300M entries have been inserted, RAM is exhausted, which is why the performance dips down. TuplDB write performance before this point is somewhat erratic, but this is due to the fact that it's inserting at a rate which the SSD cannot keep up with. Checkpoints run every few seconds, and this forces the insert thread to slow down. LMDB also shows an odd dip at the 100M point, which was consistently reproducible, but I don't know the cause.
To prepare for this test, a fresh database was created with 1 billion sequentially ordered entries. The key size is 8 bytes, and the value size is 100 bytes. The TuplDB file size is 107GiB, and the LMDB file size is 117GiB.
The test measures the performance of transactionally reading one value against a randomly selected key, within a specific range. Initially, the key range is restricted to the first 10 million entries, and the test runs over this range for 10 seconds. The key range is expanded by an additional 10 million entries, tested for 10 seconds, and so on, up to a maximum range of 1 billion entries. Each time the range is expanded, the newly added key range is pre-scanned to populate the cache. The pre-scan time is not included in the test result time.
The entire test was run several times, with 1, 2, 4, 8, and 16 threads. This first chart shows the results for 1 thread. Charts which with more threads have a similar shape, and so they don't need to be included.
TuplDB is slightly faster than LMDB in this test, but practically speaking, it's essentially the same. Both show a dip in performance when the key range reaches about 380 million, because the database no longer fits in the cache. It should be noted that this same dip was reached earlier with the random insert test (at 300 million), and this is due to the fact that random inserts have a lower b-tree fill factor than sequential inserts.
These next charts show how TuplDB and LMDB scale as more read threads are added to the test.
For the first chart, the performance over the 100 to 300 million key range was averaged together. This shows read performance when the entire range of requested entries fits entirely in the cache. LDMB outperforms TuplDB with 2 and 4 threads, but TuplDB is faster with 8 and 16 threads. The test machine has 8 physical cores and 16 logical cores, which is why the chart has an asterisk with 16 threads.
The second chart is averaging over the 700 to 900 million key range, and so the range of entries doesn't fit in the cache. TuplDB outperforms LMDB in a few cases, and vice versa. As was also seen in the earlier chart, TuplDB is definitely faster than LMDB when more threads are running than is physically available. In practice however, it's safe to assume that TuplDB and LMDB read performance (of a single value) is essentially the same.
This is another random read performance test, but with TuplDB and LMDB configured to access a block device, bypassing the file system. TuplDB supports opening the block device in a "normal" mode, and it also supports O_DIRECT
. The released version of LMDB (at this time) doesn't support block device access, and so these tests were run against a development branch. The feature will be fully available in the 1.0 version. LMDB doesn't support opening the block device using O_DIRECT
because it doesn't make sense when using mmap
.
The interesting results are in the read scalability test, when the range of entries doesn't fit in the cache.
The bars labeled "Tupl" and "LMDB" are copied from the earlier chart which used the file system (ext4). The "-bdev" bars show performance against the block device, and the "-odirect" bar shows the performance of the block device when combined with O_DIRECT
.
With LMDB, no performance gain was observed when using the block device, and it appears that there might be a slight performance regression. TuplDB is able to achieve better performance when using the block device, and with O_DIRECT
it's even faster.
It should be noted that TuplDB also supports using mmap
, and the performance of this mode closely matched that of LMDB. The limiting factor appears to be mmap
itself and not anything particular in LMDB, other than it's reliance on mmap
.
This test ran with the same database which was prepared for the random read performance test. Like the earlier test, it randomly selects a key over a specific range. The difference is that it scans over 100 entries using a cursor instead of retrieving a single value.
Like before, the entire test was run several times, with 1, 2, 4, 8, and 16 threads. This first chart shows the results for 1 thread, and charts with more threads have a similar shape, but they aren't shown.
TuplDB is much faster than LMDB in this test. As before, both show a dip in performance when the key range reaches about 380 million, because the database no longer fits in the cache.
These next charts show read scan scalability as more threads are added.
TuplDB outperforms LMDB in all cases, except with 16 threads when all of the entries fit in the cache. In general, it's safe to assume that scans are faster with TuplDB than with LMDB.
The testing continues with the same database, except this time entries are randomly selected and updated with a new value of the same size (100 bytes). Like before, the entire test was run several times, with 1, 2, 4, 8, and 16 threads.
TuplDB is clearly much faster at updating entries than LMDB. Note that there's a "LMDB 2nd pass", which is even slower. After completing all the tests, I decided to go back and run the update test again. This time, performance was degraded. After running the test even more times, I found that the degradation remains the same, but it doesn't get any worse. The 2nd pass results should be considered to be a more accurate representation of performance in a production system. I tried the same experiment with TuplDB, but it didn't show any degradation.
Again, both databases show a dip in performance when the key range reaches about 380 million, because the database no longer fits in the cache. The dip is less pronounced with LDMB because the update CPU overhead is much higher than with TuplDB. In the 2nd pass, the overhead is so high that the dip disappears entirely.
These next charts show update scalability as more threads are added.
Because TuplDB supports concurrent writes, update performance actually scales as more threads are added. As soon as more threads are running than is physically available, no significant performance gains are observed. Although LMDB doesn't support concurrent writes, these charts do show something interesting. As more threads are added, performance should remain the same, but in fact it slows down. This could be caused by the overall performance degradation problem, or it could be caused by cache thrashing at the CPU level.
The testing continues with the same database after the update test has run. This time, the threads randomly choose to perform a scan of 100 values (90% of the time) or else they do an update.
As shown earlier, TuplDB is faster than LDMB at scans, and it's faster at updates. It's still faster when mixing the operations together. Because 10% of the operations are updates, I also did a 2nd pass test with LMDB. By the way, I have no idea why TuplDB shows a performance boost at the 200M mark.
Not surprisingly, TuplDB scales better than LMDB in the mix test, since it already has already been shown to scale better with just scans or just updates.
By design LMDB only supports a single writer, and it relies on memory mapping for persisting data to the file system. TuplDB on the other hand is designed to support high concurrency, and it manages it's own cache instead of using a memory mapped file. As a result, the TuplDB design is much more complicated than LMDB, but this complexity doesn't cause any measurable overhead when compared to LMDB.
The performance of simple read operations when using the file system is essentially the same for both databases, but for all other kinds of operations, TuplDB is much faster. With TuplDB, inserts are faster, scans are faster, updates are faster, and block device access is faster. An application which mostly performs simple read operations and fits in memory would work just fine with LMDB, but if the access patterns change, then TuplDB might be a better choice.
It's possible that some overhead with LMDB Java is due to the JNR access layer. An improved Java binding might allow LMDB to exceed TuplDB performance for simple read operations, but I wouldn't expect this to have any effect at improving performance for the other operations.
This chart shows TuplDB random insert performance against a block device, in various configurations, as compared to a plain file. The data for the plain file is the same as was shown in the earlier "Random insert performance" section.
The data lines labeled as "odirect" are against a block device using O_DIRECT
. The "crc" label indicates that CRC32C checksums are enabled, and the "ciper" label indicates that encryption was also enabled. To get good write performance with O_DIRECT
, the configuration enabled up to 15 checkpoint threads to be running.
The best performance is achieved when using O_DIRECT
by itself, and enabling checksums and encryption adds some overhead. Even with this overhead, it still outperforms a plain file, at least when using the ext4 file system.