Library for out-of-memory sorting of large datasets which need to be processed in multiple map / sort / reduce passes.
You write a stream of items of type T
implementing Serialize
and Deserialize
to a ShardWriter
. The items are buffered, sorted according to a customizable sort key, then serialized to disk in chunks with serde + lz4, while maintaining an index of the position and key range of each chunk. You use a ShardReader
to stream through a item in a selected interval of the key space, in sorted order.
See Docs for API and examples.
Note: Enable the 'full-test' feature in Release mode to turn on some long-running stress tests.