The goal of SimRecorder is to provide a simple and unified interface for recording and retrieving simulation data with transparent support for multiple backend storage formats. The library is optimized for storing large NumPy arrays, but can handle most datatypes. Currently three different storage backends are supported -- zarr, hdf5, and redis
pip install https://github.com/IGITUGraz/SimRecorder/archive/master.zip
By default only support for zarr and HDF5 are installed. To install redis support, clone the repository and run
pip install -r requirements.redis.txt && pip install .
You can test if various backends work correctly by running the scripts in the tests
directory.
All required packages (including lmdb) are installed through pip as dependencies of this package.
libhdf5 needs to be installed in the system using:
sudo apt-get install libhdf5
All required packages (including redis) are installed through pip as dependencies of this package.
The library consists of a single Recorder
interface that can be initialized to use different backends by passing
in an appropriate DataStore
object.
First import the datastores you want to use
from simrecorder import Recorder, InMemoryDataStore, RedisDataStore, HDF5DataStore, ZarrDataStore
Then initialize all the datastores you want (Yes, you can have more than one!).
The
InMemoryDataStore
stores all data in memoryin_memory_datastore = InMemoryDataStore()
The
ZarrDatastore
stores all data using the given path as the directory for the lmdb database files. If you use a directory path that already exists, it opens the database in read-only mode.zarr_datastore = ZarrDataStore('~/output/data.mdb')
The
HDF5Datastore
stores all data in the given HDF5 file. If you use a file that already exists, it opens the file in read-only mode.hdf5_datastore = HDF5DataStore('~/output/data.h5')
The
HDF5Datastore
andZarrDataStore
don't support distributed simulations yet, unless you have a single writer thread that handles all interaction with the hdf5 file.The
RedisDataStore
stores all data in redis (persisted in the given data_directory). Currently, you cannot have more than oneRedisDatastore
being used per host.For distributed simulations, you need to pass in the appropriate
server_host
of the main/master node in the code for worker simulations running in the worker nodes/host.redis_datastore = RedisDataStore(server_host='localhost', data_directory='~/output')
Then initialize the recorder with the datastore(s) you want to use
# To use only in-memory datastore recorder = Recorder(in_memory_datastore) # To use only the zarr datastore recorder = Recorder(zarr_datastore) # To use only the hdf5 datastore recorder = Recorder(hdf5_datastore) # To use more than one recorder = Recorder(in_memory_datastore, redis_datastore, hdf5_datastore)
In your simulation, record the values you want. For each type of value, pass in a key. By default, every time you use the same key, the value is appended to a list-like datastructure (in the underlying datastore)
Your keys can be any arbitrary string. Use '/' for efficient use of deeper hierarchies in Zarr and HDF5 (For other datastores, it makes no difference)
# This appends some_value to a list with key 'a/b' recorder.record('a/b', some_value1) recorder.record('a/b', some_value2) # This appends some_value to a list with key 'a/c' recorder.record('a/c', some_value2)
After the simulation is done, retrieve the values using
recorder.get
, which returns a list of values.Note that if you used the
ZarrDatastore
, you will getzarr.core.Array
objects that you can either pass in directly to most NumPy functions, or convert it to NumPy arrays first before use. Thezarr.core.Array
objects also allow you to work with larger-than-memory arrays, if you use only slices of the arrays.The
HDF5Datastore
similarly returnsHDFView
objects that have similar properties aszarr.core.Array
.# This gives you a list of values your recorded [some_value1, some_value2] (Retrieved from the first datastore) recorder.get_all('a') # You can also re-intialize recorder with the same parameters in other scripts and access the keys
You can also close the recorder after writing, and open it later for reading.
Remember to close the recorder after all reading/writing is done. This flushes data and closes the connection (where applicable)
## After everything recorder.close()
To make sure all the datastores work, run:
python tests/test_datastores.py
To test the performance of the zarr, hdf5 and redis datastores, you can use the tests/time_*
. You can tune the size
of the numpy array to reflect your use case. The default values are quite large -- for instance with the default values,
the resulting hdf5 file is about 4GB.
- The Zarr backend is the recommended backend if you are running simulations on a single node. It works well for large NumPy arrays as well.
- For distributed simulations running across multiple nodes, the redis backend should be used.
- Redis backend is extremely fast for both reading and writing, as long as you're not storing large (>20MB) NumPy arrays
When using on x86 architectures where SSE2 and AVX CPU operations are not available, you should install
numcodecs from source
since Blosc
by default depends on these CPU optimizations.
For PPC architectures, install numcodecs from this fork
For one single run, on a NFS disk, Intel Xeon machine, with default parameters, for the specific 4-D array of size
100x10x10000x200 float32 values. For comparitive purposes only! You can run your own tests using scripts
tests/time_*
.
Backend | Total write time (s) | Mean write time (s) | Slicing mean read time (s) | Total read time (s) | Size on disk (GB) |
Zarr (lmdb + blosc) | 184.51 | 1.8236 | 0.0050 | 25.72 | 14 |
HDF5 | 140.69 | 1.3530 | 0.1167 | 145.30 | 15 |
redis (PyArrow) [1] | 267.82 | 1.3794 | NA | 68.00 | 48 |
redis (pickle) [1] | 305.24 | 1.7668 | NA | 66.75 | 40 |
[1] | (1, 2) Redis doesn't support larger than memory array access. The total write time is larger than number of arrays times mean write time because redis takes some time to write everything to disk and shut down at the end. |