GitHub - IGITUGraz/SimRecorder: An high-performance library for recording and storing simulation data

SimRecorder

The goal of SimRecorder is to provide a simple and unified interface for recording and retrieving simulation data with transparent support for multiple backend storage formats. The library is optimized for storing large NumPy arrays, but can handle most datatypes. Currently three different storage backends are supported -- zarr, hdf5, and redis

Installation

pip install https://github.com/IGITUGraz/SimRecorder/archive/master.zip

By default only support for zarr and HDF5 are installed. To install redis support, clone the repository and run

pip install -r requirements.redis.txt && pip install .

You can test if various backends work correctly by running the scripts in the tests directory.

Requirements

Zarr backend

All required packages (including lmdb) are installed through pip as dependencies of this package.

HDF5 backend

libhdf5 needs to be installed in the system using:

sudo apt-get install libhdf5

Redis backend

All required packages (including redis) are installed through pip as dependencies of this package.

Quickstart

The library consists of a single Recorder interface that can be initialized to use different backends by passing in an appropriate DataStore object.

First import the datastores you want to use

from simrecorder import Recorder, InMemoryDataStore, RedisDataStore, HDF5DataStore, ZarrDataStore

Then initialize all the datastores you want (Yes, you can have more than one!).

The InMemoryDataStore stores all data in memory
```
in_memory_datastore = InMemoryDataStore()
```
The ZarrDatastore stores all data using the given path as the directory for the lmdb database files. If you use a directory path that already exists, it opens the database in read-only mode.
```
zarr_datastore = ZarrDataStore('~/output/data.mdb')
```
The HDF5Datastore stores all data in the given HDF5 file. If you use a file that already exists, it opens the file in read-only mode.
```
hdf5_datastore = HDF5DataStore('~/output/data.h5')
```
The HDF5Datastore and ZarrDataStore don't support distributed simulations yet, unless you have a single writer thread that handles all interaction with the hdf5 file.

The RedisDataStore stores all data in redis (persisted in the given data_directory). Currently, you cannot have more than one RedisDatastore being used per host.

For distributed simulations, you need to pass in the appropriate server_host of the main/master node in the code for worker simulations running in the worker nodes/host.
```
redis_datastore = RedisDataStore(server_host='localhost', data_directory='~/output')
```

Then initialize the recorder with the datastore(s) you want to use

# To use only in-memory datastore
recorder = Recorder(in_memory_datastore)

# To use only the zarr datastore
recorder = Recorder(zarr_datastore)

# To use only the hdf5 datastore
recorder = Recorder(hdf5_datastore)

# To use more than one
recorder = Recorder(in_memory_datastore, redis_datastore, hdf5_datastore)

In your simulation, record the values you want. For each type of value, pass in a key. By default, every time you use the same key, the value is appended to a list-like datastructure (in the underlying datastore)

Your keys can be any arbitrary string. Use '/' for efficient use of deeper hierarchies in Zarr and HDF5 (For other datastores, it makes no difference)
```
# This appends some_value to a list with key 'a/b'
recorder.record('a/b', some_value1)
recorder.record('a/b', some_value2)
# This appends some_value to a list with key 'a/c'
recorder.record('a/c', some_value2)
```
After the simulation is done, retrieve the values using recorder.get, which returns a list of values.

Note that if you used the ZarrDatastore, you will get zarr.core.Array objects that you can either pass in directly to most NumPy functions, or convert it to NumPy arrays first before use. The zarr.core.Array objects also allow you to work with larger-than-memory arrays, if you use only slices of the arrays.

The HDF5Datastore similarly returns HDFView objects that have similar properties as zarr.core.Array.
```
# This gives you a list of values your recorded [some_value1, some_value2] (Retrieved from the first datastore)
recorder.get_all('a')
# You can also re-intialize recorder with the same parameters in other scripts and access the keys
```
You can also close the recorder after writing, and open it later for reading.
Remember to close the recorder after all reading/writing is done. This flushes data and closes the connection (where applicable)
```
## After everything
recorder.close()
```

Tests

To make sure all the datastores work, run:

python tests/test_datastores.py

To test the performance of the zarr, hdf5 and redis datastores, you can use the tests/time_*. You can tune the size of the numpy array to reflect your use case. The default values are quite large -- for instance with the default values, the resulting hdf5 file is about 4GB.

Backends

The Zarr backend is the recommended backend if you are running simulations on a single node. It works well for large NumPy arrays as well.
For distributed simulations running across multiple nodes, the redis backend should be used.
Redis backend is extremely fast for both reading and writing, as long as you're not storing large (>20MB) NumPy arrays

Architectures without SSE2 and AVX operations

When using on x86 architectures where SSE2 and AVX CPU operations are not available, you should install numcodecs from source since Blosc by default depends on these CPU optimizations.

For PPC architectures, install numcodecs from this fork

Performance benchmarks

For one single run, on a NFS disk, Intel Xeon machine, with default parameters, for the specific 4-D array of size 100x10x10000x200 float32 values. For comparitive purposes only! You can run your own tests using scripts tests/time_*.

Backend	Total write time (s)	Mean write time (s)	Slicing mean read time (s)	Total read time (s)	Size on disk (GB)
Zarr (lmdb + blosc)	184.51	1.8236	0.0050	25.72	14
HDF5	140.69	1.3530	0.1167	145.30	15
redis (PyArrow) [1]	267.82	1.3794	NA	68.00	48
redis (pickle) [1]	305.24	1.7668	NA	66.75	40

[1]	(1, 2) Redis doesn't support larger than memory array access. The total write time is larger than number of arrays times mean write time because redis takes some time to write everything to disk and shut down at the end.

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
simrecorder		simrecorder
tests		tests
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.rst		README.rst
requirements.redis.txt		requirements.redis.txt
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SimRecorder

Installation

Requirements

Zarr backend

HDF5 backend

Redis backend

Quickstart

Tests

Backends

Architectures without SSE2 and AVX operations

Performance benchmarks

About

Releases

Packages

Contributors 4

Languages

License

IGITUGraz/SimRecorder

Folders and files

Latest commit

History

Repository files navigation

SimRecorder

Installation

Requirements

Zarr backend

HDF5 backend

Redis backend

Quickstart

Tests

Backends

Architectures without SSE2 and AVX operations

Performance benchmarks

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages