goetia is a c++ library and software package for streaming analysis for de Bruijn Graphs, de Bruijn graph compaction, and genome sketching. The c++ library is fully available through Python via bindings generated by cppyy. The primary goals of goetia and its algorithms are:
- Analyse data completely on-line with streaming methods,
- Use as little of the data as possible.
This library is a work-in-progress and under rapid development. Some current usage examples can be found in the examples/
directory
and a launched with binder using the badge above.
conda is the supported installation environment. Within a conda environment, install with:
conda install goetia
This will install the goetia python package, the libgoetia
shared library,
and its headers into $CONDA_PREFIX
. With the environment activated, you can import goetia
in Python or link against the C++ library with -lgoetia
.
To build and install from source, first clone the repo:
git clone https://github.com/camillescott/goetia && cd goetia
Create the conda environment. There is a Makefile target to generate the environment; it uses
mamba
, but this can be overridden by setting CONDA_FRONTEND
to conda
. The result environment
is called goetia-dev
and is defined in environment_dev.yml
.
make create-dev-env
conda activate goetia-dev
Then build and install:
make install
The install
target will build the C++ library and cppyy
bindings, install the headers and shared
library into $CONDA_PREFIX/lib
and $CONDA_PREFIX/include
, and install the associated python
modules into the conda environment.
To install in-place, run:
make dev-install
This will use python -m pip install -e .
to allow in-place editing of the python sources. However,
changes to the C++ source will not be propagated, as the shared library has to be rebuilt.
Run make install
again to recompile and reinstall the headers and shared library.
Tests are written in pytest
; the full suite can be run with:
pytest tests/
The test suite uses pytest-benchmark
to gather performance information on some functions. This
adds significant extra time to a number of tests. This can be bypassed by just running make test
;
or, explicitly, by running:
pytest --benchmark-disable tests/
Much of the de Bruijn graph test data is randomly generated; ie, we fuzz the library. This helps
find edge cases, but means some tests might not be able to be rerun. To allow reproducibility, we
use the pytest-randomly
plugin, which manages random seed state and ordering. When pytest is run,
the random seed will be reported toward the beginning of the output, in the form:
Using --randomly-seed=2507050705
To rerun with a specific seed, run pytest with the appropriate flag:
pytest --randomly-seed=[DESIRED_SEED]