Skip to content

Track your datasets and files, what created them, when and from what

Notifications You must be signed in to change notification settings

drmarkwslater/DatasetTracker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Dataset Tracker

This is a basic dataset tracker for managing large numbers of files on a local system. You can keep track of when a file was produced, by what and what it's heirachy is as well.

Installation

At present, you'll need to clone the repo and then edit the paths appropriately:

export PYTHONPATH=<git-repo-home>/python:$PYTHONPATH
export PATH=<git-repo-home>/bin:$PATH

Usage

The first thing you need to do is create a repository to store the metadata about the files. Assuming you're OK with the default of ~/.dstrk then just do:

dstrk initDB

You can now start adding data! If you want to just play around, run the data creation script that will just produce a load of test data and a test git repo:

python tests/create_test_areas.py

Let's start by adding some data to our first dataset. Doing the following will store the info about the files as a dataset entry in the DB:

dstrk addDS ~/.dstrk-test-data/step_1/*.txt --tags "Info about step 1"

You can retrieve that info by just querying one of the files:

dstrk DSinfo ~/.dstrk-test-data/step_1/part1.txt

Note that this is NOT dependent on file location - if you move the files around this will still work!

But what if you wanted to record info from a git repository that was used to create the data? Easy:

dstrk addDS ~/.dstrk-test-data/step_1/*.txt --gitinfo ~/.dstrk-test-git

and you get appropriate stuff recorded in the DB:

dstrk DSinfo ~/.dstrk-test-data/step_1/part1.txt

Note that we have just added the same data to two different datasets. This could happen quite a bit if the exact same file is created by two different versions of software. dstrk will always tell you what datasets a file belongs to but will default to using the latest added one.

Let's say you have a chain of analysis steps - how do you preserve the hierachy? Using the '--parentDS' option:

dstrk addDS ~/.dstrk-test-data/step_2/*.txt --tags "Info about step 2" --parentDS <hashes_of_parent_DSs>

The has for a DS can be found using the dstrk DSInfo command above. To give an idea of the hierachy, do the following:

dstrk tree ~/.dstrk-test-data/step_2/part1.txt

Finally, you can remove datasets or files from the DB using the 'delDS' and 'delfiles' options.

About

Track your datasets and files, what created them, when and from what

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages