-
Notifications
You must be signed in to change notification settings - Fork 258
install_data_packages
A result of a conversation with Stéfan over coffee.
Here's a snippet from the nilearn docs: http://nilearn.github.io/building_blocks/manipulating_mr_images.html#datasets:
>>> from nilearn import datasets >>> haxby_files = datasets.fetch_haxby(n_subjects=1) >>> print haxby_files.keys() ['mask_house_little', 'anat', 'mask_house', 'mask_face', 'func', 'session_target', 'mask_vt', 'mask_face_little'] >>> # Path to first functional file >>> print haxby_files.func[0] /.../nilearn_data/haxby2001/subj1/bold.nii.gz
Here is the beginning of the fetch_haxby function:
def fetch_haxby(data_dir=None, n_subjects=1, fetch_stimuli=False, url=None, resume=True, verbose=1): """Download and loads complete haxby dataset Parameters ---------- data_dir: string, optional Path of the data directory. Used to force data storage in a specified location. Default: None n_subjects: int, optional Number of subjects, from 1 to 6. fetch_stimuli: boolean, optional Indicate if stimuli images must be downloaded. They will be presented as a dictionnary of categories. ... """
In fact there is another function fetch_haxby_simple
that fetches another
subset of the Haxby data.
This points out the use case of downloading and storing only a subset of the data within a given data package.
Therefore data packages can have groups. These groups can be fetched individually, if the upstream archive of data packages allows it.
In the case of the Haxby dataset above, the known groups might be ['subject1',
'subject2', 'subject3', 'subject4', subject5', 'subject6', 'stimuli']
.
Imagine that haxby==0.1
is a known data package in some upstream archive.
It is available as a directory on a web server, perhaps at URL
http://my.server.org/datapackages/haxby==0.1.
The user wants to fetch only a part of this data package - say only the files for subject 2.
They first tell data package where to look:
datatool add-archive http://my.server.org/datapackages
When they need the data, they ask for it:
datatool fetch haxby subject1
This checks that the package haxby
exists in a known location and fetches
the data corresponding to group subject1
.
If they want the whole thing:
datatool fetch haxby all
When fetching, datatool
has to verify the stored group MD5 against the
calculated MD5 of the calculated MD5 signatures of the individual files.
In Python, if the user tries to fetch a certain resource, the library raises an informative error:
>>> import datatool >>> # Instantiate a local virtual copy of the dataset >>> haxby = datatools.get_package('haxby') >>> bold_fname = haxby['subject2']['session1']['bold'].filename DataToolError : please fetch 'subject2' group >>> haxby.fetch('subject2') >>> bold_fname = haxby['subject2']['session1']['bold'].filename
The fetch
is a no-op if the files are already present, unless you specify
the --force
flag.
The haxby==0.1
dataset is a web directory. We might have another dataset
that is only stored as a zip file, say money-csvs-0.5.zip
. Say it has
groups ['money_things_a', 'money_things_b']
.
In this case, of the single zip file package storage, datatool
may not be
able to fetch the data package groups separately. So, if you do this:
datatool fetch money-csvs money_things_a
then datatool
has no choice but to download and unpack the whole zip file.
In this case datatool
emits a warning like No method for downloading
individual groups from money-csvs, fetching whole package
.
- Which version, for
datatool fetch haxby all
? Probably always the latest version; - Where does the package data go? There has to be a default output directory.
Maybe take the directory at the top of the package search (container) path by
default, where the user can specify another directory, via something like
datatool download-path /path/to/some-directory
.