Skip to content

install_data_packages

Matthew Brett edited this page Mar 31, 2015 · 2 revisions

Rethinking data packages with install

A result of a conversation with Stéfan over coffee.

Here's a snippet from the nilearn docs: http://nilearn.github.io/building_blocks/manipulating_mr_images.html#datasets:

>>> from nilearn import datasets
>>> haxby_files = datasets.fetch_haxby(n_subjects=1)
>>> print haxby_files.keys()
['mask_house_little', 'anat', 'mask_house', 'mask_face', 'func', 'session_target', 'mask_vt', 'mask_face_little']
>>> #  Path to first functional file
>>> print haxby_files.func[0]
/.../nilearn_data/haxby2001/subj1/bold.nii.gz

Here is the beginning of the fetch_haxby function:

def fetch_haxby(data_dir=None, n_subjects=1, fetch_stimuli=False,
                url=None, resume=True, verbose=1):
    """Download and loads complete haxby dataset

    Parameters
    ----------
    data_dir: string, optional
        Path of the data directory. Used to force data storage in a specified
        location. Default: None

    n_subjects: int, optional
        Number of subjects, from 1 to 6.

    fetch_stimuli: boolean, optional
        Indicate if stimuli images must be downloaded. They will be presented
        as a dictionnary of categories.
    ...
    """

In fact there is another function fetch_haxby_simple that fetches another subset of the Haxby data.

This points out the use case of downloading and storing only a subset of the data within a given data package.

Therefore data packages can have groups. These groups can be fetched individually, if the upstream archive of data packages allows it.

In the case of the Haxby dataset above, the known groups might be ['subject1', 'subject2', 'subject3', 'subject4', subject5', 'subject6', 'stimuli'].

Imagine that haxby==0.1 is a known data package in some upstream archive. It is available as a directory on a web server, perhaps at URL http://my.server.org/datapackages/haxby==0.1.

The user wants to fetch only a part of this data package - say only the files for subject 2.

They first tell data package where to look:

datatool add-archive http://my.server.org/datapackages

When they need the data, they ask for it:

datatool fetch haxby subject1

This checks that the package haxby exists in a known location and fetches the data corresponding to group subject1.

If they want the whole thing:

datatool fetch haxby all

When fetching, datatool has to verify the stored group MD5 against the calculated MD5 of the calculated MD5 signatures of the individual files.

In Python, if the user tries to fetch a certain resource, the library raises an informative error:

>>> import datatool
>>> # Instantiate a local virtual copy of the dataset
>>> haxby = datatools.get_package('haxby')
>>> bold_fname = haxby['subject2']['session1']['bold'].filename
DataToolError : please fetch 'subject2' group

>>> haxby.fetch('subject2')
>>> bold_fname = haxby['subject2']['session1']['bold'].filename

The fetch is a no-op if the files are already present, unless you specify the --force flag.

The haxby==0.1 dataset is a web directory. We might have another dataset that is only stored as a zip file, say money-csvs-0.5.zip. Say it has groups ['money_things_a', 'money_things_b'].

In this case, of the single zip file package storage, datatool may not be able to fetch the data package groups separately. So, if you do this:

datatool fetch money-csvs money_things_a

then datatool has no choice but to download and unpack the whole zip file. In this case datatool emits a warning like No method for downloading individual groups from money-csvs, fetching whole package.

Questions

  • Which version, for datatool fetch haxby all? Probably always the latest version;
  • Where does the package data go? There has to be a default output directory. Maybe take the directory at the top of the package search (container) path by default, where the user can specify another directory, via something like datatool download-path /path/to/some-directory.
Clone this wiki locally