Skip to content
Kyle Barbary edited this page Jun 2, 2017 · 8 revisions

Next-Gen Pipeline Plan

Synopsis

1. Input: The input to the pipeline will be "raw data". Raw data is everything that cannot be regenerated by running a program. In our case, it is FITS files and log files from the instrument, external data products such as SFD dust maps, and any human-generated metadata such as redshift determinations (if necessary).

2. Pipeline definition: The processing steps in the pipeline will be defined as a directed acyclical graph (DAG), which will specify how to create any pipeline product for any subset of data, starting from raw data. Makefiles are an example of a syntax for specifying such a DAG.

3. File-based: A "pipeline product" is a file, so that the existence or non-existance of a file can be used to determine whether a given step has been run. (A database is not used to track this information. This avoids the fragile stateful mutable database of the current pipeline).

4. Pipeline backend: How each pipeline step is actually executed has not yet been determined, but ideally will be implemented separately from the pipeline definition DAG, in order to allow switching between backends.

5. Pipeline components: Individual executables used in the pipeline will be moved from the SNFactory CVS repository into separate git repositories (hosted on GitHub) and refactored as needed.

Webpage frontend & database: Web-based tools & a database that enable efficiently querying the raw data and/or pipeline products would be layered on top of this file-based pipeline framework, as needed. Databases would thus be used only as an efficient "view" of on-disk file contents and could always be fully regenerated from scratch by a script.

1. Raw Data

For "push-button" processing, it's important to clearly delineate what is really raw (immutable) data and what is not. Here are a few examples to illustrate the point:

  • The Milky Way E(B-V) value for a supernova is not raw data. The raw data is the SN location (or telescope pointing) and (to us) the E(B-V) dust map. Separating these makes it easier to, for example, use a different dust map.

  • Human-determined redshift values should be treated as raw data, as they cannot be regenerated by a program. They should be saved with metadata such as who determined the value, the date and what data it was based on.

As of 25 February 2016, the raw data consisted of 5920 nightly tar files containing an aggregate 926646 files and 13.24 TB.

2. Pipeline definition

The goal is that if one has all the raw data and all the necessary software, one should be able to run the entire pipeline with one command, e.g. make idr. For a pipeline with multiple steps, a DAG that specifies for each step the input files, the command to run, and the expected output files is a natural and widely used way to accomplish this. Make is just one example of such a system. In Make, steps are specified in a Makefile as:

output : input1 input2 ...
       command [argument1 ...]

There are many other build systems that accomplish the same thing. One Python-based example is doit, where steps are are generated programmatically in a Python script:

def task_stepname():
    return {'targets': [output1, ...],
            'file_dep': [input1, ...],
            'actions': ['command ...', ...]}

Two DAG-based workflow systems of note are FireWorks and Swift. These are reportedly being officially supported at NERSC (see "pipeline backend" below). Makeflow is another example, directly inspired by Make, but with the ability to target multiple backends, crucially including SLURM.

awesome-pipeline has a comprehensive listing of pipeline tools, including the above.

Whichever format/tool we use, it should be minimal (easy for a human to understand) and relatively agnostic with respect to the backend. This will let us switch backends or even tools in the future with minimal, straightforward work. I anticipate that the hard work will be specifying the DAG. Once specified, it should be straightforward to write it in the syntax of the tool of choice.

Handling different processing versions

One desirable feature that doesn't seem to be handled by any existing workflow tools is intelligent versioning of pipeline products. We often want to run our pipeline with a new version of some software component and compare the outputs between the new and old version. To the user, this should appear as specifying the entire software stack and rerunning the entire pipeline. The user should not have to worry about which steps actually depend on the software that changed and thus actually have to be rerun. This should all be handled transparently by the workflow tool.

To restate this: With available workflow tools, to handle multiple processing versions, one could place all the old pipeline outputs under one directory v1, move to a new directory v2, and literally rerun the entire pipeline with the new version of the software. However, this doesn't take advantage of the fact that many steps might not actually need to be rerun - if the software used in the first step of the pipeline hasn't changed between v1 and v2, the output of that step from v1 can be reused in v2.

We may therefore eventually wish to incorporate this feature into an existing workflow tool.

If we do something like this, we should try to work with existing efforts and expertise so as not to re-invent the wheel. Also, we should first have a complete working pipeline in some existing workflow system. This will allow us to better understand our needs.

Ease of debugging

It should be easy to try running subsets of the analysis (even down to single executable command) with the ability to immediately throw away the results if wished. This is something the current pipeline fails at in a few ways:

  • The step granularity is large (the plan_[...].py scripts produce qsub scripts that have many commands).
  • The system is assumed to run on a cluster, forcing one to go through the overhead of the queue system, even for small steps.
  • The fact that everything is saved in a database (even if tests can be segregated under a different run id) is a mental block to trying things out. (Compare the ease of git branch foo and git branch -D foo for trying out code changes in git.)

3. Pipeline backend

Initially we will target running the pipeline at NERSC on the Edison or Cori machines. Both the Edison and Cori machines use the SLURM scheduler.

We may investigate using Shifter for deployment of the pipeline components, in order to address scalability problems such as long load times when running Python on many nodes. Shifter uses Docker images, so this would involve making a Docker image with the pipeline's entire software stack for each tagged production version.

4. Pipeline components

The pipeline components (individual Python scripts & libraries or compiled executables) will be reorganized into public git repositories to be hosted in the snfactory organization. This will involve running git cvsimport on subsets of the SNFactory CVS repository in order to do a one-time conversion of the CVS history.

There is an important difference from the old system regarding how versions of the pipeline processing are handled. In the old system, pipeline versions are linked to CVS repository tags, such as 02-02. This doesn't make sense when the pipeline is made up of software from several git repositories, and it creates complications for tracking versions of software not developed by the collaboration. The version control system should in fact be divorced from the processing envirnoment. Instead, a processing tag should correspond to a list of software dependencies and the specific version of each. For example, if the pipeline were made up only of cubefit, a processing tag might be specified with a single file 02-02.txt with contents:

numpy==1.9.2
scipy==0.17.0
fitsio==0.9.2
pyfftw==0.9.1
fftw==3.3.4
cubefit==0.4.0

(The specific format of the file could be different from this, depending on the tool used to manage the software environment.)