-
Notifications
You must be signed in to change notification settings - Fork 10
Home
Pydpiper is a set of Python libraries and associated executables intended primarily for running image processing pipelines on compute grids. It provides a domain-specific language (DSL) for constructing pipelines out of smaller components, wrappers for numerous command-line tools (currently largely MINC-centric, but currently expanding to some NIFTI- and ITK-based tools), code for constructing common pipeline topologies, and command-line wrappers to run some core pipelines.
Pydpiper code can be used from within Python or packaged into an application and called from the shell. Roughly speaking, the process is as follows: first, executing Pydpiper code determines the overall topology of a pipeline and the filenames of the input and output files of each step, compiling a graph of "stages" to be scheduled for execution; second, the Pydpiper server spawns "executors" (either remote jobs on a compute grid or subprocesses on a local machine) which get stages (usually shell commands) from the server as their dependencies are satisfied and run them.
Running the included check_pipeline_status.py script with a pipeline's <pipeline_name>_uri file as argument will provide a summary of running and finished stages, number of running executors, and other information.
An important source of truth is the pipeline.log file created in the pipeline's output directory. You can control the logging level by setting the shell environment variable PYRO_LOGLEVEL (before program start) to one of DEBUG, INFO (the default), WARN, or ERROR. INFO reports information about stages starting and finishing, while WARN and ERROR will only report potential problems with execution.
The <pipeline_name>_finished_stages file contains a rather uninformative list of completed stages by their number; in addition to counting the lines in this file, you can perform a join (using, e.g., Python's Pandas or R's tidyverse) with the <pipeline_name>_stages.txt file to determine which commands have run.
Each executor typically creates its own log file; these can be accessed in the logs/ subdirectory of the pipeline output directory, although it's sometimes a bit tedious to associate an executor with its stages. For the moment, grep
is often a good option.
Individual stages also redirect stdout/stderr to a log file; this will typically be reported in the pipeline.log file and at the command line in case of an error, but for single-output stages is typically of form "[dir of output]/../log/[command producing output]/[output name without extension].log".
- Keep your pipeline name (
--pipeline-name
) and ideally your input filenames relatively short. Our filename propagation is currently rather unwieldy and longer paths risk running over certain program-specific filename length limits, preventing the pipeline from starting. - In principle one can start additional executors (via
pipeline_executor.py --uri-file ... --num-executors ...
) from the command line, but as we rarely do this we're not certain how well this works.
- MBM.py
- MAGeT.py
- twolevel_model_building.py
- registration_chain.py
The pydpiper (version 1) wiki currently lives here: