Workflow - Scientist #16

betatim · 2016-02-13T08:54:26Z

Outline the envisioned workflow for a scientist. With this we can build a better idea of what needs teaching, blue-printing, etc

First suggestion for a workflow:

start new data analysis by creating a empty directory
type openscience init to create a skeleton
- runs git init, creates a "sensible" Dockerfile
- setups up aliases for running things in the docker container?
create code, run it with openscience run <cmd> which executes it inside the docker container
create a notebook or .md with code blocks that has narrative mixed with steps for reproducing parts of the analysis
git commit all along
push repo to git hub at some point(?)
as analysis comes to an end create a new ipynb/md that is the paper, preview it with openscience paper(?)

(I will edit this entry as we iterate)

The text was updated successfully, but these errors were encountered:

betatim · 2016-02-22T21:29:51Z

A minimal git repository that would work as "executable paper" for the "Great icecream preference study of 2016":

icecream-prefs/
|-> icecream/
|   \-> ... library code ...
|-> data/  # include directly for small data or mount point for data-volumes
|-> Dockerfile  # how to setup the environment
|-> paper.{ipynb,md}  # the executable paper
|-> travis.yml  # CI instructions

The paper.{ipynb,md} drives the analysis, all the heavy lifting is done somewhere inside icecream/.

We can provide a the openscience command-line tool that create this layout, and uses it to allow you to run stuff locally using the docker container described in Dockerfile

betatim · 2016-02-22T21:30:16Z

ping @ctb (who should start watching this repo for all the notifications all the time)

betatim · 2016-02-23T14:19:00Z

What you would see as the publication is the rendered version of paper.{md,ipynb} with the ability to edit it and re-run. paper.md is like a README for producing the conclusions of the paper but without having to copy and paste stuff. So it might well contain code chunks that say

Figure 4 shows that clear chocolate is the best flavour. To run the extended analysis on chocolate we run:
 `` `
 make step23
 `` `
Which produces the following table:
...

betatim · 2016-02-23T14:25:51Z

The entry point would always be to-be-invented-execute.sh paper.* This should do all the things that need doing if placed inside the docker container built according to the Docekrfile in the repo. I could be persuaded that we need a .analysis.yaml but right now I am not sure, given you can do what ever you want in the Dockerfile.

We should provide a base docker image that contains to-be-invented.sh and other useful things like the jupyter kernel+plumbing needed to run a Rmarkdown, pythonmarkdown, ipynb.

travis.yml could be as simple as instructing travis to build our container from the Dockerfile and then execute to-be-invented.sh paper.* as well as a step to upload the rendered version. We should provide a template for this travis.yml so people can set it up.

ctb · 2016-02-23T14:26:17Z

Yes, good stuff!

My concern is that if this is the only allowed structure and workflow (as opposed to merely a strongly recommended one) we will automatically lose most potential early adopters - essentially, anyone who is already doing their own thing in this area. With the specfile idea we could allow a much broader range of repo structure/workflow (and provide a Web site to build the spec by inspecting a repo), while using the above as a specific structure & workflow for demo purposes.

ctb · 2016-02-23T14:28:56Z

A strong -1 on it being a shell script - something declarative offers many more opportunities for simplicity and introspection and composition. If procedural (like a Dockerfile or a shell script) then we need to run it to find out what it does. With a YAML spec, it could specify what resources need to be present, along with inputs and outputs, and then everything (travis.yml) could be produced from that, no?

betatim · 2016-02-23T14:30:27Z

If you are doing your own thing, and don't want to make a paper.md that we can render, what do we show as the "executable paper"?

We could also inject the required stuff via docker-compose. This might remove the need for a shared base image.

ctb · 2016-02-23T14:33:32Z

On Tue, Feb 23, 2016 at 06:30:28AM -0800, Tim Head wrote:

If you are doing your own thing, and don't want to make a paper.md that we can render, what do we show as the "executable paper"?

that can be part of the spec, no? We can require it be md or md-convertible,
of course, but I still write my papers in LaTeX (for example).

We could also inject the required stuff via docker-compose. This might remove the need for a shared base image.

Yep!

betatim · 2016-02-23T14:42:20Z

Specifying resources is a pro for having .analysis.yml. You enter the world of pain of how do I specify a requirement like "CERN batch system circa Feb 2016 when the Os they ran was SLC6.blah" or "the batch system we have in University of Somewhere circa Feb2016". I think that is corner cases though or at least we should delay that for a while. Focus on things like "100GB RAM", "a GPU", etc.

Con for docker-compose, we need to know which version of the required-stuff image to inject. Could be noted in .analysis.yml.

I am open for supporting more formats for people to write their paper in. I would insist though that the format they use has a way of mixing code with prose (like a notebook). For markdown and ipynb I know how to do that. Do you know (sane) ways of doing this in LaTeX?

After reflecting on this over ☕ I am 👍 on a .analysis.yml that specifies deps, docker image to inject for required stuff, and the command to generate the "rendered HTML with interactivity" paper.

ctb · 2016-02-23T14:49:46Z

On Tue, Feb 23, 2016 at 06:42:21AM -0800, Tim Head wrote:

Specifying resources is a pro for having .analysis.yml. You enter the world of pain of how do I specify a requirement like "CERN batch system circa Feb 2016 when the Os they ran was SLC6.blah" or "the batch system we have in University of Somewhere circa Feb2016". I think that is corner cases though or at least we should delay that for a while. Focus on things like "100GB RAM", "a GPU", etc.

Agreed on world of pain! And agree we should allow arbitrary config (perhaps
via Dockerfile?) for corner/edge cases, but should encourage standardization.

Con for docker-compose, we need to know which version of the required-stuff image to inject.
Could be noted in .analysis.yml.

I am open for supporting more formats for people to write their paper in. I would insist though that the
format they use has a way of mixing code with prose (like a notebook). For markdown and ipynb I know
how to do that. Do you know (sane) ways of doing this in LaTeX?

Good point -- and @camillescott has at least been playing with some tools.
I am +1 on that requirement, we can figure out latex later!

After reflecting on this over ☕ I am 👍 on a .analysis.yml that specifies deps,
docker image to inject for required stuff, and the command to generate the "rendered HTML
with interactivity" paper.

coo'.

betatim · 2016-02-23T14:59:18Z

(note to future: in the above comment there is a sentence from titus hidden in what looks quoted text about figuring out latex later)

khinsen · 2016-02-24T13:56:48Z

A useful notion in Guix is the "build system", which is a package of tools and conventions to manage a build process. Guix has a build system based on autoconf/automake, one based on Python's distutils, etc. Considering that "building" means nothing else than "producing a digital artefact", this can easily be extended to computational science. Running a data analysis is the same as building a data analysis report.

Given the current state of the art (which is a mess), I think the best approach would be to allow arbitrary build systems, the condition being that they produce rendered output according to some criteria. Users would be strongly encouraged to use an existing build system rather than make their own, so in the end we'd have a few but not many.

Another aspect of Guix build systems worth copying is that the input to a build system is declarative and therefore analyzable.

tritemio · 2016-02-24T15:55:48Z

As food for thoughts. Let's keep simple things simple and hard things possible.

Many workflows only require python+cython or R. These cases should not be more complex because of the requirements of other workflows which requires building custom code etc...

betatim · 2016-02-24T16:23:46Z

Agree with you Antonino.

Being able to bring your own docker container to a HPC system, batch queue
or use it on the LHC computing grid isn't possible right now. However
movement towards that has started at CERN. HTCondor supports
jobs-in-containers apparently.

re: custom software Take a look at
https://github.com/betatim/everware-demo which
builds on the image from
https://github.com/betatim/everware-cern-analysis/blob/master/Dockerfile The
point being that even for a demo you need quite some custom c++ software
(which you drive from python) but I think it is well addressed with the
approach we propose.

On Wed, Feb 24, 2016 at 4:55 PM Antonino Ingargiola <
[email protected]> wrote:

As food for thoughts. Let's keep simple things simple and hard things
possible.

Many workflows only require python+cython or R. These cases should not be
more complex because of the requirements of other workflows which requires
building custom code etc...

—
Reply to this email directly or view it on GitHub
https://github.com/betatim/openscienceprize/issues/16#issuecomment-188316668
.

cranmer · 2016-02-24T18:11:34Z

Recast project is very workflow oriented.
https://github.com/recast-hep/

Here's a recent talk focusing on docker, and "parametrized workflows" for the LHC context. Workflow stuff starts around slide 10.
https://indico.cern.ch/event/501469/

@lukasheinrich can do a better job of describing this, but here's a try:

We are preparing a document that describes high-level design for executing "parametrized workflows".
We have iterated on a JSON schema to describe quite generic "parametrized workflows" or "workflow templates". We allow for each step of the workflow template to be in a different environment (in practice, we are mainly using docker).

In the current design there are schedulers that parse the workflow template and the various parameters that are needed to start executing steps in the DAG. @michal-szostakl is working on making this talk to various types of clusters (AWS, carina, google container project, CERN container project, etc.). This produces what we call a "workflow instance" (eg. the specific jobs that ran, their outputs, etc.) and that can be described with something like PROV.

(boxes are "activities" and circles are "Entities" in the PROV language)

lukasheinrich · 2016-02-24T18:22:51Z

Hi all,

I think the model we came up with can be quite general and in my initial tests it was easy to describe even somewhat complex workflow graphs.

The reason we separated the "workflow template" from the "workflow instance" is that this maps better to how usually we think about these workflows. I.e. in our heads we think of a workflow stage as "process all these files from the previous stage in parallel" instead of thinking in terms of very concrete filenames / paths. Also sometimes the full graph is not known ahead of time (which is why we couldn't use snakemake / pydoit / and friends)

For the actual workflow instance that @cranmer posted above, this is the graph of the workflow template

I intentionally modelled it such that it could be written down somewhat succinctly in a travis-like manner and can be executed locally (as @cranmer mentioned we're working on remote execution as well)

lukasheinrich · 2016-02-24T18:25:48Z

Also i agree with @tritemio. If stuff is really simple, it should stay simple and not be overly complex. If e.g. you can package all your requirements in a single e.g. docker image and run the workflow with

docker run <myimage> ./runworkflow arg1 arg2

you shouldn't need to specify a whole lot more.

lukasheinrich · 2016-02-24T18:44:11Z

this would be the simplest example of a single step process that is parametrizes by an input and output argument. As @ctb said, this more declarative way of specifying the workflow allows for many downstream applications. you can query e.g. what code is used (i.e. what docker images), what the interdependencies of various workflow steps are, what the parameters are etc. (that's what makes it easy for us to visualize)

context:
  inputparameter: ~
  outputparameter: ~
stages:
  - name: dummystage
    parameters:
      input: '{inputparameter}'
      output: '{outputparameter}'
    scheduler:
      scheduler-type: 'singlestep-from-context'
      steps:
        single:
          process:
            process-type: 'string-interpolated-cmd'
            cmd: 'echo {input} {output}'
          publisher:
            publisher-type: 'process-attr-pub'
            outputmap:
              step_output: output
          environment:
            environment-type: 'docker-encapsulated'
            image: busybox

workflow template:

workflow instance:

ctb · 2016-02-28T15:16:19Z

I like all of the comments here!

What about including links to these issues in the proposal? I don't think we want to say we've reached any conclusions yet, and the proposal is due tomorrow, but I think these discussions are incredibly valuable and we can point to them as initial progress.

betatim · 2016-02-28T15:21:34Z

That is a good idea! 👍

On Sun, Feb 28, 2016 at 4:16 PM C. Titus Brown [email protected]
wrote:

I like all of the comments here!

What about including links to these issues in the proposal? I don't think
we want to say we've reached any conclusions yet, and the proposal is due
tomorrow, but I think these discussions are incredibly valuable and we can
point to them as initial progress.

—
Reply to this email directly or view it on GitHub
https://github.com/betatim/openscienceprize/issues/16#issuecomment-189890200
.

cranmer · 2016-02-28T19:31:46Z

See also #50 . Note there are two notions of "workflow" being discussed. One is how a user of everpub uses the tools. The second is the workflow coded up in the code itself, and more connected to composition etc.

lukasheinrich · 2016-02-29T18:43:39Z

Hi,

I recently stumbled on http://common-workflow-language.github.io/ and it seems like another workflow specification language, apparently used primarily bio/med fields.

Does anyone here have experience with this / know anything about it?

Cheers,
Lukas

ctb · 2016-02-29T18:44:47Z

Maybe... https://www.genomeweb.com/informatics/seven-bridges-funds-uc-davis-support-development-standardized-workflow-language :) It's kind of a meta specification, and while it's something we should support I didn't want to bake it into the proposal.

lukasheinrich · 2016-02-29T18:47:58Z

ugh.. behind a paywall even from NYU network. is there free info on this somewhere? Obviously there is an interest in this across fields in having something like this, which is good.

cranmer · 2016-02-29T18:49:37Z

I’m not premium, so I can’t see that article :-)

On Feb 29, 2016, at 1:44 PM, C. Titus Brown [email protected] wrote:

Maybe...

https://www.genomeweb.com/informatics/seven-bridges-funds-uc-davis-support-development-standardized-workflow-language

:)

It's kind of a meta specification, and while it's something we should support
I didn't want to bake it into the proposal.
—
Reply to this email directly or view it on GitHub https://github.com/betatim/openscienceprize/issues/16#issuecomment-190327389.

ctb · 2016-02-29T18:50:02Z

On Mon, Feb 29, 2016 at 10:47:59AM -0800, Lukas wrote:

ugh.. behind a paywall even from NYU network. is there free info on this somewhere? Obviously there is an interest in this across fields in having something like this, which is good.

I can send you a PDF but that doesn't help in general, does it? Anyway, yes, I
currently employ the CWL community manager, @mr-c ;).

lukasheinrich · 2016-02-29T18:55:20Z

That's great. So, I skimmed over that and it seems somewhat similar to our workflow spec. Maybe there is an opportunity there to converge. The one feature, I think that I haven't seen elsewhere, is flexibility in the workflow DAG itself. Our spec allows for extending the graph in certain ways while it is running, which is helpful in cases where the graph structure depends on the outcomes of previous nodes in the graph.

ctb · 2016-02-29T18:57:38Z

On Mon, Feb 29, 2016 at 10:55:20AM -0800, Lukas wrote:

That's great. So, I skimmed over that and it seems somewhat similar to our workflow spec. Maybe there is an opportunity there to converge. The one feature, I think that I haven't seen elsewhere, is flexibility in the workflow DAG itself. Our spec allows for extending the graph in certain ways while it is running, which is helpful in cases where the graph structure depends on the outcomes of previous nodes in the graph.

Sounds like a nice convergence! Drop me an e-mail [email protected] if you
want an e-introduction to @mr-c.

mr-c · 2016-02-29T19:35:39Z

Hello, @mr-c here. I'm the Community Engineer for the #CommonWL. I'm coming up to speed on what you all are doing and I see a lot of crossover.

One of my main personal motivations for CWL was that there should be a way to run the complete analysis graph from a paper AND re-mix/re-use it with your own data. Hopefully our 3rd draft of the spec provides much of the functionality needed by that.

I'm not sure why @ctb thinks of us as a meta-specification; CWL tool descriptions and workflows made from those tool descriptions are completely runnable on a local machine, in a docker container, or on an academic cluster/grid.

We have a chat room if you'd like some real time conversation at https://gitter.im/common-workflow-language/common-workflow-language

FYI @betatim we have Docker containers running on HPC systems without root: https://github.com/common-workflow-language/common-workflow-language/wiki/Userspace-Container-Review#getting-userspace-containers-working-on-ancient-rhel

lukasheinrich · 2016-02-29T19:49:08Z

Hi @mr-c,

so re-mixing is exactly a point where the flexibility in the graph itself becomes important. Think of this simple type of map-reduce workflow:

one step that take a couple of parameters and produces N files by running code in docker container A
process each of those N files in parallel using code in docker container B to produce N new files
merge those N result files into a single result using some code in container C

now, different input parameters might result in different number of produced files, so the actual graph becomes invocation dependent (though similar structurally between invocations). We solved this in our proposal by allowing for "schedulers" which take a graph -- as executed until this point -- and its invocation parameters to extend the graph with the nodes based on the results up to that point.

One approach of course is to hide the parallel computation in a single step that handles all of it, but that goes a bit against the re-usability ideal, since the core component one want to re-use is what happens in each node.

Have you encountered these things within the CWL development?

mr-c · 2016-02-29T20:22:33Z

Hello @lukasheinrich

Yep, this topic comes up frequently.

We currently support the scenario you outlined with our scatter/gather feature: http://common-workflow-language.github.io/draft-3/Workflow.html#WorkflowStep

I'm sure that additional dynamic features will be added after our 1.0 release. If we are missing anything, especially derived from the usecases presented in this repo, I really want to hear about it!

lukasheinrich · 2016-02-29T20:45:31Z

yes scatter / gather (which I guess is almost synonymous with map/reduce) is one very common way this graph extensions work, but probably there are more, so we wanted to make this a first-class citizen using the notion of "workflow templates", and "workflow instances". In our JSON-based workflow schema, we allow for arbitrary sub-schemas (which need to be supported by the workflow engine that runs them). This allows custom contributions / workflow patterns to appear organically (maybe curated by a community)

Another question: can one run workflows using different docker containers (for each node in the graph) using CWL? If so, how do you describe the environment (which docker container, how to setup a shell environment within the container etc) and how do you coordinate a shared filesystem between those containers? In our case, we allow a list of resources to be listed (such as a network filesystem or a shared host directory), and docker containers can expert to see the work directory at a well-defined path (e.g. /workdir). see this example:

https://github.com/recast-hep/recast-cap-demo/blob/master/recastcap/capdata/yamlworkflow/ewk_analyses/ewkdilepton_analysis/postproc.yml

mr-c · 2016-02-29T20:55:25Z

I'm sure we'll add additional dynamic workflow patterns as the standard develops.

With CWL you can indeed define a different docker container to use for each tool or step: http://common-workflow-language.github.io/draft-3/CommandLineTool.html#DockerRequirement

We are also adding support for giving hints to the local system in the event you'd like to execute a CWL workflow using a traditional HPC cluster.

File staging is left as an implementation detail for CWL compliant platforms (some are shared filesystems, many are not).

More on the runtime environment: http://common-workflow-language.github.io/draft-3/CommandLineTool.html#Runtime_environment

Specific files to be used in computation are specified in the input object, a JSON formatted list of input parameters including file locations.

mr-c · 2016-02-29T20:56:33Z

@lukasheinrich Is there a link for the recast workflow spec? We maintain a (depressingly long) list of other scientific workflow systems at https://github.com/common-workflow-language/common-workflow-language/wiki/Existing-Workflow-systems and I'd like to add y'all.

lukasheinrich · 2016-02-29T21:02:11Z

we're working on a draft right now, hopefully there'll be something presentable soon.

betatim mentioned this issue Feb 23, 2016

Thoughts and questions on a first thorough review #18

Closed

betatim mentioned this issue Feb 23, 2016

Split proposal into sections #21

Merged

lukasheinrich mentioned this issue Feb 28, 2016

Meta-issue re composability #51

Open

lukasheinrich mentioned this issue Mar 11, 2016

demo: notebook with attached cluster + declarative workflows #116

Open

Workflow - Scientist #16

Workflow - Scientist #16

Comments

betatim commented Feb 13, 2016

betatim commented Feb 22, 2016

betatim commented Feb 22, 2016

betatim commented Feb 23, 2016

betatim commented Feb 23, 2016

ctb commented Feb 23, 2016

ctb commented Feb 23, 2016

betatim commented Feb 23, 2016

ctb commented Feb 23, 2016

betatim commented Feb 23, 2016

ctb commented Feb 23, 2016

betatim commented Feb 23, 2016

khinsen commented Feb 24, 2016

tritemio commented Feb 24, 2016

betatim commented Feb 24, 2016

cranmer commented Feb 24, 2016

lukasheinrich commented Feb 24, 2016

lukasheinrich commented Feb 24, 2016

lukasheinrich commented Feb 24, 2016

ctb commented Feb 28, 2016

betatim commented Feb 28, 2016

cranmer commented Feb 28, 2016

lukasheinrich commented Feb 29, 2016

ctb commented Feb 29, 2016 via email

lukasheinrich commented Feb 29, 2016

cranmer commented Feb 29, 2016

ctb commented Feb 29, 2016

lukasheinrich commented Feb 29, 2016

ctb commented Feb 29, 2016

mr-c commented Feb 29, 2016

lukasheinrich commented Feb 29, 2016

mr-c commented Feb 29, 2016

lukasheinrich commented Feb 29, 2016

mr-c commented Feb 29, 2016

mr-c commented Feb 29, 2016

lukasheinrich commented Feb 29, 2016