Skip to content

CodeResourcesMethods

rmcclosk edited this page Jul 2, 2014 · 15 revisions

CodeResources and Methods

Definitions

A CodeResource (CR) is a program that can be applied to data, or a dependency that is required by another CodeResource. A CodeResourceRevision (CRV) is an instance of a given CodeResource. As an object in Shipyard, a CRV does not do anything on its own, even if it is executable (e.g., a driver). We require that a Method object is defined on that CRV.

A Method provides information about how a CRV is used, specifically its inputs and outputs (which are defined as unstructured data or CompoundDatatypes). Because we rely on position arguments to pass inputs and outputs to a CodeResource, it is not possible to vary the numbers of inputs and outputs for a given CodeResource. (This may be a critical limitation and it may be necessary to expand the Method class to hold run parameters such as would be passed as keyword arguments.) Therefore, a Method should accommodate CodeResources that serve the same function but, for example, for which there exist multiple implementations by different authors (such as Conan's and Art's g2p scripts).

Stories

CodeResource

To learn how to use Shipyard, a user Jan has written a toy program in C which simply reads a list of strings and outputs the lexicographically first one. Jan compiles his C code, with whatever compiler he has on his system, and then creates a new CodeResource with the compiled binary as its first revision. He names this CodeResource "smallest". It’s important that Jan doesn’t upload the C code itself in this case, since Shipyard knows only how to execute files, not how to compile them. If Jan’s code made use of another header file he had written, he would not need to make a CodeResource for that file either, because it is only used to compile his program and is not needed at runtime.

Another Shipyard user, Catherine, has written a program in perl called count_reads.pl to process reads from a sequencing platform. It produces two output files: (1) a list of the unique, good quality reads and how many times they appeared, and (2) the number of bad quality reads which were discarded. The program uses a perl module Catherine has written called read checking.pm, which contains functions for testing the quality of the reads. To use this program, she will first make a CodeResource named "read_checking.pm" by uploading the first revision. When she goes to create the first revision of the CodeResource named "count_reads.pl", she will add the first revision of read_checking.pm as a dependency.

In a final case, user Dan wants to make use of the popular sequence alignment program Clustal Omega. He has installed the software on his system. However, as we’ll see later under Method, CodeResources which are intended for execution by Shipyard must have a particular command line interface, namely "program_name input1 input2 ... inputN output1 ... outputM", where input1 through inputN are the names of the input files, and output1 through outputN are the names of the output files. Clustal Omega does not have this interface; rather, it can be invoked as follows (from clustalo’s help): "clustalo -i my-in-seqs.fa -o my-out-seqs.fa -v". To use Clustal Omega, Dan writes a wrapper script in bash which takes arguments in the form expected by Shipyard and passes them to clustalo. Note that in bash, “$1” means the first command line argument, and so on.

#!/bin/bash
clustalo -i $1 -o $2 -v

Dan creates a new CodeResource for this wrapper script. While Dan’s script obviously depends on the program clustalo, he does not create a new CodeResource for clustalo. Shipyard is not intended to manage third-party applications that may be installed on the user’s system. Rather, it is up to the user to keep track of what is installed and available on their own machine. Importantly, this means that if the local installation of Clustal Omega is upgraded to a new version, Shipyard won’t be aware of the change, which could cause inconsistencies or integrity errors. It is up to Dan to ensure this does not happen, perhaps by adding a check for a specific version number to the top of his wrapper script.

CodeResourceRevision

Following up with the users from the previous section, Jan has written a new version of his program which parses its input as integers, instead of alphabetical strings, and outputs the numerically smallest one. After compiling the new executable, he creates a new CodeResourceRevision of the ”smallest” CodeResource, containing the new binary. Since the new revision is significantly different from the previous, Jan decides to give this one the revision name ”for integers”. Because he is creating this revision with the Shipyard web interface, the revision number 2 is automatically assigned to the new revision. If he were creating this revision through the backend API, he would need to specify the revision number manually. He would probably do this by adding 1 to the number of existing revisions of ”smallest”.

Meanwhile, Catherine has discovered a bug in her included perl module read checking.pm. She creates a new CodeResourceRevision with the revision number 2. Catherine sees no need to give this revision a new name, because it was only a bug fix and she’ll know not to use the first version anymore. Note here that creating a revision 2 of ”read checking.pm” does not change the first version of ”count reads.pl”, which still depends on version 1 of ”read checking.pm”. To use the new, debugged version, Catherine needs to create version 2 of ”count reads.pl” and add version 2 of ”read checking.pm” as a dependency.

Dan has realized that, for clinical validation purposes, he needs to be using the same version of Clustal Omega every time. He creates a new revision of the ”clustalo” CodeResource, with revision number 2 and revision name ”clinical”. This revision has a slightly modified wrapper script which exits with failure if the installed version of Clustal Omega is not 1.2.1.

#!/bin/bash
if [[ "x$(clustalo --version)" != "x1.2.1" ]]; then exit 1; fi
clustalo -i $1 -o $2 -v

Alternatively, Dan could simply have uploaded the Clustal Omega binary he has downloaded from the internet to Shipyard. He might create a CodeResource named "clustalo-bin", and a revision named "1.2.1", for this binary, and made it a dependency of the "clinical" revision of "clustalo". This works fine for standalone binaries, and might even be a better choice for Dan, since he can be really really sure that "clustalo" hasn't changed between clinical runs.

Method

The next step is for our users to tell Shipyard how to use their code. This is done by creating Methods, which are instructions for using CodeResurces.

Jan creates a MethodFamily called "smallest". He then creates two methods within this MethodFamily. The first, with the revision name "signed", has the input CompoundDatatype (integer: number). The second, with the revision name "positive", takes a Dataset with CompoundDatatype (naturalnumber: number) as its input. Both of these Methods have the revision "for integers", of the CodeResource "smallest", as their driver.

Of course, the two Methods are just going to call the same code. So why bother with two? For the same reason that some people like strongly typed programming languages. If Jan knows that everything going through his Pipeline will be a positive integer, using the "positive" Method provides an extra layer of content checking. It's equivalent to putting "assert (x > 0)" in his code.

We'll come back to Catherine in a minute, but let's skip ahead to Dan first. Dan's first revision of "clustalo" simply runs the most recent verison of Clustal Omega which is on his system. Though this is not acceptable for clinical work, it may be just what he wants for research. Therefore, he creates a MethodFamily called "align" with two Methods, with revision names "research" and "clinical". The "research" version uses his Prototype clustalo as its driver, but the "clinical" version uses the updated, clinical revision which checks the version number.

Note that Jan's two Methods both use the same CodeResourceRevision as their driver, but Dan's use two different CodeResourceRevisions. In fact, there is no need for Method drivers to even share a common CodeResource. MethodFamilies are simply a way to group Methods based on them doing the same sort of task. Two Methods being members of the same MethodFamily says nothing about how their internals work, only that they are conceptually related somehow.

Catherine needs to create only one Method, which uses the most recent version of count_reads.pl (the one with the fixed dependency). However, if she had already created a Method with the buggy version, she would have to create a new Method with the fixed version. Catherine finds this somewhat annoying - to fix a bug in her code, she had to create a new CodeResourceRevision, then a new Method, and then new versions of any Pipelines using the old Method.

Discussion

  • I disagree that Shipyard "is not intended to manage third-party applications". The original concept had included "third-party" software as method objects. I am concerned that Shipyard will fail to appeal to many users if we cannot incorporate software version control - Art.

  • Ok, I appreciate that we require users to write wrapper scripts for 3rd party apps. We should provide examples. I will write some, given time. - Art.

  • I still think it is important to allow multiple Methods per CodeResourceRevision, especially if we are going to think about parameters in the future. Both Jan's and Dan's use cases here are not too far-fetched.

  • However, I was unable to come up with any examples of two different Methods, within the same MethodFamily, which use two different CodeResources. Two different revisions of the same CodeResource, sure. But not two completely different pieces of code. In light of this, I'm going to suggest replacing MethodFamily with CodeResource.

  • As far as Catherine's story, it's annoying but I don't see any way around having to just re-create everything each time a bug comes up...

  • Regarding third-party software, there is nothing stopping Dan from just uploading the clustalo binary to Shipyard and putting it as a dependency of his wrapper script. But we can't force him to do this either, because sometimes the software he might want to use is more complicated than just a binary (shared libraries, for example). -Rosemary (last 4 points)

  • While I'm aware that it's necessary for clinical purposes, I am slightly uncomfortable with uploading binaries to Shipyard, because there is a possibility that the lab will lose the source code and be relying on a mysterious binary for their day-to-day operations. In the far future, it would be nice to somehow encourage the user to upload the source as well, maybe as a dependency which is not actually used. -Rosemary