-
Notifications
You must be signed in to change notification settings - Fork 295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scripts should output metadata file about what they did #676
Comments
A few more thoughts -- salmon does this now! also see #189, generating PROV-compliant output. I always get hung up on what to do when we are running many things in a row. Do we just have a file that we append to? or perhaps a directory that contains multiple little files with UUIDs containing the provenance output? is there any workflow-supported standard emerging? is PROV it? |
Brainstorming some considerations.
|
The most recent stable release, or current master? I just compiled the latest release from source and ran a few commands but didn't see any JSON metadata or any way to invoke it. |
On Mon, Apr 03, 2017 at 09:13:24AM -0700, Daniel Standage wrote:
> salmon does this now!
The most recent stable release, or current master? I just compiled the latest release from source and ran a few commands but didn't see any JSON metadata or any way to invoke it.
it's in the quant directory that contains quant.sf.
|
👍 Got it. Also several .json files in the index directory. |
On Mon, Apr 03, 2017 at 06:15:59AM -0700, Daniel Standage wrote:
- report relative paths (for portability) or absolute paths (for clearest provenance)?
both!
we could also use a sqlite database to track entries.
|
I like salmon's approach of attaching metadata to each artifact it creates (sequence index or quantification table). That could be an alternative to the single log file or directory approaches I discussed above. That said, salmon's approach works really well for a decidedly NOT streaming approach. If we start stitching together 3 or 4 khmer/oxli commands via UNIX pipes, all of a sudden attaching metadata to output files doesn't make as much sense. |
The closest thing to a non-domain specific standard is what we did in CWL: programs can hand off JSON files with key-value pairs: http://www.commonwl.org/v1.0/CommandLineTool.html#Output_binding (this is under documented and I am happy to explain more) |
also, salmon's habit of creating subdirectories is annoying.
|
Purely technical consideration: appending to a file when there are multiple concurrent writers is #hard. Especially when you need to make it work across operating systems and weird file systems like NFS. It is worth letting someone else provide the file locking (eg sqlite). Drawback is that you need a tool to look at your data, which is tedious compared to opening it in vim. |
thank you, @betatim, very good points :)
|
While most of the file formats we work with don't have support for metadata we aren't off the hook for recording such information and preserving it.
Based upon the discussion captured here: https://groups.google.com/d/msg/common-workflow-language/wx8G2zvDUV4/lzZPUPtQEHwJ we should store such information in a JSON file (like @kdmurray91's addition to load-into-counting.py)
The text was updated successfully, but these errors were encountered: