Scripts should output metadata file about what they did #676

mr-c · 2014-12-04T21:38:00Z

While most of the file formats we work with don't have support for metadata we aren't off the hook for recording such information and preserving it.

Based upon the discussion captured here: https://groups.google.com/d/msg/common-workflow-language/wx8G2zvDUV4/lzZPUPtQEHwJ we should store such information in a JSON file (like @kdmurray91's addition to load-into-counting.py)

ctb · 2017-04-03T12:25:51Z

A few more thoughts --

salmon does this now!

also see #189, generating PROV-compliant output.

I always get hung up on what to do when we are running many things in a row. Do we just have a file that we append to? or perhaps a directory that contains multiple little files with UUIDs containing the provenance output?

is there any workflow-supported standard emerging? is PROV it?

standage · 2017-04-03T13:15:59Z

Brainstorming some considerations.

probably want to use a filename format similar to 2017-04-03T06:55:21-khmer.json (maybe s/khmer/oxli/)
- fixed length
- sorts correctly every time without any tricks
store a distinct file for each script invocation in some directory?
- in situ with something like ./.oxli/ or ./oxli-meta/?
- global with something like ~/.oxli?
...or use a user-specified file to keep a running log of JSON entries?
- if nothing is specified, default to something like ./oxli-meta.json
wait until the end of the script (to confirm successful exit) to print?
- otherwise could end up with a lot of unhelpful/misleading files/entries
report relative paths (for portability) or absolute paths (for clearest provenance)?

standage · 2017-04-03T16:13:24Z

salmon does this now!

The most recent stable release, or current master? I just compiled the latest release from source and ran a few commands but didn't see any JSON metadata or any way to invoke it.

ctb · 2017-04-03T16:14:09Z

On Mon, Apr 03, 2017 at 09:13:24AM -0700, Daniel Standage wrote: > salmon does this now! The most recent stable release, or current master? I just compiled the latest release from source and ran a few commands but didn't see any JSON metadata or any way to invoke it.

it's in the quant directory that contains quant.sf.

standage · 2017-04-03T16:16:43Z

👍 Got it. Also several .json files in the index directory.

ctb · 2017-04-03T16:16:49Z

On Mon, Apr 03, 2017 at 06:15:59AM -0700, Daniel Standage wrote: - report relative paths (for portability) or absolute paths (for clearest provenance)?

both! we could also use a sqlite database to track entries.

standage · 2017-04-03T16:27:41Z

I like salmon's approach of attaching metadata to each artifact it creates (sequence index or quantification table). That could be an alternative to the single log file or directory approaches I discussed above.

That said, salmon's approach works really well for a decidedly NOT streaming approach. If we start stitching together 3 or 4 khmer/oxli commands via UNIX pipes, all of a sudden attaching metadata to output files doesn't make as much sense.

mr-c · 2017-04-03T16:48:25Z

The closest thing to a non-domain specific standard is what we did in CWL: programs can hand off JSON files with key-value pairs: http://www.commonwl.org/v1.0/CommandLineTool.html#Output_binding (this is under documented and I am happy to explain more)

ctb · 2017-04-04T15:55:11Z

also, salmon's habit of creating subdirectories is annoying.

betatim · 2017-04-06T12:05:08Z

Purely technical consideration: appending to a file when there are multiple concurrent writers is #hard. Especially when you need to make it work across operating systems and weird file systems like NFS. It is worth letting someone else provide the file locking (eg sqlite). Drawback is that you need a tool to look at your data, which is tedious compared to opening it in vim.

ctb · 2017-04-06T12:32:26Z

thank you, @betatim, very good points :)

mr-c added enhancement theme:best-practices discussion-needed labels Dec 4, 2014

mr-c added this to the 1.2+ milestone Dec 4, 2014

kdm9 mentioned this issue Feb 19, 2015

Python API #776

Open

kdm9 mentioned this issue Mar 19, 2015

Initial implementation of read counting in C++ read parser #877

Merged

mr-c mentioned this issue Aug 18, 2015

[RFC] Putting version numbers into .info files? #1254

Closed

ctb mentioned this issue Apr 3, 2017

Report kevlar version when running a script kevlar-dev/kevlar#61

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scripts should output metadata file about what they did #676

Scripts should output metadata file about what they did #676

mr-c commented Dec 4, 2014

ctb commented Apr 3, 2017

standage commented Apr 3, 2017

standage commented Apr 3, 2017

ctb commented Apr 3, 2017 via email

standage commented Apr 3, 2017

ctb commented Apr 3, 2017 via email

standage commented Apr 3, 2017

mr-c commented Apr 3, 2017

ctb commented Apr 4, 2017 via email

betatim commented Apr 6, 2017

ctb commented Apr 6, 2017 via email

Scripts should output metadata file about what they did #676

Scripts should output metadata file about what they did #676

Comments

mr-c commented Dec 4, 2014

ctb commented Apr 3, 2017

standage commented Apr 3, 2017

standage commented Apr 3, 2017

ctb commented Apr 3, 2017 via email

standage commented Apr 3, 2017

ctb commented Apr 3, 2017 via email

standage commented Apr 3, 2017

mr-c commented Apr 3, 2017

ctb commented Apr 4, 2017 via email

betatim commented Apr 6, 2017

ctb commented Apr 6, 2017 via email