-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Log input filename #12
Comments
Thank you for this suggestion. Would something like this be acceptable? input: R1.fq R2.f1 R3.fq
total: 8
usable: 7
unique: 4
clusters: 3
Which other things do you have in mind? |
Given that the file seems to already be valid YAML (at least the example from @Redmar-van-den-Berg is), how about sticking to YAML? Would make downstream usage super easy for people. So then, something like this: input: [R1.fq, R2.f1, R3.fq]
total: 8
usable: 7
unique: 4
clusters: 3 or this: input:
- R1.fq
- R2.f1
- R3.fq
total: 8
usable: 7
unique: 4
clusters: 3 For other stuff, I can refer to my now ancient and slightly-ranty blog post on the topic, but generally I really appreciate things like the software name, software version number, the command / parameters used. So if you end up with the results file in isolation you're able to know how it was generated. |
Well, it is a text file that happens to conform to the YAML standard, but we can pretend that this was intentional.
I will see what I can do. Do you have any preferences for key names? |
A happy accident 😊 No preference. Thanks so much! |
I added an option to write a separate metadata file. Such a file would look like this: name: humid
version: 1.0.2
parameters:
word_length: 24
allowed_mismatches: 1
log_name: /dev/stderr
dir_name: out
run_statistics: false
filter_duplicates: true
annotate_fastq_files: false
use_edit_distance: false
use_maximum_clustering: false
write_metadata: true
files:
- data/R1.fq.gz
- data/R2.fq.gz
- data/R3.fq.gz Would this work nicely with MultiQC? |
Looks great! Though does it need to be an option / a separate file? The author of HiSAT did the same and (selfishly) it's a bit of a hassle from the MultiQC side. People often forget to run with the flag or don't realise that they need to, or worse are using a pipeline written by someone else where the flags are difficult / impossible to change. So I would personally advocate for always saving this information by default, unless there's a reason not to. |
It would be nice to always write the metadata file, and optionally include the statistics in this file when We might also want to add an optional flag to |
I think so, yes. In my opinion a command line tool should produce as little clutter as possible by default. I also think that it is the responsibility of the caller to document which parameters are used, but I can see the reasons for facilitating this.
I would prefer not to mix metadata with results. |
Ok 👍🏻 The only remaining complication is knowing which metadata file corresponds to which results file. We can assume that they must both be in the same directory, but it gets complicated if there are multiple runs in one location. |
Good point.
That is why we added the output directory ( Alternatively, instead of using flags, we could consider allowing the user to provide the names of the additional files, e.g., use |
The standard filenames are one thing I really liked already, as it makes it much easier to find them 😅 Otherwise we have to search the file contents of every single file looking for specific strings custom to the tool output. Generally we're talking about two different user groups here though. I'm talking as a tool developer for downstream software where I have no control over how the software is run. (Alternatively end users running the tool indirectly, as part of a larger package / pipeline.) I like everything to be as standardised and inflexible as possible, so that the results are easy to discover and parse as possible, irrespective of what the user did upstream. In my personal opinion, 90% of users don't really care about customising filenames and things, they want stuff to "just work". The less requirement to read the docs the better. So I usually aim for sensible defaults with the option to customise. Anyway, this all got a bit in depth, subjective and philosophical which wasn't really my intention! 😅 The MultiQC module already works fine as it is and the current outputs are pretty solid as they are. So if in doubt, feel free to forget I said anything 😆 |
I have reverted the metadata logging for now. I am thinking about adding this functionality to the command line parser library that this tool uses as it would guarantee a tight coupling and avoids code duplication. |
@Redmar-van-den-Berg wrote a lovely MultiQC module for this tool in MultiQC/MultiQC#1795, but the method for finding a sample name is a bit rough:
It would be much nicer if the
stats.dat
file could log one or two more things - most importantly, the input file name(s). That way, the file can be interpreted by itself without needing directory structure for context.The text was updated successfully, but these errors were encountered: