Log input filename #12

ewels · 2022-12-31T23:33:27Z

@Redmar-van-den-Berg wrote a lovely MultiQC module for this tool in MultiQC/MultiQC#1795, but the method for finding a sample name is a bit rough:

# There is no sample name in the log, so we use the root of the
# file as sample name (since the filename is always stats.dat
name = f["root"]

It would be much nicer if the stats.dat file could log one or two more things - most importantly, the input file name(s). That way, the file can be interpreted by itself without needing directory structure for context.

The text was updated successfully, but these errors were encountered:

jfjlaros · 2023-01-01T19:17:01Z

Thank you for this suggestion.

Would something like this be acceptable?

input: R1.fq R2.f1 R3.fq
total: 8
usable: 7
unique: 4
clusters: 3

It would be much nicer if the stats.dat file could log one or two more things

Which other things do you have in mind?

ewels · 2023-01-01T19:47:45Z

Given that the file seems to already be valid YAML (at least the example from @Redmar-van-den-Berg is), how about sticking to YAML? Would make downstream usage super easy for people.

So then, something like this:

input: [R1.fq, R2.f1, R3.fq]
total: 8
usable: 7
unique: 4
clusters: 3

or this:

input:
  - R1.fq
  - R2.f1
  - R3.fq
total: 8
usable: 7
unique: 4
clusters: 3

For other stuff, I can refer to my now ancient and slightly-ranty blog post on the topic, but generally I really appreciate things like the software name, software version number, the command / parameters used. So if you end up with the results file in isolation you're able to know how it was generated.

jfjlaros · 2023-01-01T21:14:17Z

Well, it is a text file that happens to conform to the YAML standard, but we can pretend that this was intentional.

I really appreciate things like the software name, software version number, the command / parameters used.

I will see what I can do. Do you have any preferences for key names?

ewels · 2023-01-02T07:23:06Z

A happy accident 😊 No preference. Thanks so much!

jfjlaros · 2023-01-02T21:32:58Z

I added an option to write a separate metadata file.

Such a file would look like this:

name: humid
version: 1.0.2
parameters:
  word_length: 24
  allowed_mismatches: 1
  log_name: /dev/stderr
  dir_name: out
  run_statistics: false
  filter_duplicates: true
  annotate_fastq_files: false
  use_edit_distance: false
  use_maximum_clustering: false
  write_metadata: true
  files:
    - data/R1.fq.gz
    - data/R2.fq.gz
    - data/R3.fq.gz

Would this work nicely with MultiQC?

ewels · 2023-01-03T08:01:53Z

Looks great! Though does it need to be an option / a separate file? The author of HiSAT did the same and (selfishly) it's a bit of a hassle from the MultiQC side. People often forget to run with the flag or don't realise that they need to, or worse are using a pipeline written by someone else where the flags are difficult / impossible to change. So I would personally advocate for always saving this information by default, unless there's a reason not to.

Redmar-van-den-Berg · 2023-01-03T09:41:18Z

It would be nice to always write the metadata file, and optionally include the statistics in this file when humid is executed with the -s flag. That way, all the relevant information is always in the same place.

We might also want to add an optional flag to humid where the user can specify the sample name to be put in the metadata file.

jfjlaros · 2023-01-03T11:54:56Z

@ewels

Though does it need to be an option / a separate file?

I think so, yes. In my opinion a command line tool should produce as little clutter as possible by default. I also think that it is the responsibility of the caller to document which parameters are used, but I can see the reasons for facilitating this.

@Redmar-van-den-Berg

optionally include the statistics in this file

I would prefer not to mix metadata with results.

ewels · 2023-01-03T12:10:39Z

Ok 👍🏻 The only remaining complication is knowing which metadata file corresponds to which results file. We can assume that they must both be in the same directory, but it gets complicated if there are multiple runs in one location.

jfjlaros · 2023-01-03T18:51:34Z

Good point.

it gets complicated if there are multiple runs in one location.

That is why we added the output directory (-d) option. We could use the sample name here as @Redmar-van-den-Berg suggested.

Alternatively, instead of using flags, we could consider allowing the user to provide the names of the additional files, e.g., use -s <name> instead of -s. This would get rid of a lot of implicit knowledge needed to interpret the results.

ewels · 2023-01-04T10:22:38Z

The standard filenames are one thing I really liked already, as it makes it much easier to find them 😅 Otherwise we have to search the file contents of every single file looking for specific strings custom to the tool output.

Generally we're talking about two different user groups here though. I'm talking as a tool developer for downstream software where I have no control over how the software is run. (Alternatively end users running the tool indirectly, as part of a larger package / pipeline.) I like everything to be as standardised and inflexible as possible, so that the results are easy to discover and parse as possible, irrespective of what the user did upstream.

In my personal opinion, 90% of users don't really care about customising filenames and things, they want stuff to "just work". The less requirement to read the docs the better. So I usually aim for sensible defaults with the option to customise.

Anyway, this all got a bit in depth, subjective and philosophical which wasn't really my intention! 😅 The MultiQC module already works fine as it is and the current outputs are pretty solid as they are. So if in doubt, feel free to forget I said anything 😆

jfjlaros · 2023-02-03T15:02:35Z

I have reverted the metadata logging for now.

I am thinking about adding this functionality to the command line parser library that this tool uses as it would guarantee a tight coupling and avoids code duplication.

jfjlaros added the enhancement New feature or request label Jan 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Log input filename #12

Log input filename #12

ewels commented Dec 31, 2022

jfjlaros commented Jan 1, 2023

ewels commented Jan 1, 2023

jfjlaros commented Jan 1, 2023

ewels commented Jan 2, 2023

jfjlaros commented Jan 2, 2023

ewels commented Jan 3, 2023

Redmar-van-den-Berg commented Jan 3, 2023

jfjlaros commented Jan 3, 2023

ewels commented Jan 3, 2023

jfjlaros commented Jan 3, 2023

ewels commented Jan 4, 2023

jfjlaros commented Feb 3, 2023

Log input filename #12

Log input filename #12

Comments

ewels commented Dec 31, 2022

jfjlaros commented Jan 1, 2023

ewels commented Jan 1, 2023

jfjlaros commented Jan 1, 2023

ewels commented Jan 2, 2023

jfjlaros commented Jan 2, 2023

ewels commented Jan 3, 2023

Redmar-van-den-Berg commented Jan 3, 2023

jfjlaros commented Jan 3, 2023

ewels commented Jan 3, 2023

jfjlaros commented Jan 3, 2023

ewels commented Jan 4, 2023

jfjlaros commented Feb 3, 2023