Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log input filename #12

Open
ewels opened this issue Dec 31, 2022 · 12 comments
Open

Log input filename #12

ewels opened this issue Dec 31, 2022 · 12 comments
Labels
enhancement New feature or request

Comments

@ewels
Copy link

ewels commented Dec 31, 2022

@Redmar-van-den-Berg wrote a lovely MultiQC module for this tool in MultiQC/MultiQC#1795, but the method for finding a sample name is a bit rough:

# There is no sample name in the log, so we use the root of the
# file as sample name (since the filename is always stats.dat
name = f["root"]

It would be much nicer if the stats.dat file could log one or two more things - most importantly, the input file name(s). That way, the file can be interpreted by itself without needing directory structure for context.

@jfjlaros jfjlaros added the enhancement New feature or request label Jan 1, 2023
@jfjlaros
Copy link
Owner

jfjlaros commented Jan 1, 2023

Thank you for this suggestion.

Would something like this be acceptable?

input: R1.fq R2.f1 R3.fq
total: 8
usable: 7
unique: 4
clusters: 3

It would be much nicer if the stats.dat file could log one or two more things

Which other things do you have in mind?

@ewels
Copy link
Author

ewels commented Jan 1, 2023

Given that the file seems to already be valid YAML (at least the example from @Redmar-van-den-Berg is), how about sticking to YAML? Would make downstream usage super easy for people.

So then, something like this:

input: [R1.fq, R2.f1, R3.fq]
total: 8
usable: 7
unique: 4
clusters: 3

or this:

input:
  - R1.fq
  - R2.f1
  - R3.fq
total: 8
usable: 7
unique: 4
clusters: 3

For other stuff, I can refer to my now ancient and slightly-ranty blog post on the topic, but generally I really appreciate things like the software name, software version number, the command / parameters used. So if you end up with the results file in isolation you're able to know how it was generated.

@jfjlaros
Copy link
Owner

jfjlaros commented Jan 1, 2023

Well, it is a text file that happens to conform to the YAML standard, but we can pretend that this was intentional.

I really appreciate things like the software name, software version number, the command / parameters used.

I will see what I can do. Do you have any preferences for key names?

@ewels
Copy link
Author

ewels commented Jan 2, 2023

A happy accident 😊 No preference. Thanks so much!

@jfjlaros
Copy link
Owner

jfjlaros commented Jan 2, 2023

I added an option to write a separate metadata file.

Such a file would look like this:

name: humid
version: 1.0.2
parameters:
  word_length: 24
  allowed_mismatches: 1
  log_name: /dev/stderr
  dir_name: out
  run_statistics: false
  filter_duplicates: true
  annotate_fastq_files: false
  use_edit_distance: false
  use_maximum_clustering: false
  write_metadata: true
  files:
    - data/R1.fq.gz
    - data/R2.fq.gz
    - data/R3.fq.gz

Would this work nicely with MultiQC?

@ewels
Copy link
Author

ewels commented Jan 3, 2023

Looks great! Though does it need to be an option / a separate file? The author of HiSAT did the same and (selfishly) it's a bit of a hassle from the MultiQC side. People often forget to run with the flag or don't realise that they need to, or worse are using a pipeline written by someone else where the flags are difficult / impossible to change. So I would personally advocate for always saving this information by default, unless there's a reason not to.

@Redmar-van-den-Berg
Copy link
Contributor

It would be nice to always write the metadata file, and optionally include the statistics in this file when humid is executed with the -s flag. That way, all the relevant information is always in the same place.

We might also want to add an optional flag to humid where the user can specify the sample name to be put in the metadata file.

@jfjlaros
Copy link
Owner

jfjlaros commented Jan 3, 2023

@ewels

Though does it need to be an option / a separate file?

I think so, yes. In my opinion a command line tool should produce as little clutter as possible by default. I also think that it is the responsibility of the caller to document which parameters are used, but I can see the reasons for facilitating this.

@Redmar-van-den-Berg

optionally include the statistics in this file

I would prefer not to mix metadata with results.

@ewels
Copy link
Author

ewels commented Jan 3, 2023

Ok 👍🏻 The only remaining complication is knowing which metadata file corresponds to which results file. We can assume that they must both be in the same directory, but it gets complicated if there are multiple runs in one location.

@jfjlaros
Copy link
Owner

jfjlaros commented Jan 3, 2023

Good point.

it gets complicated if there are multiple runs in one location.

That is why we added the output directory (-d) option. We could use the sample name here as @Redmar-van-den-Berg suggested.

Alternatively, instead of using flags, we could consider allowing the user to provide the names of the additional files, e.g., use -s <name> instead of -s. This would get rid of a lot of implicit knowledge needed to interpret the results.

@ewels
Copy link
Author

ewels commented Jan 4, 2023

The standard filenames are one thing I really liked already, as it makes it much easier to find them 😅 Otherwise we have to search the file contents of every single file looking for specific strings custom to the tool output.

Generally we're talking about two different user groups here though. I'm talking as a tool developer for downstream software where I have no control over how the software is run. (Alternatively end users running the tool indirectly, as part of a larger package / pipeline.) I like everything to be as standardised and inflexible as possible, so that the results are easy to discover and parse as possible, irrespective of what the user did upstream.

In my personal opinion, 90% of users don't really care about customising filenames and things, they want stuff to "just work". The less requirement to read the docs the better. So I usually aim for sensible defaults with the option to customise.

Anyway, this all got a bit in depth, subjective and philosophical which wasn't really my intention! 😅 The MultiQC module already works fine as it is and the current outputs are pretty solid as they are. So if in doubt, feel free to forget I said anything 😆

@jfjlaros
Copy link
Owner

jfjlaros commented Feb 3, 2023

I have reverted the metadata logging for now.

I am thinking about adding this functionality to the command line parser library that this tool uses as it would guarantee a tight coupling and avoids code duplication.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants