-
Notifications
You must be signed in to change notification settings - Fork 441
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FastK: adding new tool #5550
FastK: adding new tool #5550
Conversation
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some first comments.
<requirement type="package" version="1.0.0">fastk</requirement> | ||
</requirements> | ||
<command detect_errors="exit_code"><![CDATA[ | ||
mkdir -p outfiles/tmpfiles && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you can also run without creating outfiles
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think the outfiles
dir makes tar'ing the files down the line easier, because i just tar -c -f fastk.tar ./outfiles/
?
tools/fastk/fastk.xml
Outdated
<command detect_errors="exit_code"><![CDATA[ | ||
mkdir -p outfiles/tmpfiles && | ||
cd outfiles && | ||
ln -s '$infile' input.fasta && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Infile can be also other formats than fasta.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
true, that was my short-sightedness from just using this with FASTAs, will add the other supported formats as described in the program's readme 👍 thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok so what i am trying to figure out now is, the program seems to decide how to run based on the given file extension, so i gzipped the test data to try to make it smaller, and then i had to change the CMD to fasta.gz
so that it would properly run on that. i'm going to mark this PR as draft for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you set ftype="fasta" in the test's
` it should be extracted automatically.
tools/fastk/fastk.xml
Outdated
</conditional> | ||
</inputs> | ||
<outputs> | ||
<data name="fastk_out" format="tar" from_work_dir="outfiles/fastk.tar" label="${tool.name} on ${on_string}: FastK hist files"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a collection would be better than a tar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the output files won't be needed/used on their own, only ever as a folder with those files inside them, so that was the rationale for tar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand. But then we have one optional output tabex_hist
and one output that "won't be needed/used ...". So what is supposed to be used as output?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, sorry about the confusion -- when k-mer counting, there should always be an output of the output.hist
file. optionally, if the user specifies the -k
option, then the outputs will include output.ktab
and a number of hidden .output.ktab.n
files, with n
corresponding to the number of threads given to the run. these hidden files are needed to be in the same folder as the non-hidden output.ktab
file in order for subsequent commands (e.g., tabex
or histex
) to work.
so to sum, there will always be at output.hist
as an output, and optionally other outputs (.output.ktab.n
files) that might be produced, and can't be used on their own
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still I don't see if there is always an output of the Galaxy tool that can be used (within Galaxy).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there should always be a output.hist
file, that is a binary file that can be viewed via running Histex
... so i suppose no, there is not always a readily useable output that can be put into other tools, it needs to be processed with another tool in this suite, first
Co-authored-by: Björn Grüning <[email protected]>
The linting fails because of a missing citation. You can add a bibtex and just cite the GitHub repo if there is no proper citation yet. The input fasta file is unfortunately too big, can you reduce the size? |
If its available remotely we can use |
added an if/elif part to the start of the tool for input file extension detection, as the tool relies on the input file's extension to determine how to run. this made some of the later parts messy so i moved the tool out of PS can i use a edit: hmm looks like the test tarballs when run on github are a bit different size, i don't get that error when i'm running locally but maybe moving to doing the |
tools/fastk/fastk.xml
Outdated
<outputs> | ||
<data name="fastk_out" format="tar" from_work_dir="fastk.tar" label="${tool.name} on ${on_string}: FastK hist files"/> | ||
<data name="tabex_hist" format="txt" from_work_dir="tabex.txt" label="${tool.name} on ${on_string}: Tabex output"> | ||
<filter>sorted_table['sorted_table_presence'] == "yes"</filter> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah i need to change this because i added the operation_type
/command_type
layer!
Nothing to apologize for! Thanks for your work. |
hi @bernt-matthias & @bgruening !! i've put together a flowchart of inputs/outputs of the tools in this suite and how they work within (and occasionally without) -- sorry about the delay! i hope this is helpful for figuring out how to proceed with wrapping the tool(s)? the gist is that FastK and Logex will produce binary files that are only readable within this suite (and the MerquryFK tool), but Tabex and Histex have human-readable output. Histex's output additionally can be read into GenomeScope 2.0, which is already wrapped (so we might not need to wrap GeneScopeFK, as it operates on the same output as GenomeScope 2.0 afaik)... (there are additional commands available from the FastK suite, but I do not have the need for them at the moment) |
So, it appears to me that it would be good to have the intermediate binary files as output. Then we can reuse them as input for more than one program. Also gives options in terms of scalability. For this, we would need new datatypes. For |
Why ? We've always supported adding datatypes to the current stable release. |
Cool. I repeatedly forget :) |
im still alive i promise |
@abueg is this ready for review? |
@bgruening I think it is ok to review now, last commits were fixing some issues with testing! I might add more things in a couple weeks, but I think this should work right now for the k-mer counting & hist generation!! |
tools/fastk/fastk.xml
Outdated
<param name="kmer_size" type="integer" min="5" max="50" value="40" label="K-mer size" help="Default k-mer size is 40." /> | ||
<conditional name="sorted_table"> | ||
<param name="sorted_table_presence" type="select" label="Sorted table selection" help="Do you want a sorted table of all canonical k-mers and their counts?"> | ||
<option value="no">No</option> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for the sub-tools idea, i.e. one for Fastk
, one for Histex
, ...
tools/fastk/fastk.xml
Outdated
<conditional name="type_sorted_table"> | ||
<param name="sorted_table_options" type="select" label="Sorted table presence" help="Do you want to specify a cut-off?"> | ||
<option value="default_sorted_table">default (1)</option> | ||
<option value="cutoff_sorted_table">specify cutoff</option> | ||
</param> | ||
<when value="default_sorted_table"/> | ||
<when value="cutoff_sorted_table"> | ||
<param name="sorted_table_cutoff" type="integer" min="2" value="10" label="Sorted table cutoff value"/> | ||
</when> | ||
</conditional> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's still optional. But the conditional is not needed for this.
</param> | ||
<when value="FastK"> | ||
<param name="kmer_size" type="integer" min="5" max="50" value="40" label="K-mer size" help="Default k-mer size is 40." /> | ||
<conditional name="sorted_table"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it would be a good idea to flatten the nested conditionals and just have one with 3 options:
- No
- Yes with default threshold
- Yes with manual threshold
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that idea. @abueg does this make sense?
<data name="fastk_out" format="tar" from_work_dir="fastk.tar" label="${tool.name} on ${on_string}: FastK files"/> | ||
<data name="fastk_hist" format="binary" from_work_dir="outfiles/output.hist" label="${tool.name} on ${on_string}: FastK hist" /> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those two outputs will also need a filter, or?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah you mean a filter for if FastK
was the selected tool?
@@ -100,7 +100,7 @@ | |||
<param name="infile" value="input01.fasta.gz"/> | |||
<output name="fastk_out" ftype="tar"> | |||
<assert_contents> | |||
<has_size value="266240" delta="1000" /> | |||
<has_archive_member path="./outfiles/output.hist" /> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
though i guess for the test i can specify -T1
I would suggest to avoid this
for content assumption, the hist and ktab files are binaries, so i didn't think the has_text/not_has_text assumptions would work, but please correct me if wrong!
Indeed. Then maybe has_size
is your friend :)
</when> | ||
</conditional> | ||
</when> | ||
<when value="Histex"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you plan to implement this as a single tool?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, do you mean implement Histex
as a single tool, or implement all these others as a single tool with FastK?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just wondering. Both would be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can maybe help with splitting the tools later. Lets go with what is here for now.
…us file sizes from running the tests in planemo serve and downloading the files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@abueg if we comment out the other functions it is ready to go.
<option value="Histex">Histex: display a FastK histogram</option> | ||
<option value="Tabex">Tabex: list, check, or find a k-mer in a FastK table</option> | ||
<option value="Profex">Profex: display a FastK profile</option> | ||
<option value="Logex">Logex: combine k-mer,count tables with logical expressions and filter with count cutoffs</option> | ||
<option value="Symmex">Symmex: produce a symmetric k-mer table from a canonical one</option> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<option value="Histex">Histex: display a FastK histogram</option> | |
<option value="Tabex">Tabex: list, check, or find a k-mer in a FastK table</option> | |
<option value="Profex">Profex: display a FastK profile</option> | |
<option value="Logex">Logex: combine k-mer,count tables with logical expressions and filter with count cutoffs</option> | |
<option value="Symmex">Symmex: produce a symmetric k-mer table from a canonical one</option> | |
<!--option value="Histex">Histex: display a FastK histogram</option> | |
<option value="Tabex">Tabex: list, check, or find a k-mer in a FastK table</option> | |
<option value="Profex">Profex: display a FastK profile</option> | |
<option value="Logex">Logex: combine k-mer,count tables with logical expressions and filter with count cutoffs</option> | |
<option value="Symmex">Symmex: produce a symmetric k-mer table from a canonical one</option--> |
<when value="Histex"> | ||
</when> | ||
<when value="Tabex"> | ||
</when> | ||
<when value="Profex"> | ||
</when> | ||
<when value="Logex"> | ||
</when> | ||
<when value="Symmex"> | ||
</when> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<when value="Histex"> | |
</when> | |
<when value="Tabex"> | |
</when> | |
<when value="Profex"> | |
</when> | |
<when value="Logex"> | |
</when> | |
<when value="Symmex"> | |
</when> | |
<!--when value="Histex"> | |
</when> | |
<when value="Tabex"> | |
</when> | |
<when value="Profex"> | |
</when> | |
<when value="Logex"> | |
</when> | |
<when value="Symmex"> | |
</when--> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest to make this a suite and use one tool per subcommand. The change should be equally simple compared to the suggestions made by @bgruening
This make scheduling much easier (guess the steps need different amounts of resources).
Guess this will be superseeded by #5965 |
ACK sorry yes @SaimMomin12 has very kindly taken over the wrappers for this suite!! |
closed as, has been mentioned, is superseded by this PR: #5965 |
FOR CONTRIBUTOR:
Hello! 👋🏼
This PR adds the k-mer counting/manipulation tool FastK as a galaxy tool. There 3 tests using different sets of parameters 👍🏼
The license permits use as long as the license text is reproduced in the binary, which I believe the conda package respects, here is the license: https://github.com/thegenemyers/FASTK/blob/master/LICENSE
Thank you @astrovsky01 for all your help with this!! 🙏🏼