Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FastK: adding new tool #5550

Closed
wants to merge 39 commits into from
Closed
Show file tree
Hide file tree
Changes from 14 commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
81b4e81
starting fastk tool wrapper
abueg Nov 9, 2022
acbf696
gfsaga
abueg Nov 11, 2022
c0ca84c
fdsaf
abueg Nov 11, 2022
73911c5
this can be served
abueg Nov 15, 2022
7b9270f
test?
abueg Nov 17, 2022
40db534
indentation
abueg Nov 17, 2022
1530579
took stuff out and this works very basic
abueg Nov 18, 2022
59ef44c
testdata
abueg Sep 28, 2023
a553432
testdata
abueg Oct 6, 2023
371c28f
tar testfiles
abueg Oct 6, 2023
1f2cfc5
tar testfiles3
abueg Oct 6, 2023
74d75a4
tests
abueg Oct 26, 2023
0cceb90
macros
abueg Oct 26, 2023
3628496
Merge branch 'galaxyproject:main' into fastk
abueg Oct 27, 2023
905b1e2
Create .shed.yml
abueg Oct 27, 2023
b682572
Update .shed.yml
bgruening Oct 27, 2023
ad04760
remove `ls -lah`
abueg Oct 27, 2023
e0663ab
gzip test input
abueg Oct 28, 2023
08d8320
added citation
abueg Oct 28, 2023
82dfa9c
gzipped test file
abueg Oct 28, 2023
97887e8
testing if bits for input file ext detection
abueg Oct 30, 2023
ba42b65
change two ifs to an elif
abueg Oct 30, 2023
e5b0606
move wd outside of outfiles
abueg Oct 30, 2023
30912b3
remove fastk command from the conditional, use variables instead
abueg Oct 30, 2023
f6bafef
rest of input elifs
abueg Oct 30, 2023
2977741
change tests to has_archive_member
abueg Oct 31, 2023
cada3fa
moving tokens to macros
abueg Oct 31, 2023
e25160e
moving citation and requirements to macros
abueg Oct 31, 2023
af3c940
remove period in desc
abueg Oct 31, 2023
d6f1382
reformatting to have other tools. test 2 & 3 good. discuss do this or…
abueg Oct 31, 2023
6a717f2
fixed `version_suffix` typo in macros.xml
abueg Jan 16, 2024
ea44af0
filter
abueg Jan 30, 2024
a2eb1a5
test 1 fix
abueg Jan 30, 2024
5dc3e37
help text
abueg Jan 30, 2024
5ede1b7
escaping unescaped characters
abueg Jan 30, 2024
8ea1904
filetype detection and unsorted bam edits
abueg Feb 26, 2024
6cfbf27
additional has_size assertions
abueg Mar 6, 2024
83f1327
typo
abueg Mar 6, 2024
94d876c
updated with size vals from failed tool test output. i got the previo…
abueg Mar 6, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
115 changes: 115 additions & 0 deletions tools/fastk/fastk.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
<tool id="fastk" name="FastK" version="@TOOL_VERSION@+galaxy@VERSION_SUFFIX@" profile="20.01">
<description>A k-mer counter for high-quality assembly datasets.</description>
abueg marked this conversation as resolved.
Show resolved Hide resolved
<macros>
<token name="@TOOL_VERSION@">1.0</token>
<token name="@VERSION_SUFFIX@">0</token>
</macros>
<requirements>
<requirement type="package" version="1.0.0">fastk</requirement>
bernt-matthias marked this conversation as resolved.
Show resolved Hide resolved
</requirements>
<command detect_errors="exit_code"><![CDATA[
mkdir -p outfiles/tmpfiles &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you can also run without creating outfiles.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the outfiles dir makes tar'ing the files down the line easier, because i just tar -c -f fastk.tar ./outfiles/ ?

cd outfiles &&
ln -s '$infile' input.fasta &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Infile can be also other formats than fasta.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true, that was my short-sightedness from just using this with FASTAs, will add the other supported formats as described in the program's readme 👍 thank you!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok so what i am trying to figure out now is, the program seems to decide how to run based on the given file extension, so i gzipped the test data to try to make it smaller, and then i had to change the CMD to fasta.gz so that it would properly run on that. i'm going to mark this PR as draft for now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you set ftype="fasta" in the test's ` it should be extracted automatically.

FastK input.fasta
#set kmer_size=$kmer_size
-k$kmer_size
#if $sorted_table.sorted_table_presence == 'yes':
#if $sorted_table.type_sorted_table.sorted_table_options == 'default_sorted_table':
-t
#elif $sorted_table.type_sorted_table.sorted_table_options == 'cutoff_sorted_table':
#set sorted_table_cutoff=$sorted_table.type_sorted_table.sorted_table_cutoff
-t$sorted_table_cutoff
abueg marked this conversation as resolved.
Show resolved Hide resolved
#end if
#end if
-T\${GALAXY_SLOTS:-1} -Noutput -Ptmpfiles
#if $sorted_table.sorted_table_presence == 'yes':
&& Tabex output.ktab -t${sorted_table.advanced.tabex_threshold} LIST > tabex.txt
&& mv tabex.txt ..
#end if
&& mv input.fasta ..
&& ls -lah
abueg marked this conversation as resolved.
Show resolved Hide resolved
&& tar -c -f fastk.tar .
]]></command>
<inputs>
<param name="infile" type="data" format="fasta,fasta.gz,fastq,fastq.gz,cram,bam,sam" label="Input file" />
abueg marked this conversation as resolved.
Show resolved Hide resolved
<param name="kmer_size" type="integer" min="5" max="50" value="40" label="K-mer size" help="Default k-mer size is 40." />
<conditional name="sorted_table">
<param name="sorted_table_presence" type="select" label="Sorted table selection" help="Do you want a sorted table of all canonical k-mers and their counts?">
<option value="no">No</option>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still wondering: What is the use of this tools if the user selects No? The tool will only produce a tar, which can't be used in Galaxy, isn't it?

Instead of yes/no, shouldn't the user select from at least one of the tools fastk provides:

  • Histex: Display a FastK histogram or convert to 1-code
  • Tabex: List, Check, find a k‑mer in a FastK table, or convert to 1-code
  • Profex: Display a FastK profile or convert to 1-code
  • Logex: Combine kmer,count tables with logical expressions & filter with count cutoffs
  • Symmex: Produce a symmetric k-mer table from a canonical one

If you agree, but don't have the time or need for all of them then maybe prepare the tool such that it can be easily extended in this way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree with that structure for how to run the tool! i think i tried to do that when i initially was making the skeleton for the tool:
https://github.com/abueg/tools-iuc/blob/73911c5e8a5b5e33f2ca70c253bdf7f7868e7eba/tools/fastk/fastk.xml
... but it got lost in the edits along the way. but would that sort of structure work for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would i put a placeholder in the other option elifs? or just have fastk and tabex implemented?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bernt-matthias each of the sub-tools for fastk use the fastk tar as an input. The original intention was to have this be a suite where you would run fastk and the output file would be useable by any of the other tools without needing to rerun fastk itself. We can add a separate output for the .hist files aside from the tar file, but the intent was to standardize an input for the histex, tabex, etc. tools to use

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for the sub-tools idea, i.e. one for Fastk, one for Histex, ...

<option value="yes">Yes</option>
</param>
<when value="no"/>
<when value="yes">
<conditional name="type_sorted_table">
<param name="sorted_table_options" type="select" label="Sorted table presence" help="Do you want to specify a cut-off?">
<option value="default_sorted_table">default (1)</option>
<option value="cutoff_sorted_table">specify cutoff</option>
</param>
<when value="default_sorted_table"/>
<when value="cutoff_sorted_table">
<param name="sorted_table_cutoff" type="integer" min="2" value="10" label="Sorted table cutoff value"/>
abueg marked this conversation as resolved.
Show resolved Hide resolved
</when>
</conditional>
Copy link
Contributor

@bernt-matthias bernt-matthias Oct 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of this, maybe:

Suggested change
<conditional name="type_sorted_table">
<param name="sorted_table_options" type="select" label="Sorted table presence" help="Do you want to specify a cut-off?">
<option value="default_sorted_table">default (1)</option>
<option value="cutoff_sorted_table">specify cutoff</option>
</param>
<when value="default_sorted_table"/>
<when value="cutoff_sorted_table">
<param name="sorted_table_cutoff" type="integer" min="2" value="10" label="Sorted table cutoff value"/>
</when>
</conditional>
<param name="sorted_table_cutoff" type="integer" min="2" optional="true" label="Sorted table cutoff value"/>

Plus the necessary change in the command section (ie check if a value is given).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several of the subsequent tools take an input that ran without the sorted table cutoff flag, which is why we made it optional

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still optional. But the conditional is not needed for this.

<section title="Advanced" name="advanced">
<param name="tabex_threshold" label="Tabex count threshold" type="integer" value="5" min="1"/>
</section>
</when>
</conditional>
</inputs>
<outputs>
<data name="fastk_out" format="tar" from_work_dir="outfiles/fastk.tar" label="${tool.name} on ${on_string}: FastK hist files"/>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a collection would be better than a tar?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the output files won't be needed/used on their own, only ever as a folder with those files inside them, so that was the rationale for tar

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand. But then we have one optional output tabex_hist and one output that "won't be needed/used ...". So what is supposed to be used as output?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, sorry about the confusion -- when k-mer counting, there should always be an output of the output.hist file. optionally, if the user specifies the -k option, then the outputs will include output.ktab and a number of hidden .output.ktab.n files, with n corresponding to the number of threads given to the run. these hidden files are needed to be in the same folder as the non-hidden output.ktab file in order for subsequent commands (e.g., tabex or histex) to work.

so to sum, there will always be at output.hist as an output, and optionally other outputs (.output.ktab.n files) that might be produced, and can't be used on their own

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still I don't see if there is always an output of the Galaxy tool that can be used (within Galaxy).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there should always be a output.hist file, that is a binary file that can be viewed via running Histex... so i suppose no, there is not always a readily useable output that can be put into other tools, it needs to be processed with another tool in this suite, first

<data name="tabex_hist" format="txt" from_work_dir="tabex.txt" label="${tool.name} on ${on_string}: Tabex output">
<filter>sorted_table['sorted_table_presence'] == "yes"</filter>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah i need to change this because i added the operation_type/command_type layer!

</data>
</outputs>
<tests>
<!-- TEST 1 -->
<test expect_num_outputs="1">
<param name="infile" value="input01.fasta"/>
<output name="fastk_out" ftype="tar">
<assert_contents>
<has_size value="266240" delta="1000" />
</assert_contents>
</output>
</test>
<!-- TEST 2 -->
<test expect_num_outputs="2">
<param name="infile" value="input01.fasta"/>
<conditional name="sorted_table">
<param name="sorted_table_presence" value="yes"/>
<conditional name="type_sorted_table">
<param name="sorted_table_options" value="default_sorted_table"/>
</conditional>
</conditional>
<output name="fastk_out" ftype="tar">
<assert_contents>
<has_size value="5826560" delta="1000" />
</assert_contents>
</output>
<output name="tabex_hist" value="test02.tabex.txt"/>
</test>
<!-- TEST 3 -->
<test expect_num_outputs="2">
<param name="infile" value="input01.fasta"/>
<conditional name="sorted_table">
<param name="sorted_table_presence" value="yes"/>
<conditional name="type_sorted_table">
<param name="sorted_table_options" value="cutoff_sorted_table"/>
<param name="sorted_table_cutoff" value="5"/>
</conditional>
</conditional>
<output name="fastk_out" ftype="tar">
<assert_contents>
<has_size value="276480" delta="1000" />
</assert_contents>
</output>
<output name="tabex_hist" value="test03.tabex.txt"/>
</test>
</tests>
<help><![CDATA[
FastK is a k‑mer counter that is optimized for processing high quality DNA assembly data sets such as those produced with an Illumina instrument or a PacBio run in HiFi mode.
]]></help>
<!-- <expand macro="citations"/> -->
</tool>

24 changes: 24 additions & 0 deletions tools/fastk/macros.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
<macros>
<token name="@TOOL_VERSION@">1.0</token>
<token name="@GALAXY_TOOL_VERSION@">galaxy</token>
abueg marked this conversation as resolved.
Show resolved Hide resolved
<token name="@SUFFIX_VERSION@">0</token>
<xml name="requirements">
<requirements>
<requirement type="package" version="@TOOL_VERSION@">fastk</requirement>
</requirements>
</xml>
<xml name="citations">
<citations>
<citation type="bibtex">
@misc{github,
author = {Gene Meyers},
year = {2020},
title = {FastK},
publisher = {GitHub},
journal = {GitHub repository},
url = {https://github.com/thegenemyers/FASTK},
}
</citation>
</citations>
</xml>
</macros>
Loading
Loading