Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FastK: adding new tool #5550

Closed
wants to merge 39 commits into from
Closed
Show file tree
Hide file tree
Changes from 36 commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
81b4e81
starting fastk tool wrapper
abueg Nov 9, 2022
acbf696
gfsaga
abueg Nov 11, 2022
c0ca84c
fdsaf
abueg Nov 11, 2022
73911c5
this can be served
abueg Nov 15, 2022
7b9270f
test?
abueg Nov 17, 2022
40db534
indentation
abueg Nov 17, 2022
1530579
took stuff out and this works very basic
abueg Nov 18, 2022
59ef44c
testdata
abueg Sep 28, 2023
a553432
testdata
abueg Oct 6, 2023
371c28f
tar testfiles
abueg Oct 6, 2023
1f2cfc5
tar testfiles3
abueg Oct 6, 2023
74d75a4
tests
abueg Oct 26, 2023
0cceb90
macros
abueg Oct 26, 2023
3628496
Merge branch 'galaxyproject:main' into fastk
abueg Oct 27, 2023
905b1e2
Create .shed.yml
abueg Oct 27, 2023
b682572
Update .shed.yml
bgruening Oct 27, 2023
ad04760
remove `ls -lah`
abueg Oct 27, 2023
e0663ab
gzip test input
abueg Oct 28, 2023
08d8320
added citation
abueg Oct 28, 2023
82dfa9c
gzipped test file
abueg Oct 28, 2023
97887e8
testing if bits for input file ext detection
abueg Oct 30, 2023
ba42b65
change two ifs to an elif
abueg Oct 30, 2023
e5b0606
move wd outside of outfiles
abueg Oct 30, 2023
30912b3
remove fastk command from the conditional, use variables instead
abueg Oct 30, 2023
f6bafef
rest of input elifs
abueg Oct 30, 2023
2977741
change tests to has_archive_member
abueg Oct 31, 2023
cada3fa
moving tokens to macros
abueg Oct 31, 2023
e25160e
moving citation and requirements to macros
abueg Oct 31, 2023
af3c940
remove period in desc
abueg Oct 31, 2023
d6f1382
reformatting to have other tools. test 2 & 3 good. discuss do this or…
abueg Oct 31, 2023
6a717f2
fixed `version_suffix` typo in macros.xml
abueg Jan 16, 2024
ea44af0
filter
abueg Jan 30, 2024
a2eb1a5
test 1 fix
abueg Jan 30, 2024
5dc3e37
help text
abueg Jan 30, 2024
5ede1b7
escaping unescaped characters
abueg Jan 30, 2024
8ea1904
filetype detection and unsorted bam edits
abueg Feb 26, 2024
6cfbf27
additional has_size assertions
abueg Mar 6, 2024
83f1327
typo
abueg Mar 6, 2024
94d876c
updated with size vals from failed tool test output. i got the previo…
abueg Mar 6, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions tools/fastk/.shed.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
categories:
- Assembly
description: "FastK: A K-mer counter (for HQ assembly data sets)"
homepage_url: https://github.com/thegenemyers/FASTK
long_description: |
FastK is a k‑mer counter that is optimized for processing high-quality DNA assembly data sets such as those produced with an Illumina instrument or a PacBio run in HiFi mode.
name: fastk
owner: iuc
remote_repository_url: https://github.com/galaxyproject/tools-iuc/tree/master/tools/fastk
type: unrestricted
169 changes: 169 additions & 0 deletions tools/fastk/fastk.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,169 @@
<tool id="fastk" name="FastK" version="@TOOL_VERSION@+galaxy@VERSION_SUFFIX@" profile="20.01">
<description>A k-mer counter for high-quality assembly datasets</description>
<macros>
<import>macros.xml</import>
</macros>
<expand macro="requirements" />
<command detect_errors="exit_code"><![CDATA[
mkdir -p outfiles/tmpfiles &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you can also run without creating outfiles.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the outfiles dir makes tar'ing the files down the line easier, because i just tar -c -f fastk.tar ./outfiles/ ?

#if $infile.ext == "fasta":
ln -s '$infile' ./input.fasta &&
#set INPUTFILE="input.fasta"
#elif $infile.ext == "fasta.gz":
ln -s '$infile' ./input.fasta.gz &&
#set INPUTFILE="input.fasta.gz"
#elif $infile.is_of_type("fastq"):
ln -s '$infile' ./input.fastq &&
#set INPUTFILE="input.fastq"
#elif $infile.is_of_type("fastq.gz"):
ln -s '$infile' ./input.fastq.gz &&
#set INPUTFILE="input.fastq.gz"
#elif $infile.ext == "cram":
ln -s '$infile' ./input.cram &&
#set INPUTFILE="input.cram"
#elif $infile.is_of_type("unsorted.bam"):
ln -s '$infile' ./input.bam &&
#set INPUTFILE="input.bam"
#elif $infile.ext == "sam":
ln -s '$infile' ./input.sam &&
#set INPUTFILE="input.sam"
#end if
#if $operation_type.command_type == 'FastK':
FastK $INPUTFILE
#set kmer_size=$operation_type.kmer_size
-k$kmer_size
#if $operation_type.sorted_table.sorted_table_presence == 'yes':
#if $operation_type.sorted_table.type_sorted_table.sorted_table_options == 'default_sorted_table':
-t
#elif $operation_type.sorted_table.type_sorted_table.sorted_table_options == 'cutoff_sorted_table':
#set sorted_table_cutoff=$operation_type.sorted_table.type_sorted_table.sorted_table_cutoff
-t$sorted_table_cutoff
#end if
#end if
-T\${GALAXY_SLOTS:-1} -Noutfiles/output -Poutfiles/tmpfiles
#if $operation_type.sorted_table.sorted_table_presence == 'yes':
&& Tabex outfiles/output.ktab -t${operation_type.sorted_table.advanced.tabex_threshold} LIST > tabex.txt
#end if
&& tar -c -f fastk.tar ./outfiles/
#elif $operation_type.command_type == 'Histex':
Histex
#elif $operation_type.command_type == 'Tabex':
Tabex
#elif $operation_type.command_type == 'Profex':
Profex
#elif $operation_type.command_type == 'Logex':
Logex
#elif $operation_type.command_type == 'Symmex':
Symmex
#end if
]]></command>
<inputs>
<param name="infile" type="data" format="fasta,fasta.gz,fastq,fastq.gz,cram,unsorted.bam,sam" label="Input file" />
<conditional name="operation_type">
<param name="command_type" type="select" label="Operation type selector" help="Select a type of operation">
<option value="FastK">FastK: count k-mers</option>
<option value="Histex">Histex: display a FastK histogram</option>
<option value="Tabex">Tabex: list, check, or find a k-mer in a FastK table</option>
<option value="Profex">Profex: display a FastK profile</option>
<option value="Logex">Logex: combine k-mer,count tables with logical expressions and filter with count cutoffs</option>
<option value="Symmex">Symmex: produce a symmetric k-mer table from a canonical one</option>
Comment on lines +65 to +69
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<option value="Histex">Histex: display a FastK histogram</option>
<option value="Tabex">Tabex: list, check, or find a k-mer in a FastK table</option>
<option value="Profex">Profex: display a FastK profile</option>
<option value="Logex">Logex: combine k-mer,count tables with logical expressions and filter with count cutoffs</option>
<option value="Symmex">Symmex: produce a symmetric k-mer table from a canonical one</option>
<!--option value="Histex">Histex: display a FastK histogram</option>
<option value="Tabex">Tabex: list, check, or find a k-mer in a FastK table</option>
<option value="Profex">Profex: display a FastK profile</option>
<option value="Logex">Logex: combine k-mer,count tables with logical expressions and filter with count cutoffs</option>
<option value="Symmex">Symmex: produce a symmetric k-mer table from a canonical one</option-->

</param>
<when value="FastK">
<param name="kmer_size" type="integer" min="5" max="50" value="40" label="K-mer size" help="Default k-mer size is 40." />
<conditional name="sorted_table">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it would be a good idea to flatten the nested conditionals and just have one with 3 options:

  • No
  • Yes with default threshold
  • Yes with manual threshold

?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that idea. @abueg does this make sense?

<param name="sorted_table_presence" type="select" label="Sorted table selection" help="Do you want a sorted table of all canonical k-mers and their counts? The sorted table is sorted lexicographically on the k-mer where a &lt; c &lt; g &lt; t.">
<option value="no">No</option>
<option value="yes">Yes</option>
</param>
<when value="no"/>
<when value="yes">
<conditional name="type_sorted_table">
<param name="sorted_table_options" type="select" label="Sorted table presence" help="Do you want to specify a cut-off? If you do, then only k-mers occuring above that cut-off will occur.">
<option value="default_sorted_table">default (1)</option>
<option value="cutoff_sorted_table">specify cutoff</option>
</param>
<when value="default_sorted_table"/>
<when value="cutoff_sorted_table">
<param name="sorted_table_cutoff" type="integer" min="2" value="10" label="Sorted table cutoff value"/>
</when>
</conditional>
<section title="Advanced" name="advanced">
<param name="tabex_threshold" label="Tabex count threshold" type="integer" value="5" min="1"/>
</section>
</when>
</conditional>
</when>
<when value="Histex">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you plan to implement this as a single tool?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, do you mean implement Histex as a single tool, or implement all these others as a single tool with FastK?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just wondering. Both would be fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can maybe help with splitting the tools later. Lets go with what is here for now.

</when>
<when value="Tabex">
</when>
<when value="Profex">
</when>
<when value="Logex">
</when>
<when value="Symmex">
</when>
Comment on lines +96 to +105
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<when value="Histex">
</when>
<when value="Tabex">
</when>
<when value="Profex">
</when>
<when value="Logex">
</when>
<when value="Symmex">
</when>
<!--when value="Histex">
</when>
<when value="Tabex">
</when>
<when value="Profex">
</when>
<when value="Logex">
</when>
<when value="Symmex">
</when-->

</conditional>
</inputs>
<outputs>
<data name="fastk_out" format="tar" from_work_dir="fastk.tar" label="${tool.name} on ${on_string}: FastK files"/>
<data name="fastk_hist" format="binary" from_work_dir="outfiles/output.hist" label="${tool.name} on ${on_string}: FastK hist" />
Comment on lines +109 to +110
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those two outputs will also need a filter, or?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah you mean a filter for if FastK was the selected tool?

<data name="tabex_hist" format="txt" from_work_dir="tabex.txt" label="${tool.name} on ${on_string}: Tabex output">
<filter>operation_type['command_type'] == 'FastK' and operation_type['sorted_table'] == 'yes' and operation_type['sorted_table_presence'] == 'yes'</filter>
</data>
</outputs>
<tests>
<!-- TEST 1 -->
<test expect_num_outputs="2">
<param name="infile" value="input01.fasta.gz" />
<param name="command_type" value="FastK" />
<output name="fastk_out" ftype="tar">
<assert_contents>
<has_archive_member path="./outfiles/output.hist" />
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With <has_archive_member path=".*" n="10"/> you could also add an assumption on the number of files in the archive, which would be neat.

Additionally you could also make assumptions on the content by including more assumptions, like so:

<has_archive_member path="./outfiles/output.hist">
  <not_has_text text="EDK72998.1"/>
</has_archive_member>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i considered adding # of files, but for the test involving sorted tables, the number of files generated can change because the number of hidden files made is dependent on how many threads is given to the program -- so it's dependent on $GALAXY_SLOTS... though i guess for the test i can specify -T1 for one thread so it's consistent?

for content assumption, the hist and ktab files are binaries, so i didn't think the has_text/not_has_text assumptions would work, but please correct me if wrong!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

though i guess for the test i can specify -T1

I would suggest to avoid this

for content assumption, the hist and ktab files are binaries, so i didn't think the has_text/not_has_text assumptions would work, but please correct me if wrong!

Indeed. Then maybe has_size is your friend :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added has_size assertions!

</assert_contents>
</output>
</test>
<!-- TEST 2 -->
<test expect_num_outputs="2">
<param name="infile" value="input01.fasta.gz" />
<param name="command_type" value="FastK" />
<conditional name="sorted_table">
<param name="sorted_table_presence" value="yes"/>
<conditional name="type_sorted_table">
<param name="sorted_table_options" value="default_sorted_table"/>
</conditional>
</conditional>
<output name="fastk_out" ftype="tar">
<assert_contents>
<has_archive_member path="./outfiles/output.hist" />
<has_archive_member path="./outfiles/output.ktab" />
</assert_contents>
</output>
<output name="tabex_hist" value="test02.tabex.txt"/>
</test>
<!-- TEST 3 -->
<test expect_num_outputs="2">
<param name="infile" value="input01.fasta.gz" />
<param name="command_type" value="FastK" />
<conditional name="sorted_table">
<param name="sorted_table_presence" value="yes"/>
<conditional name="type_sorted_table">
<param name="sorted_table_options" value="cutoff_sorted_table"/>
<param name="sorted_table_cutoff" value="5"/>
</conditional>
</conditional>
<output name="fastk_out" ftype="tar">
<assert_contents>
<has_archive_member path="./outfiles/output.hist" />
<has_archive_member path="./outfiles/output.ktab" />
</assert_contents>
</output>
<output name="tabex_hist" value="test03.tabex.txt"/>
</test>
</tests>
<help><![CDATA[
FastK is a k‑mer counter that is optimized for processing high quality DNA assembly data sets such as those produced with an Illumina instrument or a PacBio run in HiFi mode.
]]></help>
<expand macro="citations" />
</tool>

23 changes: 23 additions & 0 deletions tools/fastk/macros.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
<macros>
<token name="@TOOL_VERSION@">1.0.0</token>
<token name="@VERSION_SUFFIX@">0</token>
<xml name="requirements">
<requirements>
<requirement type="package" version="@TOOL_VERSION@">fastk</requirement>
</requirements>
</xml>
<xml name="citations">
<citations>
<citation type="bibtex">
@misc{github,
author = {Gene Meyers},
year = {2020},
title = {FastK},
publisher = {GitHub},
journal = {GitHub repository},
url = {https://github.com/thegenemyers/FASTK},
}
</citation>
</citations>
</xml>
</macros>
Binary file added tools/fastk/test-data/input01.fasta.gz
Binary file not shown.
Binary file added tools/fastk/test-data/test02.tabex.txt
Binary file not shown.
Binary file added tools/fastk/test-data/test03.tabex.txt
Binary file not shown.