Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FastK: adding new tool #5550

Closed
wants to merge 39 commits into from
Closed
Show file tree
Hide file tree
Changes from 26 commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
81b4e81
starting fastk tool wrapper
abueg Nov 9, 2022
acbf696
gfsaga
abueg Nov 11, 2022
c0ca84c
fdsaf
abueg Nov 11, 2022
73911c5
this can be served
abueg Nov 15, 2022
7b9270f
test?
abueg Nov 17, 2022
40db534
indentation
abueg Nov 17, 2022
1530579
took stuff out and this works very basic
abueg Nov 18, 2022
59ef44c
testdata
abueg Sep 28, 2023
a553432
testdata
abueg Oct 6, 2023
371c28f
tar testfiles
abueg Oct 6, 2023
1f2cfc5
tar testfiles3
abueg Oct 6, 2023
74d75a4
tests
abueg Oct 26, 2023
0cceb90
macros
abueg Oct 26, 2023
3628496
Merge branch 'galaxyproject:main' into fastk
abueg Oct 27, 2023
905b1e2
Create .shed.yml
abueg Oct 27, 2023
b682572
Update .shed.yml
bgruening Oct 27, 2023
ad04760
remove `ls -lah`
abueg Oct 27, 2023
e0663ab
gzip test input
abueg Oct 28, 2023
08d8320
added citation
abueg Oct 28, 2023
82dfa9c
gzipped test file
abueg Oct 28, 2023
97887e8
testing if bits for input file ext detection
abueg Oct 30, 2023
ba42b65
change two ifs to an elif
abueg Oct 30, 2023
e5b0606
move wd outside of outfiles
abueg Oct 30, 2023
30912b3
remove fastk command from the conditional, use variables instead
abueg Oct 30, 2023
f6bafef
rest of input elifs
abueg Oct 30, 2023
2977741
change tests to has_archive_member
abueg Oct 31, 2023
cada3fa
moving tokens to macros
abueg Oct 31, 2023
e25160e
moving citation and requirements to macros
abueg Oct 31, 2023
af3c940
remove period in desc
abueg Oct 31, 2023
d6f1382
reformatting to have other tools. test 2 & 3 good. discuss do this or…
abueg Oct 31, 2023
6a717f2
fixed `version_suffix` typo in macros.xml
abueg Jan 16, 2024
ea44af0
filter
abueg Jan 30, 2024
a2eb1a5
test 1 fix
abueg Jan 30, 2024
5dc3e37
help text
abueg Jan 30, 2024
5ede1b7
escaping unescaped characters
abueg Jan 30, 2024
8ea1904
filetype detection and unsorted bam edits
abueg Feb 26, 2024
6cfbf27
additional has_size assertions
abueg Mar 6, 2024
83f1327
typo
abueg Mar 6, 2024
94d876c
updated with size vals from failed tool test output. i got the previo…
abueg Mar 6, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions tools/fastk/.shed.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
categories:
- Assembly
description: "FastK: A K-mer counter (for HQ assembly data sets)"
homepage_url: https://github.com/thegenemyers/FASTK
long_description: |
FastK is a k‑mer counter that is optimized for processing high-quality DNA assembly data sets such as those produced with an Illumina instrument or a PacBio run in HiFi mode.
name: fastk
owner: iuc
remote_repository_url: https://github.com/galaxyproject/tools-iuc/tree/master/tools/fastk
type: unrestricted
148 changes: 148 additions & 0 deletions tools/fastk/fastk.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,148 @@
<tool id="fastk" name="FastK" version="@TOOL_VERSION@+galaxy@VERSION_SUFFIX@" profile="20.01">
<description>A k-mer counter for high-quality assembly datasets.</description>
abueg marked this conversation as resolved.
Show resolved Hide resolved
<macros>
<token name="@TOOL_VERSION@">1.0.0</token>
<token name="@VERSION_SUFFIX@">0</token>
<xml name="citations">
<citations>
<citation type="bibtex">
@misc{github,
author = {Gene Meyers},
year = {2020},
title = {FastK},
publisher = {GitHub},
journal = {GitHub repository},
url = {https://github.com/thegenemyers/FASTK},
}
</citation>
</citations>
</xml>
</macros>
<requirements>
<requirement type="package" version="@TOOL_VERSION@">fastk</requirement>
</requirements>
abueg marked this conversation as resolved.
Show resolved Hide resolved
<command detect_errors="exit_code"><![CDATA[
mkdir -p outfiles/tmpfiles &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you can also run without creating outfiles.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the outfiles dir makes tar'ing the files down the line easier, because i just tar -c -f fastk.tar ./outfiles/ ?

#if $infile.ext == "fasta":
ln -s '$infile' ./input.fasta &&
#set INPUTFILE="input.fasta"
#elif $infile.ext == "fasta.gz":
ln -s '$infile' ./input.fasta.gz &&
#set INPUTFILE="input.fasta.gz"
#elif $infile.ext == "fastq":
abueg marked this conversation as resolved.
Show resolved Hide resolved
ln -s '$infile' ./input.fastq &&
#set INPUTFILE="input.fastq"
#elif $infile.ext == "fastq.gz":
ln -s '$infile' ./input.fastq.gz &&
#set INPUTFILE="input.fastq.gz"
#elif $infile.ext == "cram":
ln -s '$infile' ./input.cram &&
#set INPUTFILE="input.cram"
#elif $infile.ext == "bam":
ln -s '$infile' ./input.bam &&
#set INPUTFILE="input.bam"
#elif $infile.ext == "sam":
ln -s '$infile' ./input.sam &&
#set INPUTFILE="input.sam"
#end if
FastK $INPUTFILE
#set kmer_size=$kmer_size
-k$kmer_size
#if $sorted_table.sorted_table_presence == 'yes':
#if $sorted_table.type_sorted_table.sorted_table_options == 'default_sorted_table':
-t
#elif $sorted_table.type_sorted_table.sorted_table_options == 'cutoff_sorted_table':
#set sorted_table_cutoff=$sorted_table.type_sorted_table.sorted_table_cutoff
-t$sorted_table_cutoff
abueg marked this conversation as resolved.
Show resolved Hide resolved
#end if
#end if
-T\${GALAXY_SLOTS:-1} -Noutfiles/output -Poutfiles/tmpfiles
#if $sorted_table.sorted_table_presence == 'yes':
&& Tabex outfiles/output.ktab -t${sorted_table.advanced.tabex_threshold} LIST > tabex.txt
#end if
&& tar -c -f fastk.tar ./outfiles/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the tar also include tmpfiles/? How about the hidden files that you mentioned. Is this intended?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the tar is intended to include the hidden files, correct 👍🏼

]]></command>
<inputs>
<param name="infile" type="data" format="fasta,fasta.gz,fastq,fastq.gz,cram,bam,sam" label="Input file" />
abueg marked this conversation as resolved.
Show resolved Hide resolved
<param name="kmer_size" type="integer" min="5" max="50" value="40" label="K-mer size" help="Default k-mer size is 40." />
<conditional name="sorted_table">
<param name="sorted_table_presence" type="select" label="Sorted table selection" help="Do you want a sorted table of all canonical k-mers and their counts?">
<option value="no">No</option>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still wondering: What is the use of this tools if the user selects No? The tool will only produce a tar, which can't be used in Galaxy, isn't it?

Instead of yes/no, shouldn't the user select from at least one of the tools fastk provides:

  • Histex: Display a FastK histogram or convert to 1-code
  • Tabex: List, Check, find a k‑mer in a FastK table, or convert to 1-code
  • Profex: Display a FastK profile or convert to 1-code
  • Logex: Combine kmer,count tables with logical expressions & filter with count cutoffs
  • Symmex: Produce a symmetric k-mer table from a canonical one

If you agree, but don't have the time or need for all of them then maybe prepare the tool such that it can be easily extended in this way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree with that structure for how to run the tool! i think i tried to do that when i initially was making the skeleton for the tool:
https://github.com/abueg/tools-iuc/blob/73911c5e8a5b5e33f2ca70c253bdf7f7868e7eba/tools/fastk/fastk.xml
... but it got lost in the edits along the way. but would that sort of structure work for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would i put a placeholder in the other option elifs? or just have fastk and tabex implemented?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bernt-matthias each of the sub-tools for fastk use the fastk tar as an input. The original intention was to have this be a suite where you would run fastk and the output file would be useable by any of the other tools without needing to rerun fastk itself. We can add a separate output for the .hist files aside from the tar file, but the intent was to standardize an input for the histex, tabex, etc. tools to use

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for the sub-tools idea, i.e. one for Fastk, one for Histex, ...

<option value="yes">Yes</option>
</param>
<when value="no"/>
<when value="yes">
<conditional name="type_sorted_table">
<param name="sorted_table_options" type="select" label="Sorted table presence" help="Do you want to specify a cut-off?">
<option value="default_sorted_table">default (1)</option>
<option value="cutoff_sorted_table">specify cutoff</option>
</param>
<when value="default_sorted_table"/>
<when value="cutoff_sorted_table">
<param name="sorted_table_cutoff" type="integer" min="2" value="10" label="Sorted table cutoff value"/>
abueg marked this conversation as resolved.
Show resolved Hide resolved
</when>
</conditional>
Copy link
Contributor

@bernt-matthias bernt-matthias Oct 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of this, maybe:

Suggested change
<conditional name="type_sorted_table">
<param name="sorted_table_options" type="select" label="Sorted table presence" help="Do you want to specify a cut-off?">
<option value="default_sorted_table">default (1)</option>
<option value="cutoff_sorted_table">specify cutoff</option>
</param>
<when value="default_sorted_table"/>
<when value="cutoff_sorted_table">
<param name="sorted_table_cutoff" type="integer" min="2" value="10" label="Sorted table cutoff value"/>
</when>
</conditional>
<param name="sorted_table_cutoff" type="integer" min="2" optional="true" label="Sorted table cutoff value"/>

Plus the necessary change in the command section (ie check if a value is given).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several of the subsequent tools take an input that ran without the sorted table cutoff flag, which is why we made it optional

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still optional. But the conditional is not needed for this.

<section title="Advanced" name="advanced">
<param name="tabex_threshold" label="Tabex count threshold" type="integer" value="5" min="1"/>
</section>
</when>
</conditional>
</inputs>
<outputs>
<data name="fastk_out" format="tar" from_work_dir="fastk.tar" label="${tool.name} on ${on_string}: FastK hist files"/>
<data name="tabex_hist" format="txt" from_work_dir="tabex.txt" label="${tool.name} on ${on_string}: Tabex output">
<filter>sorted_table['sorted_table_presence'] == "yes"</filter>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah i need to change this because i added the operation_type/command_type layer!

</data>
</outputs>
<tests>
<!-- TEST 1 -->
<test expect_num_outputs="1">
<param name="infile" value="input01.fasta.gz"/>
<output name="fastk_out" ftype="tar">
<assert_contents>
<has_archive_member path="./outfiles/output.hist" />
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With <has_archive_member path=".*" n="10"/> you could also add an assumption on the number of files in the archive, which would be neat.

Additionally you could also make assumptions on the content by including more assumptions, like so:

<has_archive_member path="./outfiles/output.hist">
  <not_has_text text="EDK72998.1"/>
</has_archive_member>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i considered adding # of files, but for the test involving sorted tables, the number of files generated can change because the number of hidden files made is dependent on how many threads is given to the program -- so it's dependent on $GALAXY_SLOTS... though i guess for the test i can specify -T1 for one thread so it's consistent?

for content assumption, the hist and ktab files are binaries, so i didn't think the has_text/not_has_text assumptions would work, but please correct me if wrong!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

though i guess for the test i can specify -T1

I would suggest to avoid this

for content assumption, the hist and ktab files are binaries, so i didn't think the has_text/not_has_text assumptions would work, but please correct me if wrong!

Indeed. Then maybe has_size is your friend :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added has_size assertions!

</assert_contents>
</output>
</test>
<!-- TEST 2 -->
<test expect_num_outputs="2">
<param name="infile" value="input01.fasta.gz"/>
<conditional name="sorted_table">
<param name="sorted_table_presence" value="yes"/>
<conditional name="type_sorted_table">
<param name="sorted_table_options" value="default_sorted_table"/>
</conditional>
</conditional>
<output name="fastk_out" ftype="tar">
<assert_contents>
<has_archive_member path="./outfiles/output.hist" />
<has_archive_member path="./outfiles/output.ktab" />
</assert_contents>
</output>
<output name="tabex_hist" value="test02.tabex.txt"/>
</test>
<!-- TEST 3 -->
<test expect_num_outputs="2">
<param name="infile" value="input01.fasta.gz"/>
<conditional name="sorted_table">
<param name="sorted_table_presence" value="yes"/>
<conditional name="type_sorted_table">
<param name="sorted_table_options" value="cutoff_sorted_table"/>
<param name="sorted_table_cutoff" value="5"/>
</conditional>
</conditional>
<output name="fastk_out" ftype="tar">
<assert_contents>
<has_archive_member path="./outfiles/output.hist" />
<has_archive_member path="./outfiles/output.ktab" />
</assert_contents>
</output>
<output name="tabex_hist" value="test03.tabex.txt"/>
</test>
</tests>
<help><![CDATA[
FastK is a k‑mer counter that is optimized for processing high quality DNA assembly data sets such as those produced with an Illumina instrument or a PacBio run in HiFi mode.
]]></help>
<expand macro="citations" />
</tool>

24 changes: 24 additions & 0 deletions tools/fastk/macros.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
<macros>
<token name="@TOOL_VERSION@">1.0.0</token>
<token name="@GALAXY_TOOL_VERSION@">galaxy</token>
abueg marked this conversation as resolved.
Show resolved Hide resolved
<token name="@SUFFIX_VERSION@">0</token>
<xml name="requirements">
<requirements>
<requirement type="package" version="@TOOL_VERSION@">fastk</requirement>
</requirements>
</xml>
<xml name="citations">
<citations>
<citation type="bibtex">
@misc{github,
author = {Gene Meyers},
year = {2020},
title = {FastK},
publisher = {GitHub},
journal = {GitHub repository},
url = {https://github.com/thegenemyers/FASTK},
}
</citation>
</citations>
</xml>
</macros>
Binary file added tools/fastk/test-data/input01.fasta.gz
Binary file not shown.
Loading