FastK: adding new tool #5550

abueg · 2023-10-27T16:57:01Z

FOR CONTRIBUTOR:

- I have read the CONTRIBUTING.md document and this tool is appropriate for the tools-iuc repo.
- License permits unrestricted use (educational + commercial)
- This PR adds a new tool or tool collection
- This PR updates an existing tool or tool collection
- This PR does something else (explain below)

Hello! 👋🏼

This PR adds the k-mer counting/manipulation tool FastK as a galaxy tool. There 3 tests using different sets of parameters 👍🏼

The license permits use as long as the license text is reproduced in the binary, which I believe the conda package respects, here is the license: https://github.com/thegenemyers/FASTK/blob/master/LICENSE

Thank you @astrovsky01 for all your help with this!! 🙏🏼

bernt-matthias · 2023-10-27T18:04:05Z

.shed.yml is missing.

bernt-matthias

Some first comments.

tools/fastk/fastk.xml

bernt-matthias · 2023-10-27T18:07:22Z

tools/fastk/fastk.xml

+        <requirement type="package" version="1.0.0">fastk</requirement>
+    </requirements>
+    <command detect_errors="exit_code"><![CDATA[
+        mkdir -p outfiles/tmpfiles && 


I guess you can also run without creating outfiles.

i think the outfiles dir makes tar'ing the files down the line easier, because i just tar -c -f fastk.tar ./outfiles/ ?

tools/fastk/fastk.xml

bernt-matthias · 2023-10-27T18:10:19Z

tools/fastk/fastk.xml

+    <command detect_errors="exit_code"><![CDATA[
+        mkdir -p outfiles/tmpfiles && 
+        cd outfiles &&
+        ln -s '$infile' input.fasta &&


Infile can be also other formats than fasta.

true, that was my short-sightedness from just using this with FASTAs, will add the other supported formats as described in the program's readme 👍 thank you!

ok so what i am trying to figure out now is, the program seems to decide how to run based on the given file extension, so i gzipped the test data to try to make it smaller, and then i had to change the CMD to fasta.gz so that it would properly run on that. i'm going to mark this PR as draft for now

If you set ftype="fasta" in the test's ` it should be extracted automatically.

bernt-matthias · 2023-10-27T18:11:45Z

tools/fastk/fastk.xml

+        </conditional>
+    </inputs>
+    <outputs>
+        <data name="fastk_out" format="tar" from_work_dir="outfiles/fastk.tar" label="${tool.name} on ${on_string}: FastK hist files"/>


Maybe a collection would be better than a tar?

the output files won't be needed/used on their own, only ever as a folder with those files inside them, so that was the rationale for tar

I understand. But then we have one optional output tabex_hist and one output that "won't be needed/used ...". So what is supposed to be used as output?

oh, sorry about the confusion -- when k-mer counting, there should always be an output of the output.hist file. optionally, if the user specifies the -k option, then the outputs will include output.ktab and a number of hidden .output.ktab.n files, with n corresponding to the number of threads given to the run. these hidden files are needed to be in the same folder as the non-hidden output.ktab file in order for subsequent commands (e.g., tabex or histex) to work.

so to sum, there will always be at output.hist as an output, and optionally other outputs (.output.ktab.n files) that might be produced, and can't be used on their own

Still I don't see if there is always an output of the Galaxy tool that can be used (within Galaxy).

there should always be a output.hist file, that is a binary file that can be viewed via running Histex... so i suppose no, there is not always a readily useable output that can be put into other tools, it needs to be processed with another tool in this suite, first

tools/fastk/fastk.xml

Co-authored-by: Björn Grüning <[email protected]>

bgruening · 2023-10-27T19:38:57Z

The linting fails because of a missing citation. You can add a bibtex and just cite the GitHub repo if there is no proper citation yet.

The input fasta file is unfortunately too big, can you reduce the size?

bernt-matthias · 2023-10-27T19:43:11Z

The input fasta file is unfortunately too big, can you reduce the size?

If its available remotely we can use location="URL" .. (CI will work as soon as we have a new planemo version with galaxyproject/planemo#1388)

abueg · 2023-10-30T01:45:51Z

added an if/elif part to the start of the tool for input file extension detection, as the tool relies on the input file's extension to determine how to run. this made some of the later parts messy so i moved the tool out of outfiles as a wd and am working outside it instead. this current version is passing the tests locally and i will work on adding the other file types into the CMD now 👍🏼 (ty @bgruening for the infile.ext tip!)

PS can i use a case statement in the CMD insead of if/elif/elif/elif?

edit: hmm looks like the test tarballs when run on github are a bit different size, i don't get that error when i'm running locally but maybe moving to doing the tar command outside of the directory changed something i didn't see... the tarball size can also depend on the # of threads used but i think the github tests use 1 thread, based off $GALAXY_SLOTS being 2?

… suite instead??

abueg · 2023-10-31T20:16:33Z

tools/fastk/fastk.xml

+    <outputs>
+        <data name="fastk_out" format="tar" from_work_dir="fastk.tar" label="${tool.name} on ${on_string}: FastK hist files"/>
+        <data name="tabex_hist" format="txt" from_work_dir="tabex.txt" label="${tool.name} on ${on_string}: Tabex output">
+            <filter>sorted_table['sorted_table_presence'] == "yes"</filter>


ah i need to change this because i added the operation_type/command_type layer!

bernt-matthias · 2023-11-02T16:04:02Z

my apologies for that!

Nothing to apologize for! Thanks for your work.

abueg · 2023-11-27T18:04:20Z

hi @bernt-matthias & @bgruening !! i've put together a flowchart of inputs/outputs of the tools in this suite and how they work within (and occasionally without) -- sorry about the delay!

i hope this is helpful for figuring out how to proceed with wrapping the tool(s)? the gist is that FastK and Logex will produce binary files that are only readable within this suite (and the MerquryFK tool), but Tabex and Histex have human-readable output. Histex's output additionally can be read into GenomeScope 2.0, which is already wrapped (so we might not need to wrap GeneScopeFK, as it operates on the same output as GenomeScope 2.0 afaik)...

(there are additional commands available from the FastK suite, but I do not have the need for them at the moment)

bernt-matthias · 2023-11-27T19:36:52Z

So, it appears to me that it would be good to have the intermediate binary files as output. Then we can reuse them as input for more than one program. Also gives options in terms of scalability.

For this, we would need new datatypes. For ktab one may subclass from the new directory datatype (?) alternatively a composite datatype might work (but difficult to use in tests). In the simplest case we can just add them to the datatypes.xml.sample file. But, this may be realized earliest with the next Galaxy release. If we can't wait we could just use binary/directory .. for now.

mvdbeek · 2023-11-28T12:22:27Z

But, this may be realized earliest with the next Galaxy release.

Why ? We've always supported adding datatypes to the current stable release.

bernt-matthias · 2023-11-28T12:35:24Z

Why ? We've always supported adding datatypes to the current stable release.

Cool. I repeatedly forget :)

abueg · 2024-01-30T11:24:33Z

im still alive i promise

bgruening · 2024-02-25T17:26:59Z

@abueg is this ready for review?

abueg · 2024-02-26T19:47:07Z

@bgruening I think it is ok to review now, last commits were fixing some issues with testing! I might add more things in a couple weeks, but I think this should work right now for the k-mer counting & hist generation!!

tools/fastk/test-data/test02.tabex.txt

bernt-matthias · 2023-11-02T16:02:10Z

tools/fastk/fastk.xml

+        <param name="kmer_size" type="integer" min="5" max="50" value="40" label="K-mer size" help="Default k-mer size is 40." />
+        <conditional name="sorted_table">
+            <param name="sorted_table_presence" type="select" label="Sorted table selection" help="Do you want a sorted table of all canonical k-mers and their counts?">
+                <option value="no">No</option>


+1 for the sub-tools idea, i.e. one for Fastk, one for Histex, ...

bernt-matthias · 2023-11-02T16:03:07Z

tools/fastk/fastk.xml

+                <conditional name="type_sorted_table">
+                    <param name="sorted_table_options" type="select" label="Sorted table presence" help="Do you want to specify a cut-off?">
+                        <option value="default_sorted_table">default (1)</option>
+                        <option value="cutoff_sorted_table">specify cutoff</option>
+                    </param>
+                    <when value="default_sorted_table"/>
+                    <when value="cutoff_sorted_table">
+                        <param name="sorted_table_cutoff" type="integer" min="2" value="10" label="Sorted table cutoff value"/>
+                    </when>
+                </conditional>


It's still optional. But the conditional is not needed for this.

bernt-matthias · 2024-03-05T17:09:18Z

tools/fastk/fastk.xml

+            </param>
+            <when value="FastK">
+                <param name="kmer_size" type="integer" min="5" max="50" value="40" label="K-mer size" help="Default k-mer size is 40." />
+                <conditional name="sorted_table">


Do you think it would be a good idea to flatten the nested conditionals and just have one with 3 options:

No

Yes with default threshold

Yes with manual threshold

?

I like that idea. @abueg does this make sense?

bernt-matthias · 2024-03-05T17:10:11Z

tools/fastk/fastk.xml

+        <data name="fastk_out" format="tar" from_work_dir="fastk.tar" label="${tool.name} on ${on_string}: FastK files"/>
+        <data name="fastk_hist" format="binary" from_work_dir="outfiles/output.hist" label="${tool.name} on ${on_string}: FastK hist" />


Those two outputs will also need a filter, or?

ah you mean a filter for if FastK was the selected tool?

bernt-matthias · 2024-03-05T17:11:47Z

tools/fastk/fastk.xml

@@ -100,7 +100,7 @@
            <param name="infile" value="input01.fasta.gz"/>
            <output name="fastk_out" ftype="tar">
                <assert_contents>
-                    <has_size value="266240" delta="1000" />
+                    <has_archive_member path="./outfiles/output.hist" />


though i guess for the test i can specify -T1

I would suggest to avoid this

for content assumption, the hist and ktab files are binaries, so i didn't think the has_text/not_has_text assumptions would work, but please correct me if wrong!

Indeed. Then maybe has_size is your friend :)

bernt-matthias · 2024-03-05T17:12:22Z

tools/fastk/fastk.xml

+                    </when>
+                </conditional>
+            </when>
+            <when value="Histex">


Do you plan to implement this as a single tool?

sorry, do you mean implement Histex as a single tool, or implement all these others as a single tool with FastK?

I was just wondering. Both would be fine.

We can maybe help with splitting the tools later. Lets go with what is here for now.

…us file sizes from running the tests in planemo serve and downloading the files

bgruening

@abueg if we comment out the other functions it is ready to go.

bgruening · 2024-03-20T15:28:06Z

tools/fastk/fastk.xml

+                <option value="Histex">Histex: display a FastK histogram</option>
+                <option value="Tabex">Tabex: list, check, or find a k-mer in a FastK table</option>
+                <option value="Profex">Profex: display a FastK profile</option>
+                <option value="Logex">Logex: combine k-mer,count tables with logical expressions and filter with count cutoffs</option>
+                <option value="Symmex">Symmex: produce a symmetric k-mer table from a canonical one</option>


Suggested change

<option value="Histex">Histex: display a FastK histogram</option>

<option value="Tabex">Tabex: list, check, or find a k-mer in a FastK table</option>

<option value="Profex">Profex: display a FastK profile</option>

<option value="Logex">Logex: combine k-mer,count tables with logical expressions and filter with count cutoffs</option>

<option value="Symmex">Symmex: produce a symmetric k-mer table from a canonical one</option>

bgruening · 2024-03-20T15:29:42Z

tools/fastk/fastk.xml

+            <when value="Histex">
+            </when>
+            <when value="Tabex">
+            </when>
+            <when value="Profex">
+            </when>
+            <when value="Logex">
+            </when>
+            <when value="Symmex">
+            </when>


Suggested change

<when value="Histex">

</when>

<when value="Tabex">

</when>

<when value="Profex">

</when>

<when value="Logex">

</when>

<when value="Symmex">

</when>

bernt-matthias

I would suggest to make this a suite and use one tool per subcommand. The change should be equally simple compared to the suggestions made by @bgruening

This make scheduling much easier (guess the steps need different amounts of resources).

bernt-matthias · 2024-04-23T21:26:02Z

Guess this will be superseeded by #5965

abueg · 2024-04-23T22:13:47Z

ACK sorry yes @SaimMomin12 has very kindly taken over the wrappers for this suite!!

abueg · 2024-05-03T15:41:15Z

closed as, has been mentioned, is superseded by this PR: #5965

abueg and others added 14 commits November 9, 2022 17:39

starting fastk tool wrapper

81b4e81

gfsaga

acbf696

fdsaf

c0ca84c

this can be served

73911c5

test?

7b9270f

indentation

40db534

took stuff out and this works very basic

1530579

testdata

59ef44c

testdata

a553432

tar testfiles

371c28f

tar testfiles3

1f2cfc5

tests

74d75a4

macros

0cceb90

Merge branch 'galaxyproject:main' into fastk

3628496

bernt-matthias reviewed Oct 27, 2023

View reviewed changes

abueg and others added 2 commits October 27, 2023 14:35

Create .shed.yml

905b1e2

Update .shed.yml

b682572

bgruening reviewed Oct 27, 2023

View reviewed changes

tools/fastk/fastk.xml Outdated Show resolved Hide resolved

remove ls -lah

ad04760

Co-authored-by: Björn Grüning <[email protected]>

gzip test input

e0663ab

abueg marked this pull request as draft October 28, 2023 02:50

abueg added 5 commits October 27, 2023 22:59

added citation

08d8320

gzipped test file

82dfa9c

testing if bits for input file ext detection

97887e8

change two ifs to an elif

ba42b65

move wd outside of outfiles

e5b0606

reformatting to have other tools. test 2 & 3 good. discuss do this or…

d6f1382

… suite instead??

abueg commented Oct 31, 2023

View reviewed changes

abueg mentioned this pull request Nov 1, 2023

Add fastk #5479

Open

astrovsky01 mentioned this pull request Jan 10, 2024

[23.1] Add binary datatypes for intermediate output of fastk tools galaxyproject/galaxy#17265

Merged

abueg and others added 3 commits January 16, 2024 13:32

fixed version_suffix typo in macros.xml

6a717f2

filter

ea44af0

test 1 fix

a2eb1a5

abueg added 2 commits January 30, 2024 06:31

help text

5dc3e37

escaping unescaped characters

5ede1b7

abueg marked this pull request as ready for review January 30, 2024 11:43

filetype detection and unsorted bam edits

8ea1904

bernt-matthias reviewed Mar 5, 2024

View reviewed changes

abueg added 3 commits March 6, 2024 18:11

additional has_size assertions

6cfbf27

typo

83f1327

updated with size vals from failed tool test output. i got the previo…

94d876c

…us file sizes from running the tests in planemo serve and downloading the files

bgruening reviewed Mar 20, 2024

View reviewed changes

bernt-matthias reviewed Mar 22, 2024

View reviewed changes

SaimMomin12 mentioned this pull request Apr 23, 2024

Updated FASTK tool wrapper #5965

Merged

5 tasks

abueg closed this May 3, 2024

		<data name="fastk_out" format="tar" from_work_dir="fastk.tar" label="${tool.name} on ${on_string}: FastK files"/>
		<data name="fastk_hist" format="binary" from_work_dir="outfiles/output.hist" label="${tool.name} on ${on_string}: FastK hist" />

FastK: adding new tool #5550

FastK: adding new tool #5550

Conversation

abueg commented Oct 27, 2023

bernt-matthias commented Oct 27, 2023

bernt-matthias left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bgruening commented Oct 27, 2023

bernt-matthias commented Oct 27, 2023

abueg commented Oct 30, 2023 • edited Loading

Choose a reason for hiding this comment

bernt-matthias commented Nov 2, 2023

abueg commented Nov 27, 2023

bernt-matthias commented Nov 27, 2023

mvdbeek commented Nov 28, 2023

bernt-matthias commented Nov 28, 2023

abueg commented Jan 30, 2024

bgruening commented Feb 25, 2024

abueg commented Feb 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bgruening left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bernt-matthias left a comment

Choose a reason for hiding this comment

bernt-matthias commented Apr 23, 2024

abueg commented Apr 23, 2024

abueg commented May 3, 2024

abueg commented Oct 30, 2023 •

edited

Loading