Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FastK: adding new tool #5550

Closed
wants to merge 39 commits into from
Closed

FastK: adding new tool #5550

wants to merge 39 commits into from

Conversation

abueg
Copy link
Contributor

@abueg abueg commented Oct 27, 2023

FOR CONTRIBUTOR:

  • - I have read the CONTRIBUTING.md document and this tool is appropriate for the tools-iuc repo.
  • - License permits unrestricted use (educational + commercial)
  • - This PR adds a new tool or tool collection
  • - This PR updates an existing tool or tool collection
  • - This PR does something else (explain below)

Hello! 👋🏼

This PR adds the k-mer counting/manipulation tool FastK as a galaxy tool. There 3 tests using different sets of parameters 👍🏼

The license permits use as long as the license text is reproduced in the binary, which I believe the conda package respects, here is the license: https://github.com/thegenemyers/FASTK/blob/master/LICENSE

Thank you @astrovsky01 for all your help with this!! 🙏🏼

@bernt-matthias
Copy link
Contributor

.shed.yml is missing.

Copy link
Contributor

@bernt-matthias bernt-matthias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some first comments.

tools/fastk/fastk.xml Outdated Show resolved Hide resolved
<requirement type="package" version="1.0.0">fastk</requirement>
</requirements>
<command detect_errors="exit_code"><![CDATA[
mkdir -p outfiles/tmpfiles &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you can also run without creating outfiles.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the outfiles dir makes tar'ing the files down the line easier, because i just tar -c -f fastk.tar ./outfiles/ ?

tools/fastk/fastk.xml Outdated Show resolved Hide resolved
<command detect_errors="exit_code"><![CDATA[
mkdir -p outfiles/tmpfiles &&
cd outfiles &&
ln -s '$infile' input.fasta &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Infile can be also other formats than fasta.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true, that was my short-sightedness from just using this with FASTAs, will add the other supported formats as described in the program's readme 👍 thank you!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok so what i am trying to figure out now is, the program seems to decide how to run based on the given file extension, so i gzipped the test data to try to make it smaller, and then i had to change the CMD to fasta.gz so that it would properly run on that. i'm going to mark this PR as draft for now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you set ftype="fasta" in the test's ` it should be extracted automatically.

</conditional>
</inputs>
<outputs>
<data name="fastk_out" format="tar" from_work_dir="outfiles/fastk.tar" label="${tool.name} on ${on_string}: FastK hist files"/>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a collection would be better than a tar?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the output files won't be needed/used on their own, only ever as a folder with those files inside them, so that was the rationale for tar

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand. But then we have one optional output tabex_hist and one output that "won't be needed/used ...". So what is supposed to be used as output?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, sorry about the confusion -- when k-mer counting, there should always be an output of the output.hist file. optionally, if the user specifies the -k option, then the outputs will include output.ktab and a number of hidden .output.ktab.n files, with n corresponding to the number of threads given to the run. these hidden files are needed to be in the same folder as the non-hidden output.ktab file in order for subsequent commands (e.g., tabex or histex) to work.

so to sum, there will always be at output.hist as an output, and optionally other outputs (.output.ktab.n files) that might be produced, and can't be used on their own

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still I don't see if there is always an output of the Galaxy tool that can be used (within Galaxy).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there should always be a output.hist file, that is a binary file that can be viewed via running Histex... so i suppose no, there is not always a readily useable output that can be put into other tools, it needs to be processed with another tool in this suite, first

tools/fastk/fastk.xml Outdated Show resolved Hide resolved
Co-authored-by: Björn Grüning <[email protected]>
@bgruening
Copy link
Member

The linting fails because of a missing citation. You can add a bibtex and just cite the GitHub repo if there is no proper citation yet.

The input fasta file is unfortunately too big, can you reduce the size?

@bernt-matthias
Copy link
Contributor

The input fasta file is unfortunately too big, can you reduce the size?

If its available remotely we can use location="URL" .. (CI will work as soon as we have a new planemo version with galaxyproject/planemo#1388)

@abueg abueg marked this pull request as draft October 28, 2023 02:50
@abueg
Copy link
Contributor Author

abueg commented Oct 30, 2023

added an if/elif part to the start of the tool for input file extension detection, as the tool relies on the input file's extension to determine how to run. this made some of the later parts messy so i moved the tool out of outfiles as a wd and am working outside it instead. this current version is passing the tests locally and i will work on adding the other file types into the CMD now 👍🏼 (ty @bgruening for the infile.ext tip!)

PS can i use a case statement in the CMD insead of if/elif/elif/elif?

edit: hmm looks like the test tarballs when run on github are a bit different size, i don't get that error when i'm running locally but maybe moving to doing the tar command outside of the directory changed something i didn't see... the tarball size can also depend on the # of threads used but i think the github tests use 1 thread, based off $GALAXY_SLOTS being 2?

<outputs>
<data name="fastk_out" format="tar" from_work_dir="fastk.tar" label="${tool.name} on ${on_string}: FastK hist files"/>
<data name="tabex_hist" format="txt" from_work_dir="tabex.txt" label="${tool.name} on ${on_string}: Tabex output">
<filter>sorted_table['sorted_table_presence'] == "yes"</filter>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah i need to change this because i added the operation_type/command_type layer!

@abueg abueg mentioned this pull request Nov 1, 2023
@bernt-matthias
Copy link
Contributor

my apologies for that!

Nothing to apologize for! Thanks for your work.

@abueg
Copy link
Contributor Author

abueg commented Nov 27, 2023

hi @bernt-matthias & @bgruening !! i've put together a flowchart of inputs/outputs of the tools in this suite and how they work within (and occasionally without) -- sorry about the delay!

i hope this is helpful for figuring out how to proceed with wrapping the tool(s)? the gist is that FastK and Logex will produce binary files that are only readable within this suite (and the MerquryFK tool), but Tabex and Histex have human-readable output. Histex's output additionally can be read into GenomeScope 2.0, which is already wrapped (so we might not need to wrap GeneScopeFK, as it operates on the same output as GenomeScope 2.0 afaik)...

image

(there are additional commands available from the FastK suite, but I do not have the need for them at the moment)

@bernt-matthias
Copy link
Contributor

So, it appears to me that it would be good to have the intermediate binary files as output. Then we can reuse them as input for more than one program. Also gives options in terms of scalability.

For this, we would need new datatypes. For ktab one may subclass from the new directory datatype (?) alternatively a composite datatype might work (but difficult to use in tests). In the simplest case we can just add them to the datatypes.xml.sample file. But, this may be realized earliest with the next Galaxy release. If we can't wait we could just use binary/directory .. for now.

@mvdbeek
Copy link
Member

mvdbeek commented Nov 28, 2023

But, this may be realized earliest with the next Galaxy release.

Why ? We've always supported adding datatypes to the current stable release.

@bernt-matthias
Copy link
Contributor

Why ? We've always supported adding datatypes to the current stable release.

Cool. I repeatedly forget :)

@abueg
Copy link
Contributor Author

abueg commented Jan 30, 2024

im still alive i promise

@abueg abueg marked this pull request as ready for review January 30, 2024 11:43
@bgruening
Copy link
Member

@abueg is this ready for review?

@abueg
Copy link
Contributor Author

abueg commented Feb 26, 2024

@bgruening I think it is ok to review now, last commits were fixing some issues with testing! I might add more things in a couple weeks, but I think this should work right now for the k-mer counting & hist generation!!

tools/fastk/test-data/test02.tabex.txt Outdated Show resolved Hide resolved
<param name="kmer_size" type="integer" min="5" max="50" value="40" label="K-mer size" help="Default k-mer size is 40." />
<conditional name="sorted_table">
<param name="sorted_table_presence" type="select" label="Sorted table selection" help="Do you want a sorted table of all canonical k-mers and their counts?">
<option value="no">No</option>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for the sub-tools idea, i.e. one for Fastk, one for Histex, ...

Comment on lines 75 to 84
<conditional name="type_sorted_table">
<param name="sorted_table_options" type="select" label="Sorted table presence" help="Do you want to specify a cut-off?">
<option value="default_sorted_table">default (1)</option>
<option value="cutoff_sorted_table">specify cutoff</option>
</param>
<when value="default_sorted_table"/>
<when value="cutoff_sorted_table">
<param name="sorted_table_cutoff" type="integer" min="2" value="10" label="Sorted table cutoff value"/>
</when>
</conditional>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still optional. But the conditional is not needed for this.

</param>
<when value="FastK">
<param name="kmer_size" type="integer" min="5" max="50" value="40" label="K-mer size" help="Default k-mer size is 40." />
<conditional name="sorted_table">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it would be a good idea to flatten the nested conditionals and just have one with 3 options:

  • No
  • Yes with default threshold
  • Yes with manual threshold

?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that idea. @abueg does this make sense?

Comment on lines +109 to +110
<data name="fastk_out" format="tar" from_work_dir="fastk.tar" label="${tool.name} on ${on_string}: FastK files"/>
<data name="fastk_hist" format="binary" from_work_dir="outfiles/output.hist" label="${tool.name} on ${on_string}: FastK hist" />
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those two outputs will also need a filter, or?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah you mean a filter for if FastK was the selected tool?

@@ -100,7 +100,7 @@
<param name="infile" value="input01.fasta.gz"/>
<output name="fastk_out" ftype="tar">
<assert_contents>
<has_size value="266240" delta="1000" />
<has_archive_member path="./outfiles/output.hist" />
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

though i guess for the test i can specify -T1

I would suggest to avoid this

for content assumption, the hist and ktab files are binaries, so i didn't think the has_text/not_has_text assumptions would work, but please correct me if wrong!

Indeed. Then maybe has_size is your friend :)

</when>
</conditional>
</when>
<when value="Histex">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you plan to implement this as a single tool?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, do you mean implement Histex as a single tool, or implement all these others as a single tool with FastK?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just wondering. Both would be fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can maybe help with splitting the tools later. Lets go with what is here for now.

abueg added 3 commits March 6, 2024 18:11
…us file sizes from running the tests in planemo serve and downloading the files
Copy link
Member

@bgruening bgruening left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abueg if we comment out the other functions it is ready to go.

Comment on lines +65 to +69
<option value="Histex">Histex: display a FastK histogram</option>
<option value="Tabex">Tabex: list, check, or find a k-mer in a FastK table</option>
<option value="Profex">Profex: display a FastK profile</option>
<option value="Logex">Logex: combine k-mer,count tables with logical expressions and filter with count cutoffs</option>
<option value="Symmex">Symmex: produce a symmetric k-mer table from a canonical one</option>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<option value="Histex">Histex: display a FastK histogram</option>
<option value="Tabex">Tabex: list, check, or find a k-mer in a FastK table</option>
<option value="Profex">Profex: display a FastK profile</option>
<option value="Logex">Logex: combine k-mer,count tables with logical expressions and filter with count cutoffs</option>
<option value="Symmex">Symmex: produce a symmetric k-mer table from a canonical one</option>
<!--option value="Histex">Histex: display a FastK histogram</option>
<option value="Tabex">Tabex: list, check, or find a k-mer in a FastK table</option>
<option value="Profex">Profex: display a FastK profile</option>
<option value="Logex">Logex: combine k-mer,count tables with logical expressions and filter with count cutoffs</option>
<option value="Symmex">Symmex: produce a symmetric k-mer table from a canonical one</option-->

Comment on lines +96 to +105
<when value="Histex">
</when>
<when value="Tabex">
</when>
<when value="Profex">
</when>
<when value="Logex">
</when>
<when value="Symmex">
</when>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<when value="Histex">
</when>
<when value="Tabex">
</when>
<when value="Profex">
</when>
<when value="Logex">
</when>
<when value="Symmex">
</when>
<!--when value="Histex">
</when>
<when value="Tabex">
</when>
<when value="Profex">
</when>
<when value="Logex">
</when>
<when value="Symmex">
</when-->

Copy link
Contributor

@bernt-matthias bernt-matthias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to make this a suite and use one tool per subcommand. The change should be equally simple compared to the suggestions made by @bgruening

This make scheduling much easier (guess the steps need different amounts of resources).

@SaimMomin12 SaimMomin12 mentioned this pull request Apr 23, 2024
5 tasks
@bernt-matthias
Copy link
Contributor

Guess this will be superseeded by #5965

@abueg
Copy link
Contributor Author

abueg commented Apr 23, 2024

ACK sorry yes @SaimMomin12 has very kindly taken over the wrappers for this suite!!

@abueg abueg closed this May 3, 2024
@abueg
Copy link
Contributor Author

abueg commented May 3, 2024

closed as, has been mentioned, is superseded by this PR: #5965

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants