Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster FASTA and FASTQ metadata setting #17462

Merged
merged 7 commits into from
Feb 19, 2024

Conversation

bernt-matthias
Copy link
Contributor

So far the Fasta datatype just used the set_meta method of Sequence which is quite general and allows for things like comments and unnecessarily strips spaces and checks for empty lines. By a more strict interpretation of the FASTA "standard" we can be nearly 50% faster (see).

Implemented the same for FASTQ. The sniffer here is really strict (e.g. expects strictly 4 line blocks) so I think we can also simplify the code here...

How to test the changes?

(Select all options that apply)

  • I've included appropriate automated tests.
  • This is a refactoring of components with existing test coverage.
  • Instructions for manual testing are as follows:
    1. [add testing steps and prerequisites here if you didn't write automated tests covering all your changes]

License

  • I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

because @ can be in the qualities (like # .. which was handled wrong
previously)
@bernt-matthias

This comment was marked as resolved.

@nsoranzo
Copy link
Member

According to https://en.wikipedia.org/wiki/FASTA_format , empty lines are allowed and I don't think they should be counted as data_lines.

@bernt-matthias
Copy link
Contributor Author

bernt-matthias commented Feb 13, 2024

Hmm. Right.

But it might be a less intrusive change if we would keep counting empty lines as data lines, or:

data_lines += 1
, or am I wrong.

Also one might argue that also an empty sequence is a sequence. Edit: but our sniffer would not accept it :)

According to https://blast.ncbi.nlm.nih.gov/doc/blast-topics/ empty lines are not allowed:

Blank lines are not allowed in the middle of FASTA input.

@nsoranzo
Copy link
Member

nsoranzo commented Feb 14, 2024

In the middle, but they are allowed between the end of a sequence and the header of the next one, as in the example on the Wikipedia page.

@bernt-matthias
Copy link
Contributor Author

I'm completely agnostic wrt this, i.e. I don't care (and can't imagine a case where this matters .. but probably there is). I guess if we add such a check (strip + the check) we loose 50% of the improvement.

@bernt-matthias
Copy link
Contributor Author

How about 24950a7? Surprisingly line[0] == ">" is faster than line.startswith(">") and then it comes handy to check for empty lines first (overall the runtime stays approx the same compared to the previous commit).

@mvdbeek
Copy link
Member

mvdbeek commented Feb 14, 2024

Not a fan of these microoptimizations, that seems like something you'd have to evaluate on every python version, given that that seems so optimizable. If this isn't dominating the runtime I wouldn't do it ?

@nsoranzo
Copy link
Member

Not a fan of these microoptimizations, that seems like something you'd have to evaluate on every python version, given that that seems so optimizable. If this isn't dominating the runtime I wouldn't do it ?

It does actually make a difference on a 3 Gb FASTA file as it avoids a function call (even on Python 3.12).

@bernt-matthias
Copy link
Contributor Author

Thanks for checking.

@martenson
Copy link
Member

nice, more of these! Thanks @bernt-matthias

@martenson martenson merged commit 0d380c4 into galaxyproject:dev Feb 19, 2024
104 of 106 checks passed
@bernt-matthias bernt-matthias deleted the topic/fasta-set-meta branch February 19, 2024 20:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants