-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster FASTA and FASTQ metadata setting #17462
Faster FASTA and FASTQ metadata setting #17462
Conversation
and do not use the slower one from Sequence xref galaxyproject#17451 (comment)
because @ can be in the qualities (like # .. which was handled wrong previously)
This comment was marked as resolved.
This comment was marked as resolved.
f5a952e
to
6b52d6d
Compare
According to https://en.wikipedia.org/wiki/FASTA_format , empty lines are allowed and I don't think they should be counted as |
Hmm. Right. But it might be a less intrusive change if we would keep counting empty lines as data lines, or: galaxy/lib/galaxy/datatypes/sequence.py Line 125 in 6b52d6d
Also one might argue that also an empty sequence is a sequence. Edit: but our sniffer would not accept it :) According to https://blast.ncbi.nlm.nih.gov/doc/blast-topics/ empty lines are not allowed:
|
In the middle, but they are allowed between the end of a sequence and the header of the next one, as in the example on the Wikipedia page. |
I'm completely agnostic wrt this, i.e. I don't care (and can't imagine a case where this matters .. but probably there is). I guess if we add such a check (strip + the check) we loose 50% of the improvement. |
6087a16
to
24950a7
Compare
How about 24950a7? Surprisingly |
Not a fan of these microoptimizations, that seems like something you'd have to evaluate on every python version, given that that seems so optimizable. If this isn't dominating the runtime I wouldn't do it ? |
It does actually make a difference on a 3 Gb FASTA file as it avoids a function call (even on Python 3.12). |
Thanks for checking. |
nice, more of these! Thanks @bernt-matthias |
So far the Fasta datatype just used the
set_meta
method ofSequence
which is quite general and allows for things like comments and unnecessarily strips spaces and checks for empty lines. By a more strict interpretation of the FASTA "standard" we can be nearly 50% faster (see).Implemented the same for FASTQ. The sniffer here is really strict (e.g. expects strictly 4 line blocks) so I think we can also simplify the code here...
How to test the changes?
(Select all options that apply)
License