-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Breaking changes for v1 #35
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, nice work on this @jakobnissen. This is a big overhaul.
I skimmed all of the changes and think it looks nice. I can try to give it a detailed review late next week if you'd like a second set of eyes (to be honest I'm completely lost with the bit-code here and in biosequences but the rest of it makes sense).
I saw your note that people can conceptually think of kmers as SVectors, but they're still implemented as NTuples, correct? I had been using SVectors of BioSymbol nucleotides to close the gaps of what I wasn't able to do with Kmers.jl, mostly with Amino Acids but with DNA and RNA as well. I liked the design interface of Kmers as SVectors, but I assume there's some efficiency reason why that isn't the direction that Kmers.jl went with implementation.
Are there any Fw + Rv Kmer iterators? I know that the current iterator returns fw + rv. It's easy to just call reverse_complement so I won't miss it if you removed it, but I'm just trying to think through where I'll need to update my projects to handle these updates.
Thanks for all of your efforts on this and related projects!
Thank you for the kind words, @cjprybol! I'm planning to post on Discourse one of the coming days about this upcoming release and request feedback. There is still a little more work to do - for example, this PR needs changes to BioSequences to land first. Nonetheless - yes! I would like a review. To test it, dev the kmers_compat branch of BioSequences (there is an open PR), then dev Kmers. We don't back Kmers with SVector, because a) we want to avoid StaticArrays as a large dependency and b) we need to do most of the low-level bit twiddling ourselves anyway, since symbols are stored in a packed format in Kmers (e.g. only 2 bits for a nucleotide instead of whole byte). A fw/rv iterator might be a good idea! I've currently implemented a canonical kmer iterator that only outputs the lexographically smaller of the two. But I think a fw/rv iterator that yields both may be more generically usable. |
I smell MinHash! In the discourse post, you note that a bunch of other more complicated convenience features were removed or not implemented. I infer that in most cases, the intent is for those things to be implemented in other packages, but are there any features that you think will belong here in a point release after the API is stable? |
@kescobo In the discourse post, I only mention skipmers, minimizers and k-min-mers. In the message of this PR, I mention other misc features such as random kmers. I believe skipmers were an invention of Sebarina Ward's research group, which is why they were originally implemented in BioSequences.jl (and Kmers.jl). As far as I know, they have not had much usage in the wider bioinfo world. One notable exception is the Minimizers, and their even newer relative k-min-mers, are more useful, but the precise algorithm for extracting them differ from application to application, so I'm guessing users would want to create their own minimizer iterators. I'd rather provide functionality to more easily make custom iterators than to provide my own which is probably not optimal for most users. Regarding the other functionality - shuffled and random kmers, I don't think they are that useful. I'll add them if someone needs them and makes a case for them. For the constructors that infer the length and alphabet, I don't think it's a good interface. Much better to be explicit.
Yes! You can combine MinHash.jl with this iterator to have a pretty efficient minhash: julia> using MinHash, BioSequences, Kmers, BenchmarkTools
julia> seq = randdnaseq(10_000_000);
julia> @btime sketch(CanonicalDNAMers{31}(seq), 100)
35.612 ms (8 allocations: 6.88 KiB) That's 3.5 nanoseconds per kmer. That could plausibly become even faster, as I'm planning to optimise hashing of kmers as much as humanly possible. |
Great! So I take this to mean that you consider Kmers.jl feature complete, barring someone making a strong affirmative case for something. Or put another way, there aren't any features you have explicitly planned to implement after the 1.0 release (at least not in this package). FWIW, I think this is a good decision - since you're aggressively optimizing performance here, it makes sense to keep the surface as small as possible.
This seems really great. Are there any other things that are (or should be planned to be) extensible like this? Is it possible for other packages to make their own kmer types, and if so, is the interface well-documented? (Noting that I haven't yet looked at the code or docs, if this is explained elsewhere, no need to repeat yourself).
I don't think it's awkward. That code is still all there in the history, and can be revived if needed. I don't think she'd expect you to maintain her code forever. One consideration might be whether it is in-principle possible to reimplement in a package that extends this one, but even if not, I don't think you need to worry about it. |
I tried to make some changes but wrongfully created an alternative PR (already closed), here are some small typos I found in the docs |
I was experimenting a little bit with the julia> collect(each_dna_codon("TGACGATCGAC"))
3-element Vector{Kmer{DNAAlphabet{2}, 3, 1}}:
TGA
CGA
TCG But what if the input seq is of a collect(each_dna_codon(dna"TGACGATCGAC"))
3-element Vector{Kmer{DNAAlphabet{2}, 3, 1}}:
TGA
TGA
TGA Shouldn't it be allowed? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll give you my review in spurts so that you can begin to address them if you want. So far,
I've made it through the documentation. Amazing work! It's very clear.
The code will probably take me longer to get through, and I will probably have less to say about it. As described in the docs, the API seems great, and I agree with your choices re: what to remove. As I said a bit earlier, it might be nice to write up a little bit about how you might implement a custom iterator (similar to the MinHash part), if only to expose whether there are any parts of the API that need tweaking to facilitate this.
I'd be comfortable with you tagging this prior to my review of the code, since that may take me some time. But I will endeavor to get that done in the next week or two.
🎉 This is a really great addition to the ecosystem, thanks again for all of this work. It's really impressive!
I was playing around more. This is probably an undesired behavior or something not yet implemented... But as the interfaces of
findall(DNA_T, dna"ATGCTG")
2-element Vector{Int64}:
2
5 will this work: findall(mer"T"d, dna"ATGCTG")
findfirst(biore"ATG"dna, dna"ATGCTG", 1)
1:3 will this work: findfirst(mer"T"d, dna"ATGCTG") I didn't see this in the in this PR, but ignore this comment if this is already something discussed. Best |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy with this whenever everyone else is. This is a very generous contribution! As long as the tests pass, I don't have any concerns with the implementation details and am excited to work with this
Can't wait to have it 😍 |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #35 +/- ##
===========================================
+ Coverage 87.67% 97.78% +10.10%
===========================================
Files 13 14 +1
Lines 584 766 +182
===========================================
+ Hits 512 749 +237
+ Misses 72 17 -55
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
In this mega-commit, Kmers.jl is thoroughly overhauled, with a new API, new docs, and new types.
This PR implements a thorough overhaul of Kmers.jl. There are too many changes to list exhaustively, but here are some important changes
TODO
Deploy preview doc buildwe skip this for laterCloses #12
Closes #13 (with the answer "no")
Closes #17
Closes #23
Closes #28
Closes #31
Closes #34