Rewrite counting of BioSequence

This is a large commit with multiple changes: 1. Remove bitparallel compiler The previous bitparallel compiler was not great. It made use of metaprogramming, which - Is hard to reason about - Does not play nice with Revise - Cannot be debugged using a debugger or similar tooling - Confuses linters and IDEs - May explode codegen, impacting compile times This might have been necessary in earlier versions of Julia, but the modern Julia compiler is so good at constant evaluation and inlining that this is almost never a good idea. 2. Add internal chunk iterator types This is an internal abstraction, an iterator that yields the encoded data chunks of LongSequence or LongSubSeq. For subsequences, the yielded chunks are aligned, such that the iterator returns the exact same elements as a chunk iterator over the correponding LongSequence. These iterators might be generally useful in future BioSequences work, and is something I've thought about for some time. 3. Deprecate the functions `n_ambigous`, `n_gaps` and `n_certain` These should instead be accessed via their corresponding `count` methods, as described in the documentation. The rationale is that these functions provide little value as standalone functions, as they should work exactly like `count` with specific predicates. As we might want to add more in the future, this simply balloons the number of exported symbols. 4. Deprecate some meaningless two-sequence counting methods The method `n_ambiguous(::BioSequence ::BioSequence)`, and the equivalent for n_gaps, n_certain and the `count` methods corresponding to these are strange and obscure. Why are they there? They significantly complicate the implementation of counting, while providing no meaningful biological operations. At least I have no idea what the biological significance of these methods are 5. Deprecate calling `matches` or `mismatches` with differing seq lengths It is not clear what the correct thing to do here is, and it may be confusing for users, who may expect a pairwise alignment or similar. Instead of guessing what the user wants, force them to take a view of the longer sequence if these differ in length. 6. Add new optimised methods for counting occurrences of symbol in sequence E.g. `count(==(DNA_A), dna"TAG")` and a similar method for `isequal`.
BioJulia · Oct 22, 2024 · 5f40116 · 5f40116
1 parent 761f053
commit 5f40116
Show file tree

Hide file tree

Showing 15 changed files with 736 additions and 769 deletions.
diff --git a/Project.toml b/Project.toml
@@ -1,7 +1,7 @@
 name = "BioSequences"
 uuid = "7e6ae17a-c86d-528c-b3b9-7f778a29fe59"
 authors = ["Sabrina Jaye Ward <[email protected]>", "Jakob Nissen <[email protected]>"]
-version = "3.2.0"
+version = "3.3.0"
 
 [deps]
 BioSymbols = "3c28c6f8-a34d-59c4-9654-267d177fcfa9"

diff --git a/docs/src/counting.md b/docs/src/counting.md
@@ -6,66 +6,73 @@ end
 ```
 
 # Counting
+`BioSequences` contain functionality to efficiently count biosymbols in a biosequence
+that satisfies some predicate.
+
+Consider a naive counting function like this:
+```julia
+function count_Ns(seq::BioSequence{<:DNAAlphabet})
+    ns = 0
+    for i in seq
+        ns += (i == DNA_N)::Bool
+    end
+    ns
+end 
+```
 
-BioSequences extends the `Base.count` method to provide some useful utilities for
-counting the number of sites in biological sequences.
-
-Most generically you can count the number of sites that satisfy some condition
-i.e. cause some function to return `true`:
+This function can be more efficient implemented by exploiting the internal
+data encoding of certain biosequences.
+Therefore, Julia provides optimised methods for `Base.count`, such that `count_Ns`
+above can be more efficiently expressed `count(==(DNA_N), seq)`.
 
-```jldoctest
-julia> count(isambiguous, dna"ATCGM")
-1
+!!! note
+    It is important to understand that this speed is achieved with custom methods of
+    `Base.count`, and not by a generic mechanism that improves the speed of counting
+    symbols in `BioSequence `in general.
+    Hence, while `count(==(DNA_N), seq)` may be optimised,
+    `count(i -> i == DNA_N, seq)` is not, as this is a different method.
 
-```
+## Currently optimised methods
+By default, only `LongSequence` and `LongSubSeq` have optimised methods. Downstream
+implementers of new `BioSequence`s may not be optimised.
 
-You can also use two sequences, for example to compute the number of matching
-or mismatching symbols:
+* `count(isGC, seq)`
+* `count(isambiguous, seq)`
+* `count(iscertain, seq)`
+* `count(isgap, seq)`
+* `count(==(biosymbol), seq)` and `count(isequal(biosymbol), seq)`
 
-```jldoctest
-julia> count(!=, dna"ATCGM", dna"GCCGM")
-2
+## Matches and mismatches
+The methods `matches` and `mismatches` take two
+sequences and count the number of positions where the sequences are unequal or equal, respectively.
 
-julia> count(==, dna"ATCGM", dna"GCCGM")
-3
+They are equivalent to `matches(a, b) = count(splat(==), zip(a, b))`
+(and with `!=`, respectively).
 
+```@docs
+matches
+mismatches
 ```
 
-## Alias functions
-
-A number of functions which are aliases for various invocations of `Base.count`
-are provided.
-
-| Alias function | Base.count call(s)                                          |
-| :------------- | :---------------------------------------------------------- |
-| `n_ambiguous`  | `count(isambiguous, seq)`, `count(isambiguous, seqa, seqb)` |
-| `n_certain`    | `count(iscertain, seq)`, `count(iscertain, seqa, seqb)`     |
-| `n_gap`        | `count(isgap, seq)`, `count(isgap, seqa, seqb)`             |
-| `matches`      | `count(==, seqa, seqb)`                                     |
-| `mismatches`   | `count(!=, seqa, seqb)`                                     |
+## GC content
+The convenience function `gc_content(seq)` is equivalent to `count(isGC, seq) / length(seq)`:
 
-## Bit-parallel optimisations
+```@docs
+gc_content
+```
 
-For the vast majority of `Base.count(f, seq)` and `Base.count(f, seqa, seqb)`
-methods, a naive counting is done: the internal `count_naive` function is called,
-which simply loops over each position, applies `f`, and accumulates the result.
+## Deprecated aliases
+Several of the optimised `count` methods have function names, which are deprecated:
 
-However, for some functions, it is possible to implement highly efficient methods
-that use bit-parallelism to check many elements at one time.
-This is made possible by the succinct encoding of BioSequences.
-Usually `f` is one of the functions provided by BioSymbols.jl or by BioSequences.jl
+| Deprecated function  | Instead use               |
+| :------------------- | :------------------------ |
+| `n_gaps`             | `count(isgap, seq)`       |
+| `n_certain`          | `count(iscertain, seq)`   |
+| `n_ambiguous`        | `count(isambiguous, seq)` |
 
-For such sequence and function combinations, `Base.count(f, seq)` is overloaded
-to call an internal `BioSequences.count_*_bitpar` function, which is passed the
-sequence(s). If you want to force BioSequences to use naive counting for the
-purposes of testing or debugging for example, then you can call
-`BioSequences.count_naive` directly.
 
 ```@docs
-gc_content
 n_gaps
 n_certain
 n_ambiguous
-matches
-mismatches
-```
+```
diff --git a/src/BioSequences.jl b/src/BioSequences.jl
@@ -205,6 +205,7 @@ using Random
 include("alphabet.jl")
 
 # Load the bit-twiddling internals that optimised BioSequences methods depend on.
+include("bit-manipulation/bitindex.jl")
 include("bit-manipulation/bit-manipulation.jl")
 
 # The generic, abstract BioSequence type
@@ -216,6 +217,8 @@ include("longsequences/longsequence.jl")
 include("longsequences/hash.jl")
 include("longsequences/randseq.jl")
 
+include("counting.jl")
+
 include("geneticcode.jl")
 
 

diff --git a/src/biosequence/biosequence.jl b/src/biosequence/biosequence.jl
@@ -235,5 +235,4 @@ include("predicates.jl")
 include("find.jl")
 include("printing.jl")
 include("transformations.jl")
-include("counting.jl")
 include("copying.jl")
diff --git a/src/biosequence/counting.jl b/src/biosequence/counting.jl
@@ -1,164 +0,0 @@
-###
-### Counting
-###
-### Counting operations on biological sequence types.
-###
-### This file is a part of BioJulia.
-### License is MIT: https://github.com/BioJulia/BioSequences.jl/blob/master/LICENSE.md
-
-###
-### Naive counting
-###
-
-function count_naive(pred, seq::BioSequence)
-    n = 0
-    @inbounds for i in eachindex(seq)
-        n += pred(seq[i])::Bool
-    end
-    return n
-end
-
-function count_naive(pred, seqa::BioSequence, seqb::BioSequence)
-    n = 0
-    @inbounds for i in 1:min(length(seqa), length(seqb))
-        n += pred(seqa[i], seqb[i])::Bool
-    end
-    return n
-end
-
-"""
-Count how many positions in a sequence satisfy a condition (i.e. f(seq[i]) -> true).
-
-The first argument should be a function which accepts an element of the sequence
-as its first parameter, additional arguments may be passed with `args...`.
-"""
-Base.count(pred, seq::BioSequence) = count_naive(pred, seq)
-Base.count(pred, seqa::BioSequence, seqb::BioSequence) = count_naive(pred, seqa, seqb)
-
-# These functions are BioSequences-specific because they take two arguments
-isambiguous_or(x::T, y::T) where {T<:NucleicAcid} = isambiguous(x) | isambiguous(y)
-isgap_or(x::T, y::T) where {T<:NucleicAcid} = isgap(x) | isgap(y)
-iscertain_and(x::T, y::T) where {T<:NucleicAcid} = iscertain(x) & iscertain(y)
-
-#BioSymbols.isambiguous(x::T, y::T) where {T<:NucleicAcid} = isambiguous(x) | isambiguous(y)
-#BioSymbols.isgap(x::T, y::T) where {T<:NucleicAcid} = isgap(x) | isgap(y)
-#BioSymbols.iscertain(x::T, y::T) where {T<:NucleicAcid} = iscertain(x) & iscertain(y)
-
-Base.count(::typeof(isambiguous), seqa::S, seqb::S) where {S<:BioSequence{<:NucleicAcidAlphabet{2}}} = 0
-Base.count(::typeof(isgap), seqa::S, seqb::S) where {S<:BioSequence{<:NucleicAcidAlphabet{2}}} = 0
-Base.count(::typeof(iscertain), seqa::S, seqb::S) where {S<:BioSequence{<:NucleicAcidAlphabet{2}}} = min(length(seqa), length(seqb))
-
-###
-### Aliases for various uses of `count`.
-###
-
-"""
-    gc_content(seq::BioSequence) -> Float64
-
-Calculate GC content of `seq`, i.e. the number of symbols that is `DNA_C`, `DNA_G`,
-`DNA_C` or `DNA_G` divided by the length of the sequence.
-
-# Examples
-```jldoctest
-julia> gc_content(dna"AGCTA")
-0.4
-
-julia> gc_content(rna"UAGCGA")
-0.5
-```
-"""
-gc_content(seq::NucleotideSeq) = isempty(seq) ? 0.0 : count(isGC, seq) / length(seq)
-
-"""
-    n_ambiguous(a::BioSequence, [b::BioSequence]) -> Int
-
-Count the number of positions where `a` (or `b`, if present) have ambigious symbols.
-If `b` is given, and the length of `a` and `b` differ, look only at the indices
-of the shorter sequence.
-
-# Examples
-```jldoctest
-julia> n_ambiguous(dna"--TAC-WN-ACY")
-3
-
-julia> n_ambiguous(rna"UAYWW", rna"UAW")
-1
-```
-"""
-n_ambiguous(seq::BioSequence) = count(isambiguous, seq)
-n_ambiguous(seqa::BioSequence, seqb::BioSequence) = count(isambiguous_or, seqa, seqb)
-
-"""
-    n_certain(a::BioSequence, [b::BioSequence]) -> Int
-
-Count the number of positions where `a` (and `b`, if present) have certain (i.e. non-ambigous
-and non-gap) symbols.
-If `b` is given, and the length of `a` and `b` differ, look only at the indices
-of the shorter sequence.
-
-# Examples
-```jldoctest
-julia> n_certain(dna"--TAC-WN-ACY")
-5
-
-julia> n_certain(rna"UAYWW", rna"UAW")
-2
-```
-"""
-n_certain(seq::BioSequence) = count(iscertain, seq)
-n_certain(seqa::BioSequence, seqb::BioSequence) = count(iscertain_and, seqa, seqb)
-
-"""
-    n_gaps(a::BioSequence, [b::BioSequence]) -> Int
-
-Count the number of positions where `a` (or `b`, if present) have gaps.
-If `b` is given, and the length of `a` and `b` differ, look only at the indices
-of the shorter sequence.
-
-# Examples
-```jldoctest
-julia> n_gaps(dna"--TAC-WN-ACY")
-4
-
-julia> n_gaps(dna"TC-AC-", dna"-CACG")
-2
-```
-"""
-n_gaps(seq::BioSequence) = count(isgap, seq)
-n_gaps(seqa::BioSequence, seqb::BioSequence) = count(isgap_or, seqa, seqb)
-
-"""
-    mismatches(a::BioSequence, b::BioSequences) -> Int
-
-Count the number of positions in where `a` and `b` differ.
-If `b` is given, and the length of `a` and `b` differ, look only at the indices
-of the shorter sequence.
-
-# Examples
-```jldoctest
-julia> mismatches(dna"TAGCTA", dna"TACCTA")
-1
-
-julia> mismatches(dna"AACA", dna"AAG")
-1
-```
-"""
-mismatches(seqa::BioSequence, seqb::BioSequence) = count(!=, seqa, seqb)
-
-"""
-    mismatches(a::BioSequence, b::BioSequences) -> Int
-
-Count the number of positions in where `a` and `b` are equal.
-If `b` is given, and the length of `a` and `b` differ, look only at the indices
-of the shorter sequence.
-
-# Examples
-```jldoctest
-julia> matches(dna"TAGCTA", dna"TACCTA")
-5
-
-julia> matches(dna"AACA", dna"AAG")
-2
-```
-"""
-matches(seqa::BioSequence, seqb::BioSequence) = count(==, seqa, seqb)

diff --git a/src/bit-manipulation/bit-manipulation.jl b/src/bit-manipulation/bit-manipulation.jl
@@ -1,7 +1,3 @@
-
-include("bitindex.jl")
-include("bitpar-compiler.jl")
-
 @inline function reversebits(x::T, ::BitsPerSymbol{2}) where T <: Base.BitUnsigned
      mask = 0x33333333333333333333333333333333 % T
      x = ((x >> 2) & mask) | ((x & mask) << 2)
@@ -67,6 +63,10 @@ end
     return count_nonzero_nibbles(enumerate_nibbles(x) & 0xEEEEEEEEEEEEEEEE)
 end
 
+@inline function ambiguous_bitcount(x::UInt64, ::AminoAcidAlphabet)
+    return count_compared_bytes(i -> (i > 0x15) & (i < 0x1a), x)
+end
+
 @inline function ambiguous_bitcount(a::UInt64, b::UInt64, ::T) where {T<:NucleicAcidAlphabet{4}}
     return count_nonzero_nibbles((enumerate_nibbles(a) | enumerate_nibbles(b)) & 0xEEEEEEEEEEEEEEEE)
 end
@@ -80,6 +80,11 @@ end
     return count_0000_nibbles(a) + count_0000_nibbles(b) - count_0000_nibbles(a | b)
 end
 
+@inline function count_compared_bytes(f::F, x::UInt64) where F
+    z = @inline reinterpret(NTuple{8, VecElement{UInt8}}, x)
+    @inline sum(map(i -> f(i.value), z))
+end
+
 @inline function certain_bitcount(x::UInt64, ::T) where {T<:NucleicAcidAlphabet{4}}
     x = enumerate_nibbles(x)
     x = x ⊻ 0x1111111111111111