Skip to content

Commit

Permalink
Rewrite counting of BioSequence
Browse files Browse the repository at this point in the history
This is a large commit with multiple changes:

1. Remove bitparallel compiler
The previous bitparallel compiler was not great. It made use of metaprogramming,
which
    - Is hard to reason about
    - Does not play nice with Revise
    - Cannot be debugged using a debugger or similar tooling
    - Confuses linters and IDEs
    - May explode codegen, impacting compile times

This might have been necessary in earlier versions of Julia, but the modern
Julia compiler is so good at constant evaluation and inlining that this is
almost never a good idea.

2. Add internal chunk iterator types
This is an internal abstraction, an iterator that yields the encoded data chunks
of LongSequence or LongSubSeq. For subsequences, the yielded chunks are aligned,
such that the iterator returns the exact same elements as a chunk iterator
over the correponding LongSequence. These iterators might be generally useful in
future BioSequences work, and is something I've thought about for some time.

3. Deprecate the functions `n_ambigous`, `n_gaps` and `n_certain`
These should instead be accessed via their corresponding `count` methods, as
described in the documentation.
The rationale is that these functions provide little value as standalone
functions, as they should work exactly like `count` with specific predicates.
As we might want to add more in the future, this simply balloons the number
of exported symbols.

4. Deprecate some meaningless two-sequence counting methods
The method `n_ambiguous(::BioSequence ::BioSequence)`, and the equivalent for
n_gaps, n_certain and the `count` methods corresponding to these are strange
and obscure. Why are they there?
They significantly complicate the implementation of counting, while providing
no meaningful biological operations. At least I have no idea what the biological
significance of these methods are

5. Deprecate calling `matches` or `mismatches` with differing seq lengths
It is not clear what the correct thing to do here is, and it may be confusing
for users, who may expect a pairwise alignment or similar.
Instead of guessing what the user wants, force them to take a view of the longer
sequence if these differ in length.

6. Add new optimised methods for counting occurrences of symbol in sequence
E.g. `count(==(DNA_A), dna"TAG")` and a similar method for `isequal`.
  • Loading branch information
jakobnissen committed Oct 22, 2024
1 parent 761f053 commit 5f40116
Show file tree
Hide file tree
Showing 15 changed files with 736 additions and 769 deletions.
2 changes: 1 addition & 1 deletion Project.toml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
name = "BioSequences"
uuid = "7e6ae17a-c86d-528c-b3b9-7f778a29fe59"
authors = ["Sabrina Jaye Ward <[email protected]>", "Jakob Nissen <[email protected]>"]
version = "3.2.0"
version = "3.3.0"

[deps]
BioSymbols = "3c28c6f8-a34d-59c4-9654-267d177fcfa9"
Expand Down
97 changes: 52 additions & 45 deletions docs/src/counting.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,66 +6,73 @@ end
```

# Counting
`BioSequences` contain functionality to efficiently count biosymbols in a biosequence
that satisfies some predicate.

Consider a naive counting function like this:
```julia
function count_Ns(seq::BioSequence{<:DNAAlphabet})
ns = 0
for i in seq
ns += (i == DNA_N)::Bool
end
ns
end
```

BioSequences extends the `Base.count` method to provide some useful utilities for
counting the number of sites in biological sequences.

Most generically you can count the number of sites that satisfy some condition
i.e. cause some function to return `true`:
This function can be more efficient implemented by exploiting the internal
data encoding of certain biosequences.
Therefore, Julia provides optimised methods for `Base.count`, such that `count_Ns`
above can be more efficiently expressed `count(==(DNA_N), seq)`.

```jldoctest
julia> count(isambiguous, dna"ATCGM")
1
!!! note
It is important to understand that this speed is achieved with custom methods of
`Base.count`, and not by a generic mechanism that improves the speed of counting
symbols in `BioSequence `in general.
Hence, while `count(==(DNA_N), seq)` may be optimised,
`count(i -> i == DNA_N, seq)` is not, as this is a different method.

```
## Currently optimised methods
By default, only `LongSequence` and `LongSubSeq` have optimised methods. Downstream
implementers of new `BioSequence`s may not be optimised.

You can also use two sequences, for example to compute the number of matching
or mismatching symbols:
* `count(isGC, seq)`
* `count(isambiguous, seq)`
* `count(iscertain, seq)`
* `count(isgap, seq)`
* `count(==(biosymbol), seq)` and `count(isequal(biosymbol), seq)`

```jldoctest
julia> count(!=, dna"ATCGM", dna"GCCGM")
2
## Matches and mismatches
The methods `matches` and `mismatches` take two
sequences and count the number of positions where the sequences are unequal or equal, respectively.

julia> count(==, dna"ATCGM", dna"GCCGM")
3
They are equivalent to `matches(a, b) = count(splat(==), zip(a, b))`
(and with `!=`, respectively).

```@docs
matches
mismatches
```

## Alias functions

A number of functions which are aliases for various invocations of `Base.count`
are provided.

| Alias function | Base.count call(s) |
| :------------- | :---------------------------------------------------------- |
| `n_ambiguous` | `count(isambiguous, seq)`, `count(isambiguous, seqa, seqb)` |
| `n_certain` | `count(iscertain, seq)`, `count(iscertain, seqa, seqb)` |
| `n_gap` | `count(isgap, seq)`, `count(isgap, seqa, seqb)` |
| `matches` | `count(==, seqa, seqb)` |
| `mismatches` | `count(!=, seqa, seqb)` |
## GC content
The convenience function `gc_content(seq)` is equivalent to `count(isGC, seq) / length(seq)`:

## Bit-parallel optimisations
```@docs
gc_content
```

For the vast majority of `Base.count(f, seq)` and `Base.count(f, seqa, seqb)`
methods, a naive counting is done: the internal `count_naive` function is called,
which simply loops over each position, applies `f`, and accumulates the result.
## Deprecated aliases
Several of the optimised `count` methods have function names, which are deprecated:

However, for some functions, it is possible to implement highly efficient methods
that use bit-parallelism to check many elements at one time.
This is made possible by the succinct encoding of BioSequences.
Usually `f` is one of the functions provided by BioSymbols.jl or by BioSequences.jl
| Deprecated function | Instead use |
| :------------------- | :------------------------ |
| `n_gaps` | `count(isgap, seq)` |
| `n_certain` | `count(iscertain, seq)` |
| `n_ambiguous` | `count(isambiguous, seq)` |

For such sequence and function combinations, `Base.count(f, seq)` is overloaded
to call an internal `BioSequences.count_*_bitpar` function, which is passed the
sequence(s). If you want to force BioSequences to use naive counting for the
purposes of testing or debugging for example, then you can call
`BioSequences.count_naive` directly.

```@docs
gc_content
n_gaps
n_certain
n_ambiguous
matches
mismatches
```
```
3 changes: 3 additions & 0 deletions src/BioSequences.jl
Original file line number Diff line number Diff line change
Expand Up @@ -205,6 +205,7 @@ using Random
include("alphabet.jl")

# Load the bit-twiddling internals that optimised BioSequences methods depend on.
include("bit-manipulation/bitindex.jl")
include("bit-manipulation/bit-manipulation.jl")

# The generic, abstract BioSequence type
Expand All @@ -216,6 +217,8 @@ include("longsequences/longsequence.jl")
include("longsequences/hash.jl")
include("longsequences/randseq.jl")

include("counting.jl")

include("geneticcode.jl")


Expand Down
1 change: 0 additions & 1 deletion src/biosequence/biosequence.jl
Original file line number Diff line number Diff line change
Expand Up @@ -235,5 +235,4 @@ include("predicates.jl")
include("find.jl")
include("printing.jl")
include("transformations.jl")
include("counting.jl")
include("copying.jl")
164 changes: 0 additions & 164 deletions src/biosequence/counting.jl
Original file line number Diff line number Diff line change
@@ -1,164 +0,0 @@
###
### Counting
###
### Counting operations on biological sequence types.
###
### This file is a part of BioJulia.
### License is MIT: https://github.com/BioJulia/BioSequences.jl/blob/master/LICENSE.md

###
### Naive counting
###

function count_naive(pred, seq::BioSequence)
n = 0
@inbounds for i in eachindex(seq)
n += pred(seq[i])::Bool
end
return n
end

function count_naive(pred, seqa::BioSequence, seqb::BioSequence)
n = 0
@inbounds for i in 1:min(length(seqa), length(seqb))
n += pred(seqa[i], seqb[i])::Bool
end
return n
end

"""
Count how many positions in a sequence satisfy a condition (i.e. f(seq[i]) -> true).
The first argument should be a function which accepts an element of the sequence
as its first parameter, additional arguments may be passed with `args...`.
"""
Base.count(pred, seq::BioSequence) = count_naive(pred, seq)
Base.count(pred, seqa::BioSequence, seqb::BioSequence) = count_naive(pred, seqa, seqb)

# These functions are BioSequences-specific because they take two arguments
isambiguous_or(x::T, y::T) where {T<:NucleicAcid} = isambiguous(x) | isambiguous(y)
isgap_or(x::T, y::T) where {T<:NucleicAcid} = isgap(x) | isgap(y)
iscertain_and(x::T, y::T) where {T<:NucleicAcid} = iscertain(x) & iscertain(y)

#BioSymbols.isambiguous(x::T, y::T) where {T<:NucleicAcid} = isambiguous(x) | isambiguous(y)
#BioSymbols.isgap(x::T, y::T) where {T<:NucleicAcid} = isgap(x) | isgap(y)
#BioSymbols.iscertain(x::T, y::T) where {T<:NucleicAcid} = iscertain(x) & iscertain(y)

Base.count(::typeof(isambiguous), seqa::S, seqb::S) where {S<:BioSequence{<:NucleicAcidAlphabet{2}}} = 0
Base.count(::typeof(isgap), seqa::S, seqb::S) where {S<:BioSequence{<:NucleicAcidAlphabet{2}}} = 0
Base.count(::typeof(iscertain), seqa::S, seqb::S) where {S<:BioSequence{<:NucleicAcidAlphabet{2}}} = min(length(seqa), length(seqb))

###
### Aliases for various uses of `count`.
###

"""
gc_content(seq::BioSequence) -> Float64
Calculate GC content of `seq`, i.e. the number of symbols that is `DNA_C`, `DNA_G`,
`DNA_C` or `DNA_G` divided by the length of the sequence.
# Examples
```jldoctest
julia> gc_content(dna"AGCTA")
0.4
julia> gc_content(rna"UAGCGA")
0.5
```
"""
gc_content(seq::NucleotideSeq) = isempty(seq) ? 0.0 : count(isGC, seq) / length(seq)

"""
n_ambiguous(a::BioSequence, [b::BioSequence]) -> Int
Count the number of positions where `a` (or `b`, if present) have ambigious symbols.
If `b` is given, and the length of `a` and `b` differ, look only at the indices
of the shorter sequence.
# Examples
```jldoctest
julia> n_ambiguous(dna"--TAC-WN-ACY")
3
julia> n_ambiguous(rna"UAYWW", rna"UAW")
1
```
"""
n_ambiguous(seq::BioSequence) = count(isambiguous, seq)
n_ambiguous(seqa::BioSequence, seqb::BioSequence) = count(isambiguous_or, seqa, seqb)

"""
n_certain(a::BioSequence, [b::BioSequence]) -> Int
Count the number of positions where `a` (and `b`, if present) have certain (i.e. non-ambigous
and non-gap) symbols.
If `b` is given, and the length of `a` and `b` differ, look only at the indices
of the shorter sequence.
# Examples
```jldoctest
julia> n_certain(dna"--TAC-WN-ACY")
5
julia> n_certain(rna"UAYWW", rna"UAW")
2
```
"""
n_certain(seq::BioSequence) = count(iscertain, seq)
n_certain(seqa::BioSequence, seqb::BioSequence) = count(iscertain_and, seqa, seqb)

"""
n_gaps(a::BioSequence, [b::BioSequence]) -> Int
Count the number of positions where `a` (or `b`, if present) have gaps.
If `b` is given, and the length of `a` and `b` differ, look only at the indices
of the shorter sequence.
# Examples
```jldoctest
julia> n_gaps(dna"--TAC-WN-ACY")
4
julia> n_gaps(dna"TC-AC-", dna"-CACG")
2
```
"""
n_gaps(seq::BioSequence) = count(isgap, seq)
n_gaps(seqa::BioSequence, seqb::BioSequence) = count(isgap_or, seqa, seqb)

"""
mismatches(a::BioSequence, b::BioSequences) -> Int
Count the number of positions in where `a` and `b` differ.
If `b` is given, and the length of `a` and `b` differ, look only at the indices
of the shorter sequence.
# Examples
```jldoctest
julia> mismatches(dna"TAGCTA", dna"TACCTA")
1
julia> mismatches(dna"AACA", dna"AAG")
1
```
"""
mismatches(seqa::BioSequence, seqb::BioSequence) = count(!=, seqa, seqb)

"""
mismatches(a::BioSequence, b::BioSequences) -> Int
Count the number of positions in where `a` and `b` are equal.
If `b` is given, and the length of `a` and `b` differ, look only at the indices
of the shorter sequence.
# Examples
```jldoctest
julia> matches(dna"TAGCTA", dna"TACCTA")
5
julia> matches(dna"AACA", dna"AAG")
2
```
"""
matches(seqa::BioSequence, seqb::BioSequence) = count(==, seqa, seqb)
13 changes: 9 additions & 4 deletions src/bit-manipulation/bit-manipulation.jl
Original file line number Diff line number Diff line change
@@ -1,7 +1,3 @@

include("bitindex.jl")
include("bitpar-compiler.jl")

@inline function reversebits(x::T, ::BitsPerSymbol{2}) where T <: Base.BitUnsigned
mask = 0x33333333333333333333333333333333 % T
x = ((x >> 2) & mask) | ((x & mask) << 2)
Expand Down Expand Up @@ -67,6 +63,10 @@ end
return count_nonzero_nibbles(enumerate_nibbles(x) & 0xEEEEEEEEEEEEEEEE)
end

@inline function ambiguous_bitcount(x::UInt64, ::AminoAcidAlphabet)
return count_compared_bytes(i -> (i > 0x15) & (i < 0x1a), x)
end

@inline function ambiguous_bitcount(a::UInt64, b::UInt64, ::T) where {T<:NucleicAcidAlphabet{4}}
return count_nonzero_nibbles((enumerate_nibbles(a) | enumerate_nibbles(b)) & 0xEEEEEEEEEEEEEEEE)
end
Expand All @@ -80,6 +80,11 @@ end
return count_0000_nibbles(a) + count_0000_nibbles(b) - count_0000_nibbles(a | b)
end

@inline function count_compared_bytes(f::F, x::UInt64) where F
z = @inline reinterpret(NTuple{8, VecElement{UInt8}}, x)
@inline sum(map(i -> f(i.value), z))
end

@inline function certain_bitcount(x::UInt64, ::T) where {T<:NucleicAcidAlphabet{4}}
x = enumerate_nibbles(x)
x = x 0x1111111111111111
Expand Down
Loading

0 comments on commit 5f40116

Please sign in to comment.