-
Notifications
You must be signed in to change notification settings - Fork 48
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
This is a large commit with multiple changes: 1. Remove bitparallel compiler The previous bitparallel compiler was not great. It made use of metaprogramming, which - Is hard to reason about - Does not play nice with Revise - Cannot be debugged using a debugger or similar tooling - Confuses linters and IDEs - May explode codegen, impacting compile times This might have been necessary in earlier versions of Julia, but the modern Julia compiler is so good at constant evaluation and inlining that this is almost never a good idea. 2. Add internal chunk iterator types This is an internal abstraction, an iterator that yields the encoded data chunks of LongSequence or LongSubSeq. For subsequences, the yielded chunks are aligned, such that the iterator returns the exact same elements as a chunk iterator over the correponding LongSequence. These iterators might be generally useful in future BioSequences work, and is something I've thought about for some time. 3. Deprecate the functions `n_ambigous`, `n_gaps` and `n_certain` These should instead be accessed via their corresponding `count` methods, as described in the documentation. The rationale is that these functions provide little value as standalone functions, as they should work exactly like `count` with specific predicates. As we might want to add more in the future, this simply balloons the number of exported symbols. 4. Deprecate some meaningless two-sequence counting methods The method `n_ambiguous(::BioSequence ::BioSequence)`, and the equivalent for n_gaps, n_certain and the `count` methods corresponding to these are strange and obscure. Why are they there? They significantly complicate the implementation of counting, while providing no meaningful biological operations. At least I have no idea what the biological significance of these methods are 5. Deprecate calling `matches` or `mismatches` with differing seq lengths It is not clear what the correct thing to do here is, and it may be confusing for users, who may expect a pairwise alignment or similar. Instead of guessing what the user wants, force them to take a view of the longer sequence if these differ in length. 6. Add new optimised methods for counting occurrences of symbol in sequence E.g. `count(==(DNA_A), dna"TAG")` and a similar method for `isequal`.
- Loading branch information
1 parent
761f053
commit 5f40116
Showing
15 changed files
with
736 additions
and
769 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,7 +1,7 @@ | ||
name = "BioSequences" | ||
uuid = "7e6ae17a-c86d-528c-b3b9-7f778a29fe59" | ||
authors = ["Sabrina Jaye Ward <[email protected]>", "Jakob Nissen <[email protected]>"] | ||
version = "3.2.0" | ||
version = "3.3.0" | ||
|
||
[deps] | ||
BioSymbols = "3c28c6f8-a34d-59c4-9654-267d177fcfa9" | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,164 +0,0 @@ | ||
### | ||
### Counting | ||
### | ||
### Counting operations on biological sequence types. | ||
### | ||
### This file is a part of BioJulia. | ||
### License is MIT: https://github.com/BioJulia/BioSequences.jl/blob/master/LICENSE.md | ||
|
||
### | ||
### Naive counting | ||
### | ||
|
||
function count_naive(pred, seq::BioSequence) | ||
n = 0 | ||
@inbounds for i in eachindex(seq) | ||
n += pred(seq[i])::Bool | ||
end | ||
return n | ||
end | ||
|
||
function count_naive(pred, seqa::BioSequence, seqb::BioSequence) | ||
n = 0 | ||
@inbounds for i in 1:min(length(seqa), length(seqb)) | ||
n += pred(seqa[i], seqb[i])::Bool | ||
end | ||
return n | ||
end | ||
|
||
""" | ||
Count how many positions in a sequence satisfy a condition (i.e. f(seq[i]) -> true). | ||
The first argument should be a function which accepts an element of the sequence | ||
as its first parameter, additional arguments may be passed with `args...`. | ||
""" | ||
Base.count(pred, seq::BioSequence) = count_naive(pred, seq) | ||
Base.count(pred, seqa::BioSequence, seqb::BioSequence) = count_naive(pred, seqa, seqb) | ||
|
||
# These functions are BioSequences-specific because they take two arguments | ||
isambiguous_or(x::T, y::T) where {T<:NucleicAcid} = isambiguous(x) | isambiguous(y) | ||
isgap_or(x::T, y::T) where {T<:NucleicAcid} = isgap(x) | isgap(y) | ||
iscertain_and(x::T, y::T) where {T<:NucleicAcid} = iscertain(x) & iscertain(y) | ||
|
||
#BioSymbols.isambiguous(x::T, y::T) where {T<:NucleicAcid} = isambiguous(x) | isambiguous(y) | ||
#BioSymbols.isgap(x::T, y::T) where {T<:NucleicAcid} = isgap(x) | isgap(y) | ||
#BioSymbols.iscertain(x::T, y::T) where {T<:NucleicAcid} = iscertain(x) & iscertain(y) | ||
|
||
Base.count(::typeof(isambiguous), seqa::S, seqb::S) where {S<:BioSequence{<:NucleicAcidAlphabet{2}}} = 0 | ||
Base.count(::typeof(isgap), seqa::S, seqb::S) where {S<:BioSequence{<:NucleicAcidAlphabet{2}}} = 0 | ||
Base.count(::typeof(iscertain), seqa::S, seqb::S) where {S<:BioSequence{<:NucleicAcidAlphabet{2}}} = min(length(seqa), length(seqb)) | ||
|
||
### | ||
### Aliases for various uses of `count`. | ||
### | ||
|
||
""" | ||
gc_content(seq::BioSequence) -> Float64 | ||
Calculate GC content of `seq`, i.e. the number of symbols that is `DNA_C`, `DNA_G`, | ||
`DNA_C` or `DNA_G` divided by the length of the sequence. | ||
# Examples | ||
```jldoctest | ||
julia> gc_content(dna"AGCTA") | ||
0.4 | ||
julia> gc_content(rna"UAGCGA") | ||
0.5 | ||
``` | ||
""" | ||
gc_content(seq::NucleotideSeq) = isempty(seq) ? 0.0 : count(isGC, seq) / length(seq) | ||
|
||
""" | ||
n_ambiguous(a::BioSequence, [b::BioSequence]) -> Int | ||
Count the number of positions where `a` (or `b`, if present) have ambigious symbols. | ||
If `b` is given, and the length of `a` and `b` differ, look only at the indices | ||
of the shorter sequence. | ||
# Examples | ||
```jldoctest | ||
julia> n_ambiguous(dna"--TAC-WN-ACY") | ||
3 | ||
julia> n_ambiguous(rna"UAYWW", rna"UAW") | ||
1 | ||
``` | ||
""" | ||
n_ambiguous(seq::BioSequence) = count(isambiguous, seq) | ||
n_ambiguous(seqa::BioSequence, seqb::BioSequence) = count(isambiguous_or, seqa, seqb) | ||
|
||
""" | ||
n_certain(a::BioSequence, [b::BioSequence]) -> Int | ||
Count the number of positions where `a` (and `b`, if present) have certain (i.e. non-ambigous | ||
and non-gap) symbols. | ||
If `b` is given, and the length of `a` and `b` differ, look only at the indices | ||
of the shorter sequence. | ||
# Examples | ||
```jldoctest | ||
julia> n_certain(dna"--TAC-WN-ACY") | ||
5 | ||
julia> n_certain(rna"UAYWW", rna"UAW") | ||
2 | ||
``` | ||
""" | ||
n_certain(seq::BioSequence) = count(iscertain, seq) | ||
n_certain(seqa::BioSequence, seqb::BioSequence) = count(iscertain_and, seqa, seqb) | ||
|
||
""" | ||
n_gaps(a::BioSequence, [b::BioSequence]) -> Int | ||
Count the number of positions where `a` (or `b`, if present) have gaps. | ||
If `b` is given, and the length of `a` and `b` differ, look only at the indices | ||
of the shorter sequence. | ||
# Examples | ||
```jldoctest | ||
julia> n_gaps(dna"--TAC-WN-ACY") | ||
4 | ||
julia> n_gaps(dna"TC-AC-", dna"-CACG") | ||
2 | ||
``` | ||
""" | ||
n_gaps(seq::BioSequence) = count(isgap, seq) | ||
n_gaps(seqa::BioSequence, seqb::BioSequence) = count(isgap_or, seqa, seqb) | ||
|
||
""" | ||
mismatches(a::BioSequence, b::BioSequences) -> Int | ||
Count the number of positions in where `a` and `b` differ. | ||
If `b` is given, and the length of `a` and `b` differ, look only at the indices | ||
of the shorter sequence. | ||
# Examples | ||
```jldoctest | ||
julia> mismatches(dna"TAGCTA", dna"TACCTA") | ||
1 | ||
julia> mismatches(dna"AACA", dna"AAG") | ||
1 | ||
``` | ||
""" | ||
mismatches(seqa::BioSequence, seqb::BioSequence) = count(!=, seqa, seqb) | ||
|
||
""" | ||
mismatches(a::BioSequence, b::BioSequences) -> Int | ||
Count the number of positions in where `a` and `b` are equal. | ||
If `b` is given, and the length of `a` and `b` differ, look only at the indices | ||
of the shorter sequence. | ||
# Examples | ||
```jldoctest | ||
julia> matches(dna"TAGCTA", dna"TACCTA") | ||
5 | ||
julia> matches(dna"AACA", dna"AAG") | ||
2 | ||
``` | ||
""" | ||
matches(seqa::BioSequence, seqb::BioSequence) = count(==, seqa, seqb) | ||
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.