From 13fb5ce13f08a9341eb576e5e76bf493b477dd6c Mon Sep 17 00:00:00 2001 From: "Documenter.jl" Date: Mon, 24 Jun 2024 14:48:47 +0000 Subject: [PATCH] build based on cbb5397 --- dev/.documenter-siteinfo.json | 2 +- dev/construction/index.html | 8 ++++---- dev/counting/index.html | 2 +- dev/index.html | 2 +- dev/interfaces/index.html | 2 +- dev/io/index.html | 2 +- dev/predicates/index.html | 6 +++--- dev/random/index.html | 4 ++-- dev/recipes/index.html | 2 +- dev/sequence_search/index.html | 6 +++--- dev/symbols/index.html | 2 +- dev/transforms/index.html | 12 ++++++------ dev/types/index.html | 2 +- 13 files changed, 26 insertions(+), 26 deletions(-) diff --git a/dev/.documenter-siteinfo.json b/dev/.documenter-siteinfo.json index c4e67e4c..0c06de1d 100644 --- a/dev/.documenter-siteinfo.json +++ b/dev/.documenter-siteinfo.json @@ -1 +1 @@ -{"documenter":{"julia_version":"1.10.4","generation_timestamp":"2024-06-24T12:35:17","documenter_version":"1.4.1"}} \ No newline at end of file +{"documenter":{"julia_version":"1.10.4","generation_timestamp":"2024-06-24T14:48:43","documenter_version":"1.4.1"}} \ No newline at end of file diff --git a/dev/construction/index.html b/dev/construction/index.html index 7171d960..428032f9 100644 --- a/dev/construction/index.html +++ b/dev/construction/index.html @@ -135,14 +135,14 @@ "TAGA" julia> string(push!(f(), DNA_A)) -"TAGA"source
BioSequences.@rna_strMacro

The LongRNA{4} equivalent to @dna_str

See also: @dna_str, @aa_str

Examples

julia> rna"UCGUGAUGC"
+"TAGA"
source
BioSequences.@rna_strMacro

The LongRNA{4} equivalent to @dna_str

See also: @dna_str, @aa_str

Examples

julia> rna"UCGUGAUGC"
 9nt RNA Sequence:
-UCGUGAUGC
source
BioSequences.@aa_strMacro

The AminoAcidAlphabet equivalent to @dna_str

See also: @dna_str, @rna_str

Examples

julia> aa"PKLEQC"
+UCGUGAUGC
source
BioSequences.@aa_strMacro

The AminoAcidAlphabet equivalent to @dna_str

See also: @dna_str, @rna_str

Examples

julia> aa"PKLEQC"
 6aa Amino Acid Sequence:
-PKLEQC
source

Comparison to other sequence types

Following Base standards, BioSequences do not compare equal to other containers even if they have the same elements. To e.g. compare a BioSequence with a vector of DNA, compare the elements themselves:

julia> seq = dna"GAGCTGA"; vec = collect(seq);
+PKLEQC
source

Comparison to other sequence types

Following Base standards, BioSequences do not compare equal to other containers even if they have the same elements. To e.g. compare a BioSequence with a vector of DNA, compare the elements themselves:

julia> seq = dna"GAGCTGA"; vec = collect(seq);
 
 julia> seq == vec, isequal(seq, vec)
 (false, false)
 
 julia> length(seq) == length(vec) && all(i == j for (i, j) in zip(seq, vec))
-true 
+true diff --git a/dev/counting/index.html b/dev/counting/index.html index d482f3c6..e5376ac0 100644 --- a/dev/counting/index.html +++ b/dev/counting/index.html @@ -6,4 +6,4 @@ julia> count(==, dna"ATCGM", dna"GCCGM") 3 -

Alias functions

A number of functions which are aliases for various invocations of Base.count are provided.

Alias functionBase.count call(s)
n_ambiguouscount(isambiguous, seq), count(isambiguous, seqa, seqb)
n_certaincount(iscertain, seq), count(iscertain, seqa, seqb)
n_gapcount(isgap, seq), count(isgap, seqa, seqb)
matchescount(==, seqa, seqb)
mismatchescount(!=, seqa, seqb)

Bit-parallel optimisations

For the vast majority of Base.count(f, seq) and Base.count(f, seqa, seqb) methods, a naive counting is done: the internal count_naive function is called, which simply loops over each position, applies f, and accumulates the result.

However, for some functions, it is possible to implement highly efficient methods that use bit-parallelism to check many elements at one time. This is made possible by the succinct encoding of BioSequences. Usually f is one of the functions provided by BioSymbols.jl or by BioSequences.jl

For such sequence and function combinations, Base.count(f, seq) is overloaded to call an internal BioSequences.count_*_bitpar function, which is passed the sequence(s). If you want to force BioSequences to use naive counting for the purposes of testing or debugging for example, then you can call BioSequences.count_naive directly.

+

Alias functions

A number of functions which are aliases for various invocations of Base.count are provided.

Alias functionBase.count call(s)
n_ambiguouscount(isambiguous, seq), count(isambiguous, seqa, seqb)
n_certaincount(iscertain, seq), count(iscertain, seqa, seqb)
n_gapcount(isgap, seq), count(isgap, seqa, seqb)
matchescount(==, seqa, seqb)
mismatchescount(!=, seqa, seqb)

Bit-parallel optimisations

For the vast majority of Base.count(f, seq) and Base.count(f, seqa, seqb) methods, a naive counting is done: the internal count_naive function is called, which simply loops over each position, applies f, and accumulates the result.

However, for some functions, it is possible to implement highly efficient methods that use bit-parallelism to check many elements at one time. This is made possible by the succinct encoding of BioSequences. Usually f is one of the functions provided by BioSymbols.jl or by BioSequences.jl

For such sequence and function combinations, Base.count(f, seq) is overloaded to call an internal BioSequences.count_*_bitpar function, which is passed the sequence(s). If you want to force BioSequences to use naive counting for the purposes of testing or debugging for example, then you can call BioSequences.count_naive directly.

diff --git a/dev/index.html b/dev/index.html index 9dec9768..73a4ef80 100644 --- a/dev/index.html +++ b/dev/index.html @@ -1,2 +1,2 @@ -Home · BioSequences.jl

BioSequences

Latest Release MIT license Documentation Pkg Status

Description

BioSequences provides data types and methods for common operations with biological sequences, including DNA, RNA, and amino acid sequences.

Installation

You can install BioSequences from the julia REPL. Press ] to enter pkg mode again, and enter the following:

add BioSequences

If you are interested in the cutting edge of the development, please check out the master branch to try new features before release.

Testing

BioSequences is tested against Julia 1.X on Linux, OS X, and Windows.

Unit tests Documentation

Contributing

We appreciate contributions from users including reporting bugs, fixing issues, improving performance and adding new features.

Take a look at the contributing files detailed contributor and maintainer guidelines, and code of conduct.

Questions?

If you have a question about contributing or using BioJulia software, come on over and chat to us on the #biology channel on the Julia SLack, or you can try the Bio category of the Julia discourse site.

+Home · BioSequences.jl

BioSequences

Latest Release MIT license Documentation Pkg Status

Description

BioSequences provides data types and methods for common operations with biological sequences, including DNA, RNA, and amino acid sequences.

Installation

You can install BioSequences from the julia REPL. Press ] to enter pkg mode again, and enter the following:

add BioSequences

If you are interested in the cutting edge of the development, please check out the master branch to try new features before release.

Testing

BioSequences is tested against Julia 1.X on Linux, OS X, and Windows.

Unit tests Documentation

Contributing

We appreciate contributions from users including reporting bugs, fixing issues, improving performance and adding new features.

Take a look at the contributing files detailed contributor and maintainer guidelines, and code of conduct.

Questions?

If you have a question about contributing or using BioJulia software, come on over and chat to us on the #biology channel on the Julia SLack, or you can try the Bio category of the Julia discourse site.

diff --git a/dev/interfaces/index.html b/dev/interfaces/index.html index 2621023f..8e863403 100644 --- a/dev/interfaces/index.html +++ b/dev/interfaces/index.html @@ -59,4 +59,4 @@ julia> Base.copy(seq::Codon) = Codon(seq.x) julia> BioSequences.has_interface(BioSequence, Codon, [RNA_C, RNA_U, RNA_A], false) -true

Interface checking functions

BioSequences.has_interfaceFunction
function has_interface(::Type{Alphabet}, A::Alphabet)

Returns whether A conforms to the Alphabet interface.

source
has_interface(::Type{BioSequence}, ::T, syms::Vector, mutable::Bool, compat::Bool=true)

Check if type T conforms to the BioSequence interface. A T is constructed from the vector of element types syms which must not be empty. If the mutable flag is set, also check the mutable interface. If the compat flag is set, check for compatibility with existing alphabets.

source
+true

Interface checking functions

BioSequences.has_interfaceFunction
function has_interface(::Type{Alphabet}, A::Alphabet)

Returns whether A conforms to the Alphabet interface.

source
has_interface(::Type{BioSequence}, ::T, syms::Vector, mutable::Bool, compat::Bool=true)

Check if type T conforms to the BioSequence interface. A T is constructed from the vector of element types syms which must not be empty. If the mutable flag is set, also check the mutable interface. If the compat flag is set, check for compatibility with existing alphabets.

source
diff --git a/dev/io/index.html b/dev/io/index.html index 1f4b728d..76d0ef26 100644 --- a/dev/io/index.html +++ b/dev/io/index.html @@ -1,2 +1,2 @@ -I/O · BioSequences.jl

I/O for sequencing file formats

Versions of BioSequences prior to v2.0 provided a FASTA, FASTQ, and 2Bit submodule for working with formatted sequence files.

After version v2.0, in order to neatly separate concerns, these submodules were removed.

Instead there will now be dedicated BioJulia packages for each format. Each of these will be compatible with BioSequences.

A list of all of the different formats and packages is provided below to help you find them quickly.

FormatPackage
FASTAFASTX.jl
FASTQFASTX.jl
2BitTwoBit.jl
+I/O · BioSequences.jl

I/O for sequencing file formats

Versions of BioSequences prior to v2.0 provided a FASTA, FASTQ, and 2Bit submodule for working with formatted sequence files.

After version v2.0, in order to neatly separate concerns, these submodules were removed.

Instead there will now be dedicated BioJulia packages for each format. Each of these will be compatible with BioSequences.

A list of all of the different formats and packages is provided below to help you find them quickly.

FormatPackage
FASTAFASTX.jl
FASTQFASTX.jl
2BitTwoBit.jl
diff --git a/dev/predicates/index.html b/dev/predicates/index.html index 6a07c135..9658fa1b 100644 --- a/dev/predicates/index.html +++ b/dev/predicates/index.html @@ -1,12 +1,12 @@ -Predicates · BioSequences.jl

Predicates

A number of predicate or query functions are supported for sequences, allowing you to check for certain properties of a sequence.

BioSequences.isrepetitiveFunction
isrepetitive(seq::BioSequence, n::Integer = length(seq))

Return true if and only if seq contains a repetitive subsequence of length ≥ n.

source
BioSequences.ispalindromicFunction
ispalindromic(seq::NucSeq) -> Bool

Check if seq is palindromic. A palindromic sequence is identical to its reverse-complement, so this should be equivalent to checking if seq == reverse_complement(seq).

Examples

julia> ispalindromic(dna"TGCA")
+Predicates · BioSequences.jl

Predicates

A number of predicate or query functions are supported for sequences, allowing you to check for certain properties of a sequence.

BioSequences.isrepetitiveFunction
isrepetitive(seq::BioSequence, n::Integer = length(seq))

Return true if and only if seq contains a repetitive subsequence of length ≥ n.

source
BioSequences.ispalindromicFunction
ispalindromic(seq::NucSeq) -> Bool

Check if seq is palindromic. A palindromic sequence is identical to its reverse-complement, so this should be equivalent to checking if seq == reverse_complement(seq).

Examples

julia> ispalindromic(dna"TGCA")
 true
 
 julia> ispalindromic(dna"TCCT")
 false
 
 julia> ispalindromic(rna"ACGGU")
-false

Return true if seq is a palindromic sequence; otherwise return false.

source
BioSequences.iscanonicalFunction
iscanonical(seq::NucleotideSeq)

Returns true if seq is canonical.

For any sequence, there is a reverse complement, which is the same sequence, but on the complimentary strand of DNA:

------->
+false

Return true if seq is a palindromic sequence; otherwise return false.

source
BioSequences.iscanonicalFunction
iscanonical(seq::NucleotideSeq)

Returns true if seq is canonical.

For any sequence, there is a reverse complement, which is the same sequence, but on the complimentary strand of DNA:

------->
 ATCGATCG
 CGATCGAT
-<-------
Note

Using the reverse_complement of a DNA sequence will give give this reverse complement.

Of the two sequences, the canonical of the two sequences is the lesser of the two i.e. canonical_seq < other_seq.

source
+<-------
Note

Using the reverse_complement of a DNA sequence will give give this reverse complement.

Of the two sequences, the canonical of the two sequences is the lesser of the two i.e. canonical_seq < other_seq.

source
diff --git a/dev/random/index.html b/dev/random/index.html index 84d15d1a..0491766d 100644 --- a/dev/random/index.html +++ b/dev/random/index.html @@ -1,8 +1,8 @@ Random sequences · BioSequences.jl

Generating random sequences

Long sequences

You can generate random long sequences using the randdna function and the Sampler's implemented in BioSequences:

BioSequences.randseqFunction
randseq([rng::AbstractRNG], A::Alphabet, len::Integer)

Generate a LongSequence{A} of length len from the specified alphabet, drawn from the default distribution. User-defined alphabets should implement this method to implement random LongSequence generation.

For RNA and DNA alphabets, the default distribution is uniform across A, C, G, and T/U. For AminoAcidAlphabet, it is uniform across the 20 standard amino acids. For a user-defined alphabet A, default is uniform across all elements of symbols(A).

Example:

julia> seq = randseq(AminoAcidAlphabet(), 50)
 50aa Amino Acid Sequence:
-VFMHSIRMIRLMVHRSWKMHSARHVNFIRCQDKKWKSADGIYTDICKYSM
source
randseq([rng::AbstractRNG], A::Alphabet, sp::Sampler, len::Integer)

Generate a LongSequence{A} of length len with elements drawn from the given sampler.

Example:

# Generate 1000-length RNA with 4% chance of N, 24% for A, C, G, or U
+VFMHSIRMIRLMVHRSWKMHSARHVNFIRCQDKKWKSADGIYTDICKYSM
source
randseq([rng::AbstractRNG], A::Alphabet, sp::Sampler, len::Integer)

Generate a LongSequence{A} of length len with elements drawn from the given sampler.

Example:

# Generate 1000-length RNA with 4% chance of N, 24% for A, C, G, or U
 julia> sp = SamplerWeighted(rna"ACGUN", fill(0.24, 4))
 julia> seq = randseq(RNAAlphabet{4}(), sp, 50)
 50nt RNA Sequence:
-CUNGGGCCCGGGNAAACGUGGUACACCCUGUUAAUAUCAACNNGCGCUNU
source
BioSequences.randdnaseqFunction
randdnaseq([rng::AbstractRNG], len::Integer)

Generate a random LongSequence{DNAAlphabet{4}} sequence of length len, with bases sampled uniformly from [A, C, G, T]

source
BioSequences.randrnaseqFunction
randrnaseq([rng::AbstractRNG], len::Integer)

Generate a random LongSequence{RNAAlphabet{4}} sequence of length len, with bases sampled uniformly from [A, C, G, U]

source
BioSequences.randaaseqFunction
randaaseq([rng::AbstractRNG], len::Integer)

Generate a random LongSequence{AminoAcidAlphabet} sequence of length len, with amino acids sampled uniformly from the 20 standard amino acids.

source
BioSequences.SamplerUniformType
SamplerUniform{T}

Uniform sampler of type T. Instantiate with a collection of eltype T containing the elements to sample.

Examples

julia> sp = SamplerUniform(rna"ACGU");
source
BioSequences.SamplerWeightedType
SamplerWeighted{T}

Weighted sampler of type T. Instantiate with a collection of eltype T containing the elements to sample, and an orderen collection of probabilities to sample each element except the last. The last probability is the remaining probability up to 1.

Examples

julia> sp = SamplerWeighted(rna"ACGUN", fill(0.2475, 4));
source
+CUNGGGCCCGGGNAAACGUGGUACACCCUGUUAAUAUCAACNNGCGCUNUsource
BioSequences.randdnaseqFunction
randdnaseq([rng::AbstractRNG], len::Integer)

Generate a random LongSequence{DNAAlphabet{4}} sequence of length len, with bases sampled uniformly from [A, C, G, T]

source
BioSequences.randrnaseqFunction
randrnaseq([rng::AbstractRNG], len::Integer)

Generate a random LongSequence{RNAAlphabet{4}} sequence of length len, with bases sampled uniformly from [A, C, G, U]

source
BioSequences.randaaseqFunction
randaaseq([rng::AbstractRNG], len::Integer)

Generate a random LongSequence{AminoAcidAlphabet} sequence of length len, with amino acids sampled uniformly from the 20 standard amino acids.

source
BioSequences.SamplerUniformType
SamplerUniform{T}

Uniform sampler of type T. Instantiate with a collection of eltype T containing the elements to sample.

Examples

julia> sp = SamplerUniform(rna"ACGU");
source
BioSequences.SamplerWeightedType
SamplerWeighted{T}

Weighted sampler of type T. Instantiate with a collection of eltype T containing the elements to sample, and an orderen collection of probabilities to sample each element except the last. The last probability is the remaining probability up to 1.

Examples

julia> sp = SamplerWeighted(rna"ACGUN", fill(0.2475, 4));
source
diff --git a/dev/recipes/index.html b/dev/recipes/index.html index c37f8304..d94df995 100644 --- a/dev/recipes/index.html +++ b/dev/recipes/index.html @@ -29,4 +29,4 @@ 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 1 1 0 1 0 0 0 0 0 - 1 0 1 1 1 0 1 1 0 1 + 1 0 1 1 1 0 1 1 0 1 diff --git a/dev/sequence_search/index.html b/dev/sequence_search/index.html index 8d300718..004bad20 100644 --- a/dev/sequence_search/index.html +++ b/dev/sequence_search/index.html @@ -50,7 +50,7 @@ julia> occursin(ExactSearchQuery(dna"CNT", iscompatible), dna"ACNT") true -source

Allowing mismatches

BioSequences.ApproximateSearchQueryType
ApproximateSearchQuery{F<:Function,S<:BioSequence}

Query type for approximate sequence search.

These queries are used as a predicate for the Base.findnext, Base.findprev, Base.occursin, Base.findfirst, and Base.findlast functions.

Using these functions with these queries allows you to search a given sequence for a sub-sequence, whilst allowing a specific number of errors.

In other words they find a subsequence of the target sequence within a specific Levenshtein distance of the query sequence.

Examples

julia> seq = dna"ACAGCGTAGCT";
+
source

Allowing mismatches

BioSequences.ApproximateSearchQueryType
ApproximateSearchQuery{F<:Function,S<:BioSequence}

Query type for approximate sequence search.

These queries are used as a predicate for the Base.findnext, Base.findprev, Base.occursin, Base.findfirst, and Base.findlast functions.

Using these functions with these queries allows you to search a given sequence for a sub-sequence, whilst allowing a specific number of errors.

In other words they find a subsequence of the target sequence within a specific Levenshtein distance of the query sequence.

Examples

julia> seq = dna"ACAGCGTAGCT";
 
 julia> query = ApproximateSearchQuery(dna"AGGG");
 
@@ -69,7 +69,7 @@
 
 julia> findnext(query, 1, dna"AAGNGG", 1) # 1 mismatch permitted (A vs G) & matched N
 1:4
-
Note

This method of searching for motifs was implemented with smaller query motifs in mind.

If you are looking to search for imperfect matches of longer sequences in this manner, you are likely better off using some kind of local-alignment algorithm or one of the BLAST variants.

source

Searching according to a pattern

Query patterns can be described in regular expressions. The syntax supports a subset of Perl and PROSITE's notation.

Biological regexes can be constructed using the BioRegex constructor, for example by doing BioRegex{AminoAcid}("MV+"). For bioregex literals, it is instead recommended using the @biore_str macro:

The Perl-like syntax starts with biore (BIOlogical REgular expression) and ends with a symbol option: "dna", "rna" or "aa". For example, biore"A+"dna is a regular expression for DNA sequences and biore"A+"aa is for amino acid sequences. The symbol options can be abbreviated to its first character: "d", "r" or "a", respectively.

Here are examples of using the regular expression for BioSequences:

julia> match(biore"A+C*"dna, dna"AAAACC")
+
Note

This method of searching for motifs was implemented with smaller query motifs in mind.

If you are looking to search for imperfect matches of longer sequences in this manner, you are likely better off using some kind of local-alignment algorithm or one of the BLAST variants.

source

Searching according to a pattern

Query patterns can be described in regular expressions. The syntax supports a subset of Perl and PROSITE's notation.

Biological regexes can be constructed using the BioRegex constructor, for example by doing BioRegex{AminoAcid}("MV+"). For bioregex literals, it is instead recommended using the @biore_str macro:

The Perl-like syntax starts with biore (BIOlogical REgular expression) and ends with a symbol option: "dna", "rna" or "aa". For example, biore"A+"dna is a regular expression for DNA sequences and biore"A+"aa is for amino acid sequences. The symbol options can be abbreviated to its first character: "d", "r" or "a", respectively.

Here are examples of using the regular expression for BioSequences:

julia> match(biore"A+C*"dna, dna"AAAACC")
 RegexMatch("AAAACC")
 
 julia> match(biore"A+C*"d, dna"AAAACC")
@@ -159,4 +159,4 @@
 3-element Vector{Int64}:
  3
  5
- 9

[Wasserman2004]: https://doi.org/10.1038/nrg1315

+ 9

[Wasserman2004]: https://doi.org/10.1038/nrg1315

diff --git a/dev/symbols/index.html b/dev/symbols/index.html index 5c4fad7a..07e501e6 100644 --- a/dev/symbols/index.html +++ b/dev/symbols/index.html @@ -70,4 +70,4 @@ julia> iscompatible(DNA_C, DNA_R) # DNA_R (A or G) cannot be DNA_C false -source
BioSymbols.isambiguousFunction
isambiguous(nt::NucleicAcid)

Test if nt is an ambiguous nucleotide.

source
isambiguous(aa::AminoAcid)

Test if aa is an ambiguous amino acid.

source
+source
BioSymbols.isambiguousFunction
isambiguous(nt::NucleicAcid)

Test if nt is an ambiguous nucleotide.

source
isambiguous(aa::AminoAcid)

Test if aa is an ambiguous amino acid.

source
diff --git a/dev/transforms/index.html b/dev/transforms/index.html index 92780fd7..39e26640 100644 --- a/dev/transforms/index.html +++ b/dev/transforms/index.html @@ -15,7 +15,7 @@ julia> seq[5] = DNA_A DNA_A -
Note

Some types such can be indexed using integers but not using ranges.

For LongSequence types, indexing a sequence by range creates a copy of the original sequence, similar to Array in Julia's Base library. If you find yourself slowed down by the allocation of these subsequences, consider using a sequence view instead.

Modifying sequences

In addition to setindex, many other modifying operations are possible for biological sequences such as push!, pop!, and insert!, which should be familiar to anyone used to editing arrays.

Base.push!Method
push!(seq::BioSequence, x)

Append a biological symbol x to a biological sequence seq.

source
Base.pop!Method
pop!(seq::BioSequence)

Remove the symbol from the end of a biological sequence seq and return it. Returns a variable of eltype(seq).

source
Base.pushfirst!Method
pushfirst!(seq, x)

Insert a biological symbol x at the beginning of a biological sequence seq.

source
Base.popfirst!Method
popfirst!(seq)

Remove the symbol from the beginning of a biological sequence seq and return it. Returns a variable of eltype(seq).

source
Base.insert!Method
insert!(seq::BioSequence, i, x)

Insert a biological symbol x into a biological sequence seq, at the given index i.

source
Base.deleteat!Method
deleteat!(seq::BioSequence, i::Integer)

Delete a biological symbol at a single position i in a biological sequence seq.

Modifies the input sequence.

source
Base.append!Method
append!(seq, other)

Add a biological sequence other onto the end of biological sequence seq. Modifies and returns seq.

source
Base.resize!Method
resize!(seq, size, [force::Bool])

Resize a biological sequence seq, to a given size. Does not resize the underlying data array unless the new size does not fit. If force, always resize underlying data array.

source
Base.empty!Method
empty!(seq::BioSequence)

Completely empty a biological sequence seq of nucleotides.

source

Here are some examples:

julia> seq = dna"ACG"
+
Note

Some types such can be indexed using integers but not using ranges.

For LongSequence types, indexing a sequence by range creates a copy of the original sequence, similar to Array in Julia's Base library. If you find yourself slowed down by the allocation of these subsequences, consider using a sequence view instead.

Modifying sequences

In addition to setindex, many other modifying operations are possible for biological sequences such as push!, pop!, and insert!, which should be familiar to anyone used to editing arrays.

Base.push!Method
push!(seq::BioSequence, x)

Append a biological symbol x to a biological sequence seq.

source
Base.pop!Method
pop!(seq::BioSequence)

Remove the symbol from the end of a biological sequence seq and return it. Returns a variable of eltype(seq).

source
Base.pushfirst!Method
pushfirst!(seq, x)

Insert a biological symbol x at the beginning of a biological sequence seq.

source
Base.popfirst!Method
popfirst!(seq)

Remove the symbol from the beginning of a biological sequence seq and return it. Returns a variable of eltype(seq).

source
Base.insert!Method
insert!(seq::BioSequence, i, x)

Insert a biological symbol x into a biological sequence seq, at the given index i.

source
Base.deleteat!Method
deleteat!(seq::BioSequence, i::Integer)

Delete a biological symbol at a single position i in a biological sequence seq.

Modifies the input sequence.

source
Base.append!Method
append!(seq, other)

Add a biological sequence other onto the end of biological sequence seq. Modifies and returns seq.

source
Base.resize!Method
resize!(seq, size, [force::Bool])

Resize a biological sequence seq, to a given size. Does not resize the underlying data array unless the new size does not fit. If force, always resize underlying data array.

source
Base.empty!Method
empty!(seq::BioSequence)

Completely empty a biological sequence seq of nucleotides.

source

Here are some examples:

julia> seq = dna"ACG"
 3nt DNA Sequence:
 ACG
 
@@ -34,7 +34,7 @@
 julia> deleteat!(seq, 2:3)
 3nt DNA Sequence:
 AAT
-

Additional transformations

In addition to these basic modifying functions, other sequence transformations that are common in bioinformatics are also provided.

Base.reverse!Method
reverse!(seq::LongSequence)

Reverse a biological sequence seq in place.

source
Base.reverseMethod
reverse(seq::BioSequence)

Create reversed copy of a biological sequence.

source
reverse(seq::LongSequence)

Create reversed copy of a biological sequence.

source
BioSequences.complement!Function
complement!(seq)

Make a complement sequence of seq in place.

source
BioSymbols.complementFunction
complement(nt::NucleicAcid)

Return the complementary nucleotide of nt.

This function returns the union of all possible complementary nucleotides.

Examples

julia> complement(DNA_A)
+

Additional transformations

In addition to these basic modifying functions, other sequence transformations that are common in bioinformatics are also provided.

Base.reverse!Method
reverse!(seq::LongSequence)

Reverse a biological sequence seq in place.

source
Base.reverseMethod
reverse(seq::BioSequence)

Create reversed copy of a biological sequence.

source
reverse(seq::LongSequence)

Create reversed copy of a biological sequence.

source
BioSymbols.complementFunction
complement(nt::NucleicAcid)

Return the complementary nucleotide of nt.

This function returns the union of all possible complementary nucleotides.

Examples

julia> complement(DNA_A)
 DNA_T
 
 julia> complement(DNA_N)
@@ -42,10 +42,10 @@
 
 julia> complement(RNA_U)
 RNA_A
-
source
complement(seq)

Make a complement sequence of seq.

source
BioSequences.canonical!Function
canonical!(seq::NucleotideSeq)

Transforms the seq into its canonical form, if it is not already canonical. Modifies the input sequence inplace.

For any sequence, there is a reverse complement, which is the same sequence, but on the complimentary strand of DNA:

------->
+
source
complement(seq)

Make a complement sequence of seq.

source
BioSequences.canonical!Function
canonical!(seq::NucleotideSeq)

Transforms the seq into its canonical form, if it is not already canonical. Modifies the input sequence inplace.

For any sequence, there is a reverse complement, which is the same sequence, but on the complimentary strand of DNA:

------->
 ATCGATCG
 CGATCGAT
-<-------
Note

Using the reverse_complement of a DNA sequence will give give this reverse complement.

Of the two sequences, the canonical of the two sequences is the lesser of the two i.e. canonical_seq < other_seq.

Using this function on a seq will ensure it is the canonical version.

source

Some examples:

julia> seq = dna"ACGTAT"
+<-------
Note

Using the reverse_complement of a DNA sequence will give give this reverse complement.

Of the two sequences, the canonical of the two sequences is the lesser of the two i.e. canonical_seq < other_seq.

Using this function on a seq will ensure it is the canonical version.

source
BioSequences.canonicalFunction
canonical(seq::NucleotideSeq)

Create the canonical sequence of seq.

source

Some examples:

julia> seq = dna"ACGTAT"
 6nt DNA Sequence:
 ACGTAT
 
@@ -60,7 +60,7 @@
 julia> reverse_complement!(seq)
 6nt DNA Sequence:
 ACGTAT
-

Many of these methods also have a version which makes a copy of the input sequence, so you get a modified copy, and don't alter the original sequence. Such methods are named the same, but without the exclamation mark. E.g. reverse instead of reverse!, and ungap instead of ungap!.

Translation

Translation is a slightly more complex transformation for RNA Sequences and so we describe it here in more detail.

The translate function translates a sequence of codons in a RNA sequence to a amino acid sequence based on a genetic code. The BioSequences package provides all NCBI defined genetic codes and they are registered in ncbi_trans_table.

BioSequences.translateFunction
translate(seq, code=standard_genetic_code, allow_ambiguous_codons=true, alternative_start=false)

Translate an LongRNA or a LongDNA to an LongAA.

Translation uses genetic code code to map codons to amino acids. See ncbi_trans_table for available genetic codes. If codons in the given sequence cannot determine a unique amino acid, they will be translated to AA_X if allow_ambiguous_codons is true and otherwise result in an error. For organisms that utilize alternative start codons, one can set alternative_start=true, in which case the first codon will always be converted to a methionine.

source
BioSequences.ncbi_trans_tableConstant

Genetic code list of NCBI.

The standard genetic code is ncbi_trans_table[1] and others can be shown by show(ncbi_trans_table). For more details, consult the next link: http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=cgencodes.

source
julia> ncbi_trans_table
+

Many of these methods also have a version which makes a copy of the input sequence, so you get a modified copy, and don't alter the original sequence. Such methods are named the same, but without the exclamation mark. E.g. reverse instead of reverse!, and ungap instead of ungap!.

Translation

Translation is a slightly more complex transformation for RNA Sequences and so we describe it here in more detail.

The translate function translates a sequence of codons in a RNA sequence to a amino acid sequence based on a genetic code. The BioSequences package provides all NCBI defined genetic codes and they are registered in ncbi_trans_table.

BioSequences.translateFunction
translate(seq, code=standard_genetic_code, allow_ambiguous_codons=true, alternative_start=false)

Translate an LongRNA or a LongDNA to an LongAA.

Translation uses genetic code code to map codons to amino acids. See ncbi_trans_table for available genetic codes. If codons in the given sequence cannot determine a unique amino acid, they will be translated to AA_X if allow_ambiguous_codons is true and otherwise result in an error. For organisms that utilize alternative start codons, one can set alternative_start=true, in which case the first codon will always be converted to a methionine.

source
BioSequences.ncbi_trans_tableConstant

Genetic code list of NCBI.

The standard genetic code is ncbi_trans_table[1] and others can be shown by show(ncbi_trans_table). For more details, consult the next link: http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=cgencodes.

source
julia> ncbi_trans_table
 Translation Tables:
   1. The Standard Code (standard_genetic_code)
   2. The Vertebrate Mitochondrial Code (vertebrate_mitochondrial_genetic_code)
@@ -80,4 +80,4 @@
  23. Thraustochytrium Mitochondrial Code (thraustochytrium_mitochondrial_genetic_code)
  24. Pterobranchia Mitochondrial Code (pterobrachia_mitochondrial_genetic_code)
  25. Candidate Division SR1 and Gracilibacteria Code (candidate_division_sr1_genetic_code)
-

https://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=cgencodes

+

https://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=cgencodes

diff --git a/dev/types/index.html b/dev/types/index.html index 0850958a..2ac9635a 100644 --- a/dev/types/index.html +++ b/dev/types/index.html @@ -1,2 +1,2 @@ -BioSequences Types · BioSequences.jl

Abstract Types

BioSequences exports an abstract BioSequence type, and several concrete sequence types which inherit from it.

The abstract BioSequence

BioSequences provides an abstract type called a BioSequence{A<:Alphabet}. This abstract type, and the methods and traits is supports, allows for many algorithms in BioSequences to be written as generically as possible, thus reducing the amount of code to read and understand, whilst maintaining high performance when such code is compiled for a concrete BioSequence subtype. Additionally, it allows new types to be implemented that are fully compatible with the rest of BioSequences, providing that key methods or traits are defined).

BioSequences.BioSequenceType
BioSequence{A <: Alphabet}

BioSequence is the main abstract type of BioSequences. It abstracts over the internal representation of different biological sequences, and is parameterized by an Alphabet, which controls the element type.

Extended help

Its subtypes are characterized by:

  • Being a linear container type with random access and indices Base.OneTo(length(x)).
  • Containing zero or more internal data elements of type encoded_data_eltype(typeof(x)).
  • Being associated with an Alphabet, A by being a subtype of BioSequence{A}.

A BioSequence{A} is indexed by an integer. The biosequence subtype, the index and the alphabet A determine how to extract the internal encoded data. The alphabet decides how to decode the data to the element type of the biosequence. Hence, the element type and container type of a BioSequence are separated.

Subtypes T of BioSequence must implement the following, with E begin an encoded data type:

  • Base.length(::T)::Int
  • encoded_data_eltype(::Type{T})::Type{E}
  • extract_encoded_element(::T, ::Integer)::E
  • copy(::T)
  • T must be able to be constructed from any iterable with length defined and with a known, compatible element type.

Furthermore, mutable sequences should implement

  • encoded_setindex!(::T, ::E, ::Integer)
  • T(undef, ::Int)
  • resize!(::T, ::Int)

For compatibility with existing Alphabets, the encoded data eltype must be UInt.

source

Some aliases for BioSequence are also provided for your convenience:

Let's have a closer look at some of those methods that a subtype of BioSequence must implement. Check out julia base library docs for length, copy and resize!.

BioSequences.encoded_data_eltypeFunction
encoded_data_eltype(::Type{<:BioSequence})

Returns the element type of the encoded data of the BioSequence. This is the return type of extract_encoded_element, i.e. the data type that stores the biological symbols in the biosequence.

See also: BioSequence

source
BioSequences.extract_encoded_elementFunction
extract_encoded_element(::BioSequence{A}, i::Integer)

Returns the encoded element at position i. This data can be decoded using decode(A(), data) to yield the element type of the biosequence.

See also: BioSequence

source
BioSequences.encoded_setindex!Function
encoded_setindex!(seq::BioSequence, x::E, i::Integer)

Given encoded data x of type encoded_data_eltype(typeof(seq)), sets the internal sequence data at the given index.

See also: BioSequence

source

A correctly defined subtype of BioSequence that satisfies the interface, will find the vast majority of methods described in the rest of this manual should work out of the box for that type. But they can always be overloaded if needed. Indeed the LongSequence type overloads Indeed some of the generic BioSequence methods, are overloaded for LongSequence, for example for transformation and counting operations where efficiency gains can be made due to the specific internal representation of a specific type.

The abstract Alphabet

Alphabets control how biological symbols are encoded and decoded. They also confer many of the automatic traits and methods that any subtype of T<:BioSequence{A<:Alphabet} will get.

BioSequences.AlphabetType
Alphabet

Alphabet is the most important type trait for BioSequence. An Alphabet represents a set of biological symbols encoded by a sequence, e.g. A, C, G and T for a DNA Alphabet that requires only 2 bits to represent each symbol.

Extended help

  • Subtypes of Alphabet are singleton structs that may or may not be parameterized.
  • Alphabets span over a finite set of biological symbols.
  • The alphabet controls the encoding from some internal "encoded data" to a BioSymbol of the alphabet's element type, as well as the decoding, the inverse process.
  • An Alphabet's encode method must not produce invalid data.

Every subtype A of Alphabet must implement:

  • Base.eltype(::Type{A})::Type{S} for some eltype S, which must be a BioSymbol.
  • symbols(::A)::Tuple{Vararg{S}}. This gives tuples of all symbols in the set of A.
  • encode(::A, ::S)::E encodes a symbol to an internal data eltype E.
  • decode(::A, ::E)::S decodes an internal data eltype E to a symbol S.
  • Except for eltype which must follow Base conventions, all functions operating on Alphabet should operate on instances of the alphabet, not the type.

If you want interoperation with existing subtypes of BioSequence, the encoded representation E must be of type UInt, and you must also implement:

  • BitsPerSymbol(::A)::BitsPerSymbol{N}, where the N must be zero or a power of two in [1, 2, 4, 8, 16, 32, [64 for 64-bit systems]].

For increased performance, see BioSequences.AsciiAlphabet

source
BioSequences.AsciiAlphabetType
AsciiAlphabet

Trait for alphabet using ASCII characters as String representation. Define codetype(A) = AsciiAlphabet() for a user-defined Alphabet A to gain speed. Methods needed: BioSymbols.stringbyte(::eltype(A)) and ascii_encode(A, ::UInt8).

source

Concrete types

Implemented alphabets

BioSequences.DNAAlphabetType

DNA nucleotide alphabet.

DNAAlphabet has a parameter N which is a number that determines the BitsPerSymbol trait. Currently supported values of N are 2 and 4.

source
BioSequences.RNAAlphabetType

RNA nucleotide alphabet.

RNAAlphabet has a parameter N which is a number that determines the BitsPerSymbol trait. Currently supported values of N are 2 and 4.

source

Long Sequences

BioSequences.LongSequenceType
LongSequence{A <: Alphabet}

General-purpose BioSequence. This type is mutable and variable-length, and should be preferred for most use cases.

Extended help

LongSequence{A<:Alphabet} <: BioSequence{A} is parameterized by a concrete Alphabet type A that defines the domain (or set) of biological symbols permitted.

As the BioSequence interface definition implies, LongSequences store the biological symbol elements that they contain in a succinct encoded form that permits many operations to be done in an efficient bit-parallel manner. As per the interface of BioSequence, the Alphabet determines how an element is encoded or decoded when it is inserted or extracted from the sequence.

For example, AminoAcidAlphabet is associated with AminoAcid and hence an object of the LongSequence{AminoAcidAlphabet} type represents a sequence of amino acids.

Symbols from multiple alphabets can't be intermixed in one sequence type.

The following table summarizes common LongSequence types that have been given aliases for convenience.

TypeSymbol typeType alias
LongSequence{DNAAlphabet{N}}DNALongDNA{N}
LongSequence{RNAAlphabet{N}}RNALongRNA{N}
LongSequence{AminoAcidAlphabet}AminoAcidLongAA

The LongDNA and LongRNA aliases use a DNAAlphabet{4}.

DNAAlphabet{4} permits ambiguous nucleotides, and a sequence must use at least 4 bits to internally store each element (and indeed LongSequence does).

If you are sure that you are working with sequences with no ambiguous nucleotides, you can use LongSequences parameterised with DNAAlphabet{2} instead.

DNAAlphabet{2} is an alphabet that uses two bits per base and limits to only unambiguous nucleotide symbols (A,C,G,T).

Changing this single parameter, is all you need to do in order to benefit from memory savings. Some computations that use bitwise operations will also be dramatically faster.

The same applies with LongSequence{RNAAlphabet{4}}, simply replace the alphabet parameter with RNAAlphabet{2} in order to benefit.

source

Sequence views

Similar to how Base Julia offers views of array objects, BioSequences offers view of LongSequences - the LongSubSeq{A<:Alphabet}.

Conceptually, a LongSubSeq{A} is similar to a LongSequence{A}, but instead of storing their own data, they refer to the data of a LongSequence. Modiying the LongSequence will be reflected in the view, and vice versa. If the underlying LongSequence is truncated, the behaviour of a view is undefined. For the same reason, some operations are not supported for views, such as resizing.

The purpose of LongSubSeq is that, since they only contain a pointer to the underlying array, an offset and a length, they are much lighter than LongSequences, and will be stack allocated on Julia 1.5 and newer. Thus, the user may construct millions of views without major performance implications.

+BioSequences Types · BioSequences.jl

Abstract Types

BioSequences exports an abstract BioSequence type, and several concrete sequence types which inherit from it.

The abstract BioSequence

BioSequences provides an abstract type called a BioSequence{A<:Alphabet}. This abstract type, and the methods and traits is supports, allows for many algorithms in BioSequences to be written as generically as possible, thus reducing the amount of code to read and understand, whilst maintaining high performance when such code is compiled for a concrete BioSequence subtype. Additionally, it allows new types to be implemented that are fully compatible with the rest of BioSequences, providing that key methods or traits are defined).

BioSequences.BioSequenceType
BioSequence{A <: Alphabet}

BioSequence is the main abstract type of BioSequences. It abstracts over the internal representation of different biological sequences, and is parameterized by an Alphabet, which controls the element type.

Extended help

Its subtypes are characterized by:

  • Being a linear container type with random access and indices Base.OneTo(length(x)).
  • Containing zero or more internal data elements of type encoded_data_eltype(typeof(x)).
  • Being associated with an Alphabet, A by being a subtype of BioSequence{A}.

A BioSequence{A} is indexed by an integer. The biosequence subtype, the index and the alphabet A determine how to extract the internal encoded data. The alphabet decides how to decode the data to the element type of the biosequence. Hence, the element type and container type of a BioSequence are separated.

Subtypes T of BioSequence must implement the following, with E begin an encoded data type:

  • Base.length(::T)::Int
  • encoded_data_eltype(::Type{T})::Type{E}
  • extract_encoded_element(::T, ::Integer)::E
  • copy(::T)
  • T must be able to be constructed from any iterable with length defined and with a known, compatible element type.

Furthermore, mutable sequences should implement

  • encoded_setindex!(::T, ::E, ::Integer)
  • T(undef, ::Int)
  • resize!(::T, ::Int)

For compatibility with existing Alphabets, the encoded data eltype must be UInt.

source

Some aliases for BioSequence are also provided for your convenience:

Let's have a closer look at some of those methods that a subtype of BioSequence must implement. Check out julia base library docs for length, copy and resize!.

BioSequences.encoded_data_eltypeFunction
encoded_data_eltype(::Type{<:BioSequence})

Returns the element type of the encoded data of the BioSequence. This is the return type of extract_encoded_element, i.e. the data type that stores the biological symbols in the biosequence.

See also: BioSequence

source
BioSequences.extract_encoded_elementFunction
extract_encoded_element(::BioSequence{A}, i::Integer)

Returns the encoded element at position i. This data can be decoded using decode(A(), data) to yield the element type of the biosequence.

See also: BioSequence

source
BioSequences.encoded_setindex!Function
encoded_setindex!(seq::BioSequence, x::E, i::Integer)

Given encoded data x of type encoded_data_eltype(typeof(seq)), sets the internal sequence data at the given index.

See also: BioSequence

source

A correctly defined subtype of BioSequence that satisfies the interface, will find the vast majority of methods described in the rest of this manual should work out of the box for that type. But they can always be overloaded if needed. Indeed the LongSequence type overloads Indeed some of the generic BioSequence methods, are overloaded for LongSequence, for example for transformation and counting operations where efficiency gains can be made due to the specific internal representation of a specific type.

The abstract Alphabet

Alphabets control how biological symbols are encoded and decoded. They also confer many of the automatic traits and methods that any subtype of T<:BioSequence{A<:Alphabet} will get.

BioSequences.AlphabetType
Alphabet

Alphabet is the most important type trait for BioSequence. An Alphabet represents a set of biological symbols encoded by a sequence, e.g. A, C, G and T for a DNA Alphabet that requires only 2 bits to represent each symbol.

Extended help

  • Subtypes of Alphabet are singleton structs that may or may not be parameterized.
  • Alphabets span over a finite set of biological symbols.
  • The alphabet controls the encoding from some internal "encoded data" to a BioSymbol of the alphabet's element type, as well as the decoding, the inverse process.
  • An Alphabet's encode method must not produce invalid data.

Every subtype A of Alphabet must implement:

  • Base.eltype(::Type{A})::Type{S} for some eltype S, which must be a BioSymbol.
  • symbols(::A)::Tuple{Vararg{S}}. This gives tuples of all symbols in the set of A.
  • encode(::A, ::S)::E encodes a symbol to an internal data eltype E.
  • decode(::A, ::E)::S decodes an internal data eltype E to a symbol S.
  • Except for eltype which must follow Base conventions, all functions operating on Alphabet should operate on instances of the alphabet, not the type.

If you want interoperation with existing subtypes of BioSequence, the encoded representation E must be of type UInt, and you must also implement:

  • BitsPerSymbol(::A)::BitsPerSymbol{N}, where the N must be zero or a power of two in [1, 2, 4, 8, 16, 32, [64 for 64-bit systems]].

For increased performance, see BioSequences.AsciiAlphabet

source
BioSequences.AsciiAlphabetType
AsciiAlphabet

Trait for alphabet using ASCII characters as String representation. Define codetype(A) = AsciiAlphabet() for a user-defined Alphabet A to gain speed. Methods needed: BioSymbols.stringbyte(::eltype(A)) and ascii_encode(A, ::UInt8).

source

Concrete types

Implemented alphabets

BioSequences.DNAAlphabetType

DNA nucleotide alphabet.

DNAAlphabet has a parameter N which is a number that determines the BitsPerSymbol trait. Currently supported values of N are 2 and 4.

source
BioSequences.RNAAlphabetType

RNA nucleotide alphabet.

RNAAlphabet has a parameter N which is a number that determines the BitsPerSymbol trait. Currently supported values of N are 2 and 4.

source

Long Sequences

BioSequences.LongSequenceType
LongSequence{A <: Alphabet}

General-purpose BioSequence. This type is mutable and variable-length, and should be preferred for most use cases.

Extended help

LongSequence{A<:Alphabet} <: BioSequence{A} is parameterized by a concrete Alphabet type A that defines the domain (or set) of biological symbols permitted.

As the BioSequence interface definition implies, LongSequences store the biological symbol elements that they contain in a succinct encoded form that permits many operations to be done in an efficient bit-parallel manner. As per the interface of BioSequence, the Alphabet determines how an element is encoded or decoded when it is inserted or extracted from the sequence.

For example, AminoAcidAlphabet is associated with AminoAcid and hence an object of the LongSequence{AminoAcidAlphabet} type represents a sequence of amino acids.

Symbols from multiple alphabets can't be intermixed in one sequence type.

The following table summarizes common LongSequence types that have been given aliases for convenience.

TypeSymbol typeType alias
LongSequence{DNAAlphabet{N}}DNALongDNA{N}
LongSequence{RNAAlphabet{N}}RNALongRNA{N}
LongSequence{AminoAcidAlphabet}AminoAcidLongAA

The LongDNA and LongRNA aliases use a DNAAlphabet{4}.

DNAAlphabet{4} permits ambiguous nucleotides, and a sequence must use at least 4 bits to internally store each element (and indeed LongSequence does).

If you are sure that you are working with sequences with no ambiguous nucleotides, you can use LongSequences parameterised with DNAAlphabet{2} instead.

DNAAlphabet{2} is an alphabet that uses two bits per base and limits to only unambiguous nucleotide symbols (A,C,G,T).

Changing this single parameter, is all you need to do in order to benefit from memory savings. Some computations that use bitwise operations will also be dramatically faster.

The same applies with LongSequence{RNAAlphabet{4}}, simply replace the alphabet parameter with RNAAlphabet{2} in order to benefit.

source

Sequence views

Similar to how Base Julia offers views of array objects, BioSequences offers view of LongSequences - the LongSubSeq{A<:Alphabet}.

Conceptually, a LongSubSeq{A} is similar to a LongSequence{A}, but instead of storing their own data, they refer to the data of a LongSequence. Modiying the LongSequence will be reflected in the view, and vice versa. If the underlying LongSequence is truncated, the behaviour of a view is undefined. For the same reason, some operations are not supported for views, such as resizing.

The purpose of LongSubSeq is that, since they only contain a pointer to the underlying array, an offset and a length, they are much lighter than LongSequences, and will be stack allocated on Julia 1.5 and newer. Thus, the user may construct millions of views without major performance implications.