diff --git a/dev/.documenter-siteinfo.json b/dev/.documenter-siteinfo.json index 0cff6d87..d67aef0b 100644 --- a/dev/.documenter-siteinfo.json +++ b/dev/.documenter-siteinfo.json @@ -1 +1 @@ -{"documenter":{"julia_version":"1.11.1","generation_timestamp":"2024-10-22T16:18:36","documenter_version":"1.7.0"}} \ No newline at end of file +{"documenter":{"julia_version":"1.11.1","generation_timestamp":"2024-10-22T16:41:41","documenter_version":"1.7.0"}} \ No newline at end of file diff --git a/dev/construction/index.html b/dev/construction/index.html index 41438b63..ce5d1f8b 100644 --- a/dev/construction/index.html +++ b/dev/construction/index.html @@ -135,11 +135,11 @@ "TAGA" julia> string(push!(f(), DNA_A)) -"TAGA"source
BioSequences.@rna_strMacro

The LongRNA{4} equivalent to @dna_str

See also: @dna_str, @aa_str

Examples

julia> rna"UCGUGAUGC"
+"TAGA"
source
BioSequences.@rna_strMacro

The LongRNA{4} equivalent to @dna_str

See also: @dna_str, @aa_str

Examples

julia> rna"UCGUGAUGC"
 9nt RNA Sequence:
-UCGUGAUGC
source
BioSequences.@aa_strMacro

The AminoAcidAlphabet equivalent to @dna_str

See also: @dna_str, @rna_str

Examples

julia> aa"PKLEQC"
+UCGUGAUGC
source
BioSequences.@aa_strMacro

The AminoAcidAlphabet equivalent to @dna_str

See also: @dna_str, @rna_str

Examples

julia> aa"PKLEQC"
 6aa Amino Acid Sequence:
-PKLEQC
source

Loose parsing

As of version 3.2.0, BioSequences.jl provide the bioseq function, which can be used to build a LongSequence from a string (or an AbstractVector{UInt8}) without knowing the correct Alphabet.

julia> bioseq("ATGTGCTGA")
+PKLEQC
source

Loose parsing

As of version 3.2.0, BioSequences.jl provide the bioseq function, which can be used to build a LongSequence from a string (or an AbstractVector{UInt8}) without knowing the correct Alphabet.

julia> bioseq("ATGTGCTGA")
 9nt DNA Sequence:
 ATGTGCTGA

The function will prioritise 2-bit alphabets over 4-bit alphabets, and prefer smaller alphabets (like DNAAlphabet{4}) over larger (like AminoAcidAlphabet). If the input cannot be encoded by any of the built-in alphabets, an error is thrown:

julia> bioseq("0!(CC!;#&&%")
 ERROR: cannot encode 0x30 in AminoAcidAlphabet
@@ -153,7 +153,7 @@
 
 julia> bioseq("PKMW#3>>0;kL")
 ERROR: cannot encode 0x23 in AminoAcidAlphabet
-[...]
source
BioSequences.guess_alphabetFunction
guess_alphabet(s::Union{AbstractString, AbstractVector{UInt8}}) -> Union{Integer, Alphabet}

Pick an Alphabet that can encode input s. If no Alphabet can, return the index of the first byte of the input which is not encodable in any alphabet. This function only knows about the alphabets listed below. If multiple alphabets are possible, pick the first from the order below (i.e. DNAAlphabet{2}() if possible, otherwise RNAAlphabet{2}() etc).

  1. DNAAlphabet{2}()
  2. RNAAlphabet{2}()
  3. DNAAlphabet{4}()
  4. RNAAlphabet{4}()
  5. AminoAcidAlphabet()
Warning

The functions bioseq and guess_alphabet are intended for use in interactive sessions, and are not suitable for use in packages or non-ephemeral work. They are type unstable, and their heuristics are subject to change in minor versions.

Examples

julia> guess_alphabet("AGGCA")
+[...]
source
BioSequences.guess_alphabetFunction
guess_alphabet(s::Union{AbstractString, AbstractVector{UInt8}}) -> Union{Integer, Alphabet}

Pick an Alphabet that can encode input s. If no Alphabet can, return the index of the first byte of the input which is not encodable in any alphabet. This function only knows about the alphabets listed below. If multiple alphabets are possible, pick the first from the order below (i.e. DNAAlphabet{2}() if possible, otherwise RNAAlphabet{2}() etc).

  1. DNAAlphabet{2}()
  2. RNAAlphabet{2}()
  3. DNAAlphabet{4}()
  4. RNAAlphabet{4}()
  5. AminoAcidAlphabet()
Warning

The functions bioseq and guess_alphabet are intended for use in interactive sessions, and are not suitable for use in packages or non-ephemeral work. They are type unstable, and their heuristics are subject to change in minor versions.

Examples

julia> guess_alphabet("AGGCA")
 DNAAlphabet{2}()
 
 julia> guess_alphabet("WKLQSTV")
@@ -163,10 +163,10 @@
 5
 
 julia> guess_alphabet("UAGCSKMU")
-RNAAlphabet{4}()
source

Comparison to other sequence types

Following Base standards, BioSequences do not compare equal to other containers even if they have the same elements. To e.g. compare a BioSequence with a vector of DNA, compare the elements themselves:

julia> seq = dna"GAGCTGA"; vec = collect(seq);
+RNAAlphabet{4}()
source

Comparison to other sequence types

Following Base standards, BioSequences do not compare equal to other containers even if they have the same elements. To e.g. compare a BioSequence with a vector of DNA, compare the elements themselves:

julia> seq = dna"GAGCTGA"; vec = collect(seq);
 
 julia> seq == vec, isequal(seq, vec)
 (false, false)
 
 julia> length(seq) == length(vec) && all(i == j for (i, j) in zip(seq, vec))
-true 
+true diff --git a/dev/counting/index.html b/dev/counting/index.html index 4af1c025..297ce7e0 100644 --- a/dev/counting/index.html +++ b/dev/counting/index.html @@ -9,24 +9,24 @@ 3 julia> matches(dna"AACA", dna"AAG") -2source
BioSequences.mismatchesFunction
mismatches(a::BioSequence, b::BioSequences) -> Int

Count the number of positions in where a and b differ. If b is given, and the length of a and b differ, look only at the indices of the shorter sequence. This function does not provide any special handling of ambiguous symbols, so e.g. DNA_A does not match DNA_N.

Warning

Passing in two sequences with differing lengths is deprecated. In a future, breaking release of BioSequences, this will error.

Examples

julia> mismatches(dna"TAGCTA", dna"TACNTA")
+2
source
BioSequences.mismatchesFunction
mismatches(a::BioSequence, b::BioSequences) -> Int

Count the number of positions in where a and b differ. If b is given, and the length of a and b differ, look only at the indices of the shorter sequence. This function does not provide any special handling of ambiguous symbols, so e.g. DNA_A does not match DNA_N.

Warning

Passing in two sequences with differing lengths is deprecated. In a future, breaking release of BioSequences, this will error.

Examples

julia> mismatches(dna"TAGCTA", dna"TACNTA")
 2
 
 julia> mismatches(dna"AACA", dna"AAG")
-1
source

GC content

The convenience function gc_content(seq) is equivalent to count(isGC, seq) / length(seq):

BioSequences.gc_contentFunction
gc_content(seq::BioSequence) -> Float64

Calculate GC content of seq, i.e. the number of symbols that is DNA_C, DNA_G, DNA_C or DNA_G divided by the length of the sequence.

Examples

julia> gc_content(dna"AGCTA")
+1
source

GC content

The convenience function gc_content(seq) is equivalent to count(isGC, seq) / length(seq):

BioSequences.gc_contentFunction
gc_content(seq::BioSequence) -> Float64

Calculate GC content of seq, i.e. the number of symbols that is DNA_C, DNA_G, DNA_C or DNA_G divided by the length of the sequence.

Examples

julia> gc_content(dna"AGCTA")
 0.4
 
 julia> gc_content(rna"UAGCGA")
-0.5
source

Deprecated aliases

Several of the optimised count methods have function names, which are deprecated:

Deprecated functionInstead use
n_gapscount(isgap, seq)
n_certaincount(iscertain, seq)
n_ambiguouscount(isambiguous, seq)
BioSequences.n_gapsFunction
n_gaps(a::BioSequence, [b::BioSequence]) -> Int

Count the number of positions where a (or b, if present) have gaps. If b is given, and the length of a and b differ, look only at the indices of the shorter sequence.

Warning

Passing in two sequences is deprecated. In a future, breaking release of BioSequences, this will throw a MethodError

Examples

julia> n_gaps(dna"--TAC-WN-ACY")
+0.5
source

Deprecated aliases

Several of the optimised count methods have function names, which are deprecated:

Deprecated functionInstead use
n_gapscount(isgap, seq)
n_certaincount(iscertain, seq)
n_ambiguouscount(isambiguous, seq)
BioSequences.n_gapsFunction
n_gaps(a::BioSequence, [b::BioSequence]) -> Int

Count the number of positions where a (or b, if present) have gaps. If b is given, and the length of a and b differ, look only at the indices of the shorter sequence.

Warning

Passing in two sequences is deprecated. In a future, breaking release of BioSequences, this will throw a MethodError

Examples

julia> n_gaps(dna"--TAC-WN-ACY")
 4
 
 julia> n_gaps(dna"TC-AC-", dna"-CACG")
-2
source
BioSequences.n_certainFunction
n_certain(a::BioSequence, [b::BioSequence]) -> Int

Count the number of positions where a (and b, if present) have certain (i.e. non-ambigous and non-gap) symbols. If b is given, and the length of a and b differ, look only at the indices of the shorter sequence. Gaps are not certain.

Warning

Passing in two sequences is deprecated. In a future, breaking release of BioSequences, this will throw a MethodError

Examples

julia> n_certain(dna"--TAC-WN-ACY")
+2
source
BioSequences.n_certainFunction
n_certain(a::BioSequence, [b::BioSequence]) -> Int

Count the number of positions where a (and b, if present) have certain (i.e. non-ambigous and non-gap) symbols. If b is given, and the length of a and b differ, look only at the indices of the shorter sequence. Gaps are not certain.

Warning

Passing in two sequences is deprecated. In a future, breaking release of BioSequences, this will throw a MethodError

Examples

julia> n_certain(dna"--TAC-WN-ACY")
 5
 
 julia> n_certain(rna"UAYWW", rna"UAW")
-2
source
BioSequences.n_ambiguousFunction
n_ambiguous(a::BioSequence, [b::BioSequence]) -> Int

Count the number of positions where a (or b, if present) have ambigious symbols. If b is given, and the length of a and b differ, look only at the indices of the shorter sequence. Gaps are not ambigous.

Warning

Passing in two sequences is deprecated. In a future, breaking release of BioSequences, this will throw a MethodError

Examples

julia> n_ambiguous(dna"--TAC-WN-ACY")
+2
source
BioSequences.n_ambiguousFunction
n_ambiguous(a::BioSequence, [b::BioSequence]) -> Int

Count the number of positions where a (or b, if present) have ambigious symbols. If b is given, and the length of a and b differ, look only at the indices of the shorter sequence. Gaps are not ambigous.

Warning

Passing in two sequences is deprecated. In a future, breaking release of BioSequences, this will throw a MethodError

Examples

julia> n_ambiguous(dna"--TAC-WN-ACY")
 3
 
 julia> n_ambiguous(rna"UAYWW", rna"UAW")
-1
source
+1source diff --git a/dev/index.html b/dev/index.html index 8f512020..963ca2b6 100644 --- a/dev/index.html +++ b/dev/index.html @@ -1,2 +1,2 @@ -Home · BioSequences.jl

BioSequences

Latest Release MIT license Documentation Pkg Status

Description

BioSequences provides data types and methods for common operations with biological sequences, including DNA, RNA, and amino acid sequences.

Installation

You can install BioSequences from the julia REPL. Press ] to enter pkg mode again, and enter the following:

add BioSequences

If you are interested in the cutting edge of the development, please check out the master branch to try new features before release.

Testing

BioSequences is tested against Julia 1.X on Linux, OS X, and Windows.

Unit tests Documentation

Contributing

We appreciate contributions from users including reporting bugs, fixing issues, improving performance and adding new features.

Take a look at the contributing files detailed contributor and maintainer guidelines, and code of conduct.

Questions?

If you have a question about contributing or using BioJulia software, come on over and chat to us on the #biology channel on the Julia SLack, or you can try the Bio category of the Julia discourse site.

+Home · BioSequences.jl

BioSequences

Latest Release MIT license Documentation Pkg Status

Description

BioSequences provides data types and methods for common operations with biological sequences, including DNA, RNA, and amino acid sequences.

Installation

You can install BioSequences from the julia REPL. Press ] to enter pkg mode again, and enter the following:

add BioSequences

If you are interested in the cutting edge of the development, please check out the master branch to try new features before release.

Testing

BioSequences is tested against Julia 1.X on Linux, OS X, and Windows.

Unit tests Documentation

Contributing

We appreciate contributions from users including reporting bugs, fixing issues, improving performance and adding new features.

Take a look at the contributing files detailed contributor and maintainer guidelines, and code of conduct.

Questions?

If you have a question about contributing or using BioJulia software, come on over and chat to us on the #biology channel on the Julia SLack, or you can try the Bio category of the Julia discourse site.

diff --git a/dev/interfaces/index.html b/dev/interfaces/index.html index 4183481a..3ff98823 100644 --- a/dev/interfaces/index.html +++ b/dev/interfaces/index.html @@ -59,4 +59,4 @@ julia> Base.copy(seq::Codon) = Codon(seq.x) julia> BioSequences.has_interface(BioSequence, Codon, [RNA_C, RNA_U, RNA_A], false) -true

Interface checking functions

BioSequences.has_interfaceFunction
function has_interface(::Type{Alphabet}, A::Alphabet)

Returns whether A conforms to the Alphabet interface.

source
has_interface(::Type{BioSequence}, ::T, syms::Vector, mutable::Bool, compat::Bool=true)

Check if type T conforms to the BioSequence interface. A T is constructed from the vector of element types syms which must not be empty. If the mutable flag is set, also check the mutable interface. If the compat flag is set, check for compatibility with existing alphabets.

source
+true

Interface checking functions

BioSequences.has_interfaceFunction
function has_interface(::Type{Alphabet}, A::Alphabet)

Returns whether A conforms to the Alphabet interface.

source
has_interface(::Type{BioSequence}, ::T, syms::Vector, mutable::Bool, compat::Bool=true)

Check if type T conforms to the BioSequence interface. A T is constructed from the vector of element types syms which must not be empty. If the mutable flag is set, also check the mutable interface. If the compat flag is set, check for compatibility with existing alphabets.

source
diff --git a/dev/io/index.html b/dev/io/index.html index 403210ca..eb09643d 100644 --- a/dev/io/index.html +++ b/dev/io/index.html @@ -1,2 +1,2 @@ -I/O · BioSequences.jl

I/O for sequencing file formats

Versions of BioSequences prior to v2.0 provided a FASTA, FASTQ, and 2Bit submodule for working with formatted sequence files.

After version v2.0, in order to neatly separate concerns, these submodules were removed.

Instead there will now be dedicated BioJulia packages for each format. Each of these will be compatible with BioSequences.

A list of all of the different formats and packages is provided below to help you find them quickly.

FormatPackage
FASTAFASTX.jl
FASTQFASTX.jl
2BitTwoBit.jl
+I/O · BioSequences.jl

I/O for sequencing file formats

Versions of BioSequences prior to v2.0 provided a FASTA, FASTQ, and 2Bit submodule for working with formatted sequence files.

After version v2.0, in order to neatly separate concerns, these submodules were removed.

Instead there will now be dedicated BioJulia packages for each format. Each of these will be compatible with BioSequences.

A list of all of the different formats and packages is provided below to help you find them quickly.

FormatPackage
FASTAFASTX.jl
FASTQFASTX.jl
2BitTwoBit.jl
diff --git a/dev/objects.inv b/dev/objects.inv index cf8510dd..0b3b1a63 100644 Binary files a/dev/objects.inv and b/dev/objects.inv differ diff --git a/dev/predicates/index.html b/dev/predicates/index.html index 32cf684b..f91a9541 100644 --- a/dev/predicates/index.html +++ b/dev/predicates/index.html @@ -1,12 +1,12 @@ -Predicates · BioSequences.jl

Predicates

A number of predicate or query functions are supported for sequences, allowing you to check for certain properties of a sequence.

BioSequences.isrepetitiveFunction
isrepetitive(seq::BioSequence, n::Integer = length(seq))

Return true if and only if seq contains a repetitive subsequence of length ≥ n.

source
BioSequences.ispalindromicFunction
ispalindromic(seq::NucSeq) -> Bool

Check if seq is palindromic. A palindromic sequence is identical to its reverse-complement, so this should be equivalent to checking if seq == reverse_complement(seq).

Examples

julia> ispalindromic(dna"TGCA")
+Predicates · BioSequences.jl

Predicates

A number of predicate or query functions are supported for sequences, allowing you to check for certain properties of a sequence.

BioSequences.isrepetitiveFunction
isrepetitive(seq::BioSequence, n::Integer = length(seq))

Return true if and only if seq contains a repetitive subsequence of length ≥ n.

source
BioSequences.ispalindromicFunction
ispalindromic(seq::NucSeq) -> Bool

Check if seq is palindromic. A palindromic sequence is identical to its reverse-complement, so this should be equivalent to checking if seq == reverse_complement(seq).

Examples

julia> ispalindromic(dna"TGCA")
 true
 
 julia> ispalindromic(dna"TCCT")
 false
 
 julia> ispalindromic(rna"ACGGU")
-false

Return true if seq is a palindromic sequence; otherwise return false.

source
BioSequences.iscanonicalFunction
iscanonical(seq::NucleotideSeq)

Returns true if seq is canonical.

For any sequence, there is a reverse complement, which is the same sequence, but on the complimentary strand of DNA:

------->
+false

Return true if seq is a palindromic sequence; otherwise return false.

source
BioSequences.iscanonicalFunction
iscanonical(seq::NucleotideSeq)

Returns true if seq is canonical.

For any sequence, there is a reverse complement, which is the same sequence, but on the complimentary strand of DNA:

------->
 ATCGATCG
 CGATCGAT
-<-------
Note

Using the reverse_complement of a DNA sequence will give give this reverse complement.

Of the two sequences, the canonical of the two sequences is the lesser of the two i.e. canonical_seq < other_seq.

source
+<-------
Note

Using the reverse_complement of a DNA sequence will give give this reverse complement.

Of the two sequences, the canonical of the two sequences is the lesser of the two i.e. canonical_seq < other_seq.

source
diff --git a/dev/random/index.html b/dev/random/index.html index 02b8296a..8cc972a9 100644 --- a/dev/random/index.html +++ b/dev/random/index.html @@ -1,8 +1,8 @@ Random sequences · BioSequences.jl

Generating random sequences

Long sequences

You can generate random long sequences using the randdna function and the Sampler's implemented in BioSequences:

BioSequences.randseqFunction
randseq([rng::AbstractRNG], A::Alphabet, len::Integer)

Generate a LongSequence{A} of length len from the specified alphabet, drawn from the default distribution. User-defined alphabets should implement this method to implement random LongSequence generation.

For RNA and DNA alphabets, the default distribution is uniform across A, C, G, and T/U. For AminoAcidAlphabet, it is uniform across the 20 standard amino acids. For a user-defined alphabet A, default is uniform across all elements of symbols(A).

Example:

julia> seq = randseq(AminoAcidAlphabet(), 50)
 50aa Amino Acid Sequence:
-VFMHSIRMIRLMVHRSWKMHSARHVNFIRCQDKKWKSADGIYTDICKYSM
source
randseq([rng::AbstractRNG], A::Alphabet, sp::Sampler, len::Integer)

Generate a LongSequence{A} of length len with elements drawn from the given sampler.

Example:

# Generate 1000-length RNA with 4% chance of N, 24% for A, C, G, or U
+VFMHSIRMIRLMVHRSWKMHSARHVNFIRCQDKKWKSADGIYTDICKYSM
source
randseq([rng::AbstractRNG], A::Alphabet, sp::Sampler, len::Integer)

Generate a LongSequence{A} of length len with elements drawn from the given sampler.

Example:

# Generate 1000-length RNA with 4% chance of N, 24% for A, C, G, or U
 julia> sp = SamplerWeighted(rna"ACGUN", fill(0.24, 4))
 julia> seq = randseq(RNAAlphabet{4}(), sp, 50)
 50nt RNA Sequence:
-CUNGGGCCCGGGNAAACGUGGUACACCCUGUUAAUAUCAACNNGCGCUNU
source
BioSequences.randdnaseqFunction
randdnaseq([rng::AbstractRNG], len::Integer)

Generate a random LongSequence{DNAAlphabet{4}} sequence of length len, with bases sampled uniformly from [A, C, G, T]

source
BioSequences.randrnaseqFunction
randrnaseq([rng::AbstractRNG], len::Integer)

Generate a random LongSequence{RNAAlphabet{4}} sequence of length len, with bases sampled uniformly from [A, C, G, U]

source
BioSequences.randaaseqFunction
randaaseq([rng::AbstractRNG], len::Integer)

Generate a random LongSequence{AminoAcidAlphabet} sequence of length len, with amino acids sampled uniformly from the 20 standard amino acids.

source
BioSequences.SamplerUniformType
SamplerUniform{T}

Uniform sampler of type T. Instantiate with a collection of eltype T containing the elements to sample.

Examples

julia> sp = SamplerUniform(rna"ACGU");
source
BioSequences.SamplerWeightedType
SamplerWeighted{T}

Weighted sampler of type T. Instantiate with a collection of eltype T containing the elements to sample, and an orderen collection of probabilities to sample each element except the last. The last probability is the remaining probability up to 1.

Examples

julia> sp = SamplerWeighted(rna"ACGUN", fill(0.2475, 4));
source
+CUNGGGCCCGGGNAAACGUGGUACACCCUGUUAAUAUCAACNNGCGCUNUsource
BioSequences.randdnaseqFunction
randdnaseq([rng::AbstractRNG], len::Integer)

Generate a random LongSequence{DNAAlphabet{4}} sequence of length len, with bases sampled uniformly from [A, C, G, T]

source
BioSequences.randrnaseqFunction
randrnaseq([rng::AbstractRNG], len::Integer)

Generate a random LongSequence{RNAAlphabet{4}} sequence of length len, with bases sampled uniformly from [A, C, G, U]

source
BioSequences.randaaseqFunction
randaaseq([rng::AbstractRNG], len::Integer)

Generate a random LongSequence{AminoAcidAlphabet} sequence of length len, with amino acids sampled uniformly from the 20 standard amino acids.

source
BioSequences.SamplerUniformType
SamplerUniform{T}

Uniform sampler of type T. Instantiate with a collection of eltype T containing the elements to sample.

Examples

julia> sp = SamplerUniform(rna"ACGU");
source
BioSequences.SamplerWeightedType
SamplerWeighted{T}

Weighted sampler of type T. Instantiate with a collection of eltype T containing the elements to sample, and an orderen collection of probabilities to sample each element except the last. The last probability is the remaining probability up to 1.

Examples

julia> sp = SamplerWeighted(rna"ACGUN", fill(0.2475, 4));
source
diff --git a/dev/recipes/index.html b/dev/recipes/index.html index 1582d907..196b7e93 100644 --- a/dev/recipes/index.html +++ b/dev/recipes/index.html @@ -29,4 +29,4 @@ 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 1 1 0 1 0 0 0 0 0 - 1 0 1 1 1 0 1 1 0 1 + 1 0 1 1 1 0 1 1 0 1 diff --git a/dev/sequence_search/index.html b/dev/sequence_search/index.html index c189e5cd..c42e9fcc 100644 --- a/dev/sequence_search/index.html +++ b/dev/sequence_search/index.html @@ -50,7 +50,7 @@ julia> occursin(ExactSearchQuery(dna"CNT", iscompatible), dna"ACNT") true -source

Allowing mismatches

BioSequences.ApproximateSearchQueryType
ApproximateSearchQuery{F<:Function,S<:BioSequence}

Query type for approximate sequence search.

These queries are used as a predicate for the Base.findnext, Base.findprev, Base.occursin, Base.findfirst, and Base.findlast functions.

Using these functions with these queries allows you to search a given sequence for a sub-sequence, whilst allowing a specific number of errors.

In other words they find a subsequence of the target sequence within a specific Levenshtein distance of the query sequence.

Examples

julia> seq = dna"ACAGCGTAGCT";
+
source

Allowing mismatches

BioSequences.ApproximateSearchQueryType
ApproximateSearchQuery{F<:Function,S<:BioSequence}

Query type for approximate sequence search.

These queries are used as a predicate for the Base.findnext, Base.findprev, Base.occursin, Base.findfirst, and Base.findlast functions.

Using these functions with these queries allows you to search a given sequence for a sub-sequence, whilst allowing a specific number of errors.

In other words they find a subsequence of the target sequence within a specific Levenshtein distance of the query sequence.

Examples

julia> seq = dna"ACAGCGTAGCT";
 
 julia> query = ApproximateSearchQuery(dna"AGGG");
 
@@ -69,7 +69,7 @@
 
 julia> findnext(query, 1, dna"AAGNGG", 1) # 1 mismatch permitted (A vs G) & matched N
 1:4
-
Note

This method of searching for motifs was implemented with smaller query motifs in mind.

If you are looking to search for imperfect matches of longer sequences in this manner, you are likely better off using some kind of local-alignment algorithm or one of the BLAST variants.

source

Searching according to a pattern

Query patterns can be described in regular expressions. The syntax supports a subset of Perl and PROSITE's notation.

Biological regexes can be constructed using the BioRegex constructor, for example by doing BioRegex{AminoAcid}("MV+"). For bioregex literals, it is instead recommended using the @biore_str macro:

The Perl-like syntax starts with biore (BIOlogical REgular expression) and ends with a symbol option: "dna", "rna" or "aa". For example, biore"A+"dna is a regular expression for DNA sequences and biore"A+"aa is for amino acid sequences. The symbol options can be abbreviated to its first character: "d", "r" or "a", respectively.

Here are examples of using the regular expression for BioSequences:

julia> match(biore"A+C*"dna, dna"AAAACC")
+
Note

This method of searching for motifs was implemented with smaller query motifs in mind.

If you are looking to search for imperfect matches of longer sequences in this manner, you are likely better off using some kind of local-alignment algorithm or one of the BLAST variants.

source

Searching according to a pattern

Query patterns can be described in regular expressions. The syntax supports a subset of Perl and PROSITE's notation.

Biological regexes can be constructed using the BioRegex constructor, for example by doing BioRegex{AminoAcid}("MV+"). For bioregex literals, it is instead recommended using the @biore_str macro:

The Perl-like syntax starts with biore (BIOlogical REgular expression) and ends with a symbol option: "dna", "rna" or "aa". For example, biore"A+"dna is a regular expression for DNA sequences and biore"A+"aa is for amino acid sequences. The symbol options can be abbreviated to its first character: "d", "r" or "a", respectively.

Here are examples of using the regular expression for BioSequences:

julia> match(biore"A+C*"dna, dna"AAAACC")
 RegexMatch("AAAACC")
 
 julia> match(biore"A+C*"d, dna"AAAACC")
@@ -159,4 +159,4 @@
 3-element Vector{Int64}:
  3
  5
- 9

[Wasserman2004]: https://doi.org/10.1038/nrg1315

+ 9

[Wasserman2004]: https://doi.org/10.1038/nrg1315

diff --git a/dev/symbols/index.html b/dev/symbols/index.html index a1e52601..bce1db47 100644 --- a/dev/symbols/index.html +++ b/dev/symbols/index.html @@ -70,4 +70,4 @@ julia> iscompatible(DNA_C, DNA_R) # DNA_R (A or G) cannot be DNA_C false -source
BioSymbols.isambiguousFunction
isambiguous(nt::NucleicAcid)

Test if nt is an ambiguous nucleotide.

source
isambiguous(aa::AminoAcid)

Test if aa is an ambiguous amino acid.

source
+source
BioSymbols.isambiguousFunction
isambiguous(nt::NucleicAcid)

Test if nt is an ambiguous nucleotide.

source
isambiguous(aa::AminoAcid)

Test if aa is an ambiguous amino acid.

source
diff --git a/dev/transforms/index.html b/dev/transforms/index.html index 6a220004..3681cd6e 100644 --- a/dev/transforms/index.html +++ b/dev/transforms/index.html @@ -15,7 +15,7 @@ julia> seq[5] = DNA_A DNA_A -
Note

Some types such can be indexed using integers but not using ranges.

For LongSequence types, indexing a sequence by range creates a copy of the original sequence, similar to Array in Julia's Base library. If you find yourself slowed down by the allocation of these subsequences, consider using a sequence view instead.

Modifying sequences

In addition to setindex, many other modifying operations are possible for biological sequences such as push!, pop!, and insert!, which should be familiar to anyone used to editing arrays.

Base.push!Method
push!(seq::BioSequence, x)

Append a biological symbol x to a biological sequence seq.

source
Base.pop!Method
pop!(seq::BioSequence)

Remove the symbol from the end of a biological sequence seq and return it. Returns a variable of eltype(seq).

source
Base.pushfirst!Method
pushfirst!(seq, x)

Insert a biological symbol x at the beginning of a biological sequence seq.

source
Base.popfirst!Method
popfirst!(seq)

Remove the symbol from the beginning of a biological sequence seq and return it. Returns a variable of eltype(seq).

source
Base.insert!Method
insert!(seq::BioSequence, i, x)

Insert a biological symbol x into a biological sequence seq, at the given index i.

source
Base.deleteat!Method
deleteat!(seq::BioSequence, i::Integer)

Delete a biological symbol at a single position i in a biological sequence seq.

Modifies the input sequence.

source
Base.append!Method
append!(seq, other)

Add a biological sequence other onto the end of biological sequence seq. Modifies and returns seq.

source
Base.resize!Method
resize!(seq, size, [force::Bool])

Resize a biological sequence seq, to a given size. Does not resize the underlying data array unless the new size does not fit. If force, always resize underlying data array.

source
Base.empty!Method
empty!(seq::BioSequence)

Completely empty a biological sequence seq of nucleotides.

source

Here are some examples:

julia> seq = dna"ACG"
+
Note

Some types such can be indexed using integers but not using ranges.

For LongSequence types, indexing a sequence by range creates a copy of the original sequence, similar to Array in Julia's Base library. If you find yourself slowed down by the allocation of these subsequences, consider using a sequence view instead.

Modifying sequences

In addition to setindex, many other modifying operations are possible for biological sequences such as push!, pop!, and insert!, which should be familiar to anyone used to editing arrays.

Base.push!Method
push!(seq::BioSequence, x)

Append a biological symbol x to a biological sequence seq.

source
Base.pop!Method
pop!(seq::BioSequence)

Remove the symbol from the end of a biological sequence seq and return it. Returns a variable of eltype(seq).

source
Base.pushfirst!Method
pushfirst!(seq, x)

Insert a biological symbol x at the beginning of a biological sequence seq.

source
Base.popfirst!Method
popfirst!(seq)

Remove the symbol from the beginning of a biological sequence seq and return it. Returns a variable of eltype(seq).

source
Base.insert!Method
insert!(seq::BioSequence, i, x)

Insert a biological symbol x into a biological sequence seq, at the given index i.

source
Base.deleteat!Method
deleteat!(seq::BioSequence, i::Integer)

Delete a biological symbol at a single position i in a biological sequence seq.

Modifies the input sequence.

source
Base.append!Method
append!(seq, other)

Add a biological sequence other onto the end of biological sequence seq. Modifies and returns seq.

source
Base.resize!Method
resize!(seq, size, [force::Bool])

Resize a biological sequence seq, to a given size. Does not resize the underlying data array unless the new size does not fit. If force, always resize underlying data array.

source
Base.empty!Method
empty!(seq::BioSequence)

Completely empty a biological sequence seq of nucleotides.

source

Here are some examples:

julia> seq = dna"ACG"
 3nt DNA Sequence:
 ACG
 
@@ -34,7 +34,7 @@
 julia> deleteat!(seq, 2:3)
 3nt DNA Sequence:
 AAT
-

Additional transformations

In addition to these basic modifying functions, other sequence transformations that are common in bioinformatics are also provided.

Base.reverse!Method
reverse!(seq::LongSequence)

Reverse a biological sequence seq in place.

source
Base.reverseMethod
reverse(seq::BioSequence)

Create reversed copy of a biological sequence.

source
reverse(seq::LongSequence)

Create reversed copy of a biological sequence.

source
BioSequences.complement!Function
complement!(seq)

Make a complement sequence of seq in place.

source
BioSymbols.complementFunction
complement(nt::NucleicAcid)

Return the complementary nucleotide of nt.

This function returns the union of all possible complementary nucleotides.

Examples

julia> complement(DNA_A)
+

Additional transformations

In addition to these basic modifying functions, other sequence transformations that are common in bioinformatics are also provided.

Base.reverse!Method
reverse!(seq::LongSequence)

Reverse a biological sequence seq in place.

source
Base.reverseMethod
reverse(seq::BioSequence)

Create reversed copy of a biological sequence.

source
reverse(seq::LongSequence)

Create reversed copy of a biological sequence.

source
BioSymbols.complementFunction
complement(nt::NucleicAcid)

Return the complementary nucleotide of nt.

This function returns the union of all possible complementary nucleotides.

Examples

julia> complement(DNA_A)
 DNA_T
 
 julia> complement(DNA_N)
@@ -42,10 +42,10 @@
 
 julia> complement(RNA_U)
 RNA_A
-
source
complement(seq)

Make a complement sequence of seq.

source
BioSequences.canonical!Function
canonical!(seq::NucleotideSeq)

Transforms the seq into its canonical form, if it is not already canonical. Modifies the input sequence inplace.

For any sequence, there is a reverse complement, which is the same sequence, but on the complimentary strand of DNA:

------->
+
source
complement(seq)

Make a complement sequence of seq.

source
BioSequences.canonical!Function
canonical!(seq::NucleotideSeq)

Transforms the seq into its canonical form, if it is not already canonical. Modifies the input sequence inplace.

For any sequence, there is a reverse complement, which is the same sequence, but on the complimentary strand of DNA:

------->
 ATCGATCG
 CGATCGAT
-<-------
Note

Using the reverse_complement of a DNA sequence will give give this reverse complement.

Of the two sequences, the canonical of the two sequences is the lesser of the two i.e. canonical_seq < other_seq.

Using this function on a seq will ensure it is the canonical version.

source

Some examples:

julia> seq = dna"ACGTAT"
+<-------
Note

Using the reverse_complement of a DNA sequence will give give this reverse complement.

Of the two sequences, the canonical of the two sequences is the lesser of the two i.e. canonical_seq < other_seq.

Using this function on a seq will ensure it is the canonical version.

source
BioSequences.canonicalFunction
canonical(seq::NucleotideSeq)

Create the canonical sequence of seq.

source

Some examples:

julia> seq = dna"ACGTAT"
 6nt DNA Sequence:
 ACGTAT
 
@@ -60,7 +60,7 @@
 julia> reverse_complement!(seq)
 6nt DNA Sequence:
 ACGTAT
-

Many of these methods also have a version which makes a copy of the input sequence, so you get a modified copy, and don't alter the original sequence. Such methods are named the same, but without the exclamation mark. E.g. reverse instead of reverse!, and ungap instead of ungap!.

Translation

Translation is a slightly more complex transformation for RNA Sequences and so we describe it here in more detail.

The translate function translates a sequence of codons in a RNA sequence to a amino acid sequence based on a genetic code. The BioSequences package provides all NCBI defined genetic codes and they are registered in ncbi_trans_table.

BioSequences.translateFunction
translate(seq, code=standard_genetic_code, allow_ambiguous_codons=true, alternative_start=false)

Translate an LongRNA or a LongDNA to an LongAA.

Translation uses genetic code code to map codons to amino acids. See ncbi_trans_table for available genetic codes. If codons in the given sequence cannot determine a unique amino acid, they will be translated to AA_X if allow_ambiguous_codons is true and otherwise result in an error. For organisms that utilize alternative start codons, one can set alternative_start=true, in which case the first codon will always be converted to a methionine.

source
BioSequences.ncbi_trans_tableConstant

Genetic code list of NCBI.

The standard genetic code is ncbi_trans_table[1] and others can be shown by show(ncbi_trans_table). For more details, consult the next link: http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=cgencodes.

source
julia> ncbi_trans_table
+

Many of these methods also have a version which makes a copy of the input sequence, so you get a modified copy, and don't alter the original sequence. Such methods are named the same, but without the exclamation mark. E.g. reverse instead of reverse!, and ungap instead of ungap!.

Translation

Translation is a slightly more complex transformation for RNA Sequences and so we describe it here in more detail.

The translate function translates a sequence of codons in a RNA sequence to a amino acid sequence based on a genetic code. The BioSequences package provides all NCBI defined genetic codes and they are registered in ncbi_trans_table.

BioSequences.translateFunction
translate(seq, code=standard_genetic_code, allow_ambiguous_codons=true, alternative_start=false)

Translate an LongRNA or a LongDNA to an LongAA.

Translation uses genetic code code to map codons to amino acids. See ncbi_trans_table for available genetic codes. If codons in the given sequence cannot determine a unique amino acid, they will be translated to AA_X if allow_ambiguous_codons is true and otherwise result in an error. For organisms that utilize alternative start codons, one can set alternative_start=true, in which case the first codon will always be converted to a methionine.

source
BioSequences.ncbi_trans_tableConstant

Genetic code list of NCBI.

The standard genetic code is ncbi_trans_table[1] and others can be shown by show(ncbi_trans_table). For more details, consult the next link: http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=cgencodes.

source
julia> ncbi_trans_table
 Translation Tables:
   1. The Standard Code (standard_genetic_code)
   2. The Vertebrate Mitochondrial Code (vertebrate_mitochondrial_genetic_code)
@@ -81,4 +81,4 @@
  23. Thraustochytrium Mitochondrial Code (thraustochytrium_mitochondrial_genetic_code)
  24. Pterobranchia Mitochondrial Code (pterobrachia_mitochondrial_genetic_code)
  25. Candidate Division SR1 and Gracilibacteria Code (candidate_division_sr1_genetic_code)
-

https://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=cgencodes

+

https://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=cgencodes

diff --git a/dev/types/index.html b/dev/types/index.html index b7eefc02..de6dea13 100644 --- a/dev/types/index.html +++ b/dev/types/index.html @@ -1,2 +1,2 @@ -BioSequences Types · BioSequences.jl

Abstract Types

BioSequences exports an abstract BioSequence type, and several concrete sequence types which inherit from it.

The abstract BioSequence

BioSequences provides an abstract type called a BioSequence{A<:Alphabet}. This abstract type, and the methods and traits is supports, allows for many algorithms in BioSequences to be written as generically as possible, thus reducing the amount of code to read and understand, whilst maintaining high performance when such code is compiled for a concrete BioSequence subtype. Additionally, it allows new types to be implemented that are fully compatible with the rest of BioSequences, providing that key methods or traits are defined).

BioSequences.BioSequenceType
BioSequence{A <: Alphabet}

BioSequence is the main abstract type of BioSequences. It abstracts over the internal representation of different biological sequences, and is parameterized by an Alphabet, which controls the element type.

Extended help

Its subtypes are characterized by:

  • Being a linear container type with random access and indices Base.OneTo(length(x)).
  • Containing zero or more internal data elements of type encoded_data_eltype(typeof(x)).
  • Being associated with an Alphabet, A by being a subtype of BioSequence{A}.

A BioSequence{A} is indexed by an integer. The biosequence subtype, the index and the alphabet A determine how to extract the internal encoded data. The alphabet decides how to decode the data to the element type of the biosequence. Hence, the element type and container type of a BioSequence are separated.

Subtypes T of BioSequence must implement the following, with E begin an encoded data type:

  • Base.length(::T)::Int
  • encoded_data_eltype(::Type{T})::Type{E}
  • extract_encoded_element(::T, ::Integer)::E
  • copy(::T)
  • T must be able to be constructed from any iterable with length defined and with a known, compatible element type.

Furthermore, mutable sequences should implement

  • encoded_setindex!(::T, ::E, ::Integer)
  • T(undef, ::Int)
  • resize!(::T, ::Int)

For compatibility with existing Alphabets, the encoded data eltype must be UInt.

source

Some aliases for BioSequence are also provided for your convenience:

Let's have a closer look at some of those methods that a subtype of BioSequence must implement. Check out julia base library docs for length, copy and resize!.

BioSequences.encoded_data_eltypeFunction
encoded_data_eltype(::Type{<:BioSequence})

Returns the element type of the encoded data of the BioSequence. This is the return type of extract_encoded_element, i.e. the data type that stores the biological symbols in the biosequence.

See also: BioSequence

source
BioSequences.extract_encoded_elementFunction
extract_encoded_element(::BioSequence{A}, i::Integer)

Returns the encoded element at position i. This data can be decoded using decode(A(), data) to yield the element type of the biosequence.

See also: BioSequence

source
BioSequences.encoded_setindex!Function
encoded_setindex!(seq::BioSequence, x::E, i::Integer)

Given encoded data x of type encoded_data_eltype(typeof(seq)), sets the internal sequence data at the given index.

See also: BioSequence

source

A correctly defined subtype of BioSequence that satisfies the interface, will find the vast majority of methods described in the rest of this manual should work out of the box for that type. But they can always be overloaded if needed. Indeed the LongSequence type overloads Indeed some of the generic BioSequence methods, are overloaded for LongSequence, for example for transformation and counting operations where efficiency gains can be made due to the specific internal representation of a specific type.

The abstract Alphabet

Alphabets control how biological symbols are encoded and decoded. They also confer many of the automatic traits and methods that any subtype of T<:BioSequence{A<:Alphabet} will get.

BioSequences.AlphabetType
Alphabet

Alphabet is the most important type trait for BioSequence. An Alphabet represents a set of biological symbols encoded by a sequence, e.g. A, C, G and T for a DNA Alphabet that requires only 2 bits to represent each symbol.

Extended help

  • Subtypes of Alphabet are singleton structs that may or may not be parameterized.
  • Alphabets span over a finite set of biological symbols.
  • The alphabet controls the encoding from some internal "encoded data" to a BioSymbol of the alphabet's element type, as well as the decoding, the inverse process.
  • An Alphabet's encode method must not produce invalid data.

Every subtype A of Alphabet must implement:

  • Base.eltype(::Type{A})::Type{S} for some eltype S, which must be a BioSymbol.
  • symbols(::A)::Tuple{Vararg{S}}. This gives tuples of all symbols in the set of A.
  • encode(::A, ::S)::E encodes a symbol to an internal data eltype E.
  • decode(::A, ::E)::S decodes an internal data eltype E to a symbol S.
  • Except for eltype which must follow Base conventions, all functions operating on Alphabet should operate on instances of the alphabet, not the type.

If you want interoperation with existing subtypes of BioSequence, the encoded representation E must be of type UInt, and you must also implement:

  • BitsPerSymbol(::A)::BitsPerSymbol{N}, where the N must be zero or a power of two in [1, 2, 4, 8, 16, 32, [64 for 64-bit systems]].

For increased performance, see BioSequences.AsciiAlphabet

source
BioSequences.AsciiAlphabetType
AsciiAlphabet

Trait for alphabet using ASCII characters as String representation. Define codetype(A) = AsciiAlphabet() for a user-defined Alphabet A to gain speed. Methods needed: BioSymbols.stringbyte(::eltype(A)) and ascii_encode(A, ::UInt8).

source

Concrete types

Implemented alphabets

BioSequences.DNAAlphabetType

DNA nucleotide alphabet.

DNAAlphabet has a parameter N which is a number that determines the BitsPerSymbol trait. Currently supported values of N are 2 and 4.

source
BioSequences.RNAAlphabetType

RNA nucleotide alphabet.

RNAAlphabet has a parameter N which is a number that determines the BitsPerSymbol trait. Currently supported values of N are 2 and 4.

source

Long Sequences

BioSequences.LongSequenceType
LongSequence{A <: Alphabet}

General-purpose BioSequence. This type is mutable and variable-length, and should be preferred for most use cases.

Extended help

LongSequence{A<:Alphabet} <: BioSequence{A} is parameterized by a concrete Alphabet type A that defines the domain (or set) of biological symbols permitted.

As the BioSequence interface definition implies, LongSequences store the biological symbol elements that they contain in a succinct encoded form that permits many operations to be done in an efficient bit-parallel manner. As per the interface of BioSequence, the Alphabet determines how an element is encoded or decoded when it is inserted or extracted from the sequence.

For example, AminoAcidAlphabet is associated with AminoAcid and hence an object of the LongSequence{AminoAcidAlphabet} type represents a sequence of amino acids.

Symbols from multiple alphabets can't be intermixed in one sequence type.

The following table summarizes common LongSequence types that have been given aliases for convenience.

TypeSymbol typeType alias
LongSequence{DNAAlphabet{N}}DNALongDNA{N}
LongSequence{RNAAlphabet{N}}RNALongRNA{N}
LongSequence{AminoAcidAlphabet}AminoAcidLongAA

The LongDNA and LongRNA aliases use a DNAAlphabet{4}.

DNAAlphabet{4} permits ambiguous nucleotides, and a sequence must use at least 4 bits to internally store each element (and indeed LongSequence does).

If you are sure that you are working with sequences with no ambiguous nucleotides, you can use LongSequences parameterised with DNAAlphabet{2} instead.

DNAAlphabet{2} is an alphabet that uses two bits per base and limits to only unambiguous nucleotide symbols (A,C,G,T).

Changing this single parameter, is all you need to do in order to benefit from memory savings. Some computations that use bitwise operations will also be dramatically faster.

The same applies with LongSequence{RNAAlphabet{4}}, simply replace the alphabet parameter with RNAAlphabet{2} in order to benefit.

source

Sequence views

Similar to how Base Julia offers views of array objects, BioSequences offers view of LongSequences - the LongSubSeq{A<:Alphabet}.

Conceptually, a LongSubSeq{A} is similar to a LongSequence{A}, but instead of storing their own data, they refer to the data of a LongSequence. Modiying the LongSequence will be reflected in the view, and vice versa. If the underlying LongSequence is truncated, the behaviour of a view is undefined. For the same reason, some operations are not supported for views, such as resizing.

The purpose of LongSubSeq is that, since they only contain a pointer to the underlying array, an offset and a length, they are much lighter than LongSequences, and will be stack allocated on Julia 1.5 and newer. Thus, the user may construct millions of views without major performance implications.

+BioSequences Types · BioSequences.jl

Abstract Types

BioSequences exports an abstract BioSequence type, and several concrete sequence types which inherit from it.

The abstract BioSequence

BioSequences provides an abstract type called a BioSequence{A<:Alphabet}. This abstract type, and the methods and traits is supports, allows for many algorithms in BioSequences to be written as generically as possible, thus reducing the amount of code to read and understand, whilst maintaining high performance when such code is compiled for a concrete BioSequence subtype. Additionally, it allows new types to be implemented that are fully compatible with the rest of BioSequences, providing that key methods or traits are defined).

BioSequences.BioSequenceType
BioSequence{A <: Alphabet}

BioSequence is the main abstract type of BioSequences. It abstracts over the internal representation of different biological sequences, and is parameterized by an Alphabet, which controls the element type.

Extended help

Its subtypes are characterized by:

  • Being a linear container type with random access and indices Base.OneTo(length(x)).
  • Containing zero or more internal data elements of type encoded_data_eltype(typeof(x)).
  • Being associated with an Alphabet, A by being a subtype of BioSequence{A}.

A BioSequence{A} is indexed by an integer. The biosequence subtype, the index and the alphabet A determine how to extract the internal encoded data. The alphabet decides how to decode the data to the element type of the biosequence. Hence, the element type and container type of a BioSequence are separated.

Subtypes T of BioSequence must implement the following, with E begin an encoded data type:

  • Base.length(::T)::Int
  • encoded_data_eltype(::Type{T})::Type{E}
  • extract_encoded_element(::T, ::Integer)::E
  • copy(::T)
  • T must be able to be constructed from any iterable with length defined and with a known, compatible element type.

Furthermore, mutable sequences should implement

  • encoded_setindex!(::T, ::E, ::Integer)
  • T(undef, ::Int)
  • resize!(::T, ::Int)

For compatibility with existing Alphabets, the encoded data eltype must be UInt.

source

Some aliases for BioSequence are also provided for your convenience:

Let's have a closer look at some of those methods that a subtype of BioSequence must implement. Check out julia base library docs for length, copy and resize!.

BioSequences.encoded_data_eltypeFunction
encoded_data_eltype(::Type{<:BioSequence})

Returns the element type of the encoded data of the BioSequence. This is the return type of extract_encoded_element, i.e. the data type that stores the biological symbols in the biosequence.

See also: BioSequence

source
BioSequences.extract_encoded_elementFunction
extract_encoded_element(::BioSequence{A}, i::Integer)

Returns the encoded element at position i. This data can be decoded using decode(A(), data) to yield the element type of the biosequence.

See also: BioSequence

source
BioSequences.encoded_setindex!Function
encoded_setindex!(seq::BioSequence, x::E, i::Integer)

Given encoded data x of type encoded_data_eltype(typeof(seq)), sets the internal sequence data at the given index.

See also: BioSequence

source

A correctly defined subtype of BioSequence that satisfies the interface, will find the vast majority of methods described in the rest of this manual should work out of the box for that type. But they can always be overloaded if needed. Indeed the LongSequence type overloads Indeed some of the generic BioSequence methods, are overloaded for LongSequence, for example for transformation and counting operations where efficiency gains can be made due to the specific internal representation of a specific type.

The abstract Alphabet

Alphabets control how biological symbols are encoded and decoded. They also confer many of the automatic traits and methods that any subtype of T<:BioSequence{A<:Alphabet} will get.

BioSequences.AlphabetType
Alphabet

Alphabet is the most important type trait for BioSequence. An Alphabet represents a set of biological symbols encoded by a sequence, e.g. A, C, G and T for a DNA Alphabet that requires only 2 bits to represent each symbol.

Extended help

  • Subtypes of Alphabet are singleton structs that may or may not be parameterized.
  • Alphabets span over a finite set of biological symbols.
  • The alphabet controls the encoding from some internal "encoded data" to a BioSymbol of the alphabet's element type, as well as the decoding, the inverse process.
  • An Alphabet's encode method must not produce invalid data.

Every subtype A of Alphabet must implement:

  • Base.eltype(::Type{A})::Type{S} for some eltype S, which must be a BioSymbol.
  • symbols(::A)::Tuple{Vararg{S}}. This gives tuples of all symbols in the set of A.
  • encode(::A, ::S)::E encodes a symbol to an internal data eltype E.
  • decode(::A, ::E)::S decodes an internal data eltype E to a symbol S.
  • Except for eltype which must follow Base conventions, all functions operating on Alphabet should operate on instances of the alphabet, not the type.

If you want interoperation with existing subtypes of BioSequence, the encoded representation E must be of type UInt, and you must also implement:

  • BitsPerSymbol(::A)::BitsPerSymbol{N}, where the N must be zero or a power of two in [1, 2, 4, 8, 16, 32, [64 for 64-bit systems]].

For increased performance, see BioSequences.AsciiAlphabet

source
BioSequences.AsciiAlphabetType
AsciiAlphabet

Trait for alphabet using ASCII characters as String representation. Define codetype(A) = AsciiAlphabet() for a user-defined Alphabet A to gain speed. Methods needed: BioSymbols.stringbyte(::eltype(A)) and ascii_encode(A, ::UInt8).

source

Concrete types

Implemented alphabets

BioSequences.DNAAlphabetType

DNA nucleotide alphabet.

DNAAlphabet has a parameter N which is a number that determines the BitsPerSymbol trait. Currently supported values of N are 2 and 4.

source
BioSequences.RNAAlphabetType

RNA nucleotide alphabet.

RNAAlphabet has a parameter N which is a number that determines the BitsPerSymbol trait. Currently supported values of N are 2 and 4.

source

Long Sequences

BioSequences.LongSequenceType
LongSequence{A <: Alphabet}

General-purpose BioSequence. This type is mutable and variable-length, and should be preferred for most use cases.

Extended help

LongSequence{A<:Alphabet} <: BioSequence{A} is parameterized by a concrete Alphabet type A that defines the domain (or set) of biological symbols permitted.

As the BioSequence interface definition implies, LongSequences store the biological symbol elements that they contain in a succinct encoded form that permits many operations to be done in an efficient bit-parallel manner. As per the interface of BioSequence, the Alphabet determines how an element is encoded or decoded when it is inserted or extracted from the sequence.

For example, AminoAcidAlphabet is associated with AminoAcid and hence an object of the LongSequence{AminoAcidAlphabet} type represents a sequence of amino acids.

Symbols from multiple alphabets can't be intermixed in one sequence type.

The following table summarizes common LongSequence types that have been given aliases for convenience.

TypeSymbol typeType alias
LongSequence{DNAAlphabet{N}}DNALongDNA{N}
LongSequence{RNAAlphabet{N}}RNALongRNA{N}
LongSequence{AminoAcidAlphabet}AminoAcidLongAA

The LongDNA and LongRNA aliases use a DNAAlphabet{4}.

DNAAlphabet{4} permits ambiguous nucleotides, and a sequence must use at least 4 bits to internally store each element (and indeed LongSequence does).

If you are sure that you are working with sequences with no ambiguous nucleotides, you can use LongSequences parameterised with DNAAlphabet{2} instead.

DNAAlphabet{2} is an alphabet that uses two bits per base and limits to only unambiguous nucleotide symbols (A,C,G,T).

Changing this single parameter, is all you need to do in order to benefit from memory savings. Some computations that use bitwise operations will also be dramatically faster.

The same applies with LongSequence{RNAAlphabet{4}}, simply replace the alphabet parameter with RNAAlphabet{2} in order to benefit.

source

Sequence views

Similar to how Base Julia offers views of array objects, BioSequences offers view of LongSequences - the LongSubSeq{A<:Alphabet}.

Conceptually, a LongSubSeq{A} is similar to a LongSequence{A}, but instead of storing their own data, they refer to the data of a LongSequence. Modiying the LongSequence will be reflected in the view, and vice versa. If the underlying LongSequence is truncated, the behaviour of a view is undefined. For the same reason, some operations are not supported for views, such as resizing.

The purpose of LongSubSeq is that, since they only contain a pointer to the underlying array, an offset and a length, they are much lighter than LongSequences, and will be stack allocated on Julia 1.5 and newer. Thus, the user may construct millions of views without major performance implications.