From 76b3e8f5f38bf26c1bc35092aa4fe3734805395d Mon Sep 17 00:00:00 2001 From: "Documenter.jl" Date: Fri, 25 Oct 2024 10:22:00 +0000 Subject: [PATCH] build based on 295ba89 --- dev/.documenter-siteinfo.json | 2 +- dev/construction/index.html | 12 ++++++------ dev/counting/index.html | 12 ++++++------ dev/index.html | 2 +- dev/interfaces/index.html | 2 +- dev/io/index.html | 2 +- dev/predicates/index.html | 6 +++--- dev/random/index.html | 4 ++-- dev/recipes/index.html | 2 +- dev/search_index.js | 2 +- dev/sequence_search/index.html | 6 +++--- dev/symbols/index.html | 2 +- dev/transforms/index.html | 12 ++++++------ dev/types/index.html | 2 +- 14 files changed, 34 insertions(+), 34 deletions(-) diff --git a/dev/.documenter-siteinfo.json b/dev/.documenter-siteinfo.json index 364b959a..19fb67a3 100644 --- a/dev/.documenter-siteinfo.json +++ b/dev/.documenter-siteinfo.json @@ -1 +1 @@ -{"documenter":{"julia_version":"1.11.1","generation_timestamp":"2024-10-24T17:54:15","documenter_version":"1.7.0"}} \ No newline at end of file +{"documenter":{"julia_version":"1.11.1","generation_timestamp":"2024-10-25T10:21:53","documenter_version":"1.7.0"}} \ No newline at end of file diff --git a/dev/construction/index.html b/dev/construction/index.html index 62ecc191..2003f044 100644 --- a/dev/construction/index.html +++ b/dev/construction/index.html @@ -135,11 +135,11 @@ "TAGA" julia> string(push!(f(), DNA_A)) -"TAGA"source
BioSequences.@rna_strMacro

The LongRNA{4} equivalent to @dna_str

See also: @dna_str, @aa_str

Examples

julia> rna"UCGUGAUGC"
+"TAGA"
source
BioSequences.@rna_strMacro

The LongRNA{4} equivalent to @dna_str

See also: @dna_str, @aa_str

Examples

julia> rna"UCGUGAUGC"
 9nt RNA Sequence:
-UCGUGAUGC
source
BioSequences.@aa_strMacro

The AminoAcidAlphabet equivalent to @dna_str

See also: @dna_str, @rna_str

Examples

julia> aa"PKLEQC"
+UCGUGAUGC
source
BioSequences.@aa_strMacro

The AminoAcidAlphabet equivalent to @dna_str

See also: @dna_str, @rna_str

Examples

julia> aa"PKLEQC"
 6aa Amino Acid Sequence:
-PKLEQC
source

Loose parsing

As of version 3.2.0, BioSequences.jl provide the bioseq function, which can be used to build a LongSequence from a string (or an AbstractVector{UInt8}) without knowing the correct Alphabet.

julia> bioseq("ATGTGCTGA")
+PKLEQC
source

Loose parsing

As of version 3.2.0, BioSequences.jl provide the bioseq function, which can be used to build a LongSequence from a string (or an AbstractVector{UInt8}) without knowing the correct Alphabet.

julia> bioseq("ATGTGCTGA")
 9nt DNA Sequence:
 ATGTGCTGA

The function will prioritise 2-bit alphabets over 4-bit alphabets, and prefer smaller alphabets (like DNAAlphabet{4}) over larger (like AminoAcidAlphabet). If the input cannot be encoded by any of the built-in alphabets, an error is thrown:

julia> bioseq("0!(CC!;#&&%")
 ERROR: cannot encode 0x30 in AminoAcidAlphabet
@@ -153,7 +153,7 @@
 
 julia> bioseq("PKMW#3>>0;kL")
 ERROR: cannot encode 0x23 in AminoAcidAlphabet
-[...]
source
BioSequences.guess_alphabetFunction
guess_alphabet(s::Union{AbstractString, AbstractVector{UInt8}}) -> Union{Integer, Alphabet}

Pick an Alphabet that can encode input s. If no Alphabet can, return the index of the first byte of the input which is not encodable in any alphabet. This function only knows about the alphabets listed below. If multiple alphabets are possible, pick the first from the order below (i.e. DNAAlphabet{2}() if possible, otherwise RNAAlphabet{2}() etc).

  1. DNAAlphabet{2}()
  2. RNAAlphabet{2}()
  3. DNAAlphabet{4}()
  4. RNAAlphabet{4}()
  5. AminoAcidAlphabet()
Warning

The functions bioseq and guess_alphabet are intended for use in interactive sessions, and are not suitable for use in packages or non-ephemeral work. They are type unstable, and their heuristics are subject to change in minor versions.

Examples

julia> guess_alphabet("AGGCA")
+[...]
source
BioSequences.guess_alphabetFunction
guess_alphabet(s::Union{AbstractString, AbstractVector{UInt8}}) -> Union{Integer, Alphabet}

Pick an Alphabet that can encode input s. If no Alphabet can, return the index of the first byte of the input which is not encodable in any alphabet. This function only knows about the alphabets listed below. If multiple alphabets are possible, pick the first from the order below (i.e. DNAAlphabet{2}() if possible, otherwise RNAAlphabet{2}() etc).

  1. DNAAlphabet{2}()
  2. RNAAlphabet{2}()
  3. DNAAlphabet{4}()
  4. RNAAlphabet{4}()
  5. AminoAcidAlphabet()
Warning

The functions bioseq and guess_alphabet are intended for use in interactive sessions, and are not suitable for use in packages or non-ephemeral work. They are type unstable, and their heuristics are subject to change in minor versions.

Examples

julia> guess_alphabet("AGGCA")
 DNAAlphabet{2}()
 
 julia> guess_alphabet("WKLQSTV")
@@ -163,10 +163,10 @@
 5
 
 julia> guess_alphabet("UAGCSKMU")
-RNAAlphabet{4}()
source

Comparison to other sequence types

Following Base standards, BioSequences do not compare equal to other containers even if they have the same elements. To e.g. compare a BioSequence with a vector of DNA, compare the elements themselves:

julia> seq = dna"GAGCTGA"; vec = collect(seq);
+RNAAlphabet{4}()
source

Comparison to other sequence types

Following Base standards, BioSequences do not compare equal to other containers even if they have the same elements. To e.g. compare a BioSequence with a vector of DNA, compare the elements themselves:

julia> seq = dna"GAGCTGA"; vec = collect(seq);
 
 julia> seq == vec, isequal(seq, vec)
 (false, false)
 
 julia> length(seq) == length(vec) && all(i == j for (i, j) in zip(seq, vec))
-true 
+true diff --git a/dev/counting/index.html b/dev/counting/index.html index 9920cebf..c6402be4 100644 --- a/dev/counting/index.html +++ b/dev/counting/index.html @@ -9,24 +9,24 @@ 3 julia> matches(dna"AACA", dna"AAG") -2source
BioSequences.mismatchesFunction
mismatches(a::BioSequence, b::BioSequences) -> Int

Count the number of positions in where a and b differ. If b is given, and the length of a and b differ, look only at the indices of the shorter sequence. This function does not provide any special handling of ambiguous symbols, so e.g. DNA_A does not match DNA_N.

Warning

Passing in two sequences with differing lengths is deprecated. In a future, breaking release of BioSequences, this will error.

Examples

julia> mismatches(dna"TAGCTA", dna"TACNTA")
+2
source
BioSequences.mismatchesFunction
mismatches(a::BioSequence, b::BioSequences) -> Int

Count the number of positions in where a and b differ. If b is given, and the length of a and b differ, look only at the indices of the shorter sequence. This function does not provide any special handling of ambiguous symbols, so e.g. DNA_A does not match DNA_N.

Warning

Passing in two sequences with differing lengths is deprecated. In a future, breaking release of BioSequences, this will error.

Examples

julia> mismatches(dna"TAGCTA", dna"TACNTA")
 2
 
 julia> mismatches(dna"AACA", dna"AAG")
-1
source

GC content

The convenience function gc_content(seq) is equivalent to count(isGC, seq) / length(seq):

BioSequences.gc_contentFunction
gc_content(seq::BioSequence) -> Float64

Calculate GC content of seq, i.e. the number of symbols that is DNA_C, DNA_G, DNA_C or DNA_G divided by the length of the sequence.

Examples

julia> gc_content(dna"AGCTA")
+1
source

GC content

The convenience function gc_content(seq) is equivalent to count(isGC, seq) / length(seq):

BioSequences.gc_contentFunction
gc_content(seq::BioSequence) -> Float64

Calculate GC content of seq, i.e. the number of symbols that is DNA_C, DNA_G, DNA_C or DNA_G divided by the length of the sequence.

Examples

julia> gc_content(dna"AGCTA")
 0.4
 
 julia> gc_content(rna"UAGCGA")
-0.5
source

Deprecated aliases

Several of the optimised count methods have function names, which are deprecated:

Deprecated functionInstead use
n_gapscount(isgap, seq)
n_certaincount(iscertain, seq)
n_ambiguouscount(isambiguous, seq)
BioSequences.n_gapsFunction
n_gaps(a::BioSequence, [b::BioSequence]) -> Int

Count the number of positions where a (or b, if present) have gaps. If b is given, and the length of a and b differ, look only at the indices of the shorter sequence.

Warning

Passing in two sequences is deprecated. In a future, breaking release of BioSequences, this will throw a MethodError

Examples

julia> n_gaps(dna"--TAC-WN-ACY")
+0.5
source

Deprecated aliases

Several of the optimised count methods have function names, which are deprecated:

Deprecated functionInstead use
n_gapscount(isgap, seq)
n_certaincount(iscertain, seq)
n_ambiguouscount(isambiguous, seq)
BioSequences.n_gapsFunction
n_gaps(a::BioSequence, [b::BioSequence]) -> Int

Count the number of positions where a (or b, if present) have gaps. If b is given, and the length of a and b differ, look only at the indices of the shorter sequence.

Warning

Passing in two sequences is deprecated. In a future, breaking release of BioSequences, this will throw a MethodError

Examples

julia> n_gaps(dna"--TAC-WN-ACY")
 4
 
 julia> n_gaps(dna"TC-AC-", dna"-CACG")
-2
source
BioSequences.n_certainFunction
n_certain(a::BioSequence, [b::BioSequence]) -> Int

Count the number of positions where a (and b, if present) have certain (i.e. non-ambigous and non-gap) symbols. If b is given, and the length of a and b differ, look only at the indices of the shorter sequence. Gaps are not certain.

Warning

Passing in two sequences is deprecated. In a future, breaking release of BioSequences, this will throw a MethodError

Examples

julia> n_certain(dna"--TAC-WN-ACY")
+2
source
BioSequences.n_certainFunction
n_certain(a::BioSequence, [b::BioSequence]) -> Int

Count the number of positions where a (and b, if present) have certain (i.e. non-ambigous and non-gap) symbols. If b is given, and the length of a and b differ, look only at the indices of the shorter sequence. Gaps are not certain.

Warning

Passing in two sequences is deprecated. In a future, breaking release of BioSequences, this will throw a MethodError

Examples

julia> n_certain(dna"--TAC-WN-ACY")
 5
 
 julia> n_certain(rna"UAYWW", rna"UAW")
-2
source
BioSequences.n_ambiguousFunction
n_ambiguous(a::BioSequence, [b::BioSequence]) -> Int

Count the number of positions where a (or b, if present) have ambigious symbols. If b is given, and the length of a and b differ, look only at the indices of the shorter sequence. Gaps are not ambigous.

Warning

Passing in two sequences is deprecated. In a future, breaking release of BioSequences, this will throw a MethodError

Examples

julia> n_ambiguous(dna"--TAC-WN-ACY")
+2
source
BioSequences.n_ambiguousFunction
n_ambiguous(a::BioSequence, [b::BioSequence]) -> Int

Count the number of positions where a (or b, if present) have ambigious symbols. If b is given, and the length of a and b differ, look only at the indices of the shorter sequence. Gaps are not ambigous.

Warning

Passing in two sequences is deprecated. In a future, breaking release of BioSequences, this will throw a MethodError

Examples

julia> n_ambiguous(dna"--TAC-WN-ACY")
 3
 
 julia> n_ambiguous(rna"UAYWW", rna"UAW")
-1
source
+1source diff --git a/dev/index.html b/dev/index.html index c7a42d23..08c77045 100644 --- a/dev/index.html +++ b/dev/index.html @@ -1,2 +1,2 @@ -Home · BioSequences.jl

BioSequences

Latest Release MIT license Documentation Pkg Status

Description

BioSequences provides data types and methods for common operations with biological sequences, including DNA, RNA, and amino acid sequences.

Installation

You can install BioSequences from the julia REPL. Press ] to enter pkg mode again, and enter the following:

add BioSequences

If you are interested in the cutting edge of the development, please check out the master branch to try new features before release.

Testing

BioSequences is tested against Julia 1.X on Linux, OS X, and Windows.

Unit tests Documentation

Contributing

We appreciate contributions from users including reporting bugs, fixing issues, improving performance and adding new features.

Take a look at the contributing files detailed contributor and maintainer guidelines, and code of conduct.

Questions?

If you have a question about contributing or using BioJulia software, come on over and chat to us on the #biology channel on the Julia SLack, or you can try the Bio category of the Julia discourse site.

+Home · BioSequences.jl

BioSequences

Latest Release MIT license Documentation Pkg Status

Description

BioSequences provides data types and methods for common operations with biological sequences, including DNA, RNA, and amino acid sequences.

Installation

You can install BioSequences from the julia REPL. Press ] to enter pkg mode again, and enter the following:

add BioSequences

If you are interested in the cutting edge of the development, please check out the master branch to try new features before release.

Testing

BioSequences is tested against Julia 1.X on Linux, OS X, and Windows.

Unit tests Documentation

Contributing

We appreciate contributions from users including reporting bugs, fixing issues, improving performance and adding new features.

Take a look at the contributing files detailed contributor and maintainer guidelines, and code of conduct.

Questions?

If you have a question about contributing or using BioJulia software, come on over and chat to us on the #biology channel on the Julia SLack, or you can try the Bio category of the Julia discourse site.

diff --git a/dev/interfaces/index.html b/dev/interfaces/index.html index 98a77444..b4d9ed5e 100644 --- a/dev/interfaces/index.html +++ b/dev/interfaces/index.html @@ -59,4 +59,4 @@ julia> Base.copy(seq::Codon) = Codon(seq.x) julia> BioSequences.has_interface(BioSequence, Codon, [RNA_C, RNA_U, RNA_A], false) -true

Interface checking functions

BioSequences.has_interfaceFunction
function has_interface(::Type{Alphabet}, A::Alphabet)

Returns whether A conforms to the Alphabet interface.

source
has_interface(::Type{BioSequence}, ::T, syms::Vector, mutable::Bool, compat::Bool=true)

Check if type T conforms to the BioSequence interface. A T is constructed from the vector of element types syms which must not be empty. If the mutable flag is set, also check the mutable interface. If the compat flag is set, check for compatibility with existing alphabets.

source
+true

Interface checking functions

BioSequences.has_interfaceFunction
function has_interface(::Type{Alphabet}, A::Alphabet)

Returns whether A conforms to the Alphabet interface.

source
has_interface(::Type{BioSequence}, ::T, syms::Vector, mutable::Bool, compat::Bool=true)

Check if type T conforms to the BioSequence interface. A T is constructed from the vector of element types syms which must not be empty. If the mutable flag is set, also check the mutable interface. If the compat flag is set, check for compatibility with existing alphabets.

source
diff --git a/dev/io/index.html b/dev/io/index.html index 5d14b662..9a77cc6f 100644 --- a/dev/io/index.html +++ b/dev/io/index.html @@ -1,2 +1,2 @@ -I/O · BioSequences.jl

I/O for sequencing file formats

Versions of BioSequences prior to v2.0 provided a FASTA, FASTQ, and 2Bit submodule for working with formatted sequence files.

After version v2.0, in order to neatly separate concerns, these submodules were removed.

Instead there will now be dedicated BioJulia packages for each format. Each of these will be compatible with BioSequences.

A list of all of the different formats and packages is provided below to help you find them quickly.

FormatPackage
FASTAFASTX.jl
FASTQFASTX.jl
2BitTwoBit.jl
+I/O · BioSequences.jl

I/O for sequencing file formats

Versions of BioSequences prior to v2.0 provided a FASTA, FASTQ, and 2Bit submodule for working with formatted sequence files.

After version v2.0, in order to neatly separate concerns, these submodules were removed.

Instead there will now be dedicated BioJulia packages for each format. Each of these will be compatible with BioSequences.

A list of all of the different formats and packages is provided below to help you find them quickly.

FormatPackage
FASTAFASTX.jl
FASTQFASTX.jl
2BitTwoBit.jl
diff --git a/dev/predicates/index.html b/dev/predicates/index.html index a76f67db..f7a79c3a 100644 --- a/dev/predicates/index.html +++ b/dev/predicates/index.html @@ -1,12 +1,12 @@ -Predicates · BioSequences.jl

Predicates

A number of predicate or query functions are supported for sequences, allowing you to check for certain properties of a sequence.

BioSequences.isrepetitiveFunction
isrepetitive(seq::BioSequence, n::Integer = length(seq))

Return true if and only if seq contains a repetitive subsequence of length ≥ n.

source
BioSequences.ispalindromicFunction
ispalindromic(seq::NucSeq) -> Bool

Check if seq is palindromic. A palindromic sequence is identical to its reverse-complement, so this should be equivalent to checking if seq == reverse_complement(seq).

Examples

julia> ispalindromic(dna"TGCA")
+Predicates · BioSequences.jl

Predicates

A number of predicate or query functions are supported for sequences, allowing you to check for certain properties of a sequence.

BioSequences.isrepetitiveFunction
isrepetitive(seq::BioSequence, n::Integer = length(seq))

Return true if and only if seq contains a repetitive subsequence of length ≥ n.

source
BioSequences.ispalindromicFunction
ispalindromic(seq::NucSeq) -> Bool

Check if seq is palindromic. A palindromic sequence is identical to its reverse-complement, so this should be equivalent to checking if seq == reverse_complement(seq).

Examples

julia> ispalindromic(dna"TGCA")
 true
 
 julia> ispalindromic(dna"TCCT")
 false
 
 julia> ispalindromic(rna"ACGGU")
-false

Return true if seq is a palindromic sequence; otherwise return false.

source
BioSequences.iscanonicalFunction
iscanonical(seq::NucleotideSeq)

Returns true if seq is canonical.

For any sequence, there is a reverse complement, which is the same sequence, but on the complimentary strand of DNA:

------->
+false

Return true if seq is a palindromic sequence; otherwise return false.

source
BioSequences.iscanonicalFunction
iscanonical(seq::NucleotideSeq)

Returns true if seq is canonical.

For any sequence, there is a reverse complement, which is the same sequence, but on the complimentary strand of DNA:

------->
 ATCGATCG
 CGATCGAT
-<-------
Note

Using the reverse_complement of a DNA sequence will give give this reverse complement.

Of the two sequences, the canonical of the two sequences is the lesser of the two i.e. canonical_seq < other_seq.

source
+<-------
Note

Using the reverse_complement of a DNA sequence will give give this reverse complement.

Of the two sequences, the canonical of the two sequences is the lesser of the two i.e. canonical_seq < other_seq.

source
diff --git a/dev/random/index.html b/dev/random/index.html index 8648e87f..302b8737 100644 --- a/dev/random/index.html +++ b/dev/random/index.html @@ -1,8 +1,8 @@ Random sequences · BioSequences.jl

Generating random sequences

Long sequences

You can generate random long sequences using the randdna function and the Sampler's implemented in BioSequences:

BioSequences.randseqFunction
randseq([rng::AbstractRNG], A::Alphabet, len::Integer)

Generate a LongSequence{A} of length len from the specified alphabet, drawn from the default distribution. User-defined alphabets should implement this method to implement random LongSequence generation.

For RNA and DNA alphabets, the default distribution is uniform across A, C, G, and T/U. For AminoAcidAlphabet, it is uniform across the 20 standard amino acids. For a user-defined alphabet A, default is uniform across all elements of symbols(A).

Example:

julia> seq = randseq(AminoAcidAlphabet(), 50)
 50aa Amino Acid Sequence:
-VFMHSIRMIRLMVHRSWKMHSARHVNFIRCQDKKWKSADGIYTDICKYSM
source
randseq([rng::AbstractRNG], A::Alphabet, sp::Sampler, len::Integer)

Generate a LongSequence{A} of length len with elements drawn from the given sampler.

Example:

# Generate 1000-length RNA with 4% chance of N, 24% for A, C, G, or U
+VFMHSIRMIRLMVHRSWKMHSARHVNFIRCQDKKWKSADGIYTDICKYSM
source
randseq([rng::AbstractRNG], A::Alphabet, sp::Sampler, len::Integer)

Generate a LongSequence{A} of length len with elements drawn from the given sampler.

Example:

# Generate 1000-length RNA with 4% chance of N, 24% for A, C, G, or U
 julia> sp = SamplerWeighted(rna"ACGUN", fill(0.24, 4))
 julia> seq = randseq(RNAAlphabet{4}(), sp, 50)
 50nt RNA Sequence:
-CUNGGGCCCGGGNAAACGUGGUACACCCUGUUAAUAUCAACNNGCGCUNU
source
BioSequences.randdnaseqFunction
randdnaseq([rng::AbstractRNG], len::Integer)

Generate a random LongSequence{DNAAlphabet{4}} sequence of length len, with bases sampled uniformly from [A, C, G, T]

source
BioSequences.randrnaseqFunction
randrnaseq([rng::AbstractRNG], len::Integer)

Generate a random LongSequence{RNAAlphabet{4}} sequence of length len, with bases sampled uniformly from [A, C, G, U]

source
BioSequences.randaaseqFunction
randaaseq([rng::AbstractRNG], len::Integer)

Generate a random LongSequence{AminoAcidAlphabet} sequence of length len, with amino acids sampled uniformly from the 20 standard amino acids.

source
BioSequences.SamplerUniformType
SamplerUniform{T}

Uniform sampler of type T. Instantiate with a collection of eltype T containing the elements to sample.

Examples

julia> sp = SamplerUniform(rna"ACGU");
source
BioSequences.SamplerWeightedType
SamplerWeighted{T}

Weighted sampler of type T. Instantiate with a collection of eltype T containing the elements to sample, and an orderen collection of probabilities to sample each element except the last. The last probability is the remaining probability up to 1.

Examples

julia> sp = SamplerWeighted(rna"ACGUN", fill(0.2475, 4));
source
+CUNGGGCCCGGGNAAACGUGGUACACCCUGUUAAUAUCAACNNGCGCUNUsource
BioSequences.randdnaseqFunction
randdnaseq([rng::AbstractRNG], len::Integer)

Generate a random LongSequence{DNAAlphabet{4}} sequence of length len, with bases sampled uniformly from [A, C, G, T]

source
BioSequences.randrnaseqFunction
randrnaseq([rng::AbstractRNG], len::Integer)

Generate a random LongSequence{RNAAlphabet{4}} sequence of length len, with bases sampled uniformly from [A, C, G, U]

source
BioSequences.randaaseqFunction
randaaseq([rng::AbstractRNG], len::Integer)

Generate a random LongSequence{AminoAcidAlphabet} sequence of length len, with amino acids sampled uniformly from the 20 standard amino acids.

source
BioSequences.SamplerUniformType
SamplerUniform{T}

Uniform sampler of type T. Instantiate with a collection of eltype T containing the elements to sample.

Examples

julia> sp = SamplerUniform(rna"ACGU");
source
BioSequences.SamplerWeightedType
SamplerWeighted{T}

Weighted sampler of type T. Instantiate with a collection of eltype T containing the elements to sample, and an orderen collection of probabilities to sample each element except the last. The last probability is the remaining probability up to 1.

Examples

julia> sp = SamplerWeighted(rna"ACGUN", fill(0.2475, 4));
source
diff --git a/dev/recipes/index.html b/dev/recipes/index.html index c0d3aa03..a4cf1ae7 100644 --- a/dev/recipes/index.html +++ b/dev/recipes/index.html @@ -29,4 +29,4 @@ 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 1 1 0 1 0 0 0 0 0 - 1 0 1 1 1 0 1 1 0 1 + 1 0 1 1 1 0 1 1 0 1 diff --git a/dev/search_index.js b/dev/search_index.js index 40c54a97..d5e66b43 100644 --- a/dev/search_index.js +++ b/dev/search_index.js @@ -1,3 +1,3 @@ var documenterSearchIndex = {"docs": -[{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"CurrentModule = BioSequences\nDocTestSetup = quote\n using BioSequences\nend","category":"page"},{"location":"symbols/#Biological-symbols","page":"Biological Symbols","title":"Biological symbols","text":"","category":"section"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"The BioSequences module reexports the biological symbol (character) types that are provided by BioSymbols.jl:","category":"page"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"Type Meaning\nDNA DNA nucleotide\nRNA RNA nucleotide\nAminoAcid Amino acid","category":"page"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"These symbols are elements of biological sequence types, just as characters are elements of strings.","category":"page"},{"location":"symbols/#DNA-and-RNA-nucleotides","page":"Biological Symbols","title":"DNA and RNA nucleotides","text":"","category":"section"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"Set of nucleotide symbols in BioSequences covers IUPAC nucleotide base plus a gap symbol:","category":"page"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"Symbol Constant Meaning\n'A' DNA_A / RNA_A A; Adenine\n'C' DNA_C / RNA_C C; Cytosine\n'G' DNA_G / RNA_G G; Guanine\n'T' DNA_T T; Thymine (DNA only)\n'U' RNA_U U; Uracil (RNA only)\n'M' DNA_M / RNA_M A or C\n'R' DNA_R / RNA_R A or G\n'W' DNA_W / RNA_W A or T/U\n'S' DNA_S / RNA_S C or G\n'Y' DNA_Y / RNA_Y C or T/U\n'K' DNA_K / RNA_K G or T/U\n'V' DNA_V / RNA_V A or C or G; not T/U\n'H' DNA_H / RNA_H A or C or T; not G\n'D' DNA_D / RNA_D A or G or T/U; not C\n'B' DNA_B / RNA_B C or G or T/U; not A\n'N' DNA_N / RNA_N A or C or G or T/U\n'-' DNA_Gap / RNA_Gap Gap (none of the above)","category":"page"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"https://www.bioinformatics.org/sms/iupac.html","category":"page"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"Symbols are accessible as constants with DNA_ or RNA_ prefix:","category":"page"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"julia> DNA_A\nDNA_A\n\njulia> DNA_T\nDNA_T\n\njulia> RNA_U\nRNA_U\n\njulia> DNA_Gap\nDNA_Gap\n\njulia> typeof(DNA_A)\nDNA\n\njulia> typeof(RNA_A)\nRNA\n","category":"page"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"Symbols can be constructed by converting regular characters:","category":"page"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"julia> convert(DNA, 'C')\nDNA_C\n\njulia> convert(DNA, 'C') === DNA_C\ntrue\n","category":"page"},{"location":"symbols/#Amino-acids","page":"Biological Symbols","title":"Amino acids","text":"","category":"section"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"Set of amino acid symbols also covers IUPAC amino acid symbols plus a gap symbol:","category":"page"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"Symbol Constant Meaning\n'A' AA_A Alanine\n'R' AA_R Arginine\n'N' AA_N Asparagine\n'D' AA_D Aspartic acid (Aspartate)\n'C' AA_C Cysteine\n'Q' AA_Q Glutamine\n'E' AA_E Glutamic acid (Glutamate)\n'G' AA_G Glycine\n'H' AA_H Histidine\n'I' AA_I Isoleucine\n'L' AA_L Leucine\n'K' AA_K Lysine\n'M' AA_M Methionine\n'F' AA_F Phenylalanine\n'P' AA_P Proline\n'S' AA_S Serine\n'T' AA_T Threonine\n'W' AA_W Tryptophan\n'Y' AA_Y Tyrosine\n'V' AA_V Valine\n'O' AA_O Pyrrolysine\n'U' AA_U Selenocysteine\n'B' AA_B Aspartic acid or Asparagine\n'J' AA_J Leucine or Isoleucine\n'Z' AA_Z Glutamine or Glutamic acid\n'X' AA_X Any amino acid\n'*' AA_Term Termination codon\n'-' AA_Gap Gap (none of the above)","category":"page"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"https://www.bioinformatics.org/sms/iupac.html","category":"page"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"Symbols are accessible as constants with AA_ prefix:","category":"page"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"julia> AA_A\nAA_A\n\njulia> AA_Q\nAA_Q\n\njulia> AA_Term\nAA_Term\n\njulia> typeof(AA_A)\nAminoAcid\n","category":"page"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"Symbols can be constructed by converting regular characters:","category":"page"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"julia> convert(AminoAcid, 'A')\nAA_A\n\njulia> convert(AminoAcid, 'P') === AA_P\ntrue\n","category":"page"},{"location":"symbols/#Other-functions","page":"Biological Symbols","title":"Other functions","text":"","category":"section"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"alphabet\ngap\niscompatible\nisambiguous","category":"page"},{"location":"symbols/#BioSymbols.alphabet","page":"Biological Symbols","title":"BioSymbols.alphabet","text":"alphabet(DNA)\n\nGet all symbols of DNA in sorted order.\n\nExamples\n\njulia> alphabet(DNA)\n(DNA_Gap, DNA_A, DNA_C, DNA_M, DNA_G, DNA_R, DNA_S, DNA_V, DNA_T, DNA_W, DNA_Y, DNA_H, DNA_K, DNA_D, DNA_B, DNA_N)\n\njulia> issorted(alphabet(DNA))\ntrue\n\n\n\n\n\n\nalphabet(RNA)\n\nGet all symbols of RNA in sorted order.\n\nExamples\n\njulia> alphabet(RNA)\n(RNA_Gap, RNA_A, RNA_C, RNA_M, RNA_G, RNA_R, RNA_S, RNA_V, RNA_U, RNA_W, RNA_Y, RNA_H, RNA_K, RNA_D, RNA_B, RNA_N)\n\njulia> issorted(alphabet(RNA))\ntrue\n\n\n\n\n\n\nalphabet(AminoAcid)\n\nGet all symbols of AminoAcid in sorted order.\n\nExamples\n\njulia> alphabet(AminoAcid)\n(AA_A, AA_R, AA_N, AA_D, AA_C, AA_Q, AA_E, AA_G, AA_H, AA_I, AA_L, AA_K, AA_M, AA_F, AA_P, AA_S, AA_T, AA_W, AA_Y, AA_V, AA_O, AA_U, AA_B, AA_J, AA_Z, AA_X, AA_Term, AA_Gap)\n\njulia> issorted(alphabet(AminoAcid))\ntrue\n\n\n\n\n\n\n","category":"function"},{"location":"symbols/#BioSymbols.gap","page":"Biological Symbols","title":"BioSymbols.gap","text":"gap(::Type{T})::T\n\nReturn the gap (indel) representation of T. By default, gap is defined for DNA, RNA, AminoAcid and Char.\n\nExamples\n\njulia> gap(RNA)\nRNA_Gap\n\njulia> gap(Char)\n'-': ASCII/Unicode U+002D (category Pd: Punctuation, dash)\n\n\n\n\n\n","category":"function"},{"location":"symbols/#BioSymbols.iscompatible","page":"Biological Symbols","title":"BioSymbols.iscompatible","text":"iscompatible(x::S, y::S) where S <: BioSymbol\n\nTest if x and y are compatible with each other.\n\nExamples\n\njulia> iscompatible(AA_A, AA_R)\nfalse\n\njulia> iscompatible(AA_A, AA_X)\ntrue\n\njulia> iscompatible(DNA_A, DNA_A)\ntrue\n\njulia> iscompatible(DNA_C, DNA_N) # DNA_N can be DNA_C\ntrue\n\njulia> iscompatible(DNA_C, DNA_R) # DNA_R (A or G) cannot be DNA_C\nfalse\n\n\n\n\n\n\n","category":"function"},{"location":"symbols/#BioSymbols.isambiguous","page":"Biological Symbols","title":"BioSymbols.isambiguous","text":"isambiguous(nt::NucleicAcid)\n\nTest if nt is an ambiguous nucleotide.\n\n\n\n\n\nisambiguous(aa::AminoAcid)\n\nTest if aa is an ambiguous amino acid.\n\n\n\n\n\n","category":"function"},{"location":"io/#I/O-for-sequencing-file-formats","page":"I/O","title":"I/O for sequencing file formats","text":"","category":"section"},{"location":"io/","page":"I/O","title":"I/O","text":"Versions of BioSequences prior to v2.0 provided a FASTA, FASTQ, and 2Bit submodule for working with formatted sequence files.","category":"page"},{"location":"io/","page":"I/O","title":"I/O","text":"After version v2.0, in order to neatly separate concerns, these submodules were removed.","category":"page"},{"location":"io/","page":"I/O","title":"I/O","text":"Instead there will now be dedicated BioJulia packages for each format. Each of these will be compatible with BioSequences.","category":"page"},{"location":"io/","page":"I/O","title":"I/O","text":"A list of all of the different formats and packages is provided below to help you find them quickly.","category":"page"},{"location":"io/","page":"I/O","title":"I/O","text":"Format Package\nFASTA FASTX.jl\nFASTQ FASTX.jl\n2Bit TwoBit.jl","category":"page"},{"location":"counting/","page":"Counting","title":"Counting","text":"CurrentModule = BioSequences\nDocTestSetup = quote\n using BioSequences\nend","category":"page"},{"location":"counting/#Counting","page":"Counting","title":"Counting","text":"","category":"section"},{"location":"counting/","page":"Counting","title":"Counting","text":"BioSequences contains functionality to efficiently count biosymbols in a biosequence that satisfies some predicate.","category":"page"},{"location":"counting/","page":"Counting","title":"Counting","text":"Consider a naive counting function like this:","category":"page"},{"location":"counting/","page":"Counting","title":"Counting","text":"function count_Ns(seq::BioSequence{<:DNAAlphabet})\n ns = 0\n for i in seq\n ns += (i == DNA_N)::Bool\n end\n ns\nend ","category":"page"},{"location":"counting/","page":"Counting","title":"Counting","text":"This function can be more efficiently implemented by exploiting the internal data layout of certain biosequences. Therefore, Julia provides optimised methods for Base.count, such that count_Ns above can be more efficiently expressed count(==(DNA_N), seq).","category":"page"},{"location":"counting/","page":"Counting","title":"Counting","text":"note: Note\nIt is important to understand that this speed is achieved with custom methods of Base.count, and not by a generic mechanism that improves the speed of counting symbols in BioSequencein general. Hence, while count(==(DNA_N), seq) may be optimised, count(i -> i == DNA_N, seq) is not, as this is a different method.","category":"page"},{"location":"counting/#Currently-optimised-methods","page":"Counting","title":"Currently optimised methods","text":"","category":"section"},{"location":"counting/","page":"Counting","title":"Counting","text":"By default, only the BioSequence and Alphabet types found in BioSequences.jl have optimised methods.","category":"page"},{"location":"counting/","page":"Counting","title":"Counting","text":"count(isGC, seq)\ncount(isambiguous, seq)\ncount(iscertain, seq)\ncount(isgap, seq)\ncount(==(biosymbol), seq) and count(isequal(biosymbol), seq)","category":"page"},{"location":"counting/#Matches-and-mismatches","page":"Counting","title":"Matches and mismatches","text":"","category":"section"},{"location":"counting/","page":"Counting","title":"Counting","text":"The methods matches and mismatches take two sequences and count the number of positions where the sequences are unequal or equal, respectively.","category":"page"},{"location":"counting/","page":"Counting","title":"Counting","text":"They are equivalent to matches(a, b) = count(splat(==), zip(a, b)) (and with !=, respectively).","category":"page"},{"location":"counting/","page":"Counting","title":"Counting","text":"matches\nmismatches","category":"page"},{"location":"counting/#BioSequences.matches","page":"Counting","title":"BioSequences.matches","text":"matches(a::BioSequence, b::BioSequences) -> Int\n\nCount the number of positions in where a and b are equal. If b is given, and the length of a and b differ, look only at the indices of the shorter sequence. This function does not provide any special handling of ambiguous symbols, so e.g. DNA_A does not match DNA_N.\n\nwarning: Warning\nPassing in two sequences with differing lengths is deprecated. In a future, breaking release of BioSequences, this will error.\n\nExamples\n\njulia> matches(dna\"TAWNNA\", dna\"TACCTA\")\n3\n\njulia> matches(dna\"AACA\", dna\"AAG\")\n2\n\n\n\n\n\n","category":"function"},{"location":"counting/#BioSequences.mismatches","page":"Counting","title":"BioSequences.mismatches","text":"mismatches(a::BioSequence, b::BioSequences) -> Int\n\nCount the number of positions in where a and b differ. If b is given, and the length of a and b differ, look only at the indices of the shorter sequence. This function does not provide any special handling of ambiguous symbols, so e.g. DNA_A does not match DNA_N.\n\nwarning: Warning\nPassing in two sequences with differing lengths is deprecated. In a future, breaking release of BioSequences, this will error.\n\nExamples\n\njulia> mismatches(dna\"TAGCTA\", dna\"TACNTA\")\n2\n\njulia> mismatches(dna\"AACA\", dna\"AAG\")\n1\n\n\n\n\n\n","category":"function"},{"location":"counting/#GC-content","page":"Counting","title":"GC content","text":"","category":"section"},{"location":"counting/","page":"Counting","title":"Counting","text":"The convenience function gc_content(seq) is equivalent to count(isGC, seq) / length(seq):","category":"page"},{"location":"counting/","page":"Counting","title":"Counting","text":"gc_content","category":"page"},{"location":"counting/#BioSequences.gc_content","page":"Counting","title":"BioSequences.gc_content","text":"gc_content(seq::BioSequence) -> Float64\n\nCalculate GC content of seq, i.e. the number of symbols that is DNA_C, DNA_G, DNA_C or DNA_G divided by the length of the sequence.\n\nExamples\n\njulia> gc_content(dna\"AGCTA\")\n0.4\n\njulia> gc_content(rna\"UAGCGA\")\n0.5\n\n\n\n\n\n","category":"function"},{"location":"counting/#Deprecated-aliases","page":"Counting","title":"Deprecated aliases","text":"","category":"section"},{"location":"counting/","page":"Counting","title":"Counting","text":"Several of the optimised count methods have function names, which are deprecated:","category":"page"},{"location":"counting/","page":"Counting","title":"Counting","text":"Deprecated function Instead use\nn_gaps count(isgap, seq)\nn_certain count(iscertain, seq)\nn_ambiguous count(isambiguous, seq)","category":"page"},{"location":"counting/","page":"Counting","title":"Counting","text":"n_gaps\nn_certain\nn_ambiguous","category":"page"},{"location":"counting/#BioSequences.n_gaps","page":"Counting","title":"BioSequences.n_gaps","text":"n_gaps(a::BioSequence, [b::BioSequence]) -> Int\n\nCount the number of positions where a (or b, if present) have gaps. If b is given, and the length of a and b differ, look only at the indices of the shorter sequence.\n\nwarning: Warning\nPassing in two sequences is deprecated. In a future, breaking release of BioSequences, this will throw a MethodError\n\nExamples\n\njulia> n_gaps(dna\"--TAC-WN-ACY\")\n4\n\njulia> n_gaps(dna\"TC-AC-\", dna\"-CACG\")\n2\n\n\n\n\n\n","category":"function"},{"location":"counting/#BioSequences.n_certain","page":"Counting","title":"BioSequences.n_certain","text":"n_certain(a::BioSequence, [b::BioSequence]) -> Int\n\nCount the number of positions where a (and b, if present) have certain (i.e. non-ambigous and non-gap) symbols. If b is given, and the length of a and b differ, look only at the indices of the shorter sequence. Gaps are not certain.\n\nwarning: Warning\nPassing in two sequences is deprecated. In a future, breaking release of BioSequences, this will throw a MethodError\n\nExamples\n\njulia> n_certain(dna\"--TAC-WN-ACY\")\n5\n\njulia> n_certain(rna\"UAYWW\", rna\"UAW\")\n2\n\n\n\n\n\n","category":"function"},{"location":"counting/#BioSequences.n_ambiguous","page":"Counting","title":"BioSequences.n_ambiguous","text":"n_ambiguous(a::BioSequence, [b::BioSequence]) -> Int\n\nCount the number of positions where a (or b, if present) have ambigious symbols. If b is given, and the length of a and b differ, look only at the indices of the shorter sequence. Gaps are not ambigous.\n\nwarning: Warning\nPassing in two sequences is deprecated. In a future, breaking release of BioSequences, this will throw a MethodError\n\nExamples\n\njulia> n_ambiguous(dna\"--TAC-WN-ACY\")\n3\n\njulia> n_ambiguous(rna\"UAYWW\", rna\"UAW\")\n1\n\n\n\n\n\n","category":"function"},{"location":"interfaces/","page":"Implementing custom types","title":"Implementing custom types","text":"CurrentModule = BioSequences\nDocTestSetup = quote\n using BioSequences\nend","category":"page"},{"location":"interfaces/#Custom-BioSequences-types","page":"Implementing custom types","title":"Custom BioSequences types","text":"","category":"section"},{"location":"interfaces/","page":"Implementing custom types","title":"Implementing custom types","text":"If you're a developing your own Bioinformatics package or method, you may find that the reference implementation of concrete LongSequence types provided in this package are not optimal for your purposes.","category":"page"},{"location":"interfaces/","page":"Implementing custom types","title":"Implementing custom types","text":"This page describes the interfaces for BioSequences' core types for developers or other packages implementing their own sequence types or extending BioSequences functionality.","category":"page"},{"location":"interfaces/#Implementing-custom-Alphabets","page":"Implementing custom types","title":"Implementing custom Alphabets","text":"","category":"section"},{"location":"interfaces/","page":"Implementing custom types","title":"Implementing custom types","text":"Recall the required methods that define the Alphabet interface. ","category":"page"},{"location":"interfaces/","page":"Implementing custom types","title":"Implementing custom types","text":"To create an example custom alphabet, we need to create a singleton type, that implements a few methods in order to conform to the interface as described in the Alphabet documentation.","category":"page"},{"location":"interfaces/","page":"Implementing custom types","title":"Implementing custom types","text":"Let's do that for a restricted Amino Acid alphabet. We can test that it conforms to the interface with the BioSequences.has_interface function.","category":"page"},{"location":"interfaces/","page":"Implementing custom types","title":"Implementing custom types","text":"julia> struct ReducedAAAlphabet <: Alphabet end\n\njulia> Base.eltype(::Type{ReducedAAAlphabet}) = AminoAcid\n\njulia> BioSequences.BitsPerSymbol(::ReducedAAAlphabet) = BioSequences.BitsPerSymbol{4}()\n\njulia> function BioSequences.symbols(::ReducedAAAlphabet)\n (AA_L, AA_C, AA_A, AA_G, AA_S, AA_T, AA_P, AA_F,\n AA_W, AA_E, AA_D, AA_N, AA_Q, AA_K, AA_H, AA_M)\n end\n\njulia> const (ENC_LUT, DEC_LUT) = let\n enc_lut = fill(0xff, length(alphabet(AminoAcid)))\n dec_lut = fill(AA_A, length(symbols(ReducedAAAlphabet())))\n for (i, aa) in enumerate(symbols(ReducedAAAlphabet()))\n enc_lut[reinterpret(UInt8, aa) + 0x01] = i - 1\n dec_lut[i] = aa\n end\n (Tuple(enc_lut), Tuple(dec_lut))\n end\n((0x02, 0xff, 0x0b, 0x0a, 0x01, 0x0c, 0x09, 0x03, 0x0e, 0xff, 0x00, 0x0d, 0x0f, 0x07, 0x06, 0x04, 0x05, 0x08, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff), (AA_L, AA_C, AA_A, AA_G, AA_S, AA_T, AA_P, AA_F, AA_W, AA_E, AA_D, AA_N, AA_Q, AA_K, AA_H, AA_M))\n\njulia> function BioSequences.encode(::ReducedAAAlphabet, aa::AminoAcid)\n i = reinterpret(UInt8, aa) + 0x01\n (i ≥ length(ENC_LUT) || @inbounds ENC_LUT[i] === 0xff) && throw(DomainError(aa))\n (@inbounds ENC_LUT[i]) % UInt\n end\n\njulia> function BioSequences.decode(::ReducedAAAlphabet, x::UInt)\n x ≥ length(DEC_LUT) && throw(DomainError(aa))\n @inbounds DEC_LUT[x + UInt(1)]\n end\n\njulia> BioSequences.has_interface(Alphabet, ReducedAAAlphabet())\ntrue\n","category":"page"},{"location":"interfaces/#Implementing-custom-BioSequences","page":"Implementing custom types","title":"Implementing custom BioSequences","text":"","category":"section"},{"location":"interfaces/","page":"Implementing custom types","title":"Implementing custom types","text":"Recall the required methods that define the BioSequence interface. ","category":"page"},{"location":"interfaces/","page":"Implementing custom types","title":"Implementing custom types","text":"To create an example custom alphabet, we need to create a singleton type, that implements a few methods in order to conform to the interface as described in the BioSequence documentation.","category":"page"},{"location":"interfaces/","page":"Implementing custom types","title":"Implementing custom types","text":"Let's do that for a custom sequence type that is optimised to represent a small sequence: A Codon. We can test that it conforms to the interface with the BioSequences.has_interface function.","category":"page"},{"location":"interfaces/","page":"Implementing custom types","title":"Implementing custom types","text":"julia> struct Codon <: BioSequence{RNAAlphabet{2}}\n x::UInt8\n end\n\njulia> function Codon(iterable)\n length(iterable) == 3 || error(\"Must have length 3\")\n x = zero(UInt)\n for (i, nt) in enumerate(iterable)\n x |= BioSequences.encode(Alphabet(Codon), convert(RNA, nt)) << (6-2i)\n end\n Codon(x % UInt8)\n end\nCodon\n\njulia> Base.length(::Codon) = 3\n\njulia> BioSequences.encoded_data_eltype(::Type{Codon}) = UInt\n\njulia> function BioSequences.extract_encoded_element(x::Codon, i::Int)\n ((x.x >>> (6-2i)) & 3) % UInt\n end\n\njulia> Base.copy(seq::Codon) = Codon(seq.x)\n\njulia> BioSequences.has_interface(BioSequence, Codon, [RNA_C, RNA_U, RNA_A], false)\ntrue","category":"page"},{"location":"interfaces/#Interface-checking-functions","page":"Implementing custom types","title":"Interface checking functions","text":"","category":"section"},{"location":"interfaces/","page":"Implementing custom types","title":"Implementing custom types","text":"BioSequences.has_interface","category":"page"},{"location":"interfaces/#BioSequences.has_interface","page":"Implementing custom types","title":"BioSequences.has_interface","text":"function has_interface(::Type{Alphabet}, A::Alphabet)\n\nReturns whether A conforms to the Alphabet interface.\n\n\n\n\n\nhas_interface(::Type{BioSequence}, ::T, syms::Vector, mutable::Bool, compat::Bool=true)\n\nCheck if type T conforms to the BioSequence interface. A T is constructed from the vector of element types syms which must not be empty. If the mutable flag is set, also check the mutable interface. If the compat flag is set, check for compatibility with existing alphabets.\n\n\n\n\n\n","category":"function"},{"location":"random/","page":"Random sequences","title":"Random sequences","text":"CurrentModule = BioSequences\nDocTestSetup = quote\n using BioSequences\nend","category":"page"},{"location":"random/#Generating-random-sequences","page":"Random sequences","title":"Generating random sequences","text":"","category":"section"},{"location":"random/#Long-sequences","page":"Random sequences","title":"Long sequences","text":"","category":"section"},{"location":"random/","page":"Random sequences","title":"Random sequences","text":"You can generate random long sequences using the randdna function and the Sampler's implemented in BioSequences:","category":"page"},{"location":"random/","page":"Random sequences","title":"Random sequences","text":"randseq\nranddnaseq\nrandrnaseq\nrandaaseq\nSamplerUniform\nSamplerWeighted","category":"page"},{"location":"random/#BioSequences.randseq","page":"Random sequences","title":"BioSequences.randseq","text":"randseq([rng::AbstractRNG], A::Alphabet, len::Integer)\n\nGenerate a LongSequence{A} of length len from the specified alphabet, drawn from the default distribution. User-defined alphabets should implement this method to implement random LongSequence generation.\n\nFor RNA and DNA alphabets, the default distribution is uniform across A, C, G, and T/U. For AminoAcidAlphabet, it is uniform across the 20 standard amino acids. For a user-defined alphabet A, default is uniform across all elements of symbols(A).\n\nExample:\n\njulia> seq = randseq(AminoAcidAlphabet(), 50)\n50aa Amino Acid Sequence:\nVFMHSIRMIRLMVHRSWKMHSARHVNFIRCQDKKWKSADGIYTDICKYSM\n\n\n\n\n\nrandseq([rng::AbstractRNG], A::Alphabet, sp::Sampler, len::Integer)\n\nGenerate a LongSequence{A} of length len with elements drawn from the given sampler.\n\nExample:\n\n# Generate 1000-length RNA with 4% chance of N, 24% for A, C, G, or U\njulia> sp = SamplerWeighted(rna\"ACGUN\", fill(0.24, 4))\njulia> seq = randseq(RNAAlphabet{4}(), sp, 50)\n50nt RNA Sequence:\nCUNGGGCCCGGGNAAACGUGGUACACCCUGUUAAUAUCAACNNGCGCUNU\n\n\n\n\n\n","category":"function"},{"location":"random/#BioSequences.randdnaseq","page":"Random sequences","title":"BioSequences.randdnaseq","text":"randdnaseq([rng::AbstractRNG], len::Integer)\n\nGenerate a random LongSequence{DNAAlphabet{4}} sequence of length len, with bases sampled uniformly from [A, C, G, T]\n\n\n\n\n\n","category":"function"},{"location":"random/#BioSequences.randrnaseq","page":"Random sequences","title":"BioSequences.randrnaseq","text":"randrnaseq([rng::AbstractRNG], len::Integer)\n\nGenerate a random LongSequence{RNAAlphabet{4}} sequence of length len, with bases sampled uniformly from [A, C, G, U]\n\n\n\n\n\n","category":"function"},{"location":"random/#BioSequences.randaaseq","page":"Random sequences","title":"BioSequences.randaaseq","text":"randaaseq([rng::AbstractRNG], len::Integer)\n\nGenerate a random LongSequence{AminoAcidAlphabet} sequence of length len, with amino acids sampled uniformly from the 20 standard amino acids.\n\n\n\n\n\n","category":"function"},{"location":"random/#BioSequences.SamplerUniform","page":"Random sequences","title":"BioSequences.SamplerUniform","text":"SamplerUniform{T}\n\nUniform sampler of type T. Instantiate with a collection of eltype T containing the elements to sample.\n\nExamples\n\njulia> sp = SamplerUniform(rna\"ACGU\");\n\n\n\n\n\n","category":"type"},{"location":"random/#BioSequences.SamplerWeighted","page":"Random sequences","title":"BioSequences.SamplerWeighted","text":"SamplerWeighted{T}\n\nWeighted sampler of type T. Instantiate with a collection of eltype T containing the elements to sample, and an orderen collection of probabilities to sample each element except the last. The last probability is the remaining probability up to 1.\n\nExamples\n\njulia> sp = SamplerWeighted(rna\"ACGUN\", fill(0.2475, 4));\n\n\n\n\n\n","category":"type"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"CurrentModule = BioSequences\nDocTestSetup = quote\n using BioSequences\nend","category":"page"},{"location":"transforms/#Indexing-and-modifying-sequences","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"","category":"section"},{"location":"transforms/#Indexing","page":"Indexing & modifying sequences","title":"Indexing","text":"","category":"section"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"Most BioSequence concrete subtypes for the most part behave like other vector or string types. They can be indexed using integers or ranges:","category":"page"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"For example, with LongSequences:","category":"page"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"julia> seq = dna\"ACGTTTANAGTNNAGTACC\"\n19nt DNA Sequence:\nACGTTTANAGTNNAGTACC\n\njulia> seq[5]\nDNA_T\n\njulia> seq[6:end]\n14nt DNA Sequence:\nTANAGTNNAGTACC\n","category":"page"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"The biological symbol at a given locus in a biological sequence can be set using setindex:","category":"page"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"julia> seq = dna\"ACGTTTANAGTNNAGTACC\"\n19nt DNA Sequence:\nACGTTTANAGTNNAGTACC\n\njulia> seq[5] = DNA_A\nDNA_A\n","category":"page"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"note: Note\nSome types such can be indexed using integers but not using ranges.","category":"page"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"For LongSequence types, indexing a sequence by range creates a copy of the original sequence, similar to Array in Julia's Base library. If you find yourself slowed down by the allocation of these subsequences, consider using a sequence view instead.","category":"page"},{"location":"transforms/#Modifying-sequences","page":"Indexing & modifying sequences","title":"Modifying sequences","text":"","category":"section"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"In addition to setindex, many other modifying operations are possible for biological sequences such as push!, pop!, and insert!, which should be familiar to anyone used to editing arrays.","category":"page"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"push!(::BioSequences.BioSequence, ::Any)\npop!(::BioSequences.BioSequence)\npushfirst!(::BioSequences.BioSequence, ::Any)\npopfirst!(::BioSequences.BioSequence)\ninsert!(::BioSequences.BioSequence, ::Integer, ::Any)\ndeleteat!(::BioSequences.BioSequence, ::Integer)\nappend!(::BioSequences.BioSequence, ::BioSequences.BioSequence)\nresize!(::BioSequences.LongSequence, ::Integer)\nempty!(::BioSequences.BioSequence)","category":"page"},{"location":"transforms/#Base.push!-Tuple{BioSequence, Any}","page":"Indexing & modifying sequences","title":"Base.push!","text":"push!(seq::BioSequence, x)\n\nAppend a biological symbol x to a biological sequence seq.\n\n\n\n\n\n","category":"method"},{"location":"transforms/#Base.pop!-Tuple{BioSequence}","page":"Indexing & modifying sequences","title":"Base.pop!","text":"pop!(seq::BioSequence)\n\nRemove the symbol from the end of a biological sequence seq and return it. Returns a variable of eltype(seq).\n\n\n\n\n\n","category":"method"},{"location":"transforms/#Base.pushfirst!-Tuple{BioSequence, Any}","page":"Indexing & modifying sequences","title":"Base.pushfirst!","text":"pushfirst!(seq, x)\n\nInsert a biological symbol x at the beginning of a biological sequence seq.\n\n\n\n\n\n","category":"method"},{"location":"transforms/#Base.popfirst!-Tuple{BioSequence}","page":"Indexing & modifying sequences","title":"Base.popfirst!","text":"popfirst!(seq)\n\nRemove the symbol from the beginning of a biological sequence seq and return it. Returns a variable of eltype(seq).\n\n\n\n\n\n","category":"method"},{"location":"transforms/#Base.insert!-Tuple{BioSequence, Integer, Any}","page":"Indexing & modifying sequences","title":"Base.insert!","text":"insert!(seq::BioSequence, i, x)\n\nInsert a biological symbol x into a biological sequence seq, at the given index i.\n\n\n\n\n\n","category":"method"},{"location":"transforms/#Base.deleteat!-Tuple{BioSequence, Integer}","page":"Indexing & modifying sequences","title":"Base.deleteat!","text":"deleteat!(seq::BioSequence, i::Integer)\n\nDelete a biological symbol at a single position i in a biological sequence seq.\n\nModifies the input sequence.\n\n\n\n\n\n","category":"method"},{"location":"transforms/#Base.append!-Tuple{BioSequence, BioSequence}","page":"Indexing & modifying sequences","title":"Base.append!","text":"append!(seq, other)\n\nAdd a biological sequence other onto the end of biological sequence seq. Modifies and returns seq.\n\n\n\n\n\n","category":"method"},{"location":"transforms/#Base.resize!-Tuple{LongSequence, Integer}","page":"Indexing & modifying sequences","title":"Base.resize!","text":"resize!(seq, size, [force::Bool=false])\n\nResize a biological sequence seq, to a given size. Does not resize the underlying data array unless the new size does not fit. If force, always resize underlying data array.\n\nNote that resizing to a larger size, and then loading from uninitialized positions is not allowed and may cause undefined behaviour. Make sure to always fill any uninitialized biosymbols after resizing.\n\n\n\n\n\n","category":"method"},{"location":"transforms/#Base.empty!-Tuple{BioSequence}","page":"Indexing & modifying sequences","title":"Base.empty!","text":"empty!(seq::BioSequence)\n\nCompletely empty a biological sequence seq of nucleotides.\n\n\n\n\n\n","category":"method"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"Here are some examples:","category":"page"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"julia> seq = dna\"ACG\"\n3nt DNA Sequence:\nACG\n\njulia> push!(seq, DNA_T)\n4nt DNA Sequence:\nACGT\n\njulia> append!(seq, dna\"AT\")\n6nt DNA Sequence:\nACGTAT\n\njulia> deleteat!(seq, 2)\n5nt DNA Sequence:\nAGTAT\n\njulia> deleteat!(seq, 2:3)\n3nt DNA Sequence:\nAAT\n","category":"page"},{"location":"transforms/#Additional-transformations","page":"Indexing & modifying sequences","title":"Additional transformations","text":"","category":"section"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"In addition to these basic modifying functions, other sequence transformations that are common in bioinformatics are also provided.","category":"page"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"reverse!(::BioSequences.LongSequence)\nreverse(::BioSequences.LongSequence{<:NucleicAcidAlphabet})\ncomplement!\ncomplement\nreverse_complement!\nreverse_complement\nungap!\nungap\ncanonical!\ncanonical","category":"page"},{"location":"transforms/#Base.reverse!-Tuple{LongSequence}","page":"Indexing & modifying sequences","title":"Base.reverse!","text":"reverse!(seq::LongSequence)\n\nReverse a biological sequence seq in place.\n\n\n\n\n\n","category":"method"},{"location":"transforms/#Base.reverse-Tuple{LongSequence{<:NucleicAcidAlphabet}}","page":"Indexing & modifying sequences","title":"Base.reverse","text":"reverse(seq::BioSequence)\n\nCreate reversed copy of a biological sequence.\n\n\n\n\n\nreverse(seq::LongSequence)\n\nCreate reversed copy of a biological sequence.\n\n\n\n\n\n","category":"method"},{"location":"transforms/#BioSequences.complement!","page":"Indexing & modifying sequences","title":"BioSequences.complement!","text":"complement!(seq)\n\nMake a complement sequence of seq in place.\n\n\n\n\n\n","category":"function"},{"location":"transforms/#BioSymbols.complement","page":"Indexing & modifying sequences","title":"BioSymbols.complement","text":"complement(nt::NucleicAcid)\n\nReturn the complementary nucleotide of nt.\n\nThis function returns the union of all possible complementary nucleotides.\n\nExamples\n\njulia> complement(DNA_A)\nDNA_T\n\njulia> complement(DNA_N)\nDNA_N\n\njulia> complement(RNA_U)\nRNA_A\n\n\n\n\n\n\ncomplement(seq)\n\nMake a complement sequence of seq.\n\n\n\n\n\n","category":"function"},{"location":"transforms/#BioSequences.reverse_complement!","page":"Indexing & modifying sequences","title":"BioSequences.reverse_complement!","text":"reverse_complement!(seq)\n\nMake a reversed complement sequence of seq in place.\n\n\n\n\n\n","category":"function"},{"location":"transforms/#BioSequences.reverse_complement","page":"Indexing & modifying sequences","title":"BioSequences.reverse_complement","text":"reverse_complement(seq)\n\nMake a reversed complement sequence of seq.\n\n\n\n\n\n","category":"function"},{"location":"transforms/#BioSequences.ungap!","page":"Indexing & modifying sequences","title":"BioSequences.ungap!","text":"Remove gap characters from an input sequence.\n\n\n\n\n\n","category":"function"},{"location":"transforms/#BioSequences.ungap","page":"Indexing & modifying sequences","title":"BioSequences.ungap","text":"Create a copy of a sequence with gap characters removed.\n\n\n\n\n\n","category":"function"},{"location":"transforms/#BioSequences.canonical!","page":"Indexing & modifying sequences","title":"BioSequences.canonical!","text":"canonical!(seq::NucleotideSeq)\n\nTransforms the seq into its canonical form, if it is not already canonical. Modifies the input sequence inplace.\n\nFor any sequence, there is a reverse complement, which is the same sequence, but on the complimentary strand of DNA:\n\n------->\nATCGATCG\nCGATCGAT\n<-------\n\nnote: Note\nUsing the reverse_complement of a DNA sequence will give give this reverse complement.\n\nOf the two sequences, the canonical of the two sequences is the lesser of the two i.e. canonical_seq < other_seq.\n\nUsing this function on a seq will ensure it is the canonical version.\n\n\n\n\n\n","category":"function"},{"location":"transforms/#BioSequences.canonical","page":"Indexing & modifying sequences","title":"BioSequences.canonical","text":"canonical(seq::NucleotideSeq)\n\nCreate the canonical sequence of seq.\n\n\n\n\n\n","category":"function"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"Some examples:","category":"page"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"julia> seq = dna\"ACGTAT\"\n6nt DNA Sequence:\nACGTAT\n\njulia> reverse!(seq)\n6nt DNA Sequence:\nTATGCA\n\njulia> complement!(seq)\n6nt DNA Sequence:\nATACGT\n\njulia> reverse_complement!(seq)\n6nt DNA Sequence:\nACGTAT\n","category":"page"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"Many of these methods also have a version which makes a copy of the input sequence, so you get a modified copy, and don't alter the original sequence. Such methods are named the same, but without the exclamation mark. E.g. reverse instead of reverse!, and ungap instead of ungap!. ","category":"page"},{"location":"transforms/#Translation","page":"Indexing & modifying sequences","title":"Translation","text":"","category":"section"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"Translation is a slightly more complex transformation for RNA Sequences and so we describe it here in more detail.","category":"page"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"The translate function translates a sequence of codons in a RNA sequence to a amino acid sequence based on a genetic code. The BioSequences package provides all NCBI defined genetic codes and they are registered in ncbi_trans_table.","category":"page"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"translate\nncbi_trans_table","category":"page"},{"location":"transforms/#BioSequences.translate","page":"Indexing & modifying sequences","title":"BioSequences.translate","text":"translate(seq, code=standard_genetic_code, allow_ambiguous_codons=true, alternative_start=false)\n\nTranslate an LongRNA or a LongDNA to an LongAA.\n\nTranslation uses genetic code code to map codons to amino acids. See ncbi_trans_table for available genetic codes. If codons in the given sequence cannot determine a unique amino acid, they will be translated to AA_X if allow_ambiguous_codons is true and otherwise result in an error. For organisms that utilize alternative start codons, one can set alternative_start=true, in which case the first codon will always be converted to a methionine.\n\n\n\n\n\n","category":"function"},{"location":"transforms/#BioSequences.ncbi_trans_table","page":"Indexing & modifying sequences","title":"BioSequences.ncbi_trans_table","text":"Genetic code list of NCBI.\n\nThe standard genetic code is ncbi_trans_table[1] and others can be shown by show(ncbi_trans_table). For more details, consult the next link: http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=cgencodes.\n\n\n\n\n\n","category":"constant"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"julia> ncbi_trans_table\nTranslation Tables:\n 1. The Standard Code (standard_genetic_code)\n 2. The Vertebrate Mitochondrial Code (vertebrate_mitochondrial_genetic_code)\n 3. The Yeast Mitochondrial Code (yeast_mitochondrial_genetic_code)\n 4. The Mold, Protozoan, and Coelenterate Mitochondrial Code and the Mycoplasma/Spiroplasma Code (mold_mitochondrial_genetic_code)\n 5. The Invertebrate Mitochondrial Code (invertebrate_mitochondrial_genetic_code)\n 6. The Ciliate, Dasycladacean and Hexamita Nuclear Code (ciliate_nuclear_genetic_code)\n 9. The Echinoderm and Flatworm Mitochondrial Code (echinoderm_mitochondrial_genetic_code)\n 10. The Euplotid Nuclear Code (euplotid_nuclear_genetic_code)\n 11. The Bacterial, Archaeal and Plant Plastid Code (bacterial_plastid_genetic_code)\n 12. The Alternative Yeast Nuclear Code (alternative_yeast_nuclear_genetic_code)\n 13. The Ascidian Mitochondrial Code (ascidian_mitochondrial_genetic_code)\n 14. The Alternative Flatworm Mitochondrial Code (alternative_flatworm_mitochondrial_genetic_code)\n 15. Blepharisma Macronuclear Code (blepharisma_macronuclear_genetic_code)\n 16. Chlorophycean Mitochondrial Code (chlorophycean_mitochondrial_genetic_code)\n 21. Trematode Mitochondrial Code (trematode_mitochondrial_genetic_code)\n 22. Scenedesmus obliquus Mitochondrial Code (scenedesmus_obliquus_mitochondrial_genetic_code)\n 23. Thraustochytrium Mitochondrial Code (thraustochytrium_mitochondrial_genetic_code)\n 24. Pterobranchia Mitochondrial Code (pterobrachia_mitochondrial_genetic_code)\n 25. Candidate Division SR1 and Gracilibacteria Code (candidate_division_sr1_genetic_code)\n","category":"page"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"https://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=cgencodes","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"CurrentModule = BioSequences\nDocTestSetup = quote\n using BioSequences\nend","category":"page"},{"location":"construction/#Construction-and-conversion","page":"Constructing sequences","title":"Construction & conversion","text":"","category":"section"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"Here we will showcase the various ways you can construct the various sequence types in BioSequences.","category":"page"},{"location":"construction/#Constructing-sequences","page":"Constructing sequences","title":"Constructing sequences","text":"","category":"section"},{"location":"construction/#From-strings","page":"Constructing sequences","title":"From strings","text":"","category":"section"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"Sequences can be constructed from strings using their constructors:","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"julia> LongDNA{4}(\"TTANC\")\n5nt DNA Sequence:\nTTANC\n\njulia> LongSequence{DNAAlphabet{2}}(\"TTAGC\")\n5nt DNA Sequence:\nTTAGC\n\njulia> LongRNA{4}(\"UUANC\")\n5nt RNA Sequence:\nUUANC\n\njulia> LongSequence{RNAAlphabet{2}}(\"UUAGC\")\n5nt RNA Sequence:\nUUAGC\n","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"Type alias' can also be used for brevity.","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"julia> LongDNA{4}(\"TTANC\")\n5nt DNA Sequence:\nTTANC\n\njulia> LongDNA{2}(\"TTAGC\")\n5nt DNA Sequence:\nTTAGC\n\njulia> LongRNA{4}(\"UUANC\")\n5nt RNA Sequence:\nUUANC\n\njulia> LongRNA{2}(\"UUAGC\")\n5nt RNA Sequence:\nUUAGC","category":"page"},{"location":"construction/#Constructing-sequences-from-arrays-of-BioSymbols","page":"Constructing sequences","title":"Constructing sequences from arrays of BioSymbols","text":"","category":"section"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"Sequences can be constructed using vectors or arrays of a BioSymbol type:","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"julia> LongDNA{4}([DNA_T, DNA_T, DNA_A, DNA_N, DNA_C])\n5nt DNA Sequence:\nTTANC\n\njulia> LongSequence{DNAAlphabet{2}}([DNA_T, DNA_T, DNA_A, DNA_G, DNA_C])\n5nt DNA Sequence:\nTTAGC\n","category":"page"},{"location":"construction/#Constructing-sequences-from-other-sequences","page":"Constructing sequences","title":"Constructing sequences from other sequences","text":"","category":"section"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"You can create sequences, by concatenating other sequences together:","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"julia> LongDNA{2}(\"ACGT\") * LongDNA{2}(\"TGCA\")\n8nt DNA Sequence:\nACGTTGCA\n\njulia> repeat(LongDNA{4}(\"TA\"), 10)\n20nt DNA Sequence:\nTATATATATATATATATATA\n\njulia> LongDNA{4}(\"TA\") ^ 10\n20nt DNA Sequence:\nTATATATATATATATATATA\n","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"Sequence views (LongSubSeqs) are special, in that they do not own their own data, and must be constructed from a LongSequence or another LongSubSeq:","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"julia> seq = LongDNA{4}(\"TACGGACATTA\")\n11nt DNA Sequence:\nTACGGACATTA\n\njulia> seqview = LongSubSeq(seq, 3:7)\n5nt DNA Sequence:\nCGGAC\n\njulia> seqview2 = @view seq[1:3]\n3nt DNA Sequence:\nTAC\n\njulia> typeof(seqview) == typeof(seqview2) && typeof(seqview) <: LongSubSeq\ntrue\n","category":"page"},{"location":"construction/#Conversion-of-sequence-types","page":"Constructing sequences","title":"Conversion of sequence types","text":"","category":"section"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"You can convert between sequence types, if the sequences are compatible - that is, if the source sequence does not contain symbols that are un-encodable by the destination type.","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"julia> dna = dna\"TTACGTAGACCG\"\n12nt DNA Sequence:\nTTACGTAGACCG\n\njulia> dna2 = convert(LongDNA{2}, dna)\n12nt DNA Sequence:\nTTACGTAGACCG","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"DNA/RNA are special in that they can be converted to each other, despite containing distinct symbols. When doing so, DNA_T is converted to RNA_U and vice versa.","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"julia> convert(LongRNA{2}, dna\"TAGCTAGG\")\n8nt RNA Sequence:\nUAGCUAGG","category":"page"},{"location":"construction/#String-literals","page":"Constructing sequences","title":"String literals","text":"","category":"section"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"BioSequences provides several string literal macros for creating sequences.","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"note: Note\nWhen you use literals you may mix the case of characters.","category":"page"},{"location":"construction/#Long-sequence-literals","page":"Constructing sequences","title":"Long sequence literals","text":"","category":"section"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"julia> dna\"TACGTANNATC\"\n11nt DNA Sequence:\nTACGTANNATC\n\njulia> rna\"AUUUGNCCANU\"\n11nt RNA Sequence:\nAUUUGNCCANU\n\njulia> aa\"ARNDCQEGHILKMFPSTWYVX\"\n21aa Amino Acid Sequence:\nARNDCQEGHILKMFPSTWYVX","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"However, it should be noted that by default these sequence literals allocate the LongSequence object before the code containing the sequence literal is run. This means there may be occasions where your program does not behave as you first expect. For example consider the following code:","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"julia> function foo()\n s = dna\"CTT\"\n push!(s, DNA_A)\n end\nfoo (generic function with 1 method)\n","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"DocTestSetup = quote\n using BioSequences\n function foo()\n s = dna\"CTT\"d\n push!(s, DNA_A)\n end\nend","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"You might expect that every time you call foo, that a DNA sequence CTTA would be returned. You might expect that this is because every time foo is called, a new DNA sequence variable CTT is created, and the A nucleotide is pushed to it, and the result, CTTA is returned. In other words you might expect the following output:","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"julia> foo()\n4nt DNA Sequence:\nCTTA\n\njulia> foo()\n4nt DNA Sequence:\nCTTA\n\njulia> foo()\n4nt DNA Sequence:\nCTTA\n","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"However, this is not what happens, instead the following happens:","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"DocTestSetup = quote\n using BioSequences\n function foo()\n s = dna\"CTT\"s\n push!(s, DNA_A)\n end\nend","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"julia> foo()\n4nt DNA Sequence:\nCTTA\n\njulia> foo()\n5nt DNA Sequence:\nCTTAA\n\njulia> foo()\n6nt DNA Sequence:\nCTTAAA\n","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"The reason for this is because the sequence literal is allocated only once before the first time the function foo is called and run. Therefore, s in foo is always a reference to that one sequence that was allocated. So one sequence is created before foo is called, and then it is pushed to every time foo is called. Thus, that one allocated sequence grows with every call of foo.","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"If you wanted foo to create a new sequence each time it is called, then you can add a flag to the end of the sequence literal to dictate behaviour: A flag of 's' means 'static': the sequence will be allocated before code is run, as is the default behaviour described above. However providing 'd' flag changes the behaviour: 'd' means 'dynamic': the sequence will be allocated whilst the code is running, and not before. So to change foo so as it creates a new sequence each time it is called, simply add the 'd' flag to the sequence literal:","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"DocTestSetup = quote\n using BioSequences\nend","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"julia> function foo()\n s = dna\"CTT\"d # 'd' flag appended to the string literal.\n push!(s, DNA_A)\n end\nfoo (generic function with 1 method)\n","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"Now every time foo is called, a new sequence CTT is created, and an A nucleotide is pushed to it:","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"DocTestSetup = quote\n using BioSequences\n function foo()\n s = dna\"CTT\"d\n push!(s, DNA_A)\n end\nend","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"julia> foo()\n4nt DNA Sequence:\nCTTA\n\njulia> foo()\n4nt DNA Sequence:\nCTTA\n\njulia> foo()\n4nt DNA Sequence:\nCTTA\n","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"DocTestSetup = quote\n using BioSequences\nend","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"So the take home message of sequence literals is this:","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"Be careful when you are using sequence literals inside of functions, and inside the bodies of things like for loops. And if you use them and are unsure, use the 's' and 'd' flags to ensure the behaviour you get is the behaviour you intend.","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"@dna_str\n@rna_str\n@aa_str","category":"page"},{"location":"construction/#BioSequences.@dna_str","page":"Constructing sequences","title":"BioSequences.@dna_str","text":"@dna_str(seq, flag=\"s\") -> LongDNA{4}\n\nCreate a LongDNA{4} sequence at parse time from string seq. If flag is \"s\" ('static', the default), the sequence is created at parse time, and inserted directly into the returned expression. A static string ought not to be mutated Alternatively, if flag is \"d\" (dynamic), a new sequence is parsed and created whenever the code where is macro is placed is run.\n\nSee also: @aa_str, @rna_str\n\nExamples\n\nIn the example below, the static sequence is created once, at parse time, NOT when the function f is run. This means it is the same sequence that is pushed to repeatedly.\n\njulia> f() = dna\"TAG\";\n\njulia> string(push!(f(), DNA_A)) # NB: Mutates static string!\n\"TAGA\"\n\njulia> string(push!(f(), DNA_A))\n\"TAGAA\"\n\njulia> f() = dna\"TAG\"d; # dynamically make seq\n\njulia> string(push!(f(), DNA_A))\n\"TAGA\"\n\njulia> string(push!(f(), DNA_A))\n\"TAGA\"\n\n\n\n\n\n","category":"macro"},{"location":"construction/#BioSequences.@rna_str","page":"Constructing sequences","title":"BioSequences.@rna_str","text":"The LongRNA{4} equivalent to @dna_str\n\nSee also: @dna_str, @aa_str\n\nExamples\n\njulia> rna\"UCGUGAUGC\"\n9nt RNA Sequence:\nUCGUGAUGC\n\n\n\n\n\n","category":"macro"},{"location":"construction/#BioSequences.@aa_str","page":"Constructing sequences","title":"BioSequences.@aa_str","text":"The AminoAcidAlphabet equivalent to @dna_str\n\nSee also: @dna_str, @rna_str\n\nExamples\n\njulia> aa\"PKLEQC\"\n6aa Amino Acid Sequence:\nPKLEQC\n\n\n\n\n\n","category":"macro"},{"location":"construction/#Loose-parsing","page":"Constructing sequences","title":"Loose parsing","text":"","category":"section"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"As of version 3.2.0, BioSequences.jl provide the bioseq function, which can be used to build a LongSequence from a string (or an AbstractVector{UInt8}) without knowing the correct Alphabet.","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"julia> bioseq(\"ATGTGCTGA\")\n9nt DNA Sequence:\nATGTGCTGA","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"The function will prioritise 2-bit alphabets over 4-bit alphabets, and prefer smaller alphabets (like DNAAlphabet{4}) over larger (like AminoAcidAlphabet). If the input cannot be encoded by any of the built-in alphabets, an error is thrown:","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"julia> bioseq(\"0!(CC!;#&&%\")\nERROR: cannot encode 0x30 in AminoAcidAlphabet\n[...]","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"Note that this function is only intended to be used for interactive, ephemeral work. The function is necessarily type unstable, and the precise returned alphabet for a given input is a heuristic which is subject to change.","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"bioseq\nguess_alphabet","category":"page"},{"location":"construction/#BioSequences.bioseq","page":"Constructing sequences","title":"BioSequences.bioseq","text":"bioseq(s::Union{AbstractString, AbstractVector{UInt8}}) -> LongSequence\n\nParse s into a LongSequence with an appropriate Alphabet, or throw an exception if no alphabet matches. See guess_alphabet for the available alphabets and the alphabet priority.\n\nwarning: Warning\nThe functions bioseq and guess_alphabet are intended for use in interactive sessions, and are not suitable for use in packages or non-ephemeral work. They are type unstable, and their heuristics are subject to change in minor versions.\n\nExamples\n\njulia> bioseq(\"QMKLPEEFW\")\n9aa Amino Acid Sequence:\nQMKLPEEFW\n\njulia> bioseq(\"UAUGCUGUAGG\")\n11nt RNA Sequence:\nUAUGCUGUAGG\n\njulia> bioseq(\"PKMW#3>>0;kL\")\nERROR: cannot encode 0x23 in AminoAcidAlphabet\n[...]\n\n\n\n\n\n","category":"function"},{"location":"construction/#BioSequences.guess_alphabet","page":"Constructing sequences","title":"BioSequences.guess_alphabet","text":"guess_alphabet(s::Union{AbstractString, AbstractVector{UInt8}}) -> Union{Integer, Alphabet}\n\nPick an Alphabet that can encode input s. If no Alphabet can, return the index of the first byte of the input which is not encodable in any alphabet. This function only knows about the alphabets listed below. If multiple alphabets are possible, pick the first from the order below (i.e. DNAAlphabet{2}() if possible, otherwise RNAAlphabet{2}() etc).\n\nDNAAlphabet{2}()\nRNAAlphabet{2}()\nDNAAlphabet{4}()\nRNAAlphabet{4}()\nAminoAcidAlphabet()\n\nwarning: Warning\nThe functions bioseq and guess_alphabet are intended for use in interactive sessions, and are not suitable for use in packages or non-ephemeral work. They are type unstable, and their heuristics are subject to change in minor versions.\n\nExamples\n\njulia> guess_alphabet(\"AGGCA\")\nDNAAlphabet{2}()\n\njulia> guess_alphabet(\"WKLQSTV\")\nAminoAcidAlphabet()\n\njulia> guess_alphabet(\"QAWT+!\")\n5\n\njulia> guess_alphabet(\"UAGCSKMU\")\nRNAAlphabet{4}()\n\n\n\n\n\n","category":"function"},{"location":"construction/#Comparison-to-other-sequence-types","page":"Constructing sequences","title":"Comparison to other sequence types","text":"","category":"section"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"Following Base standards, BioSequences do not compare equal to other containers even if they have the same elements. To e.g. compare a BioSequence with a vector of DNA, compare the elements themselves:","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"julia> seq = dna\"GAGCTGA\"; vec = collect(seq);\n\njulia> seq == vec, isequal(seq, vec)\n(false, false)\n\njulia> length(seq) == length(vec) && all(i == j for (i, j) in zip(seq, vec))\ntrue ","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"CurrentModule = BioSequences\nDocTestSetup = quote\n using BioSequences\nend","category":"page"},{"location":"sequence_search/#Searching-for-sequence-motifs","page":"Pattern matching and searching","title":"Searching for sequence motifs","text":"","category":"section"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"There are many ways to search for particular motifs in biological sequences:","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"Exact searches, where you are looking for exact matches of a particular character of substring.\nApproximate searches, where you are looking for sequences that are sufficiently similar to a given sequence or family of sequences.\nSearches where you are looking for sequences that conform to some sort of pattern.","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"Like other Julia sequences such as Vector, you can search a BioSequence with the findfirst(predicate, collection) method pattern.","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"All these kinds of searches are provided in BioSequences.jl, and they all conform to the findnext, findprev, and occursin patterns established in Base for String and collections like Vector.","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"The exception is searching using the specialised regex provided in this package, which as you shall see, conforms to the match pattern established in Base for pcre and Strings.","category":"page"},{"location":"sequence_search/#Symbol-search","page":"Pattern matching and searching","title":"Symbol search","text":"","category":"section"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"julia> seq = dna\"ACAGCGTAGCT\";\n\njulia> findfirst(DNA_A, seq)\n1\n\njulia> findlast(DNA_A, seq)\n8\n\njulia> findnext(DNA_A, seq, 2)\n3\n\njulia> findprev(DNA_A, seq, 7)\n3\n\njulia> findall(DNA_A, seq)\n3-element Vector{Int64}:\n 1\n 3\n 8","category":"page"},{"location":"sequence_search/#Exact-search","page":"Pattern matching and searching","title":"Exact search","text":"","category":"section"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"ExactSearchQuery","category":"page"},{"location":"sequence_search/#BioSequences.ExactSearchQuery","page":"Pattern matching and searching","title":"BioSequences.ExactSearchQuery","text":"ExactSearchQuery{F<:Function,S<:BioSequence}\n\nQuery type for exact sequence search.\n\nAn exact search, is one where are you are looking in some given sequence, for exact instances of some given substring.\n\nThese queries are used as a predicate for the Base.findnext, Base.findprev, Base.occursin, Base.findfirst, and Base.findlast functions.\n\nExamples\n\njulia> seq = dna\"ACAGCGTAGCT\";\n\njulia> query = ExactSearchQuery(dna\"AGC\");\n\njulia> findfirst(query, seq)\n3:5\n\njulia> findlast(query, seq)\n8:10\n\njulia> findnext(query, seq, 6)\n8:10\n\njulia> findprev(query, seq, 7)\n3:5\n\njulia> findall(query, seq)\n2-element Vector{UnitRange{Int64}}:\n 3:5\n 8:10\n\njulia> occursin(query, seq)\ntrue\n\n\nYou can pass a comparator function such as isequal or iscompatible to its constructor to modify the search behaviour.\n\nThe default is isequal, however, in biology, sometimes we want a more flexible comparison to find subsequences of compatible symbols.\n\njulia> query = ExactSearchQuery(dna\"CGT\", iscompatible);\n\njulia> findfirst(query, dna\"ACNT\") # 'N' matches 'G'\n2:4\n\njulia> findfirst(query, dna\"ACGT\") # 'G' matches 'N'\n2:4\n\njulia> occursin(ExactSearchQuery(dna\"CNT\", iscompatible), dna\"ACNT\")\ntrue\n\n\n\n\n\n\n","category":"type"},{"location":"sequence_search/#Allowing-mismatches","page":"Pattern matching and searching","title":"Allowing mismatches","text":"","category":"section"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"ApproximateSearchQuery","category":"page"},{"location":"sequence_search/#BioSequences.ApproximateSearchQuery","page":"Pattern matching and searching","title":"BioSequences.ApproximateSearchQuery","text":"ApproximateSearchQuery{F<:Function,S<:BioSequence}\n\nQuery type for approximate sequence search.\n\nThese queries are used as a predicate for the Base.findnext, Base.findprev, Base.occursin, Base.findfirst, and Base.findlast functions.\n\nUsing these functions with these queries allows you to search a given sequence for a sub-sequence, whilst allowing a specific number of errors.\n\nIn other words they find a subsequence of the target sequence within a specific Levenshtein distance of the query sequence.\n\nExamples\n\njulia> seq = dna\"ACAGCGTAGCT\";\n\njulia> query = ApproximateSearchQuery(dna\"AGGG\");\n\njulia> findfirst(query, 0, seq) == nothing # nothing matches with no errors\ntrue\n\njulia> findfirst(query, 1, seq) # seq[3:6] matches with one error\n3:6\n\njulia> findfirst(query, 2, seq) # seq[1:4] matches with two errors\n1:4\n\n\nYou can pass a comparator function such as isequal or iscompatible to its constructor to modify the search behaviour.\n\nThe default is isequal, however, in biology, sometimes we want a more flexible comparison to find subsequences of compatible symbols.\n\njulia> query = ApproximateSearchQuery(dna\"AGGG\", iscompatible);\n\njulia> occursin(query, 1, dna\"AAGNGG\") # 1 mismatch permitted (A vs G) & matched N\ntrue\n\njulia> findnext(query, 1, dna\"AAGNGG\", 1) # 1 mismatch permitted (A vs G) & matched N\n1:4\n\n\nnote: Note\nThis method of searching for motifs was implemented with smaller query motifs in mind.If you are looking to search for imperfect matches of longer sequences in this manner, you are likely better off using some kind of local-alignment algorithm or one of the BLAST variants.\n\n\n\n\n\n","category":"type"},{"location":"sequence_search/#Searching-according-to-a-pattern","page":"Pattern matching and searching","title":"Searching according to a pattern","text":"","category":"section"},{"location":"sequence_search/#Regular-expression-search","page":"Pattern matching and searching","title":"Regular expression search","text":"","category":"section"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"Query patterns can be described in regular expressions. The syntax supports a subset of Perl and PROSITE's notation.","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"Biological regexes can be constructed using the BioRegex constructor, for example by doing BioRegex{AminoAcid}(\"MV+\"). For bioregex literals, it is instead recommended using the @biore_str macro:","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"The Perl-like syntax starts with biore (BIOlogical REgular expression) and ends with a symbol option: \"dna\", \"rna\" or \"aa\". For example, biore\"A+\"dna is a regular expression for DNA sequences and biore\"A+\"aa is for amino acid sequences. The symbol options can be abbreviated to its first character: \"d\", \"r\" or \"a\", respectively.","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"Here are examples of using the regular expression for BioSequences:","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"julia> match(biore\"A+C*\"dna, dna\"AAAACC\")\nRegexMatch(\"AAAACC\")\n\njulia> match(biore\"A+C*\"d, dna\"AAAACC\")\nRegexMatch(\"AAAACC\")\n\njulia> occursin(biore\"A+C*\"dna, dna\"AAC\")\ntrue\n\njulia> occursin(biore\"A+C*\"dna, dna\"C\")\nfalse\n","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"match will return a RegexMatch if a match is found, otherwise it will return nothing if no match is found.","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"The table below summarizes available syntax elements.","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"Syntax Description Example\n| alternation \"A|T\" matches \"A\" and \"T\"\n* zero or more times repeat \"TA*\" matches \"T\", \"TA\" and \"TAA\"\n+ one or more times repeat \"TA+\" matches \"TA\" and \"TAA\"\n? zero or one time \"TA?\" matches \"T\" and \"TA\"\n{n,} n or more times repeat \"A{3,}\" matches \"AAA\" and \"AAAA\"\n{n,m} n-m times repeat \"A{3,5}\" matches \"AAA\", \"AAAA\" and \"AAAAA\"\n^ the start of the sequence \"^TAN*\" matches \"TATGT\"\n$ the end of the sequence \"N*TA$\" matches \"GCTA\"\n(...) pattern grouping \"(TA)+\" matches \"TA\" and \"TATA\"\n[...] one of symbols \"[ACG]+\" matches \"AGGC\"","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"eachmatch and findfirst are also defined, just like usual regex and strings found in Base.","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"julia> collect(matched(x) for x in eachmatch(biore\"TATA*?\"d, dna\"TATTATAATTA\")) # overlap\n4-element Vector{LongSequence{DNAAlphabet{4}}}:\n TAT \n TAT\n TATA\n TATAA\n\njulia> collect(matched(x) for x in eachmatch(biore\"TATA*\"d, dna\"TATTATAATTA\", false)) # no overlap\n2-element Vector{LongSequence{DNAAlphabet{4}}}:\n TAT \n TATAA\n\njulia> findfirst(biore\"TATA*\"d, dna\"TATTATAATTA\")\n1:3\n\njulia> findfirst(biore\"TATA*\"d, dna\"TATTATAATTA\", 2)\n4:8\n","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"Noteworthy differences from strings are:","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"Ambiguous characters match any compatible characters (e.g. biore\"N\"d is equivalent to biore\"[ACGT]\"d).\nWhitespaces are ignored (e.g. biore\"A C G\"d is equivalent to biore\"ACG\"d).","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"The PROSITE notation is described in ScanProsite - user manual. The syntax supports almost all notations including the extended syntax. The PROSITE notation starts with prosite prefix and no symbol option is needed because it always describes patterns of amino acid sequences:","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"julia> match(prosite\"[AC]-x-V-x(4)-{ED}\", aa\"CPVPQARG\")\nRegexMatch(\"CPVPQARG\")\n\njulia> match(prosite\"[AC]xVx(4){ED}\", aa\"CPVPQARG\")\nRegexMatch(\"CPVPQARG\")\n","category":"page"},{"location":"sequence_search/#Position-weight-matrix-search","page":"Pattern matching and searching","title":"Position weight matrix search","text":"","category":"section"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"A motif can be specified using position weight matrix (PWM) in a probabilistic way. This method searches for the first position in the sequence where a score calculated using a PWM is greater than or equal to a threshold. More formally, denoting the sequence as S and the PWM value of symbol s at position j as M_sj, the score starting from a position p is defined as","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"operatornamescore(S p) = sum_i=1^L M_Sp+i-1i","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"and the search returns the smallest p that satisfies operatornamescore(S p) ge t.","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"There are two kinds of matrices in this package: PFM and PWM. The PFM type is a position frequency matrix and stores symbol frequencies for each position. The PWM is a position weight matrix and stores symbol scores for each position. You can create a PFM from a set of sequences with the same length and then create a PWM from the PFM object.","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"julia> motifs = [dna\"TTA\", dna\"CTA\", dna\"ACA\", dna\"TCA\", dna\"GTA\"]\n5-element Vector{LongSequence{DNAAlphabet{4}}}:\n TTA\n CTA\n ACA\n TCA\n GTA\n\njulia> pfm = PFM(motifs) # sequence set => PFM\n4×3 PFM{DNA, Int64}:\n A 1 0 5\n C 1 2 0\n G 1 0 0\n T 2 3 0\n\njulia> pwm = PWM(pfm) # PFM => PWM\n4×3 PWM{DNA, Float64}:\n A -0.321928 -Inf 2.0\n C -0.321928 0.678072 -Inf\n G -0.321928 -Inf -Inf\n T 0.678072 1.26303 -Inf\n\njulia> pwm = PWM(pfm .+ 0.01) # add pseudo counts to avoid infinite values\n4×3 PWM{DNA, Float64}:\n A -0.319068 -6.97728 1.99139\n C -0.319068 0.673772 -6.97728\n G -0.319068 -6.97728 -6.97728\n T 0.673772 1.25634 -6.97728\n\njulia> pwm = PWM(pfm .+ 0.01, prior=[0.2, 0.3, 0.3, 0.2]) # GC-rich prior\n4×3 PWM{DNA, Float64}:\n A 0.00285965 -6.65535 2.31331\n C -0.582103 0.410737 -7.24031\n G -0.582103 -7.24031 -7.24031\n T 0.9957 1.57827 -6.65535\n","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"The PWM_sj matrix is computed from PFM_sj and the prior probability p(s) as follows ([Wasserman2004]):","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"beginalign\n PWM_sj = log_2 fracp(sj)p(s) \n p(sj) = fracPFM_sjsum_s PFM_sj\nendalign","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"However, if you just want to quickly conduct a search, constructing the PFM and PWM is done for you as a convenience if you build a PWMSearchQuery, using a collection of sequences:","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"julia> motifs = [dna\"TTA\", dna\"CTA\", dna\"ACA\", dna\"TCA\", dna\"GTA\"]\n5-element Vector{LongSequence{DNAAlphabet{4}}}:\n TTA\n CTA\n ACA\n TCA\n GTA\n\njulia> subject = dna\"TATTATAATTA\";\n\njulia> qa = PWMSearchQuery(motifs, 1.0);\n\njulia> findfirst(qa, subject)\n3\n\njulia> findall(qa, subject)\n3-element Vector{Int64}:\n 3\n 5\n 9","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"[Wasserman2004]: https://doi.org/10.1038/nrg1315","category":"page"},{"location":"predicates/","page":"Predicates","title":"Predicates","text":"CurrentModule = BioSequences\nDocTestSetup = quote\n using BioSequences\nend","category":"page"},{"location":"predicates/#Predicates","page":"Predicates","title":"Predicates","text":"","category":"section"},{"location":"predicates/","page":"Predicates","title":"Predicates","text":"A number of predicate or query functions are supported for sequences, allowing you to check for certain properties of a sequence.","category":"page"},{"location":"predicates/","page":"Predicates","title":"Predicates","text":"isrepetitive\nispalindromic\nhasambiguity\niscanonical","category":"page"},{"location":"predicates/#BioSequences.isrepetitive","page":"Predicates","title":"BioSequences.isrepetitive","text":"isrepetitive(seq::BioSequence, n::Integer = length(seq))\n\nReturn true if and only if seq contains a repetitive subsequence of length ≥ n.\n\n\n\n\n\n","category":"function"},{"location":"predicates/#BioSequences.ispalindromic","page":"Predicates","title":"BioSequences.ispalindromic","text":"ispalindromic(seq::NucSeq) -> Bool\n\nCheck if seq is palindromic. A palindromic sequence is identical to its reverse-complement, so this should be equivalent to checking if seq == reverse_complement(seq).\n\nExamples\n\njulia> ispalindromic(dna\"TGCA\")\ntrue\n\njulia> ispalindromic(dna\"TCCT\")\nfalse\n\njulia> ispalindromic(rna\"ACGGU\")\nfalse\n\nReturn true if seq is a palindromic sequence; otherwise return false.\n\n\n\n\n\n","category":"function"},{"location":"predicates/#BioSequences.hasambiguity","page":"Predicates","title":"BioSequences.hasambiguity","text":"hasambiguity(seq::BioSequence)\n\nReturns true if seq has an ambiguous symbol; otherwise return false.\n\n\n\n\n\n","category":"function"},{"location":"predicates/#BioSequences.iscanonical","page":"Predicates","title":"BioSequences.iscanonical","text":"iscanonical(seq::NucleotideSeq)\n\nReturns true if seq is canonical.\n\nFor any sequence, there is a reverse complement, which is the same sequence, but on the complimentary strand of DNA:\n\n------->\nATCGATCG\nCGATCGAT\n<-------\n\nnote: Note\nUsing the reverse_complement of a DNA sequence will give give this reverse complement.\n\nOf the two sequences, the canonical of the two sequences is the lesser of the two i.e. canonical_seq < other_seq.\n\n\n\n\n\n","category":"function"},{"location":"recipes/","page":"Recipes","title":"Recipes","text":"CurrentModule = BioSequences\nDocTestSetup = quote\n using BioSequences\n using BioSymbols\nend","category":"page"},{"location":"recipes/#Recipes","page":"Recipes","title":"Recipes","text":"","category":"section"},{"location":"recipes/","page":"Recipes","title":"Recipes","text":"This page provides tested example code to solve various common problems using BioSequences.","category":"page"},{"location":"recipes/#One-hot-encoding-biosequences","page":"Recipes","title":"One-hot encoding biosequences","text":"","category":"section"},{"location":"recipes/","page":"Recipes","title":"Recipes","text":"The types DNA, RNA and AminoAcid expose a binary representation through the exported function BioSymbols.compatbits, which is a one-hot encoding of:","category":"page"},{"location":"recipes/","page":"Recipes","title":"Recipes","text":"julia> using BioSymbols\n\njulia> compatbits(DNA_W)\n0x09\n\njulia> compatbits(AA_J)\n0x00000600","category":"page"},{"location":"recipes/","page":"Recipes","title":"Recipes","text":"Each set bit in the encoding corresponds to a compatible unambiguous symbol. For example, for RNA, the four lower bits encode A, C, G, and U, in order. Hence, the symbol D, which is short for A, G or U, is encoded as 0x01 | 0x04 | 0x08 == 0x0d:","category":"page"},{"location":"recipes/","page":"Recipes","title":"Recipes","text":"julia> compatbits(RNA_D)\n0x0d\n\njulia> compatbits(RNA_A) | compatbits(DNA_G) | compatbits(RNA_U)\n0x0d","category":"page"},{"location":"recipes/","page":"Recipes","title":"Recipes","text":"Using this, we can construct a function to one-hot encode sequences - in this example, nucleic acid sequences:","category":"page"},{"location":"recipes/","page":"Recipes","title":"Recipes","text":"function one_hot(s::NucSeq)\n M = falses(4, length(s))\n for (i, s) in enumerate(s)\n bits = compatbits(s)\n while !iszero(bits)\n M[trailing_zeros(bits) + 1, i] = true\n bits &= bits - one(bits) # clear lowest bit\n end\n end\n M\nend\n\none_hot(dna\"TGNTKCTW-T\")\n\n# output\n\n4×10 BitMatrix:\n 0 0 1 0 0 0 0 1 0 0\n 0 0 1 0 0 1 0 0 0 0\n 0 1 1 0 1 0 0 0 0 0\n 1 0 1 1 1 0 1 1 0 1","category":"page"},{"location":"#BioSequences","page":"Home","title":"BioSequences","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"(Image: Latest Release) (Image: MIT license) (Image: Documentation) (Image: Pkg Status)","category":"page"},{"location":"#Description","page":"Home","title":"Description","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"BioSequences provides data types and methods for common operations with biological sequences, including DNA, RNA, and amino acid sequences.","category":"page"},{"location":"#Installation","page":"Home","title":"Installation","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"You can install BioSequences from the julia REPL. Press ] to enter pkg mode again, and enter the following:","category":"page"},{"location":"","page":"Home","title":"Home","text":"add BioSequences","category":"page"},{"location":"","page":"Home","title":"Home","text":"If you are interested in the cutting edge of the development, please check out the master branch to try new features before release.","category":"page"},{"location":"#Testing","page":"Home","title":"Testing","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"BioSequences is tested against Julia 1.X on Linux, OS X, and Windows.","category":"page"},{"location":"","page":"Home","title":"Home","text":"(Image: Unit tests) (Image: Documentation) (Image: )","category":"page"},{"location":"#Contributing","page":"Home","title":"Contributing","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"We appreciate contributions from users including reporting bugs, fixing issues, improving performance and adding new features.","category":"page"},{"location":"","page":"Home","title":"Home","text":"Take a look at the contributing files detailed contributor and maintainer guidelines, and code of conduct.","category":"page"},{"location":"#Questions?","page":"Home","title":"Questions?","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"If you have a question about contributing or using BioJulia software, come on over and chat to us on the #biology channel on the Julia SLack, or you can try the Bio category of the Julia discourse site.","category":"page"},{"location":"types/","page":"BioSequences Types","title":"BioSequences Types","text":"CurrentModule = BioSequences\nDocTestSetup = quote\n using BioSequences\nend","category":"page"},{"location":"types/#Abstract-Types","page":"BioSequences Types","title":"Abstract Types","text":"","category":"section"},{"location":"types/","page":"BioSequences Types","title":"BioSequences Types","text":"BioSequences exports an abstract BioSequence type, and several concrete sequence types which inherit from it.","category":"page"},{"location":"types/#The-abstract-BioSequence","page":"BioSequences Types","title":"The abstract BioSequence","text":"","category":"section"},{"location":"types/","page":"BioSequences Types","title":"BioSequences Types","text":"BioSequences provides an abstract type called a BioSequence{A<:Alphabet}. This abstract type, and the methods and traits is supports, allows for many algorithms in BioSequences to be written as generically as possible, thus reducing the amount of code to read and understand, whilst maintaining high performance when such code is compiled for a concrete BioSequence subtype. Additionally, it allows new types to be implemented that are fully compatible with the rest of BioSequences, providing that key methods or traits are defined).","category":"page"},{"location":"types/","page":"BioSequences Types","title":"BioSequences Types","text":"BioSequence","category":"page"},{"location":"types/#BioSequences.BioSequence","page":"BioSequences Types","title":"BioSequences.BioSequence","text":"BioSequence{A <: Alphabet}\n\nBioSequence is the main abstract type of BioSequences. It abstracts over the internal representation of different biological sequences, and is parameterized by an Alphabet, which controls the element type.\n\nExtended help\n\nIts subtypes are characterized by:\n\nBeing a linear container type with random access and indices Base.OneTo(length(x)).\nContaining zero or more internal data elements of type encoded_data_eltype(typeof(x)).\nBeing associated with an Alphabet, A by being a subtype of BioSequence{A}.\n\nA BioSequence{A} is indexed by an integer. The biosequence subtype, the index and the alphabet A determine how to extract the internal encoded data. The alphabet decides how to decode the data to the element type of the biosequence. Hence, the element type and container type of a BioSequence are separated.\n\nSubtypes T of BioSequence must implement the following, with E begin an encoded data type:\n\nBase.length(::T)::Int\nencoded_data_eltype(::Type{T})::Type{E}\nextract_encoded_element(::T, ::Integer)::E\ncopy(::T)\nT must be able to be constructed from any iterable with length defined and with a known, compatible element type.\n\nFurthermore, mutable sequences should implement\n\nencoded_setindex!(::T, ::E, ::Integer)\nT(undef, ::Int)\nresize!(::T, ::Int)\n\nFor compatibility with existing Alphabets, the encoded data eltype must be UInt.\n\n\n\n\n\n","category":"type"},{"location":"types/","page":"BioSequences Types","title":"BioSequences Types","text":"Some aliases for BioSequence are also provided for your convenience:","category":"page"},{"location":"types/","page":"BioSequences Types","title":"BioSequences Types","text":"NucSeq\nAASeq","category":"page"},{"location":"types/#BioSequences.NucSeq","page":"BioSequences Types","title":"BioSequences.NucSeq","text":"An alias for BioSequence{<:NucleicAcidAlphabet}\n\n\n\n\n\n","category":"type"},{"location":"types/#BioSequences.AASeq","page":"BioSequences Types","title":"BioSequences.AASeq","text":"An alias for BioSequence{AminoAcidAlphabet}\n\n\n\n\n\n","category":"type"},{"location":"types/","page":"BioSequences Types","title":"BioSequences Types","text":"Let's have a closer look at some of those methods that a subtype of BioSequence must implement. Check out julia base library docs for length, copy and resize!.","category":"page"},{"location":"types/","page":"BioSequences Types","title":"BioSequences Types","text":"encoded_data_eltype\nextract_encoded_element\nencoded_setindex!","category":"page"},{"location":"types/#BioSequences.encoded_data_eltype","page":"BioSequences Types","title":"BioSequences.encoded_data_eltype","text":"encoded_data_eltype(::Type{<:BioSequence})\n\nReturns the element type of the encoded data of the BioSequence. This is the return type of extract_encoded_element, i.e. the data type that stores the biological symbols in the biosequence.\n\nSee also: BioSequence \n\n\n\n\n\n","category":"function"},{"location":"types/#BioSequences.extract_encoded_element","page":"BioSequences Types","title":"BioSequences.extract_encoded_element","text":"extract_encoded_element(::BioSequence{A}, i::Integer)\n\nReturns the encoded element at position i. This data can be decoded using decode(A(), data) to yield the element type of the biosequence.\n\nSee also: BioSequence \n\n\n\n\n\n","category":"function"},{"location":"types/#BioSequences.encoded_setindex!","page":"BioSequences Types","title":"BioSequences.encoded_setindex!","text":"encoded_setindex!(seq::BioSequence, x::E, i::Integer)\n\nGiven encoded data x of type encoded_data_eltype(typeof(seq)), sets the internal sequence data at the given index.\n\nSee also: BioSequence \n\n\n\n\n\n","category":"function"},{"location":"types/","page":"BioSequences Types","title":"BioSequences Types","text":"A correctly defined subtype of BioSequence that satisfies the interface, will find the vast majority of methods described in the rest of this manual should work out of the box for that type. But they can always be overloaded if needed. Indeed the LongSequence type overloads Indeed some of the generic BioSequence methods, are overloaded for LongSequence, for example for transformation and counting operations where efficiency gains can be made due to the specific internal representation of a specific type.","category":"page"},{"location":"types/#The-abstract-Alphabet","page":"BioSequences Types","title":"The abstract Alphabet","text":"","category":"section"},{"location":"types/","page":"BioSequences Types","title":"BioSequences Types","text":"Alphabets control how biological symbols are encoded and decoded. They also confer many of the automatic traits and methods that any subtype of T<:BioSequence{A<:Alphabet} will get.","category":"page"},{"location":"types/","page":"BioSequences Types","title":"BioSequences Types","text":"BioSequences.Alphabet\nBioSequences.AsciiAlphabet","category":"page"},{"location":"types/#BioSequences.Alphabet","page":"BioSequences Types","title":"BioSequences.Alphabet","text":"Alphabet\n\nAlphabet is the most important type trait for BioSequence. An Alphabet represents a set of biological symbols encoded by a sequence, e.g. A, C, G and T for a DNA Alphabet that requires only 2 bits to represent each symbol.\n\nExtended help\n\nSubtypes of Alphabet are singleton structs that may or may not be parameterized.\nAlphabets span over a finite set of biological symbols.\nThe alphabet controls the encoding from some internal \"encoded data\" to a BioSymbol of the alphabet's element type, as well as the decoding, the inverse process.\nAn Alphabet's encode method must not produce invalid data. \n\nEvery subtype A of Alphabet must implement:\n\nBase.eltype(::Type{A})::Type{S} for some eltype S, which must be a BioSymbol.\nsymbols(::A)::Tuple{Vararg{S}}. This gives tuples of all symbols in the set of A.\nencode(::A, ::S)::E encodes a symbol to an internal data eltype E.\ndecode(::A, ::E)::S decodes an internal data eltype E to a symbol S.\nExcept for eltype which must follow Base conventions, all functions operating on Alphabet should operate on instances of the alphabet, not the type.\n\nIf you want interoperation with existing subtypes of BioSequence, the encoded representation E must be of type UInt, and you must also implement:\n\nBitsPerSymbol(::A)::BitsPerSymbol{N}, where the N must be zero or a power of two in [1, 2, 4, 8, 16, 32, [64 for 64-bit systems]].\n\nFor increased performance, see BioSequences.AsciiAlphabet\n\n\n\n\n\n","category":"type"},{"location":"types/#BioSequences.AsciiAlphabet","page":"BioSequences Types","title":"BioSequences.AsciiAlphabet","text":"AsciiAlphabet\n\nTrait for alphabet using ASCII characters as String representation. Define codetype(A) = AsciiAlphabet() for a user-defined Alphabet A to gain speed. Methods needed: BioSymbols.stringbyte(::eltype(A)) and ascii_encode(A, ::UInt8).\n\n\n\n\n\n","category":"type"},{"location":"types/#Concrete-types","page":"BioSequences Types","title":"Concrete types","text":"","category":"section"},{"location":"types/#Implemented-alphabets","page":"BioSequences Types","title":"Implemented alphabets","text":"","category":"section"},{"location":"types/","page":"BioSequences Types","title":"BioSequences Types","text":"DNAAlphabet\nRNAAlphabet\nAminoAcidAlphabet","category":"page"},{"location":"types/#BioSequences.DNAAlphabet","page":"BioSequences Types","title":"BioSequences.DNAAlphabet","text":"DNA nucleotide alphabet.\n\nDNAAlphabet has a parameter N which is a number that determines the BitsPerSymbol trait. Currently supported values of N are 2 and 4.\n\n\n\n\n\n","category":"type"},{"location":"types/#BioSequences.RNAAlphabet","page":"BioSequences Types","title":"BioSequences.RNAAlphabet","text":"RNA nucleotide alphabet.\n\nRNAAlphabet has a parameter N which is a number that determines the BitsPerSymbol trait. Currently supported values of N are 2 and 4.\n\n\n\n\n\n","category":"type"},{"location":"types/#BioSequences.AminoAcidAlphabet","page":"BioSequences Types","title":"BioSequences.AminoAcidAlphabet","text":"Amino acid alphabet.\n\n\n\n\n\n","category":"type"},{"location":"types/#Long-Sequences","page":"BioSequences Types","title":"Long Sequences","text":"","category":"section"},{"location":"types/","page":"BioSequences Types","title":"BioSequences Types","text":"LongSequence","category":"page"},{"location":"types/#BioSequences.LongSequence","page":"BioSequences Types","title":"BioSequences.LongSequence","text":"LongSequence{A <: Alphabet}\n\nGeneral-purpose BioSequence. This type is mutable and variable-length, and should be preferred for most use cases.\n\nExtended help\n\nLongSequence{A<:Alphabet} <: BioSequence{A} is parameterized by a concrete Alphabet type A that defines the domain (or set) of biological symbols permitted.\n\nAs the BioSequence interface definition implies, LongSequences store the biological symbol elements that they contain in a succinct encoded form that permits many operations to be done in an efficient bit-parallel manner. As per the interface of BioSequence, the Alphabet determines how an element is encoded or decoded when it is inserted or extracted from the sequence.\n\nFor example, AminoAcidAlphabet is associated with AminoAcid and hence an object of the LongSequence{AminoAcidAlphabet} type represents a sequence of amino acids.\n\nSymbols from multiple alphabets can't be intermixed in one sequence type.\n\nThe following table summarizes common LongSequence types that have been given aliases for convenience.\n\nType Symbol type Type alias\nLongSequence{DNAAlphabet{N}} DNA LongDNA{N}\nLongSequence{RNAAlphabet{N}} RNA LongRNA{N}\nLongSequence{AminoAcidAlphabet} AminoAcid LongAA\n\nThe LongDNA and LongRNA aliases use a DNAAlphabet{4}.\n\nDNAAlphabet{4} permits ambiguous nucleotides, and a sequence must use at least 4 bits to internally store each element (and indeed LongSequence does).\n\nIf you are sure that you are working with sequences with no ambiguous nucleotides, you can use LongSequences parameterised with DNAAlphabet{2} instead.\n\nDNAAlphabet{2} is an alphabet that uses two bits per base and limits to only unambiguous nucleotide symbols (A,C,G,T).\n\nChanging this single parameter, is all you need to do in order to benefit from memory savings. Some computations that use bitwise operations will also be dramatically faster.\n\nThe same applies with LongSequence{RNAAlphabet{4}}, simply replace the alphabet parameter with RNAAlphabet{2} in order to benefit.\n\n\n\n\n\n","category":"type"},{"location":"types/#Sequence-views","page":"BioSequences Types","title":"Sequence views","text":"","category":"section"},{"location":"types/","page":"BioSequences Types","title":"BioSequences Types","text":"Similar to how Base Julia offers views of array objects, BioSequences offers view of LongSequences - the LongSubSeq{A<:Alphabet}.","category":"page"},{"location":"types/","page":"BioSequences Types","title":"BioSequences Types","text":"Conceptually, a LongSubSeq{A} is similar to a LongSequence{A}, but instead of storing their own data, they refer to the data of a LongSequence. Modiying the LongSequence will be reflected in the view, and vice versa. If the underlying LongSequence is truncated, the behaviour of a view is undefined. For the same reason, some operations are not supported for views, such as resizing.","category":"page"},{"location":"types/","page":"BioSequences Types","title":"BioSequences Types","text":"The purpose of LongSubSeq is that, since they only contain a pointer to the underlying array, an offset and a length, they are much lighter than LongSequences, and will be stack allocated on Julia 1.5 and newer. Thus, the user may construct millions of views without major performance implications.","category":"page"}] +[{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"CurrentModule = BioSequences\nDocTestSetup = quote\n using BioSequences\nend","category":"page"},{"location":"symbols/#Biological-symbols","page":"Biological Symbols","title":"Biological symbols","text":"","category":"section"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"The BioSequences module reexports the biological symbol (character) types that are provided by BioSymbols.jl:","category":"page"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"Type Meaning\nDNA DNA nucleotide\nRNA RNA nucleotide\nAminoAcid Amino acid","category":"page"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"These symbols are elements of biological sequence types, just as characters are elements of strings.","category":"page"},{"location":"symbols/#DNA-and-RNA-nucleotides","page":"Biological Symbols","title":"DNA and RNA nucleotides","text":"","category":"section"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"Set of nucleotide symbols in BioSequences covers IUPAC nucleotide base plus a gap symbol:","category":"page"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"Symbol Constant Meaning\n'A' DNA_A / RNA_A A; Adenine\n'C' DNA_C / RNA_C C; Cytosine\n'G' DNA_G / RNA_G G; Guanine\n'T' DNA_T T; Thymine (DNA only)\n'U' RNA_U U; Uracil (RNA only)\n'M' DNA_M / RNA_M A or C\n'R' DNA_R / RNA_R A or G\n'W' DNA_W / RNA_W A or T/U\n'S' DNA_S / RNA_S C or G\n'Y' DNA_Y / RNA_Y C or T/U\n'K' DNA_K / RNA_K G or T/U\n'V' DNA_V / RNA_V A or C or G; not T/U\n'H' DNA_H / RNA_H A or C or T; not G\n'D' DNA_D / RNA_D A or G or T/U; not C\n'B' DNA_B / RNA_B C or G or T/U; not A\n'N' DNA_N / RNA_N A or C or G or T/U\n'-' DNA_Gap / RNA_Gap Gap (none of the above)","category":"page"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"https://www.bioinformatics.org/sms/iupac.html","category":"page"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"Symbols are accessible as constants with DNA_ or RNA_ prefix:","category":"page"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"julia> DNA_A\nDNA_A\n\njulia> DNA_T\nDNA_T\n\njulia> RNA_U\nRNA_U\n\njulia> DNA_Gap\nDNA_Gap\n\njulia> typeof(DNA_A)\nDNA\n\njulia> typeof(RNA_A)\nRNA\n","category":"page"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"Symbols can be constructed by converting regular characters:","category":"page"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"julia> convert(DNA, 'C')\nDNA_C\n\njulia> convert(DNA, 'C') === DNA_C\ntrue\n","category":"page"},{"location":"symbols/#Amino-acids","page":"Biological Symbols","title":"Amino acids","text":"","category":"section"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"Set of amino acid symbols also covers IUPAC amino acid symbols plus a gap symbol:","category":"page"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"Symbol Constant Meaning\n'A' AA_A Alanine\n'R' AA_R Arginine\n'N' AA_N Asparagine\n'D' AA_D Aspartic acid (Aspartate)\n'C' AA_C Cysteine\n'Q' AA_Q Glutamine\n'E' AA_E Glutamic acid (Glutamate)\n'G' AA_G Glycine\n'H' AA_H Histidine\n'I' AA_I Isoleucine\n'L' AA_L Leucine\n'K' AA_K Lysine\n'M' AA_M Methionine\n'F' AA_F Phenylalanine\n'P' AA_P Proline\n'S' AA_S Serine\n'T' AA_T Threonine\n'W' AA_W Tryptophan\n'Y' AA_Y Tyrosine\n'V' AA_V Valine\n'O' AA_O Pyrrolysine\n'U' AA_U Selenocysteine\n'B' AA_B Aspartic acid or Asparagine\n'J' AA_J Leucine or Isoleucine\n'Z' AA_Z Glutamine or Glutamic acid\n'X' AA_X Any amino acid\n'*' AA_Term Termination codon\n'-' AA_Gap Gap (none of the above)","category":"page"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"https://www.bioinformatics.org/sms/iupac.html","category":"page"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"Symbols are accessible as constants with AA_ prefix:","category":"page"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"julia> AA_A\nAA_A\n\njulia> AA_Q\nAA_Q\n\njulia> AA_Term\nAA_Term\n\njulia> typeof(AA_A)\nAminoAcid\n","category":"page"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"Symbols can be constructed by converting regular characters:","category":"page"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"julia> convert(AminoAcid, 'A')\nAA_A\n\njulia> convert(AminoAcid, 'P') === AA_P\ntrue\n","category":"page"},{"location":"symbols/#Other-functions","page":"Biological Symbols","title":"Other functions","text":"","category":"section"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"alphabet\ngap\niscompatible\nisambiguous","category":"page"},{"location":"symbols/#BioSymbols.alphabet","page":"Biological Symbols","title":"BioSymbols.alphabet","text":"alphabet(DNA)\n\nGet all symbols of DNA in sorted order.\n\nExamples\n\njulia> alphabet(DNA)\n(DNA_Gap, DNA_A, DNA_C, DNA_M, DNA_G, DNA_R, DNA_S, DNA_V, DNA_T, DNA_W, DNA_Y, DNA_H, DNA_K, DNA_D, DNA_B, DNA_N)\n\njulia> issorted(alphabet(DNA))\ntrue\n\n\n\n\n\n\nalphabet(RNA)\n\nGet all symbols of RNA in sorted order.\n\nExamples\n\njulia> alphabet(RNA)\n(RNA_Gap, RNA_A, RNA_C, RNA_M, RNA_G, RNA_R, RNA_S, RNA_V, RNA_U, RNA_W, RNA_Y, RNA_H, RNA_K, RNA_D, RNA_B, RNA_N)\n\njulia> issorted(alphabet(RNA))\ntrue\n\n\n\n\n\n\nalphabet(AminoAcid)\n\nGet all symbols of AminoAcid in sorted order.\n\nExamples\n\njulia> alphabet(AminoAcid)\n(AA_A, AA_R, AA_N, AA_D, AA_C, AA_Q, AA_E, AA_G, AA_H, AA_I, AA_L, AA_K, AA_M, AA_F, AA_P, AA_S, AA_T, AA_W, AA_Y, AA_V, AA_O, AA_U, AA_B, AA_J, AA_Z, AA_X, AA_Term, AA_Gap)\n\njulia> issorted(alphabet(AminoAcid))\ntrue\n\n\n\n\n\n\n","category":"function"},{"location":"symbols/#BioSymbols.gap","page":"Biological Symbols","title":"BioSymbols.gap","text":"gap(::Type{T})::T\n\nReturn the gap (indel) representation of T. By default, gap is defined for DNA, RNA, AminoAcid and Char.\n\nExamples\n\njulia> gap(RNA)\nRNA_Gap\n\njulia> gap(Char)\n'-': ASCII/Unicode U+002D (category Pd: Punctuation, dash)\n\n\n\n\n\n","category":"function"},{"location":"symbols/#BioSymbols.iscompatible","page":"Biological Symbols","title":"BioSymbols.iscompatible","text":"iscompatible(x::S, y::S) where S <: BioSymbol\n\nTest if x and y are compatible with each other.\n\nExamples\n\njulia> iscompatible(AA_A, AA_R)\nfalse\n\njulia> iscompatible(AA_A, AA_X)\ntrue\n\njulia> iscompatible(DNA_A, DNA_A)\ntrue\n\njulia> iscompatible(DNA_C, DNA_N) # DNA_N can be DNA_C\ntrue\n\njulia> iscompatible(DNA_C, DNA_R) # DNA_R (A or G) cannot be DNA_C\nfalse\n\n\n\n\n\n\n","category":"function"},{"location":"symbols/#BioSymbols.isambiguous","page":"Biological Symbols","title":"BioSymbols.isambiguous","text":"isambiguous(nt::NucleicAcid)\n\nTest if nt is an ambiguous nucleotide.\n\n\n\n\n\nisambiguous(aa::AminoAcid)\n\nTest if aa is an ambiguous amino acid.\n\n\n\n\n\n","category":"function"},{"location":"io/#I/O-for-sequencing-file-formats","page":"I/O","title":"I/O for sequencing file formats","text":"","category":"section"},{"location":"io/","page":"I/O","title":"I/O","text":"Versions of BioSequences prior to v2.0 provided a FASTA, FASTQ, and 2Bit submodule for working with formatted sequence files.","category":"page"},{"location":"io/","page":"I/O","title":"I/O","text":"After version v2.0, in order to neatly separate concerns, these submodules were removed.","category":"page"},{"location":"io/","page":"I/O","title":"I/O","text":"Instead there will now be dedicated BioJulia packages for each format. Each of these will be compatible with BioSequences.","category":"page"},{"location":"io/","page":"I/O","title":"I/O","text":"A list of all of the different formats and packages is provided below to help you find them quickly.","category":"page"},{"location":"io/","page":"I/O","title":"I/O","text":"Format Package\nFASTA FASTX.jl\nFASTQ FASTX.jl\n2Bit TwoBit.jl","category":"page"},{"location":"counting/","page":"Counting","title":"Counting","text":"CurrentModule = BioSequences\nDocTestSetup = quote\n using BioSequences\nend","category":"page"},{"location":"counting/#Counting","page":"Counting","title":"Counting","text":"","category":"section"},{"location":"counting/","page":"Counting","title":"Counting","text":"BioSequences contains functionality to efficiently count biosymbols in a biosequence that satisfies some predicate.","category":"page"},{"location":"counting/","page":"Counting","title":"Counting","text":"Consider a naive counting function like this:","category":"page"},{"location":"counting/","page":"Counting","title":"Counting","text":"function count_Ns(seq::BioSequence{<:DNAAlphabet})\n ns = 0\n for i in seq\n ns += (i == DNA_N)::Bool\n end\n ns\nend ","category":"page"},{"location":"counting/","page":"Counting","title":"Counting","text":"This function can be more efficiently implemented by exploiting the internal data layout of certain biosequences. Therefore, Julia provides optimised methods for Base.count, such that count_Ns above can be more efficiently expressed count(==(DNA_N), seq).","category":"page"},{"location":"counting/","page":"Counting","title":"Counting","text":"note: Note\nIt is important to understand that this speed is achieved with custom methods of Base.count, and not by a generic mechanism that improves the speed of counting symbols in BioSequencein general. Hence, while count(==(DNA_N), seq) may be optimised, count(i -> i == DNA_N, seq) is not, as this is a different method.","category":"page"},{"location":"counting/#Currently-optimised-methods","page":"Counting","title":"Currently optimised methods","text":"","category":"section"},{"location":"counting/","page":"Counting","title":"Counting","text":"By default, only the BioSequence and Alphabet types found in BioSequences.jl have optimised methods.","category":"page"},{"location":"counting/","page":"Counting","title":"Counting","text":"count(isGC, seq)\ncount(isambiguous, seq)\ncount(iscertain, seq)\ncount(isgap, seq)\ncount(==(biosymbol), seq) and count(isequal(biosymbol), seq)","category":"page"},{"location":"counting/#Matches-and-mismatches","page":"Counting","title":"Matches and mismatches","text":"","category":"section"},{"location":"counting/","page":"Counting","title":"Counting","text":"The methods matches and mismatches take two sequences and count the number of positions where the sequences are unequal or equal, respectively.","category":"page"},{"location":"counting/","page":"Counting","title":"Counting","text":"They are equivalent to matches(a, b) = count(splat(==), zip(a, b)) (and with !=, respectively).","category":"page"},{"location":"counting/","page":"Counting","title":"Counting","text":"matches\nmismatches","category":"page"},{"location":"counting/#BioSequences.matches","page":"Counting","title":"BioSequences.matches","text":"matches(a::BioSequence, b::BioSequences) -> Int\n\nCount the number of positions in where a and b are equal. If b is given, and the length of a and b differ, look only at the indices of the shorter sequence. This function does not provide any special handling of ambiguous symbols, so e.g. DNA_A does not match DNA_N.\n\nwarning: Warning\nPassing in two sequences with differing lengths is deprecated. In a future, breaking release of BioSequences, this will error.\n\nExamples\n\njulia> matches(dna\"TAWNNA\", dna\"TACCTA\")\n3\n\njulia> matches(dna\"AACA\", dna\"AAG\")\n2\n\n\n\n\n\n","category":"function"},{"location":"counting/#BioSequences.mismatches","page":"Counting","title":"BioSequences.mismatches","text":"mismatches(a::BioSequence, b::BioSequences) -> Int\n\nCount the number of positions in where a and b differ. If b is given, and the length of a and b differ, look only at the indices of the shorter sequence. This function does not provide any special handling of ambiguous symbols, so e.g. DNA_A does not match DNA_N.\n\nwarning: Warning\nPassing in two sequences with differing lengths is deprecated. In a future, breaking release of BioSequences, this will error.\n\nExamples\n\njulia> mismatches(dna\"TAGCTA\", dna\"TACNTA\")\n2\n\njulia> mismatches(dna\"AACA\", dna\"AAG\")\n1\n\n\n\n\n\n","category":"function"},{"location":"counting/#GC-content","page":"Counting","title":"GC content","text":"","category":"section"},{"location":"counting/","page":"Counting","title":"Counting","text":"The convenience function gc_content(seq) is equivalent to count(isGC, seq) / length(seq):","category":"page"},{"location":"counting/","page":"Counting","title":"Counting","text":"gc_content","category":"page"},{"location":"counting/#BioSequences.gc_content","page":"Counting","title":"BioSequences.gc_content","text":"gc_content(seq::BioSequence) -> Float64\n\nCalculate GC content of seq, i.e. the number of symbols that is DNA_C, DNA_G, DNA_C or DNA_G divided by the length of the sequence.\n\nExamples\n\njulia> gc_content(dna\"AGCTA\")\n0.4\n\njulia> gc_content(rna\"UAGCGA\")\n0.5\n\n\n\n\n\n","category":"function"},{"location":"counting/#Deprecated-aliases","page":"Counting","title":"Deprecated aliases","text":"","category":"section"},{"location":"counting/","page":"Counting","title":"Counting","text":"Several of the optimised count methods have function names, which are deprecated:","category":"page"},{"location":"counting/","page":"Counting","title":"Counting","text":"Deprecated function Instead use\nn_gaps count(isgap, seq)\nn_certain count(iscertain, seq)\nn_ambiguous count(isambiguous, seq)","category":"page"},{"location":"counting/","page":"Counting","title":"Counting","text":"n_gaps\nn_certain\nn_ambiguous","category":"page"},{"location":"counting/#BioSequences.n_gaps","page":"Counting","title":"BioSequences.n_gaps","text":"n_gaps(a::BioSequence, [b::BioSequence]) -> Int\n\nCount the number of positions where a (or b, if present) have gaps. If b is given, and the length of a and b differ, look only at the indices of the shorter sequence.\n\nwarning: Warning\nPassing in two sequences is deprecated. In a future, breaking release of BioSequences, this will throw a MethodError\n\nExamples\n\njulia> n_gaps(dna\"--TAC-WN-ACY\")\n4\n\njulia> n_gaps(dna\"TC-AC-\", dna\"-CACG\")\n2\n\n\n\n\n\n","category":"function"},{"location":"counting/#BioSequences.n_certain","page":"Counting","title":"BioSequences.n_certain","text":"n_certain(a::BioSequence, [b::BioSequence]) -> Int\n\nCount the number of positions where a (and b, if present) have certain (i.e. non-ambigous and non-gap) symbols. If b is given, and the length of a and b differ, look only at the indices of the shorter sequence. Gaps are not certain.\n\nwarning: Warning\nPassing in two sequences is deprecated. In a future, breaking release of BioSequences, this will throw a MethodError\n\nExamples\n\njulia> n_certain(dna\"--TAC-WN-ACY\")\n5\n\njulia> n_certain(rna\"UAYWW\", rna\"UAW\")\n2\n\n\n\n\n\n","category":"function"},{"location":"counting/#BioSequences.n_ambiguous","page":"Counting","title":"BioSequences.n_ambiguous","text":"n_ambiguous(a::BioSequence, [b::BioSequence]) -> Int\n\nCount the number of positions where a (or b, if present) have ambigious symbols. If b is given, and the length of a and b differ, look only at the indices of the shorter sequence. Gaps are not ambigous.\n\nwarning: Warning\nPassing in two sequences is deprecated. In a future, breaking release of BioSequences, this will throw a MethodError\n\nExamples\n\njulia> n_ambiguous(dna\"--TAC-WN-ACY\")\n3\n\njulia> n_ambiguous(rna\"UAYWW\", rna\"UAW\")\n1\n\n\n\n\n\n","category":"function"},{"location":"interfaces/","page":"Implementing custom types","title":"Implementing custom types","text":"CurrentModule = BioSequences\nDocTestSetup = quote\n using BioSequences\nend","category":"page"},{"location":"interfaces/#Custom-BioSequences-types","page":"Implementing custom types","title":"Custom BioSequences types","text":"","category":"section"},{"location":"interfaces/","page":"Implementing custom types","title":"Implementing custom types","text":"If you're a developing your own Bioinformatics package or method, you may find that the reference implementation of concrete LongSequence types provided in this package are not optimal for your purposes.","category":"page"},{"location":"interfaces/","page":"Implementing custom types","title":"Implementing custom types","text":"This page describes the interfaces for BioSequences' core types for developers or other packages implementing their own sequence types or extending BioSequences functionality.","category":"page"},{"location":"interfaces/#Implementing-custom-Alphabets","page":"Implementing custom types","title":"Implementing custom Alphabets","text":"","category":"section"},{"location":"interfaces/","page":"Implementing custom types","title":"Implementing custom types","text":"Recall the required methods that define the Alphabet interface. ","category":"page"},{"location":"interfaces/","page":"Implementing custom types","title":"Implementing custom types","text":"To create an example custom alphabet, we need to create a singleton type, that implements a few methods in order to conform to the interface as described in the Alphabet documentation.","category":"page"},{"location":"interfaces/","page":"Implementing custom types","title":"Implementing custom types","text":"Let's do that for a restricted Amino Acid alphabet. We can test that it conforms to the interface with the BioSequences.has_interface function.","category":"page"},{"location":"interfaces/","page":"Implementing custom types","title":"Implementing custom types","text":"julia> struct ReducedAAAlphabet <: Alphabet end\n\njulia> Base.eltype(::Type{ReducedAAAlphabet}) = AminoAcid\n\njulia> BioSequences.BitsPerSymbol(::ReducedAAAlphabet) = BioSequences.BitsPerSymbol{4}()\n\njulia> function BioSequences.symbols(::ReducedAAAlphabet)\n (AA_L, AA_C, AA_A, AA_G, AA_S, AA_T, AA_P, AA_F,\n AA_W, AA_E, AA_D, AA_N, AA_Q, AA_K, AA_H, AA_M)\n end\n\njulia> const (ENC_LUT, DEC_LUT) = let\n enc_lut = fill(0xff, length(alphabet(AminoAcid)))\n dec_lut = fill(AA_A, length(symbols(ReducedAAAlphabet())))\n for (i, aa) in enumerate(symbols(ReducedAAAlphabet()))\n enc_lut[reinterpret(UInt8, aa) + 0x01] = i - 1\n dec_lut[i] = aa\n end\n (Tuple(enc_lut), Tuple(dec_lut))\n end\n((0x02, 0xff, 0x0b, 0x0a, 0x01, 0x0c, 0x09, 0x03, 0x0e, 0xff, 0x00, 0x0d, 0x0f, 0x07, 0x06, 0x04, 0x05, 0x08, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff), (AA_L, AA_C, AA_A, AA_G, AA_S, AA_T, AA_P, AA_F, AA_W, AA_E, AA_D, AA_N, AA_Q, AA_K, AA_H, AA_M))\n\njulia> function BioSequences.encode(::ReducedAAAlphabet, aa::AminoAcid)\n i = reinterpret(UInt8, aa) + 0x01\n (i ≥ length(ENC_LUT) || @inbounds ENC_LUT[i] === 0xff) && throw(DomainError(aa))\n (@inbounds ENC_LUT[i]) % UInt\n end\n\njulia> function BioSequences.decode(::ReducedAAAlphabet, x::UInt)\n x ≥ length(DEC_LUT) && throw(DomainError(aa))\n @inbounds DEC_LUT[x + UInt(1)]\n end\n\njulia> BioSequences.has_interface(Alphabet, ReducedAAAlphabet())\ntrue\n","category":"page"},{"location":"interfaces/#Implementing-custom-BioSequences","page":"Implementing custom types","title":"Implementing custom BioSequences","text":"","category":"section"},{"location":"interfaces/","page":"Implementing custom types","title":"Implementing custom types","text":"Recall the required methods that define the BioSequence interface. ","category":"page"},{"location":"interfaces/","page":"Implementing custom types","title":"Implementing custom types","text":"To create an example custom alphabet, we need to create a singleton type, that implements a few methods in order to conform to the interface as described in the BioSequence documentation.","category":"page"},{"location":"interfaces/","page":"Implementing custom types","title":"Implementing custom types","text":"Let's do that for a custom sequence type that is optimised to represent a small sequence: A Codon. We can test that it conforms to the interface with the BioSequences.has_interface function.","category":"page"},{"location":"interfaces/","page":"Implementing custom types","title":"Implementing custom types","text":"julia> struct Codon <: BioSequence{RNAAlphabet{2}}\n x::UInt8\n end\n\njulia> function Codon(iterable)\n length(iterable) == 3 || error(\"Must have length 3\")\n x = zero(UInt)\n for (i, nt) in enumerate(iterable)\n x |= BioSequences.encode(Alphabet(Codon), convert(RNA, nt)) << (6-2i)\n end\n Codon(x % UInt8)\n end\nCodon\n\njulia> Base.length(::Codon) = 3\n\njulia> BioSequences.encoded_data_eltype(::Type{Codon}) = UInt\n\njulia> function BioSequences.extract_encoded_element(x::Codon, i::Int)\n ((x.x >>> (6-2i)) & 3) % UInt\n end\n\njulia> Base.copy(seq::Codon) = Codon(seq.x)\n\njulia> BioSequences.has_interface(BioSequence, Codon, [RNA_C, RNA_U, RNA_A], false)\ntrue","category":"page"},{"location":"interfaces/#Interface-checking-functions","page":"Implementing custom types","title":"Interface checking functions","text":"","category":"section"},{"location":"interfaces/","page":"Implementing custom types","title":"Implementing custom types","text":"BioSequences.has_interface","category":"page"},{"location":"interfaces/#BioSequences.has_interface","page":"Implementing custom types","title":"BioSequences.has_interface","text":"function has_interface(::Type{Alphabet}, A::Alphabet)\n\nReturns whether A conforms to the Alphabet interface.\n\n\n\n\n\nhas_interface(::Type{BioSequence}, ::T, syms::Vector, mutable::Bool, compat::Bool=true)\n\nCheck if type T conforms to the BioSequence interface. A T is constructed from the vector of element types syms which must not be empty. If the mutable flag is set, also check the mutable interface. If the compat flag is set, check for compatibility with existing alphabets.\n\n\n\n\n\n","category":"function"},{"location":"random/","page":"Random sequences","title":"Random sequences","text":"CurrentModule = BioSequences\nDocTestSetup = quote\n using BioSequences\nend","category":"page"},{"location":"random/#Generating-random-sequences","page":"Random sequences","title":"Generating random sequences","text":"","category":"section"},{"location":"random/#Long-sequences","page":"Random sequences","title":"Long sequences","text":"","category":"section"},{"location":"random/","page":"Random sequences","title":"Random sequences","text":"You can generate random long sequences using the randdna function and the Sampler's implemented in BioSequences:","category":"page"},{"location":"random/","page":"Random sequences","title":"Random sequences","text":"randseq\nranddnaseq\nrandrnaseq\nrandaaseq\nSamplerUniform\nSamplerWeighted","category":"page"},{"location":"random/#BioSequences.randseq","page":"Random sequences","title":"BioSequences.randseq","text":"randseq([rng::AbstractRNG], A::Alphabet, len::Integer)\n\nGenerate a LongSequence{A} of length len from the specified alphabet, drawn from the default distribution. User-defined alphabets should implement this method to implement random LongSequence generation.\n\nFor RNA and DNA alphabets, the default distribution is uniform across A, C, G, and T/U. For AminoAcidAlphabet, it is uniform across the 20 standard amino acids. For a user-defined alphabet A, default is uniform across all elements of symbols(A).\n\nExample:\n\njulia> seq = randseq(AminoAcidAlphabet(), 50)\n50aa Amino Acid Sequence:\nVFMHSIRMIRLMVHRSWKMHSARHVNFIRCQDKKWKSADGIYTDICKYSM\n\n\n\n\n\nrandseq([rng::AbstractRNG], A::Alphabet, sp::Sampler, len::Integer)\n\nGenerate a LongSequence{A} of length len with elements drawn from the given sampler.\n\nExample:\n\n# Generate 1000-length RNA with 4% chance of N, 24% for A, C, G, or U\njulia> sp = SamplerWeighted(rna\"ACGUN\", fill(0.24, 4))\njulia> seq = randseq(RNAAlphabet{4}(), sp, 50)\n50nt RNA Sequence:\nCUNGGGCCCGGGNAAACGUGGUACACCCUGUUAAUAUCAACNNGCGCUNU\n\n\n\n\n\n","category":"function"},{"location":"random/#BioSequences.randdnaseq","page":"Random sequences","title":"BioSequences.randdnaseq","text":"randdnaseq([rng::AbstractRNG], len::Integer)\n\nGenerate a random LongSequence{DNAAlphabet{4}} sequence of length len, with bases sampled uniformly from [A, C, G, T]\n\n\n\n\n\n","category":"function"},{"location":"random/#BioSequences.randrnaseq","page":"Random sequences","title":"BioSequences.randrnaseq","text":"randrnaseq([rng::AbstractRNG], len::Integer)\n\nGenerate a random LongSequence{RNAAlphabet{4}} sequence of length len, with bases sampled uniformly from [A, C, G, U]\n\n\n\n\n\n","category":"function"},{"location":"random/#BioSequences.randaaseq","page":"Random sequences","title":"BioSequences.randaaseq","text":"randaaseq([rng::AbstractRNG], len::Integer)\n\nGenerate a random LongSequence{AminoAcidAlphabet} sequence of length len, with amino acids sampled uniformly from the 20 standard amino acids.\n\n\n\n\n\n","category":"function"},{"location":"random/#BioSequences.SamplerUniform","page":"Random sequences","title":"BioSequences.SamplerUniform","text":"SamplerUniform{T}\n\nUniform sampler of type T. Instantiate with a collection of eltype T containing the elements to sample.\n\nExamples\n\njulia> sp = SamplerUniform(rna\"ACGU\");\n\n\n\n\n\n","category":"type"},{"location":"random/#BioSequences.SamplerWeighted","page":"Random sequences","title":"BioSequences.SamplerWeighted","text":"SamplerWeighted{T}\n\nWeighted sampler of type T. Instantiate with a collection of eltype T containing the elements to sample, and an orderen collection of probabilities to sample each element except the last. The last probability is the remaining probability up to 1.\n\nExamples\n\njulia> sp = SamplerWeighted(rna\"ACGUN\", fill(0.2475, 4));\n\n\n\n\n\n","category":"type"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"CurrentModule = BioSequences\nDocTestSetup = quote\n using BioSequences\nend","category":"page"},{"location":"transforms/#Indexing-and-modifying-sequences","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"","category":"section"},{"location":"transforms/#Indexing","page":"Indexing & modifying sequences","title":"Indexing","text":"","category":"section"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"Most BioSequence concrete subtypes for the most part behave like other vector or string types. They can be indexed using integers or ranges:","category":"page"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"For example, with LongSequences:","category":"page"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"julia> seq = dna\"ACGTTTANAGTNNAGTACC\"\n19nt DNA Sequence:\nACGTTTANAGTNNAGTACC\n\njulia> seq[5]\nDNA_T\n\njulia> seq[6:end]\n14nt DNA Sequence:\nTANAGTNNAGTACC\n","category":"page"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"The biological symbol at a given locus in a biological sequence can be set using setindex:","category":"page"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"julia> seq = dna\"ACGTTTANAGTNNAGTACC\"\n19nt DNA Sequence:\nACGTTTANAGTNNAGTACC\n\njulia> seq[5] = DNA_A\nDNA_A\n","category":"page"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"note: Note\nSome types such can be indexed using integers but not using ranges.","category":"page"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"For LongSequence types, indexing a sequence by range creates a copy of the original sequence, similar to Array in Julia's Base library. If you find yourself slowed down by the allocation of these subsequences, consider using a sequence view instead.","category":"page"},{"location":"transforms/#Modifying-sequences","page":"Indexing & modifying sequences","title":"Modifying sequences","text":"","category":"section"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"In addition to setindex, many other modifying operations are possible for biological sequences such as push!, pop!, and insert!, which should be familiar to anyone used to editing arrays.","category":"page"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"push!(::BioSequences.BioSequence, ::Any)\npop!(::BioSequences.BioSequence)\npushfirst!(::BioSequences.BioSequence, ::Any)\npopfirst!(::BioSequences.BioSequence)\ninsert!(::BioSequences.BioSequence, ::Integer, ::Any)\ndeleteat!(::BioSequences.BioSequence, ::Integer)\nappend!(::BioSequences.BioSequence, ::BioSequences.BioSequence)\nresize!(::BioSequences.LongSequence, ::Integer)\nempty!(::BioSequences.BioSequence)","category":"page"},{"location":"transforms/#Base.push!-Tuple{BioSequence, Any}","page":"Indexing & modifying sequences","title":"Base.push!","text":"push!(seq::BioSequence, x)\n\nAppend a biological symbol x to a biological sequence seq.\n\n\n\n\n\n","category":"method"},{"location":"transforms/#Base.pop!-Tuple{BioSequence}","page":"Indexing & modifying sequences","title":"Base.pop!","text":"pop!(seq::BioSequence)\n\nRemove the symbol from the end of a biological sequence seq and return it. Returns a variable of eltype(seq).\n\n\n\n\n\n","category":"method"},{"location":"transforms/#Base.pushfirst!-Tuple{BioSequence, Any}","page":"Indexing & modifying sequences","title":"Base.pushfirst!","text":"pushfirst!(seq, x)\n\nInsert a biological symbol x at the beginning of a biological sequence seq.\n\n\n\n\n\n","category":"method"},{"location":"transforms/#Base.popfirst!-Tuple{BioSequence}","page":"Indexing & modifying sequences","title":"Base.popfirst!","text":"popfirst!(seq)\n\nRemove the symbol from the beginning of a biological sequence seq and return it. Returns a variable of eltype(seq).\n\n\n\n\n\n","category":"method"},{"location":"transforms/#Base.insert!-Tuple{BioSequence, Integer, Any}","page":"Indexing & modifying sequences","title":"Base.insert!","text":"insert!(seq::BioSequence, i, x)\n\nInsert a biological symbol x into a biological sequence seq, at the given index i.\n\n\n\n\n\n","category":"method"},{"location":"transforms/#Base.deleteat!-Tuple{BioSequence, Integer}","page":"Indexing & modifying sequences","title":"Base.deleteat!","text":"deleteat!(seq::BioSequence, i::Integer)\n\nDelete a biological symbol at a single position i in a biological sequence seq.\n\nModifies the input sequence.\n\n\n\n\n\n","category":"method"},{"location":"transforms/#Base.append!-Tuple{BioSequence, BioSequence}","page":"Indexing & modifying sequences","title":"Base.append!","text":"append!(seq, other)\n\nAdd a biological sequence other onto the end of biological sequence seq. Modifies and returns seq.\n\n\n\n\n\n","category":"method"},{"location":"transforms/#Base.resize!-Tuple{LongSequence, Integer}","page":"Indexing & modifying sequences","title":"Base.resize!","text":"resize!(seq, size, [force::Bool=false])\n\nResize a biological sequence seq, to a given size. Does not resize the underlying data array unless the new size does not fit. If force, always resize underlying data array.\n\nNote that resizing to a larger size, and then loading from uninitialized positions is not allowed and may cause undefined behaviour. Make sure to always fill any uninitialized biosymbols after resizing.\n\n\n\n\n\n","category":"method"},{"location":"transforms/#Base.empty!-Tuple{BioSequence}","page":"Indexing & modifying sequences","title":"Base.empty!","text":"empty!(seq::BioSequence)\n\nCompletely empty a biological sequence seq of nucleotides.\n\n\n\n\n\n","category":"method"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"Here are some examples:","category":"page"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"julia> seq = dna\"ACG\"\n3nt DNA Sequence:\nACG\n\njulia> push!(seq, DNA_T)\n4nt DNA Sequence:\nACGT\n\njulia> append!(seq, dna\"AT\")\n6nt DNA Sequence:\nACGTAT\n\njulia> deleteat!(seq, 2)\n5nt DNA Sequence:\nAGTAT\n\njulia> deleteat!(seq, 2:3)\n3nt DNA Sequence:\nAAT\n","category":"page"},{"location":"transforms/#Additional-transformations","page":"Indexing & modifying sequences","title":"Additional transformations","text":"","category":"section"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"In addition to these basic modifying functions, other sequence transformations that are common in bioinformatics are also provided.","category":"page"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"reverse!(::BioSequences.LongSequence)\nreverse(::BioSequences.LongSequence{<:NucleicAcidAlphabet})\ncomplement!\ncomplement\nreverse_complement!\nreverse_complement\nungap!\nungap\ncanonical!\ncanonical","category":"page"},{"location":"transforms/#Base.reverse!-Tuple{LongSequence}","page":"Indexing & modifying sequences","title":"Base.reverse!","text":"reverse!(seq::LongSequence)\n\nReverse a biological sequence seq in place.\n\n\n\n\n\n","category":"method"},{"location":"transforms/#Base.reverse-Tuple{LongSequence{<:NucleicAcidAlphabet}}","page":"Indexing & modifying sequences","title":"Base.reverse","text":"reverse(seq::BioSequence)\n\nCreate reversed copy of a biological sequence.\n\n\n\n\n\nreverse(seq::LongSequence)\n\nCreate reversed copy of a biological sequence.\n\n\n\n\n\n","category":"method"},{"location":"transforms/#BioSequences.complement!","page":"Indexing & modifying sequences","title":"BioSequences.complement!","text":"complement!(seq)\n\nMake a complement sequence of seq in place.\n\n\n\n\n\n","category":"function"},{"location":"transforms/#BioSymbols.complement","page":"Indexing & modifying sequences","title":"BioSymbols.complement","text":"complement(nt::NucleicAcid)\n\nReturn the complementary nucleotide of nt.\n\nThis function returns the union of all possible complementary nucleotides.\n\nExamples\n\njulia> complement(DNA_A)\nDNA_T\n\njulia> complement(DNA_N)\nDNA_N\n\njulia> complement(RNA_U)\nRNA_A\n\n\n\n\n\n\ncomplement(seq)\n\nMake a complement sequence of seq.\n\n\n\n\n\n","category":"function"},{"location":"transforms/#BioSequences.reverse_complement!","page":"Indexing & modifying sequences","title":"BioSequences.reverse_complement!","text":"reverse_complement!(seq)\n\nMake a reversed complement sequence of seq in place.\n\n\n\n\n\n","category":"function"},{"location":"transforms/#BioSequences.reverse_complement","page":"Indexing & modifying sequences","title":"BioSequences.reverse_complement","text":"reverse_complement(seq)\n\nMake a reversed complement sequence of seq.\n\n\n\n\n\n","category":"function"},{"location":"transforms/#BioSequences.ungap!","page":"Indexing & modifying sequences","title":"BioSequences.ungap!","text":"Remove gap characters from an input sequence.\n\n\n\n\n\n","category":"function"},{"location":"transforms/#BioSequences.ungap","page":"Indexing & modifying sequences","title":"BioSequences.ungap","text":"Create a copy of a sequence with gap characters removed.\n\n\n\n\n\n","category":"function"},{"location":"transforms/#BioSequences.canonical!","page":"Indexing & modifying sequences","title":"BioSequences.canonical!","text":"canonical!(seq::NucleotideSeq)\n\nTransforms the seq into its canonical form, if it is not already canonical. Modifies the input sequence inplace.\n\nFor any sequence, there is a reverse complement, which is the same sequence, but on the complimentary strand of DNA:\n\n------->\nATCGATCG\nCGATCGAT\n<-------\n\nnote: Note\nUsing the reverse_complement of a DNA sequence will give give this reverse complement.\n\nOf the two sequences, the canonical of the two sequences is the lesser of the two i.e. canonical_seq < other_seq.\n\nUsing this function on a seq will ensure it is the canonical version.\n\n\n\n\n\n","category":"function"},{"location":"transforms/#BioSequences.canonical","page":"Indexing & modifying sequences","title":"BioSequences.canonical","text":"canonical(seq::NucleotideSeq)\n\nCreate the canonical sequence of seq.\n\n\n\n\n\n","category":"function"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"Some examples:","category":"page"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"julia> seq = dna\"ACGTAT\"\n6nt DNA Sequence:\nACGTAT\n\njulia> reverse!(seq)\n6nt DNA Sequence:\nTATGCA\n\njulia> complement!(seq)\n6nt DNA Sequence:\nATACGT\n\njulia> reverse_complement!(seq)\n6nt DNA Sequence:\nACGTAT\n","category":"page"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"Many of these methods also have a version which makes a copy of the input sequence, so you get a modified copy, and don't alter the original sequence. Such methods are named the same, but without the exclamation mark. E.g. reverse instead of reverse!, and ungap instead of ungap!. ","category":"page"},{"location":"transforms/#Translation","page":"Indexing & modifying sequences","title":"Translation","text":"","category":"section"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"Translation is a slightly more complex transformation for RNA Sequences and so we describe it here in more detail.","category":"page"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"The translate function translates a sequence of codons in a RNA sequence to a amino acid sequence based on a genetic code. The BioSequences package provides all NCBI defined genetic codes and they are registered in ncbi_trans_table.","category":"page"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"translate\nncbi_trans_table","category":"page"},{"location":"transforms/#BioSequences.translate","page":"Indexing & modifying sequences","title":"BioSequences.translate","text":"translate(seq, code=standard_genetic_code, allow_ambiguous_codons=true, alternative_start=false)\n\nTranslate an LongRNA or a LongDNA to an LongAA.\n\nTranslation uses genetic code code to map codons to amino acids. See ncbi_trans_table for available genetic codes. If codons in the given sequence cannot determine a unique amino acid, they will be translated to AA_X if allow_ambiguous_codons is true and otherwise result in an error. For organisms that utilize alternative start codons, one can set alternative_start=true, in which case the first codon will always be converted to a methionine.\n\n\n\n\n\n","category":"function"},{"location":"transforms/#BioSequences.ncbi_trans_table","page":"Indexing & modifying sequences","title":"BioSequences.ncbi_trans_table","text":"Genetic code list of NCBI.\n\nThe standard genetic code is ncbi_trans_table[1] and others can be shown by show(ncbi_trans_table). For more details, consult the next link: http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=cgencodes.\n\n\n\n\n\n","category":"constant"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"julia> ncbi_trans_table\nTranslation Tables:\n 1. The Standard Code (standard_genetic_code)\n 2. The Vertebrate Mitochondrial Code (vertebrate_mitochondrial_genetic_code)\n 3. The Yeast Mitochondrial Code (yeast_mitochondrial_genetic_code)\n 4. The Mold, Protozoan, and Coelenterate Mitochondrial Code and the Mycoplasma/Spiroplasma Code (mold_mitochondrial_genetic_code)\n 5. The Invertebrate Mitochondrial Code (invertebrate_mitochondrial_genetic_code)\n 6. The Ciliate, Dasycladacean and Hexamita Nuclear Code (ciliate_nuclear_genetic_code)\n 9. The Echinoderm and Flatworm Mitochondrial Code (echinoderm_mitochondrial_genetic_code)\n 10. The Euplotid Nuclear Code (euplotid_nuclear_genetic_code)\n 11. The Bacterial, Archaeal and Plant Plastid Code (bacterial_plastid_genetic_code)\n 12. The Alternative Yeast Nuclear Code (alternative_yeast_nuclear_genetic_code)\n 13. The Ascidian Mitochondrial Code (ascidian_mitochondrial_genetic_code)\n 14. The Alternative Flatworm Mitochondrial Code (alternative_flatworm_mitochondrial_genetic_code)\n 15. Blepharisma Macronuclear Code (blepharisma_macronuclear_genetic_code)\n 16. Chlorophycean Mitochondrial Code (chlorophycean_mitochondrial_genetic_code)\n 21. Trematode Mitochondrial Code (trematode_mitochondrial_genetic_code)\n 22. Scenedesmus obliquus Mitochondrial Code (scenedesmus_obliquus_mitochondrial_genetic_code)\n 23. Thraustochytrium Mitochondrial Code (thraustochytrium_mitochondrial_genetic_code)\n 24. Pterobranchia Mitochondrial Code (pterobrachia_mitochondrial_genetic_code)\n 25. Candidate Division SR1 and Gracilibacteria Code (candidate_division_sr1_genetic_code)\n","category":"page"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"https://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=cgencodes","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"CurrentModule = BioSequences\nDocTestSetup = quote\n using BioSequences\nend","category":"page"},{"location":"construction/#Construction-and-conversion","page":"Constructing sequences","title":"Construction & conversion","text":"","category":"section"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"Here we will showcase the various ways you can construct the various sequence types in BioSequences.","category":"page"},{"location":"construction/#Constructing-sequences","page":"Constructing sequences","title":"Constructing sequences","text":"","category":"section"},{"location":"construction/#From-strings","page":"Constructing sequences","title":"From strings","text":"","category":"section"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"Sequences can be constructed from strings using their constructors:","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"julia> LongDNA{4}(\"TTANC\")\n5nt DNA Sequence:\nTTANC\n\njulia> LongSequence{DNAAlphabet{2}}(\"TTAGC\")\n5nt DNA Sequence:\nTTAGC\n\njulia> LongRNA{4}(\"UUANC\")\n5nt RNA Sequence:\nUUANC\n\njulia> LongSequence{RNAAlphabet{2}}(\"UUAGC\")\n5nt RNA Sequence:\nUUAGC\n","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"Type alias' can also be used for brevity.","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"julia> LongDNA{4}(\"TTANC\")\n5nt DNA Sequence:\nTTANC\n\njulia> LongDNA{2}(\"TTAGC\")\n5nt DNA Sequence:\nTTAGC\n\njulia> LongRNA{4}(\"UUANC\")\n5nt RNA Sequence:\nUUANC\n\njulia> LongRNA{2}(\"UUAGC\")\n5nt RNA Sequence:\nUUAGC","category":"page"},{"location":"construction/#Constructing-sequences-from-arrays-of-BioSymbols","page":"Constructing sequences","title":"Constructing sequences from arrays of BioSymbols","text":"","category":"section"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"Sequences can be constructed using vectors or arrays of a BioSymbol type:","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"julia> LongDNA{4}([DNA_T, DNA_T, DNA_A, DNA_N, DNA_C])\n5nt DNA Sequence:\nTTANC\n\njulia> LongSequence{DNAAlphabet{2}}([DNA_T, DNA_T, DNA_A, DNA_G, DNA_C])\n5nt DNA Sequence:\nTTAGC\n","category":"page"},{"location":"construction/#Constructing-sequences-from-other-sequences","page":"Constructing sequences","title":"Constructing sequences from other sequences","text":"","category":"section"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"You can create sequences, by concatenating other sequences together:","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"julia> LongDNA{2}(\"ACGT\") * LongDNA{2}(\"TGCA\")\n8nt DNA Sequence:\nACGTTGCA\n\njulia> repeat(LongDNA{4}(\"TA\"), 10)\n20nt DNA Sequence:\nTATATATATATATATATATA\n\njulia> LongDNA{4}(\"TA\") ^ 10\n20nt DNA Sequence:\nTATATATATATATATATATA\n","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"Sequence views (LongSubSeqs) are special, in that they do not own their own data, and must be constructed from a LongSequence or another LongSubSeq:","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"julia> seq = LongDNA{4}(\"TACGGACATTA\")\n11nt DNA Sequence:\nTACGGACATTA\n\njulia> seqview = LongSubSeq(seq, 3:7)\n5nt DNA Sequence:\nCGGAC\n\njulia> seqview2 = @view seq[1:3]\n3nt DNA Sequence:\nTAC\n\njulia> typeof(seqview) == typeof(seqview2) && typeof(seqview) <: LongSubSeq\ntrue\n","category":"page"},{"location":"construction/#Conversion-of-sequence-types","page":"Constructing sequences","title":"Conversion of sequence types","text":"","category":"section"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"You can convert between sequence types, if the sequences are compatible - that is, if the source sequence does not contain symbols that are un-encodable by the destination type.","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"julia> dna = dna\"TTACGTAGACCG\"\n12nt DNA Sequence:\nTTACGTAGACCG\n\njulia> dna2 = convert(LongDNA{2}, dna)\n12nt DNA Sequence:\nTTACGTAGACCG","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"DNA/RNA are special in that they can be converted to each other, despite containing distinct symbols. When doing so, DNA_T is converted to RNA_U and vice versa.","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"julia> convert(LongRNA{2}, dna\"TAGCTAGG\")\n8nt RNA Sequence:\nUAGCUAGG","category":"page"},{"location":"construction/#String-literals","page":"Constructing sequences","title":"String literals","text":"","category":"section"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"BioSequences provides several string literal macros for creating sequences.","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"note: Note\nWhen you use literals you may mix the case of characters.","category":"page"},{"location":"construction/#Long-sequence-literals","page":"Constructing sequences","title":"Long sequence literals","text":"","category":"section"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"julia> dna\"TACGTANNATC\"\n11nt DNA Sequence:\nTACGTANNATC\n\njulia> rna\"AUUUGNCCANU\"\n11nt RNA Sequence:\nAUUUGNCCANU\n\njulia> aa\"ARNDCQEGHILKMFPSTWYVX\"\n21aa Amino Acid Sequence:\nARNDCQEGHILKMFPSTWYVX","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"However, it should be noted that by default these sequence literals allocate the LongSequence object before the code containing the sequence literal is run. This means there may be occasions where your program does not behave as you first expect. For example consider the following code:","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"julia> function foo()\n s = dna\"CTT\"\n push!(s, DNA_A)\n end\nfoo (generic function with 1 method)\n","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"DocTestSetup = quote\n using BioSequences\n function foo()\n s = dna\"CTT\"d\n push!(s, DNA_A)\n end\nend","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"You might expect that every time you call foo, that a DNA sequence CTTA would be returned. You might expect that this is because every time foo is called, a new DNA sequence variable CTT is created, and the A nucleotide is pushed to it, and the result, CTTA is returned. In other words you might expect the following output:","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"julia> foo()\n4nt DNA Sequence:\nCTTA\n\njulia> foo()\n4nt DNA Sequence:\nCTTA\n\njulia> foo()\n4nt DNA Sequence:\nCTTA\n","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"However, this is not what happens, instead the following happens:","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"DocTestSetup = quote\n using BioSequences\n function foo()\n s = dna\"CTT\"s\n push!(s, DNA_A)\n end\nend","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"julia> foo()\n4nt DNA Sequence:\nCTTA\n\njulia> foo()\n5nt DNA Sequence:\nCTTAA\n\njulia> foo()\n6nt DNA Sequence:\nCTTAAA\n","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"The reason for this is because the sequence literal is allocated only once before the first time the function foo is called and run. Therefore, s in foo is always a reference to that one sequence that was allocated. So one sequence is created before foo is called, and then it is pushed to every time foo is called. Thus, that one allocated sequence grows with every call of foo.","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"If you wanted foo to create a new sequence each time it is called, then you can add a flag to the end of the sequence literal to dictate behaviour: A flag of 's' means 'static': the sequence will be allocated before code is run, as is the default behaviour described above. However providing 'd' flag changes the behaviour: 'd' means 'dynamic': the sequence will be allocated whilst the code is running, and not before. So to change foo so as it creates a new sequence each time it is called, simply add the 'd' flag to the sequence literal:","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"DocTestSetup = quote\n using BioSequences\nend","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"julia> function foo()\n s = dna\"CTT\"d # 'd' flag appended to the string literal.\n push!(s, DNA_A)\n end\nfoo (generic function with 1 method)\n","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"Now every time foo is called, a new sequence CTT is created, and an A nucleotide is pushed to it:","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"DocTestSetup = quote\n using BioSequences\n function foo()\n s = dna\"CTT\"d\n push!(s, DNA_A)\n end\nend","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"julia> foo()\n4nt DNA Sequence:\nCTTA\n\njulia> foo()\n4nt DNA Sequence:\nCTTA\n\njulia> foo()\n4nt DNA Sequence:\nCTTA\n","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"DocTestSetup = quote\n using BioSequences\nend","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"So the take home message of sequence literals is this:","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"Be careful when you are using sequence literals inside of functions, and inside the bodies of things like for loops. And if you use them and are unsure, use the 's' and 'd' flags to ensure the behaviour you get is the behaviour you intend.","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"@dna_str\n@rna_str\n@aa_str","category":"page"},{"location":"construction/#BioSequences.@dna_str","page":"Constructing sequences","title":"BioSequences.@dna_str","text":"@dna_str(seq, flag=\"s\") -> LongDNA{4}\n\nCreate a LongDNA{4} sequence at parse time from string seq. If flag is \"s\" ('static', the default), the sequence is created at parse time, and inserted directly into the returned expression. A static string ought not to be mutated Alternatively, if flag is \"d\" (dynamic), a new sequence is parsed and created whenever the code where is macro is placed is run.\n\nSee also: @aa_str, @rna_str\n\nExamples\n\nIn the example below, the static sequence is created once, at parse time, NOT when the function f is run. This means it is the same sequence that is pushed to repeatedly.\n\njulia> f() = dna\"TAG\";\n\njulia> string(push!(f(), DNA_A)) # NB: Mutates static string!\n\"TAGA\"\n\njulia> string(push!(f(), DNA_A))\n\"TAGAA\"\n\njulia> f() = dna\"TAG\"d; # dynamically make seq\n\njulia> string(push!(f(), DNA_A))\n\"TAGA\"\n\njulia> string(push!(f(), DNA_A))\n\"TAGA\"\n\n\n\n\n\n","category":"macro"},{"location":"construction/#BioSequences.@rna_str","page":"Constructing sequences","title":"BioSequences.@rna_str","text":"The LongRNA{4} equivalent to @dna_str\n\nSee also: @dna_str, @aa_str\n\nExamples\n\njulia> rna\"UCGUGAUGC\"\n9nt RNA Sequence:\nUCGUGAUGC\n\n\n\n\n\n","category":"macro"},{"location":"construction/#BioSequences.@aa_str","page":"Constructing sequences","title":"BioSequences.@aa_str","text":"The AminoAcidAlphabet equivalent to @dna_str\n\nSee also: @dna_str, @rna_str\n\nExamples\n\njulia> aa\"PKLEQC\"\n6aa Amino Acid Sequence:\nPKLEQC\n\n\n\n\n\n","category":"macro"},{"location":"construction/#Loose-parsing","page":"Constructing sequences","title":"Loose parsing","text":"","category":"section"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"As of version 3.2.0, BioSequences.jl provide the bioseq function, which can be used to build a LongSequence from a string (or an AbstractVector{UInt8}) without knowing the correct Alphabet.","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"julia> bioseq(\"ATGTGCTGA\")\n9nt DNA Sequence:\nATGTGCTGA","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"The function will prioritise 2-bit alphabets over 4-bit alphabets, and prefer smaller alphabets (like DNAAlphabet{4}) over larger (like AminoAcidAlphabet). If the input cannot be encoded by any of the built-in alphabets, an error is thrown:","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"julia> bioseq(\"0!(CC!;#&&%\")\nERROR: cannot encode 0x30 in AminoAcidAlphabet\n[...]","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"Note that this function is only intended to be used for interactive, ephemeral work. The function is necessarily type unstable, and the precise returned alphabet for a given input is a heuristic which is subject to change.","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"bioseq\nguess_alphabet","category":"page"},{"location":"construction/#BioSequences.bioseq","page":"Constructing sequences","title":"BioSequences.bioseq","text":"bioseq(s::Union{AbstractString, AbstractVector{UInt8}}) -> LongSequence\n\nParse s into a LongSequence with an appropriate Alphabet, or throw an exception if no alphabet matches. See guess_alphabet for the available alphabets and the alphabet priority.\n\nwarning: Warning\nThe functions bioseq and guess_alphabet are intended for use in interactive sessions, and are not suitable for use in packages or non-ephemeral work. They are type unstable, and their heuristics are subject to change in minor versions.\n\nExamples\n\njulia> bioseq(\"QMKLPEEFW\")\n9aa Amino Acid Sequence:\nQMKLPEEFW\n\njulia> bioseq(\"UAUGCUGUAGG\")\n11nt RNA Sequence:\nUAUGCUGUAGG\n\njulia> bioseq(\"PKMW#3>>0;kL\")\nERROR: cannot encode 0x23 in AminoAcidAlphabet\n[...]\n\n\n\n\n\n","category":"function"},{"location":"construction/#BioSequences.guess_alphabet","page":"Constructing sequences","title":"BioSequences.guess_alphabet","text":"guess_alphabet(s::Union{AbstractString, AbstractVector{UInt8}}) -> Union{Integer, Alphabet}\n\nPick an Alphabet that can encode input s. If no Alphabet can, return the index of the first byte of the input which is not encodable in any alphabet. This function only knows about the alphabets listed below. If multiple alphabets are possible, pick the first from the order below (i.e. DNAAlphabet{2}() if possible, otherwise RNAAlphabet{2}() etc).\n\nDNAAlphabet{2}()\nRNAAlphabet{2}()\nDNAAlphabet{4}()\nRNAAlphabet{4}()\nAminoAcidAlphabet()\n\nwarning: Warning\nThe functions bioseq and guess_alphabet are intended for use in interactive sessions, and are not suitable for use in packages or non-ephemeral work. They are type unstable, and their heuristics are subject to change in minor versions.\n\nExamples\n\njulia> guess_alphabet(\"AGGCA\")\nDNAAlphabet{2}()\n\njulia> guess_alphabet(\"WKLQSTV\")\nAminoAcidAlphabet()\n\njulia> guess_alphabet(\"QAWT+!\")\n5\n\njulia> guess_alphabet(\"UAGCSKMU\")\nRNAAlphabet{4}()\n\n\n\n\n\n","category":"function"},{"location":"construction/#Comparison-to-other-sequence-types","page":"Constructing sequences","title":"Comparison to other sequence types","text":"","category":"section"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"Following Base standards, BioSequences do not compare equal to other containers even if they have the same elements. To e.g. compare a BioSequence with a vector of DNA, compare the elements themselves:","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"julia> seq = dna\"GAGCTGA\"; vec = collect(seq);\n\njulia> seq == vec, isequal(seq, vec)\n(false, false)\n\njulia> length(seq) == length(vec) && all(i == j for (i, j) in zip(seq, vec))\ntrue ","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"CurrentModule = BioSequences\nDocTestSetup = quote\n using BioSequences\nend","category":"page"},{"location":"sequence_search/#Searching-for-sequence-motifs","page":"Pattern matching and searching","title":"Searching for sequence motifs","text":"","category":"section"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"There are many ways to search for particular motifs in biological sequences:","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"Exact searches, where you are looking for exact matches of a particular character of substring.\nApproximate searches, where you are looking for sequences that are sufficiently similar to a given sequence or family of sequences.\nSearches where you are looking for sequences that conform to some sort of pattern.","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"Like other Julia sequences such as Vector, you can search a BioSequence with the findfirst(predicate, collection) method pattern.","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"All these kinds of searches are provided in BioSequences.jl, and they all conform to the findnext, findprev, and occursin patterns established in Base for String and collections like Vector.","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"The exception is searching using the specialised regex provided in this package, which as you shall see, conforms to the match pattern established in Base for pcre and Strings.","category":"page"},{"location":"sequence_search/#Symbol-search","page":"Pattern matching and searching","title":"Symbol search","text":"","category":"section"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"julia> seq = dna\"ACAGCGTAGCT\";\n\njulia> findfirst(DNA_A, seq)\n1\n\njulia> findlast(DNA_A, seq)\n8\n\njulia> findnext(DNA_A, seq, 2)\n3\n\njulia> findprev(DNA_A, seq, 7)\n3\n\njulia> findall(DNA_A, seq)\n3-element Vector{Int64}:\n 1\n 3\n 8","category":"page"},{"location":"sequence_search/#Exact-search","page":"Pattern matching and searching","title":"Exact search","text":"","category":"section"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"ExactSearchQuery","category":"page"},{"location":"sequence_search/#BioSequences.ExactSearchQuery","page":"Pattern matching and searching","title":"BioSequences.ExactSearchQuery","text":"ExactSearchQuery{F<:Function,S<:BioSequence}\n\nQuery type for exact sequence search.\n\nAn exact search, is one where are you are looking in some given sequence, for exact instances of some given substring.\n\nThese queries are used as a predicate for the Base.findnext, Base.findprev, Base.occursin, Base.findfirst, and Base.findlast functions.\n\nExamples\n\njulia> seq = dna\"ACAGCGTAGCT\";\n\njulia> query = ExactSearchQuery(dna\"AGC\");\n\njulia> findfirst(query, seq)\n3:5\n\njulia> findlast(query, seq)\n8:10\n\njulia> findnext(query, seq, 6)\n8:10\n\njulia> findprev(query, seq, 7)\n3:5\n\njulia> findall(query, seq)\n2-element Vector{UnitRange{Int64}}:\n 3:5\n 8:10\n\njulia> occursin(query, seq)\ntrue\n\n\nYou can pass a comparator function such as isequal or iscompatible to its constructor to modify the search behaviour.\n\nThe default is isequal, however, in biology, sometimes we want a more flexible comparison to find subsequences of compatible symbols.\n\njulia> query = ExactSearchQuery(dna\"CGT\", iscompatible);\n\njulia> findfirst(query, dna\"ACNT\") # 'N' matches 'G'\n2:4\n\njulia> findfirst(query, dna\"ACGT\") # 'G' matches 'N'\n2:4\n\njulia> occursin(ExactSearchQuery(dna\"CNT\", iscompatible), dna\"ACNT\")\ntrue\n\n\n\n\n\n\n","category":"type"},{"location":"sequence_search/#Allowing-mismatches","page":"Pattern matching and searching","title":"Allowing mismatches","text":"","category":"section"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"ApproximateSearchQuery","category":"page"},{"location":"sequence_search/#BioSequences.ApproximateSearchQuery","page":"Pattern matching and searching","title":"BioSequences.ApproximateSearchQuery","text":"ApproximateSearchQuery{F<:Function,S<:BioSequence}\n\nQuery type for approximate sequence search.\n\nThese queries are used as a predicate for the Base.findnext, Base.findprev, Base.occursin, Base.findfirst, and Base.findlast functions.\n\nUsing these functions with these queries allows you to search a given sequence for a sub-sequence, whilst allowing a specific number of errors.\n\nIn other words they find a subsequence of the target sequence within a specific Levenshtein distance of the query sequence.\n\nExamples\n\njulia> seq = dna\"ACAGCGTAGCT\";\n\njulia> query = ApproximateSearchQuery(dna\"AGGG\");\n\njulia> findfirst(query, 0, seq) == nothing # nothing matches with no errors\ntrue\n\njulia> findfirst(query, 1, seq) # seq[3:6] matches with one error\n3:6\n\njulia> findfirst(query, 2, seq) # seq[1:4] matches with two errors\n1:4\n\n\nYou can pass a comparator function such as isequal or iscompatible to its constructor to modify the search behaviour.\n\nThe default is isequal, however, in biology, sometimes we want a more flexible comparison to find subsequences of compatible symbols.\n\njulia> query = ApproximateSearchQuery(dna\"AGGG\", iscompatible);\n\njulia> occursin(query, 1, dna\"AAGNGG\") # 1 mismatch permitted (A vs G) & matched N\ntrue\n\njulia> findnext(query, 1, dna\"AAGNGG\", 1) # 1 mismatch permitted (A vs G) & matched N\n1:4\n\n\nnote: Note\nThis method of searching for motifs was implemented with smaller query motifs in mind.If you are looking to search for imperfect matches of longer sequences in this manner, you are likely better off using some kind of local-alignment algorithm or one of the BLAST variants.\n\n\n\n\n\n","category":"type"},{"location":"sequence_search/#Searching-according-to-a-pattern","page":"Pattern matching and searching","title":"Searching according to a pattern","text":"","category":"section"},{"location":"sequence_search/#Regular-expression-search","page":"Pattern matching and searching","title":"Regular expression search","text":"","category":"section"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"Query patterns can be described in regular expressions. The syntax supports a subset of Perl and PROSITE's notation.","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"Biological regexes can be constructed using the BioRegex constructor, for example by doing BioRegex{AminoAcid}(\"MV+\"). For bioregex literals, it is instead recommended using the @biore_str macro:","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"The Perl-like syntax starts with biore (BIOlogical REgular expression) and ends with a symbol option: \"dna\", \"rna\" or \"aa\". For example, biore\"A+\"dna is a regular expression for DNA sequences and biore\"A+\"aa is for amino acid sequences. The symbol options can be abbreviated to its first character: \"d\", \"r\" or \"a\", respectively.","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"Here are examples of using the regular expression for BioSequences:","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"julia> match(biore\"A+C*\"dna, dna\"AAAACC\")\nRegexMatch(\"AAAACC\")\n\njulia> match(biore\"A+C*\"d, dna\"AAAACC\")\nRegexMatch(\"AAAACC\")\n\njulia> occursin(biore\"A+C*\"dna, dna\"AAC\")\ntrue\n\njulia> occursin(biore\"A+C*\"dna, dna\"C\")\nfalse\n","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"match will return a RegexMatch if a match is found, otherwise it will return nothing if no match is found.","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"The table below summarizes available syntax elements.","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"Syntax Description Example\n| alternation \"A|T\" matches \"A\" and \"T\"\n* zero or more times repeat \"TA*\" matches \"T\", \"TA\" and \"TAA\"\n+ one or more times repeat \"TA+\" matches \"TA\" and \"TAA\"\n? zero or one time \"TA?\" matches \"T\" and \"TA\"\n{n,} n or more times repeat \"A{3,}\" matches \"AAA\" and \"AAAA\"\n{n,m} n-m times repeat \"A{3,5}\" matches \"AAA\", \"AAAA\" and \"AAAAA\"\n^ the start of the sequence \"^TAN*\" matches \"TATGT\"\n$ the end of the sequence \"N*TA$\" matches \"GCTA\"\n(...) pattern grouping \"(TA)+\" matches \"TA\" and \"TATA\"\n[...] one of symbols \"[ACG]+\" matches \"AGGC\"","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"eachmatch and findfirst are also defined, just like usual regex and strings found in Base.","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"julia> collect(matched(x) for x in eachmatch(biore\"TATA*?\"d, dna\"TATTATAATTA\")) # overlap\n4-element Vector{LongSequence{DNAAlphabet{4}}}:\n TAT \n TAT\n TATA\n TATAA\n\njulia> collect(matched(x) for x in eachmatch(biore\"TATA*\"d, dna\"TATTATAATTA\", false)) # no overlap\n2-element Vector{LongSequence{DNAAlphabet{4}}}:\n TAT \n TATAA\n\njulia> findfirst(biore\"TATA*\"d, dna\"TATTATAATTA\")\n1:3\n\njulia> findfirst(biore\"TATA*\"d, dna\"TATTATAATTA\", 2)\n4:8\n","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"Noteworthy differences from strings are:","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"Ambiguous characters match any compatible characters (e.g. biore\"N\"d is equivalent to biore\"[ACGT]\"d).\nWhitespaces are ignored (e.g. biore\"A C G\"d is equivalent to biore\"ACG\"d).","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"The PROSITE notation is described in ScanProsite - user manual. The syntax supports almost all notations including the extended syntax. The PROSITE notation starts with prosite prefix and no symbol option is needed because it always describes patterns of amino acid sequences:","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"julia> match(prosite\"[AC]-x-V-x(4)-{ED}\", aa\"CPVPQARG\")\nRegexMatch(\"CPVPQARG\")\n\njulia> match(prosite\"[AC]xVx(4){ED}\", aa\"CPVPQARG\")\nRegexMatch(\"CPVPQARG\")\n","category":"page"},{"location":"sequence_search/#Position-weight-matrix-search","page":"Pattern matching and searching","title":"Position weight matrix search","text":"","category":"section"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"A motif can be specified using position weight matrix (PWM) in a probabilistic way. This method searches for the first position in the sequence where a score calculated using a PWM is greater than or equal to a threshold. More formally, denoting the sequence as S and the PWM value of symbol s at position j as M_sj, the score starting from a position p is defined as","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"operatornamescore(S p) = sum_i=1^L M_Sp+i-1i","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"and the search returns the smallest p that satisfies operatornamescore(S p) ge t.","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"There are two kinds of matrices in this package: PFM and PWM. The PFM type is a position frequency matrix and stores symbol frequencies for each position. The PWM is a position weight matrix and stores symbol scores for each position. You can create a PFM from a set of sequences with the same length and then create a PWM from the PFM object.","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"julia> motifs = [dna\"TTA\", dna\"CTA\", dna\"ACA\", dna\"TCA\", dna\"GTA\"]\n5-element Vector{LongSequence{DNAAlphabet{4}}}:\n TTA\n CTA\n ACA\n TCA\n GTA\n\njulia> pfm = PFM(motifs) # sequence set => PFM\n4×3 PFM{DNA, Int64}:\n A 1 0 5\n C 1 2 0\n G 1 0 0\n T 2 3 0\n\njulia> pwm = PWM(pfm) # PFM => PWM\n4×3 PWM{DNA, Float64}:\n A -0.321928 -Inf 2.0\n C -0.321928 0.678072 -Inf\n G -0.321928 -Inf -Inf\n T 0.678072 1.26303 -Inf\n\njulia> pwm = PWM(pfm .+ 0.01) # add pseudo counts to avoid infinite values\n4×3 PWM{DNA, Float64}:\n A -0.319068 -6.97728 1.99139\n C -0.319068 0.673772 -6.97728\n G -0.319068 -6.97728 -6.97728\n T 0.673772 1.25634 -6.97728\n\njulia> pwm = PWM(pfm .+ 0.01, prior=[0.2, 0.3, 0.3, 0.2]) # GC-rich prior\n4×3 PWM{DNA, Float64}:\n A 0.00285965 -6.65535 2.31331\n C -0.582103 0.410737 -7.24031\n G -0.582103 -7.24031 -7.24031\n T 0.9957 1.57827 -6.65535\n","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"The PWM_sj matrix is computed from PFM_sj and the prior probability p(s) as follows ([Wasserman2004]):","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"beginalign\n PWM_sj = log_2 fracp(sj)p(s) \n p(sj) = fracPFM_sjsum_s PFM_sj\nendalign","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"However, if you just want to quickly conduct a search, constructing the PFM and PWM is done for you as a convenience if you build a PWMSearchQuery, using a collection of sequences:","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"julia> motifs = [dna\"TTA\", dna\"CTA\", dna\"ACA\", dna\"TCA\", dna\"GTA\"]\n5-element Vector{LongSequence{DNAAlphabet{4}}}:\n TTA\n CTA\n ACA\n TCA\n GTA\n\njulia> subject = dna\"TATTATAATTA\";\n\njulia> qa = PWMSearchQuery(motifs, 1.0);\n\njulia> findfirst(qa, subject)\n3\n\njulia> findall(qa, subject)\n3-element Vector{Int64}:\n 3\n 5\n 9","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"[Wasserman2004]: https://doi.org/10.1038/nrg1315","category":"page"},{"location":"predicates/","page":"Predicates","title":"Predicates","text":"CurrentModule = BioSequences\nDocTestSetup = quote\n using BioSequences\nend","category":"page"},{"location":"predicates/#Predicates","page":"Predicates","title":"Predicates","text":"","category":"section"},{"location":"predicates/","page":"Predicates","title":"Predicates","text":"A number of predicate or query functions are supported for sequences, allowing you to check for certain properties of a sequence.","category":"page"},{"location":"predicates/","page":"Predicates","title":"Predicates","text":"isrepetitive\nispalindromic\nhasambiguity\niscanonical","category":"page"},{"location":"predicates/#BioSequences.isrepetitive","page":"Predicates","title":"BioSequences.isrepetitive","text":"isrepetitive(seq::BioSequence, n::Integer = length(seq))\n\nReturn true if and only if seq contains a repetitive subsequence of length ≥ n.\n\n\n\n\n\n","category":"function"},{"location":"predicates/#BioSequences.ispalindromic","page":"Predicates","title":"BioSequences.ispalindromic","text":"ispalindromic(seq::NucSeq) -> Bool\n\nCheck if seq is palindromic. A palindromic sequence is identical to its reverse-complement, so this should be equivalent to checking if seq == reverse_complement(seq).\n\nExamples\n\njulia> ispalindromic(dna\"TGCA\")\ntrue\n\njulia> ispalindromic(dna\"TCCT\")\nfalse\n\njulia> ispalindromic(rna\"ACGGU\")\nfalse\n\nReturn true if seq is a palindromic sequence; otherwise return false.\n\n\n\n\n\n","category":"function"},{"location":"predicates/#BioSequences.hasambiguity","page":"Predicates","title":"BioSequences.hasambiguity","text":"hasambiguity(seq::BioSequence)\n\nReturns true if seq has an ambiguous symbol; otherwise return false.\n\n\n\n\n\n","category":"function"},{"location":"predicates/#BioSequences.iscanonical","page":"Predicates","title":"BioSequences.iscanonical","text":"iscanonical(seq::NucleotideSeq)\n\nReturns true if seq is canonical.\n\nFor any sequence, there is a reverse complement, which is the same sequence, but on the complimentary strand of DNA:\n\n------->\nATCGATCG\nCGATCGAT\n<-------\n\nnote: Note\nUsing the reverse_complement of a DNA sequence will give give this reverse complement.\n\nOf the two sequences, the canonical of the two sequences is the lesser of the two i.e. canonical_seq < other_seq.\n\n\n\n\n\n","category":"function"},{"location":"recipes/","page":"Recipes","title":"Recipes","text":"CurrentModule = BioSequences\nDocTestSetup = quote\n using BioSequences\n using BioSymbols\nend","category":"page"},{"location":"recipes/#Recipes","page":"Recipes","title":"Recipes","text":"","category":"section"},{"location":"recipes/","page":"Recipes","title":"Recipes","text":"This page provides tested example code to solve various common problems using BioSequences.","category":"page"},{"location":"recipes/#One-hot-encoding-biosequences","page":"Recipes","title":"One-hot encoding biosequences","text":"","category":"section"},{"location":"recipes/","page":"Recipes","title":"Recipes","text":"The types DNA, RNA and AminoAcid expose a binary representation through the exported function BioSymbols.compatbits, which is a one-hot encoding of:","category":"page"},{"location":"recipes/","page":"Recipes","title":"Recipes","text":"julia> using BioSymbols\n\njulia> compatbits(DNA_W)\n0x09\n\njulia> compatbits(AA_J)\n0x00000600","category":"page"},{"location":"recipes/","page":"Recipes","title":"Recipes","text":"Each set bit in the encoding corresponds to a compatible unambiguous symbol. For example, for RNA, the four lower bits encode A, C, G, and U, in order. Hence, the symbol D, which is short for A, G or U, is encoded as 0x01 | 0x04 | 0x08 == 0x0d:","category":"page"},{"location":"recipes/","page":"Recipes","title":"Recipes","text":"julia> compatbits(RNA_D)\n0x0d\n\njulia> compatbits(RNA_A) | compatbits(DNA_G) | compatbits(RNA_U)\n0x0d","category":"page"},{"location":"recipes/","page":"Recipes","title":"Recipes","text":"Using this, we can construct a function to one-hot encode sequences - in this example, nucleic acid sequences:","category":"page"},{"location":"recipes/","page":"Recipes","title":"Recipes","text":"function one_hot(s::NucSeq)\n M = falses(4, length(s))\n for (i, s) in enumerate(s)\n bits = compatbits(s)\n while !iszero(bits)\n M[trailing_zeros(bits) + 1, i] = true\n bits &= bits - one(bits) # clear lowest bit\n end\n end\n M\nend\n\none_hot(dna\"TGNTKCTW-T\")\n\n# output\n\n4×10 BitMatrix:\n 0 0 1 0 0 0 0 1 0 0\n 0 0 1 0 0 1 0 0 0 0\n 0 1 1 0 1 0 0 0 0 0\n 1 0 1 1 1 0 1 1 0 1","category":"page"},{"location":"#BioSequences","page":"Home","title":"BioSequences","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"(Image: Latest Release) (Image: MIT license) (Image: Documentation) (Image: Pkg Status)","category":"page"},{"location":"#Description","page":"Home","title":"Description","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"BioSequences provides data types and methods for common operations with biological sequences, including DNA, RNA, and amino acid sequences.","category":"page"},{"location":"#Installation","page":"Home","title":"Installation","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"You can install BioSequences from the julia REPL. Press ] to enter pkg mode again, and enter the following:","category":"page"},{"location":"","page":"Home","title":"Home","text":"add BioSequences","category":"page"},{"location":"","page":"Home","title":"Home","text":"If you are interested in the cutting edge of the development, please check out the master branch to try new features before release.","category":"page"},{"location":"#Testing","page":"Home","title":"Testing","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"BioSequences is tested against Julia 1.X on Linux, OS X, and Windows.","category":"page"},{"location":"","page":"Home","title":"Home","text":"(Image: Unit tests) (Image: Documentation) (Image: )","category":"page"},{"location":"#Contributing","page":"Home","title":"Contributing","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"We appreciate contributions from users including reporting bugs, fixing issues, improving performance and adding new features.","category":"page"},{"location":"","page":"Home","title":"Home","text":"Take a look at the contributing files detailed contributor and maintainer guidelines, and code of conduct.","category":"page"},{"location":"#Questions?","page":"Home","title":"Questions?","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"If you have a question about contributing or using BioJulia software, come on over and chat to us on the #biology channel on the Julia SLack, or you can try the Bio category of the Julia discourse site.","category":"page"},{"location":"types/","page":"BioSequences Types","title":"BioSequences Types","text":"CurrentModule = BioSequences\nDocTestSetup = quote\n using BioSequences\nend","category":"page"},{"location":"types/#Abstract-Types","page":"BioSequences Types","title":"Abstract Types","text":"","category":"section"},{"location":"types/","page":"BioSequences Types","title":"BioSequences Types","text":"BioSequences exports an abstract BioSequence type, and several concrete sequence types which inherit from it.","category":"page"},{"location":"types/#The-abstract-BioSequence","page":"BioSequences Types","title":"The abstract BioSequence","text":"","category":"section"},{"location":"types/","page":"BioSequences Types","title":"BioSequences Types","text":"BioSequences provides an abstract type called a BioSequence{A<:Alphabet}. This abstract type, and the methods and traits is supports, allows for many algorithms in BioSequences to be written as generically as possible, thus reducing the amount of code to read and understand, whilst maintaining high performance when such code is compiled for a concrete BioSequence subtype. Additionally, it allows new types to be implemented that are fully compatible with the rest of BioSequences, providing that key methods or traits are defined).","category":"page"},{"location":"types/","page":"BioSequences Types","title":"BioSequences Types","text":"BioSequence","category":"page"},{"location":"types/#BioSequences.BioSequence","page":"BioSequences Types","title":"BioSequences.BioSequence","text":"BioSequence{A <: Alphabet}\n\nBioSequence is the main abstract type of BioSequences. It abstracts over the internal representation of different biological sequences, and is parameterized by an Alphabet, which controls the element type.\n\nExtended help\n\nIts subtypes are characterized by:\n\nBeing a linear container type with random access and indices Base.OneTo(length(x)).\nContaining zero or more internal data elements of type encoded_data_eltype(typeof(x)).\nBeing associated with an Alphabet, A by being a subtype of BioSequence{A}.\n\nA BioSequence{A} is indexed by an integer. The biosequence subtype, the index and the alphabet A determine how to extract the internal encoded data. The alphabet decides how to decode the data to the element type of the biosequence. Hence, the element type and container type of a BioSequence are separated.\n\nSubtypes T of BioSequence must implement the following, with E begin an encoded data type:\n\nBase.length(::T)::Int\nencoded_data_eltype(::Type{T})::Type{E}\nextract_encoded_element(::T, ::Integer)::E\ncopy(::T)\nT must be able to be constructed from any iterable with length defined and with a known, compatible element type.\n\nFurthermore, mutable sequences should implement\n\nencoded_setindex!(::T, ::E, ::Integer)\nT(undef, ::Int)\nresize!(::T, ::Int)\n\nFor compatibility with existing Alphabets, the encoded data eltype must be UInt.\n\n\n\n\n\n","category":"type"},{"location":"types/","page":"BioSequences Types","title":"BioSequences Types","text":"Some aliases for BioSequence are also provided for your convenience:","category":"page"},{"location":"types/","page":"BioSequences Types","title":"BioSequences Types","text":"NucSeq\nAASeq","category":"page"},{"location":"types/#BioSequences.NucSeq","page":"BioSequences Types","title":"BioSequences.NucSeq","text":"An alias for BioSequence{<:NucleicAcidAlphabet}\n\n\n\n\n\n","category":"type"},{"location":"types/#BioSequences.AASeq","page":"BioSequences Types","title":"BioSequences.AASeq","text":"An alias for BioSequence{AminoAcidAlphabet}\n\n\n\n\n\n","category":"type"},{"location":"types/","page":"BioSequences Types","title":"BioSequences Types","text":"Let's have a closer look at some of those methods that a subtype of BioSequence must implement. Check out julia base library docs for length, copy and resize!.","category":"page"},{"location":"types/","page":"BioSequences Types","title":"BioSequences Types","text":"encoded_data_eltype\nextract_encoded_element\nencoded_setindex!","category":"page"},{"location":"types/#BioSequences.encoded_data_eltype","page":"BioSequences Types","title":"BioSequences.encoded_data_eltype","text":"encoded_data_eltype(::Type{<:BioSequence})\n\nReturns the element type of the encoded data of the BioSequence. This is the return type of extract_encoded_element, i.e. the data type that stores the biological symbols in the biosequence.\n\nSee also: BioSequence \n\n\n\n\n\n","category":"function"},{"location":"types/#BioSequences.extract_encoded_element","page":"BioSequences Types","title":"BioSequences.extract_encoded_element","text":"extract_encoded_element(::BioSequence{A}, i::Integer)\n\nReturns the encoded element at position i. This data can be decoded using decode(A(), data) to yield the element type of the biosequence.\n\nSee also: BioSequence \n\n\n\n\n\n","category":"function"},{"location":"types/#BioSequences.encoded_setindex!","page":"BioSequences Types","title":"BioSequences.encoded_setindex!","text":"encoded_setindex!(seq::BioSequence, x::E, i::Integer)\n\nGiven encoded data x of type encoded_data_eltype(typeof(seq)), sets the internal sequence data at the given index.\n\nSee also: BioSequence \n\n\n\n\n\n","category":"function"},{"location":"types/","page":"BioSequences Types","title":"BioSequences Types","text":"A correctly defined subtype of BioSequence that satisfies the interface, will find the vast majority of methods described in the rest of this manual should work out of the box for that type. But they can always be overloaded if needed. Indeed the LongSequence type overloads Indeed some of the generic BioSequence methods, are overloaded for LongSequence, for example for transformation and counting operations where efficiency gains can be made due to the specific internal representation of a specific type.","category":"page"},{"location":"types/#The-abstract-Alphabet","page":"BioSequences Types","title":"The abstract Alphabet","text":"","category":"section"},{"location":"types/","page":"BioSequences Types","title":"BioSequences Types","text":"Alphabets control how biological symbols are encoded and decoded. They also confer many of the automatic traits and methods that any subtype of T<:BioSequence{A<:Alphabet} will get.","category":"page"},{"location":"types/","page":"BioSequences Types","title":"BioSequences Types","text":"BioSequences.Alphabet\nBioSequences.AsciiAlphabet","category":"page"},{"location":"types/#BioSequences.Alphabet","page":"BioSequences Types","title":"BioSequences.Alphabet","text":"Alphabet\n\nAlphabet is the most important type trait for BioSequence. An Alphabet represents a set of biological symbols encoded by a sequence, e.g. A, C, G and T for a DNA Alphabet that requires only 2 bits to represent each symbol.\n\nExtended help\n\nSubtypes of Alphabet are singleton structs that may or may not be parameterized.\nAlphabets span over a finite set of biological symbols.\nThe alphabet controls the encoding from some internal \"encoded data\" to a BioSymbol of the alphabet's element type, as well as the decoding, the inverse process.\nAn Alphabet's encode method must not produce invalid data. \n\nRequired methods\n\nEvery subtype A of Alphabet must implement:\n\nBase.eltype(::Type{A})::Type{S} for some eltype S, which must be a BioSymbol.\nsymbols(::A)::Tuple{Vararg{S}}. This gives tuples of all symbols in the set of A.\nencode(::A, ::S)::E encodes a symbol to an internal data eltype E.\ndecode(::A, ::E)::S decodes an internal data eltype E to a symbol S.\nExcept for eltype which must follow Base conventions, all functions operating on Alphabet should operate on instances of the alphabet, not the type.\n\nIf you want interoperation with existing subtypes of BioSequence, the encoded representation E must be of type UInt, and you must also implement:\n\nBitsPerSymbol(::A)::BitsPerSymbol{N}, where the N must be zero or a power of two in [1, 2, 4, 8, 16, 32, [64 for 64-bit systems]].\n\nOptional methods\n\nBitsPerSymbol for compatibility with existing BioSequences\nAsciiAlphabet for increased printing/writing efficiency\ntryencode for fallible encoding.\n\n\n\n\n\n","category":"type"},{"location":"types/#BioSequences.AsciiAlphabet","page":"BioSequences Types","title":"BioSequences.AsciiAlphabet","text":"AsciiAlphabet\n\nTrait for alphabet using ASCII characters as String representation. Define codetype(A) = AsciiAlphabet() for a user-defined Alphabet A to gain speed. Methods needed: BioSymbols.stringbyte(::eltype(A)) and ascii_encode(A, ::UInt8).\n\n\n\n\n\n","category":"type"},{"location":"types/#Concrete-types","page":"BioSequences Types","title":"Concrete types","text":"","category":"section"},{"location":"types/#Implemented-alphabets","page":"BioSequences Types","title":"Implemented alphabets","text":"","category":"section"},{"location":"types/","page":"BioSequences Types","title":"BioSequences Types","text":"DNAAlphabet\nRNAAlphabet\nAminoAcidAlphabet","category":"page"},{"location":"types/#BioSequences.DNAAlphabet","page":"BioSequences Types","title":"BioSequences.DNAAlphabet","text":"DNA nucleotide alphabet.\n\nDNAAlphabet has a parameter N which is a number that determines the BitsPerSymbol trait. Currently supported values of N are 2 and 4.\n\n\n\n\n\n","category":"type"},{"location":"types/#BioSequences.RNAAlphabet","page":"BioSequences Types","title":"BioSequences.RNAAlphabet","text":"RNA nucleotide alphabet.\n\nRNAAlphabet has a parameter N which is a number that determines the BitsPerSymbol trait. Currently supported values of N are 2 and 4.\n\n\n\n\n\n","category":"type"},{"location":"types/#BioSequences.AminoAcidAlphabet","page":"BioSequences Types","title":"BioSequences.AminoAcidAlphabet","text":"Amino acid alphabet.\n\n\n\n\n\n","category":"type"},{"location":"types/#Long-Sequences","page":"BioSequences Types","title":"Long Sequences","text":"","category":"section"},{"location":"types/","page":"BioSequences Types","title":"BioSequences Types","text":"LongSequence","category":"page"},{"location":"types/#BioSequences.LongSequence","page":"BioSequences Types","title":"BioSequences.LongSequence","text":"LongSequence{A <: Alphabet}\n\nGeneral-purpose BioSequence. This type is mutable and variable-length, and should be preferred for most use cases.\n\nExtended help\n\nLongSequence{A<:Alphabet} <: BioSequence{A} is parameterized by a concrete Alphabet type A that defines the domain (or set) of biological symbols permitted.\n\nAs the BioSequence interface definition implies, LongSequences store the biological symbol elements that they contain in a succinct encoded form that permits many operations to be done in an efficient bit-parallel manner. As per the interface of BioSequence, the Alphabet determines how an element is encoded or decoded when it is inserted or extracted from the sequence.\n\nFor example, AminoAcidAlphabet is associated with AminoAcid and hence an object of the LongSequence{AminoAcidAlphabet} type represents a sequence of amino acids.\n\nSymbols from multiple alphabets can't be intermixed in one sequence type.\n\nThe following table summarizes common LongSequence types that have been given aliases for convenience.\n\nType Symbol type Type alias\nLongSequence{DNAAlphabet{N}} DNA LongDNA{N}\nLongSequence{RNAAlphabet{N}} RNA LongRNA{N}\nLongSequence{AminoAcidAlphabet} AminoAcid LongAA\n\nThe LongDNA and LongRNA aliases use a DNAAlphabet{4}.\n\nDNAAlphabet{4} permits ambiguous nucleotides, and a sequence must use at least 4 bits to internally store each element (and indeed LongSequence does).\n\nIf you are sure that you are working with sequences with no ambiguous nucleotides, you can use LongSequences parameterised with DNAAlphabet{2} instead.\n\nDNAAlphabet{2} is an alphabet that uses two bits per base and limits to only unambiguous nucleotide symbols (A,C,G,T).\n\nChanging this single parameter, is all you need to do in order to benefit from memory savings. Some computations that use bitwise operations will also be dramatically faster.\n\nThe same applies with LongSequence{RNAAlphabet{4}}, simply replace the alphabet parameter with RNAAlphabet{2} in order to benefit.\n\n\n\n\n\n","category":"type"},{"location":"types/#Sequence-views","page":"BioSequences Types","title":"Sequence views","text":"","category":"section"},{"location":"types/","page":"BioSequences Types","title":"BioSequences Types","text":"Similar to how Base Julia offers views of array objects, BioSequences offers view of LongSequences - the LongSubSeq{A<:Alphabet}.","category":"page"},{"location":"types/","page":"BioSequences Types","title":"BioSequences Types","text":"Conceptually, a LongSubSeq{A} is similar to a LongSequence{A}, but instead of storing their own data, they refer to the data of a LongSequence. Modiying the LongSequence will be reflected in the view, and vice versa. If the underlying LongSequence is truncated, the behaviour of a view is undefined. For the same reason, some operations are not supported for views, such as resizing.","category":"page"},{"location":"types/","page":"BioSequences Types","title":"BioSequences Types","text":"The purpose of LongSubSeq is that, since they only contain a pointer to the underlying array, an offset and a length, they are much lighter than LongSequences, and will be stack allocated on Julia 1.5 and newer. Thus, the user may construct millions of views without major performance implications.","category":"page"}] } diff --git a/dev/sequence_search/index.html b/dev/sequence_search/index.html index 3830b361..08ff77c2 100644 --- a/dev/sequence_search/index.html +++ b/dev/sequence_search/index.html @@ -50,7 +50,7 @@ julia> occursin(ExactSearchQuery(dna"CNT", iscompatible), dna"ACNT") true -source

Allowing mismatches

BioSequences.ApproximateSearchQueryType
ApproximateSearchQuery{F<:Function,S<:BioSequence}

Query type for approximate sequence search.

These queries are used as a predicate for the Base.findnext, Base.findprev, Base.occursin, Base.findfirst, and Base.findlast functions.

Using these functions with these queries allows you to search a given sequence for a sub-sequence, whilst allowing a specific number of errors.

In other words they find a subsequence of the target sequence within a specific Levenshtein distance of the query sequence.

Examples

julia> seq = dna"ACAGCGTAGCT";
+
source

Allowing mismatches

BioSequences.ApproximateSearchQueryType
ApproximateSearchQuery{F<:Function,S<:BioSequence}

Query type for approximate sequence search.

These queries are used as a predicate for the Base.findnext, Base.findprev, Base.occursin, Base.findfirst, and Base.findlast functions.

Using these functions with these queries allows you to search a given sequence for a sub-sequence, whilst allowing a specific number of errors.

In other words they find a subsequence of the target sequence within a specific Levenshtein distance of the query sequence.

Examples

julia> seq = dna"ACAGCGTAGCT";
 
 julia> query = ApproximateSearchQuery(dna"AGGG");
 
@@ -69,7 +69,7 @@
 
 julia> findnext(query, 1, dna"AAGNGG", 1) # 1 mismatch permitted (A vs G) & matched N
 1:4
-
Note

This method of searching for motifs was implemented with smaller query motifs in mind.

If you are looking to search for imperfect matches of longer sequences in this manner, you are likely better off using some kind of local-alignment algorithm or one of the BLAST variants.

source

Searching according to a pattern

Query patterns can be described in regular expressions. The syntax supports a subset of Perl and PROSITE's notation.

Biological regexes can be constructed using the BioRegex constructor, for example by doing BioRegex{AminoAcid}("MV+"). For bioregex literals, it is instead recommended using the @biore_str macro:

The Perl-like syntax starts with biore (BIOlogical REgular expression) and ends with a symbol option: "dna", "rna" or "aa". For example, biore"A+"dna is a regular expression for DNA sequences and biore"A+"aa is for amino acid sequences. The symbol options can be abbreviated to its first character: "d", "r" or "a", respectively.

Here are examples of using the regular expression for BioSequences:

julia> match(biore"A+C*"dna, dna"AAAACC")
+
Note

This method of searching for motifs was implemented with smaller query motifs in mind.

If you are looking to search for imperfect matches of longer sequences in this manner, you are likely better off using some kind of local-alignment algorithm or one of the BLAST variants.

source

Searching according to a pattern

Query patterns can be described in regular expressions. The syntax supports a subset of Perl and PROSITE's notation.

Biological regexes can be constructed using the BioRegex constructor, for example by doing BioRegex{AminoAcid}("MV+"). For bioregex literals, it is instead recommended using the @biore_str macro:

The Perl-like syntax starts with biore (BIOlogical REgular expression) and ends with a symbol option: "dna", "rna" or "aa". For example, biore"A+"dna is a regular expression for DNA sequences and biore"A+"aa is for amino acid sequences. The symbol options can be abbreviated to its first character: "d", "r" or "a", respectively.

Here are examples of using the regular expression for BioSequences:

julia> match(biore"A+C*"dna, dna"AAAACC")
 RegexMatch("AAAACC")
 
 julia> match(biore"A+C*"d, dna"AAAACC")
@@ -159,4 +159,4 @@
 3-element Vector{Int64}:
  3
  5
- 9

[Wasserman2004]: https://doi.org/10.1038/nrg1315

+ 9

[Wasserman2004]: https://doi.org/10.1038/nrg1315

diff --git a/dev/symbols/index.html b/dev/symbols/index.html index 09f067d6..94736969 100644 --- a/dev/symbols/index.html +++ b/dev/symbols/index.html @@ -70,4 +70,4 @@ julia> iscompatible(DNA_C, DNA_R) # DNA_R (A or G) cannot be DNA_C false -source
BioSymbols.isambiguousFunction
isambiguous(nt::NucleicAcid)

Test if nt is an ambiguous nucleotide.

source
isambiguous(aa::AminoAcid)

Test if aa is an ambiguous amino acid.

source
+source
BioSymbols.isambiguousFunction
isambiguous(nt::NucleicAcid)

Test if nt is an ambiguous nucleotide.

source
isambiguous(aa::AminoAcid)

Test if aa is an ambiguous amino acid.

source
diff --git a/dev/transforms/index.html b/dev/transforms/index.html index 780eee1f..42e4376e 100644 --- a/dev/transforms/index.html +++ b/dev/transforms/index.html @@ -15,7 +15,7 @@ julia> seq[5] = DNA_A DNA_A -
Note

Some types such can be indexed using integers but not using ranges.

For LongSequence types, indexing a sequence by range creates a copy of the original sequence, similar to Array in Julia's Base library. If you find yourself slowed down by the allocation of these subsequences, consider using a sequence view instead.

Modifying sequences

In addition to setindex, many other modifying operations are possible for biological sequences such as push!, pop!, and insert!, which should be familiar to anyone used to editing arrays.

Base.push!Method
push!(seq::BioSequence, x)

Append a biological symbol x to a biological sequence seq.

source
Base.pop!Method
pop!(seq::BioSequence)

Remove the symbol from the end of a biological sequence seq and return it. Returns a variable of eltype(seq).

source
Base.pushfirst!Method
pushfirst!(seq, x)

Insert a biological symbol x at the beginning of a biological sequence seq.

source
Base.popfirst!Method
popfirst!(seq)

Remove the symbol from the beginning of a biological sequence seq and return it. Returns a variable of eltype(seq).

source
Base.insert!Method
insert!(seq::BioSequence, i, x)

Insert a biological symbol x into a biological sequence seq, at the given index i.

source
Base.deleteat!Method
deleteat!(seq::BioSequence, i::Integer)

Delete a biological symbol at a single position i in a biological sequence seq.

Modifies the input sequence.

source
Base.append!Method
append!(seq, other)

Add a biological sequence other onto the end of biological sequence seq. Modifies and returns seq.

source
Base.resize!Method
resize!(seq, size, [force::Bool=false])

Resize a biological sequence seq, to a given size. Does not resize the underlying data array unless the new size does not fit. If force, always resize underlying data array.

Note that resizing to a larger size, and then loading from uninitialized positions is not allowed and may cause undefined behaviour. Make sure to always fill any uninitialized biosymbols after resizing.

source
Base.empty!Method
empty!(seq::BioSequence)

Completely empty a biological sequence seq of nucleotides.

source

Here are some examples:

julia> seq = dna"ACG"
+
Note

Some types such can be indexed using integers but not using ranges.

For LongSequence types, indexing a sequence by range creates a copy of the original sequence, similar to Array in Julia's Base library. If you find yourself slowed down by the allocation of these subsequences, consider using a sequence view instead.

Modifying sequences

In addition to setindex, many other modifying operations are possible for biological sequences such as push!, pop!, and insert!, which should be familiar to anyone used to editing arrays.

Base.push!Method
push!(seq::BioSequence, x)

Append a biological symbol x to a biological sequence seq.

source
Base.pop!Method
pop!(seq::BioSequence)

Remove the symbol from the end of a biological sequence seq and return it. Returns a variable of eltype(seq).

source
Base.pushfirst!Method
pushfirst!(seq, x)

Insert a biological symbol x at the beginning of a biological sequence seq.

source
Base.popfirst!Method
popfirst!(seq)

Remove the symbol from the beginning of a biological sequence seq and return it. Returns a variable of eltype(seq).

source
Base.insert!Method
insert!(seq::BioSequence, i, x)

Insert a biological symbol x into a biological sequence seq, at the given index i.

source
Base.deleteat!Method
deleteat!(seq::BioSequence, i::Integer)

Delete a biological symbol at a single position i in a biological sequence seq.

Modifies the input sequence.

source
Base.append!Method
append!(seq, other)

Add a biological sequence other onto the end of biological sequence seq. Modifies and returns seq.

source
Base.resize!Method
resize!(seq, size, [force::Bool=false])

Resize a biological sequence seq, to a given size. Does not resize the underlying data array unless the new size does not fit. If force, always resize underlying data array.

Note that resizing to a larger size, and then loading from uninitialized positions is not allowed and may cause undefined behaviour. Make sure to always fill any uninitialized biosymbols after resizing.

source
Base.empty!Method
empty!(seq::BioSequence)

Completely empty a biological sequence seq of nucleotides.

source

Here are some examples:

julia> seq = dna"ACG"
 3nt DNA Sequence:
 ACG
 
@@ -34,7 +34,7 @@
 julia> deleteat!(seq, 2:3)
 3nt DNA Sequence:
 AAT
-

Additional transformations

In addition to these basic modifying functions, other sequence transformations that are common in bioinformatics are also provided.

Base.reverse!Method
reverse!(seq::LongSequence)

Reverse a biological sequence seq in place.

source
Base.reverseMethod
reverse(seq::BioSequence)

Create reversed copy of a biological sequence.

source
reverse(seq::LongSequence)

Create reversed copy of a biological sequence.

source
BioSequences.complement!Function
complement!(seq)

Make a complement sequence of seq in place.

source
BioSymbols.complementFunction
complement(nt::NucleicAcid)

Return the complementary nucleotide of nt.

This function returns the union of all possible complementary nucleotides.

Examples

julia> complement(DNA_A)
+

Additional transformations

In addition to these basic modifying functions, other sequence transformations that are common in bioinformatics are also provided.

Base.reverse!Method
reverse!(seq::LongSequence)

Reverse a biological sequence seq in place.

source
Base.reverseMethod
reverse(seq::BioSequence)

Create reversed copy of a biological sequence.

source
reverse(seq::LongSequence)

Create reversed copy of a biological sequence.

source
BioSymbols.complementFunction
complement(nt::NucleicAcid)

Return the complementary nucleotide of nt.

This function returns the union of all possible complementary nucleotides.

Examples

julia> complement(DNA_A)
 DNA_T
 
 julia> complement(DNA_N)
@@ -42,10 +42,10 @@
 
 julia> complement(RNA_U)
 RNA_A
-
source
complement(seq)

Make a complement sequence of seq.

source
BioSequences.canonical!Function
canonical!(seq::NucleotideSeq)

Transforms the seq into its canonical form, if it is not already canonical. Modifies the input sequence inplace.

For any sequence, there is a reverse complement, which is the same sequence, but on the complimentary strand of DNA:

------->
+
source
complement(seq)

Make a complement sequence of seq.

source
BioSequences.canonical!Function
canonical!(seq::NucleotideSeq)

Transforms the seq into its canonical form, if it is not already canonical. Modifies the input sequence inplace.

For any sequence, there is a reverse complement, which is the same sequence, but on the complimentary strand of DNA:

------->
 ATCGATCG
 CGATCGAT
-<-------
Note

Using the reverse_complement of a DNA sequence will give give this reverse complement.

Of the two sequences, the canonical of the two sequences is the lesser of the two i.e. canonical_seq < other_seq.

Using this function on a seq will ensure it is the canonical version.

source

Some examples:

julia> seq = dna"ACGTAT"
+<-------
Note

Using the reverse_complement of a DNA sequence will give give this reverse complement.

Of the two sequences, the canonical of the two sequences is the lesser of the two i.e. canonical_seq < other_seq.

Using this function on a seq will ensure it is the canonical version.

source
BioSequences.canonicalFunction
canonical(seq::NucleotideSeq)

Create the canonical sequence of seq.

source

Some examples:

julia> seq = dna"ACGTAT"
 6nt DNA Sequence:
 ACGTAT
 
@@ -60,7 +60,7 @@
 julia> reverse_complement!(seq)
 6nt DNA Sequence:
 ACGTAT
-

Many of these methods also have a version which makes a copy of the input sequence, so you get a modified copy, and don't alter the original sequence. Such methods are named the same, but without the exclamation mark. E.g. reverse instead of reverse!, and ungap instead of ungap!.

Translation

Translation is a slightly more complex transformation for RNA Sequences and so we describe it here in more detail.

The translate function translates a sequence of codons in a RNA sequence to a amino acid sequence based on a genetic code. The BioSequences package provides all NCBI defined genetic codes and they are registered in ncbi_trans_table.

BioSequences.translateFunction
translate(seq, code=standard_genetic_code, allow_ambiguous_codons=true, alternative_start=false)

Translate an LongRNA or a LongDNA to an LongAA.

Translation uses genetic code code to map codons to amino acids. See ncbi_trans_table for available genetic codes. If codons in the given sequence cannot determine a unique amino acid, they will be translated to AA_X if allow_ambiguous_codons is true and otherwise result in an error. For organisms that utilize alternative start codons, one can set alternative_start=true, in which case the first codon will always be converted to a methionine.

source
BioSequences.ncbi_trans_tableConstant

Genetic code list of NCBI.

The standard genetic code is ncbi_trans_table[1] and others can be shown by show(ncbi_trans_table). For more details, consult the next link: http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=cgencodes.

source
julia> ncbi_trans_table
+

Many of these methods also have a version which makes a copy of the input sequence, so you get a modified copy, and don't alter the original sequence. Such methods are named the same, but without the exclamation mark. E.g. reverse instead of reverse!, and ungap instead of ungap!.

Translation

Translation is a slightly more complex transformation for RNA Sequences and so we describe it here in more detail.

The translate function translates a sequence of codons in a RNA sequence to a amino acid sequence based on a genetic code. The BioSequences package provides all NCBI defined genetic codes and they are registered in ncbi_trans_table.

BioSequences.translateFunction
translate(seq, code=standard_genetic_code, allow_ambiguous_codons=true, alternative_start=false)

Translate an LongRNA or a LongDNA to an LongAA.

Translation uses genetic code code to map codons to amino acids. See ncbi_trans_table for available genetic codes. If codons in the given sequence cannot determine a unique amino acid, they will be translated to AA_X if allow_ambiguous_codons is true and otherwise result in an error. For organisms that utilize alternative start codons, one can set alternative_start=true, in which case the first codon will always be converted to a methionine.

source
BioSequences.ncbi_trans_tableConstant

Genetic code list of NCBI.

The standard genetic code is ncbi_trans_table[1] and others can be shown by show(ncbi_trans_table). For more details, consult the next link: http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=cgencodes.

source
julia> ncbi_trans_table
 Translation Tables:
   1. The Standard Code (standard_genetic_code)
   2. The Vertebrate Mitochondrial Code (vertebrate_mitochondrial_genetic_code)
@@ -81,4 +81,4 @@
  23. Thraustochytrium Mitochondrial Code (thraustochytrium_mitochondrial_genetic_code)
  24. Pterobranchia Mitochondrial Code (pterobrachia_mitochondrial_genetic_code)
  25. Candidate Division SR1 and Gracilibacteria Code (candidate_division_sr1_genetic_code)
-

https://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=cgencodes

+

https://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=cgencodes

diff --git a/dev/types/index.html b/dev/types/index.html index f18782e3..17dc824f 100644 --- a/dev/types/index.html +++ b/dev/types/index.html @@ -1,2 +1,2 @@ -BioSequences Types · BioSequences.jl

Abstract Types

BioSequences exports an abstract BioSequence type, and several concrete sequence types which inherit from it.

The abstract BioSequence

BioSequences provides an abstract type called a BioSequence{A<:Alphabet}. This abstract type, and the methods and traits is supports, allows for many algorithms in BioSequences to be written as generically as possible, thus reducing the amount of code to read and understand, whilst maintaining high performance when such code is compiled for a concrete BioSequence subtype. Additionally, it allows new types to be implemented that are fully compatible with the rest of BioSequences, providing that key methods or traits are defined).

BioSequences.BioSequenceType
BioSequence{A <: Alphabet}

BioSequence is the main abstract type of BioSequences. It abstracts over the internal representation of different biological sequences, and is parameterized by an Alphabet, which controls the element type.

Extended help

Its subtypes are characterized by:

  • Being a linear container type with random access and indices Base.OneTo(length(x)).
  • Containing zero or more internal data elements of type encoded_data_eltype(typeof(x)).
  • Being associated with an Alphabet, A by being a subtype of BioSequence{A}.

A BioSequence{A} is indexed by an integer. The biosequence subtype, the index and the alphabet A determine how to extract the internal encoded data. The alphabet decides how to decode the data to the element type of the biosequence. Hence, the element type and container type of a BioSequence are separated.

Subtypes T of BioSequence must implement the following, with E begin an encoded data type:

  • Base.length(::T)::Int
  • encoded_data_eltype(::Type{T})::Type{E}
  • extract_encoded_element(::T, ::Integer)::E
  • copy(::T)
  • T must be able to be constructed from any iterable with length defined and with a known, compatible element type.

Furthermore, mutable sequences should implement

  • encoded_setindex!(::T, ::E, ::Integer)
  • T(undef, ::Int)
  • resize!(::T, ::Int)

For compatibility with existing Alphabets, the encoded data eltype must be UInt.

source

Some aliases for BioSequence are also provided for your convenience:

Let's have a closer look at some of those methods that a subtype of BioSequence must implement. Check out julia base library docs for length, copy and resize!.

BioSequences.encoded_data_eltypeFunction
encoded_data_eltype(::Type{<:BioSequence})

Returns the element type of the encoded data of the BioSequence. This is the return type of extract_encoded_element, i.e. the data type that stores the biological symbols in the biosequence.

See also: BioSequence

source
BioSequences.extract_encoded_elementFunction
extract_encoded_element(::BioSequence{A}, i::Integer)

Returns the encoded element at position i. This data can be decoded using decode(A(), data) to yield the element type of the biosequence.

See also: BioSequence

source
BioSequences.encoded_setindex!Function
encoded_setindex!(seq::BioSequence, x::E, i::Integer)

Given encoded data x of type encoded_data_eltype(typeof(seq)), sets the internal sequence data at the given index.

See also: BioSequence

source

A correctly defined subtype of BioSequence that satisfies the interface, will find the vast majority of methods described in the rest of this manual should work out of the box for that type. But they can always be overloaded if needed. Indeed the LongSequence type overloads Indeed some of the generic BioSequence methods, are overloaded for LongSequence, for example for transformation and counting operations where efficiency gains can be made due to the specific internal representation of a specific type.

The abstract Alphabet

Alphabets control how biological symbols are encoded and decoded. They also confer many of the automatic traits and methods that any subtype of T<:BioSequence{A<:Alphabet} will get.

BioSequences.AlphabetType
Alphabet

Alphabet is the most important type trait for BioSequence. An Alphabet represents a set of biological symbols encoded by a sequence, e.g. A, C, G and T for a DNA Alphabet that requires only 2 bits to represent each symbol.

Extended help

  • Subtypes of Alphabet are singleton structs that may or may not be parameterized.
  • Alphabets span over a finite set of biological symbols.
  • The alphabet controls the encoding from some internal "encoded data" to a BioSymbol of the alphabet's element type, as well as the decoding, the inverse process.
  • An Alphabet's encode method must not produce invalid data.

Every subtype A of Alphabet must implement:

  • Base.eltype(::Type{A})::Type{S} for some eltype S, which must be a BioSymbol.
  • symbols(::A)::Tuple{Vararg{S}}. This gives tuples of all symbols in the set of A.
  • encode(::A, ::S)::E encodes a symbol to an internal data eltype E.
  • decode(::A, ::E)::S decodes an internal data eltype E to a symbol S.
  • Except for eltype which must follow Base conventions, all functions operating on Alphabet should operate on instances of the alphabet, not the type.

If you want interoperation with existing subtypes of BioSequence, the encoded representation E must be of type UInt, and you must also implement:

  • BitsPerSymbol(::A)::BitsPerSymbol{N}, where the N must be zero or a power of two in [1, 2, 4, 8, 16, 32, [64 for 64-bit systems]].

For increased performance, see BioSequences.AsciiAlphabet

source
BioSequences.AsciiAlphabetType
AsciiAlphabet

Trait for alphabet using ASCII characters as String representation. Define codetype(A) = AsciiAlphabet() for a user-defined Alphabet A to gain speed. Methods needed: BioSymbols.stringbyte(::eltype(A)) and ascii_encode(A, ::UInt8).

source

Concrete types

Implemented alphabets

BioSequences.DNAAlphabetType

DNA nucleotide alphabet.

DNAAlphabet has a parameter N which is a number that determines the BitsPerSymbol trait. Currently supported values of N are 2 and 4.

source
BioSequences.RNAAlphabetType

RNA nucleotide alphabet.

RNAAlphabet has a parameter N which is a number that determines the BitsPerSymbol trait. Currently supported values of N are 2 and 4.

source

Long Sequences

BioSequences.LongSequenceType
LongSequence{A <: Alphabet}

General-purpose BioSequence. This type is mutable and variable-length, and should be preferred for most use cases.

Extended help

LongSequence{A<:Alphabet} <: BioSequence{A} is parameterized by a concrete Alphabet type A that defines the domain (or set) of biological symbols permitted.

As the BioSequence interface definition implies, LongSequences store the biological symbol elements that they contain in a succinct encoded form that permits many operations to be done in an efficient bit-parallel manner. As per the interface of BioSequence, the Alphabet determines how an element is encoded or decoded when it is inserted or extracted from the sequence.

For example, AminoAcidAlphabet is associated with AminoAcid and hence an object of the LongSequence{AminoAcidAlphabet} type represents a sequence of amino acids.

Symbols from multiple alphabets can't be intermixed in one sequence type.

The following table summarizes common LongSequence types that have been given aliases for convenience.

TypeSymbol typeType alias
LongSequence{DNAAlphabet{N}}DNALongDNA{N}
LongSequence{RNAAlphabet{N}}RNALongRNA{N}
LongSequence{AminoAcidAlphabet}AminoAcidLongAA

The LongDNA and LongRNA aliases use a DNAAlphabet{4}.

DNAAlphabet{4} permits ambiguous nucleotides, and a sequence must use at least 4 bits to internally store each element (and indeed LongSequence does).

If you are sure that you are working with sequences with no ambiguous nucleotides, you can use LongSequences parameterised with DNAAlphabet{2} instead.

DNAAlphabet{2} is an alphabet that uses two bits per base and limits to only unambiguous nucleotide symbols (A,C,G,T).

Changing this single parameter, is all you need to do in order to benefit from memory savings. Some computations that use bitwise operations will also be dramatically faster.

The same applies with LongSequence{RNAAlphabet{4}}, simply replace the alphabet parameter with RNAAlphabet{2} in order to benefit.

source

Sequence views

Similar to how Base Julia offers views of array objects, BioSequences offers view of LongSequences - the LongSubSeq{A<:Alphabet}.

Conceptually, a LongSubSeq{A} is similar to a LongSequence{A}, but instead of storing their own data, they refer to the data of a LongSequence. Modiying the LongSequence will be reflected in the view, and vice versa. If the underlying LongSequence is truncated, the behaviour of a view is undefined. For the same reason, some operations are not supported for views, such as resizing.

The purpose of LongSubSeq is that, since they only contain a pointer to the underlying array, an offset and a length, they are much lighter than LongSequences, and will be stack allocated on Julia 1.5 and newer. Thus, the user may construct millions of views without major performance implications.

+BioSequences Types · BioSequences.jl

Abstract Types

BioSequences exports an abstract BioSequence type, and several concrete sequence types which inherit from it.

The abstract BioSequence

BioSequences provides an abstract type called a BioSequence{A<:Alphabet}. This abstract type, and the methods and traits is supports, allows for many algorithms in BioSequences to be written as generically as possible, thus reducing the amount of code to read and understand, whilst maintaining high performance when such code is compiled for a concrete BioSequence subtype. Additionally, it allows new types to be implemented that are fully compatible with the rest of BioSequences, providing that key methods or traits are defined).

BioSequences.BioSequenceType
BioSequence{A <: Alphabet}

BioSequence is the main abstract type of BioSequences. It abstracts over the internal representation of different biological sequences, and is parameterized by an Alphabet, which controls the element type.

Extended help

Its subtypes are characterized by:

  • Being a linear container type with random access and indices Base.OneTo(length(x)).
  • Containing zero or more internal data elements of type encoded_data_eltype(typeof(x)).
  • Being associated with an Alphabet, A by being a subtype of BioSequence{A}.

A BioSequence{A} is indexed by an integer. The biosequence subtype, the index and the alphabet A determine how to extract the internal encoded data. The alphabet decides how to decode the data to the element type of the biosequence. Hence, the element type and container type of a BioSequence are separated.

Subtypes T of BioSequence must implement the following, with E begin an encoded data type:

  • Base.length(::T)::Int
  • encoded_data_eltype(::Type{T})::Type{E}
  • extract_encoded_element(::T, ::Integer)::E
  • copy(::T)
  • T must be able to be constructed from any iterable with length defined and with a known, compatible element type.

Furthermore, mutable sequences should implement

  • encoded_setindex!(::T, ::E, ::Integer)
  • T(undef, ::Int)
  • resize!(::T, ::Int)

For compatibility with existing Alphabets, the encoded data eltype must be UInt.

source

Some aliases for BioSequence are also provided for your convenience:

Let's have a closer look at some of those methods that a subtype of BioSequence must implement. Check out julia base library docs for length, copy and resize!.

BioSequences.encoded_data_eltypeFunction
encoded_data_eltype(::Type{<:BioSequence})

Returns the element type of the encoded data of the BioSequence. This is the return type of extract_encoded_element, i.e. the data type that stores the biological symbols in the biosequence.

See also: BioSequence

source
BioSequences.extract_encoded_elementFunction
extract_encoded_element(::BioSequence{A}, i::Integer)

Returns the encoded element at position i. This data can be decoded using decode(A(), data) to yield the element type of the biosequence.

See also: BioSequence

source
BioSequences.encoded_setindex!Function
encoded_setindex!(seq::BioSequence, x::E, i::Integer)

Given encoded data x of type encoded_data_eltype(typeof(seq)), sets the internal sequence data at the given index.

See also: BioSequence

source

A correctly defined subtype of BioSequence that satisfies the interface, will find the vast majority of methods described in the rest of this manual should work out of the box for that type. But they can always be overloaded if needed. Indeed the LongSequence type overloads Indeed some of the generic BioSequence methods, are overloaded for LongSequence, for example for transformation and counting operations where efficiency gains can be made due to the specific internal representation of a specific type.

The abstract Alphabet

Alphabets control how biological symbols are encoded and decoded. They also confer many of the automatic traits and methods that any subtype of T<:BioSequence{A<:Alphabet} will get.

BioSequences.AlphabetType
Alphabet

Alphabet is the most important type trait for BioSequence. An Alphabet represents a set of biological symbols encoded by a sequence, e.g. A, C, G and T for a DNA Alphabet that requires only 2 bits to represent each symbol.

Extended help

  • Subtypes of Alphabet are singleton structs that may or may not be parameterized.
  • Alphabets span over a finite set of biological symbols.
  • The alphabet controls the encoding from some internal "encoded data" to a BioSymbol of the alphabet's element type, as well as the decoding, the inverse process.
  • An Alphabet's encode method must not produce invalid data.

Required methods

Every subtype A of Alphabet must implement:

  • Base.eltype(::Type{A})::Type{S} for some eltype S, which must be a BioSymbol.
  • symbols(::A)::Tuple{Vararg{S}}. This gives tuples of all symbols in the set of A.
  • encode(::A, ::S)::E encodes a symbol to an internal data eltype E.
  • decode(::A, ::E)::S decodes an internal data eltype E to a symbol S.
  • Except for eltype which must follow Base conventions, all functions operating on Alphabet should operate on instances of the alphabet, not the type.

If you want interoperation with existing subtypes of BioSequence, the encoded representation E must be of type UInt, and you must also implement:

  • BitsPerSymbol(::A)::BitsPerSymbol{N}, where the N must be zero or a power of two in [1, 2, 4, 8, 16, 32, [64 for 64-bit systems]].

Optional methods

  • BitsPerSymbol for compatibility with existing BioSequences
  • AsciiAlphabet for increased printing/writing efficiency
  • tryencode for fallible encoding.
source
BioSequences.AsciiAlphabetType
AsciiAlphabet

Trait for alphabet using ASCII characters as String representation. Define codetype(A) = AsciiAlphabet() for a user-defined Alphabet A to gain speed. Methods needed: BioSymbols.stringbyte(::eltype(A)) and ascii_encode(A, ::UInt8).

source

Concrete types

Implemented alphabets

BioSequences.DNAAlphabetType

DNA nucleotide alphabet.

DNAAlphabet has a parameter N which is a number that determines the BitsPerSymbol trait. Currently supported values of N are 2 and 4.

source
BioSequences.RNAAlphabetType

RNA nucleotide alphabet.

RNAAlphabet has a parameter N which is a number that determines the BitsPerSymbol trait. Currently supported values of N are 2 and 4.

source

Long Sequences

BioSequences.LongSequenceType
LongSequence{A <: Alphabet}

General-purpose BioSequence. This type is mutable and variable-length, and should be preferred for most use cases.

Extended help

LongSequence{A<:Alphabet} <: BioSequence{A} is parameterized by a concrete Alphabet type A that defines the domain (or set) of biological symbols permitted.

As the BioSequence interface definition implies, LongSequences store the biological symbol elements that they contain in a succinct encoded form that permits many operations to be done in an efficient bit-parallel manner. As per the interface of BioSequence, the Alphabet determines how an element is encoded or decoded when it is inserted or extracted from the sequence.

For example, AminoAcidAlphabet is associated with AminoAcid and hence an object of the LongSequence{AminoAcidAlphabet} type represents a sequence of amino acids.

Symbols from multiple alphabets can't be intermixed in one sequence type.

The following table summarizes common LongSequence types that have been given aliases for convenience.

TypeSymbol typeType alias
LongSequence{DNAAlphabet{N}}DNALongDNA{N}
LongSequence{RNAAlphabet{N}}RNALongRNA{N}
LongSequence{AminoAcidAlphabet}AminoAcidLongAA

The LongDNA and LongRNA aliases use a DNAAlphabet{4}.

DNAAlphabet{4} permits ambiguous nucleotides, and a sequence must use at least 4 bits to internally store each element (and indeed LongSequence does).

If you are sure that you are working with sequences with no ambiguous nucleotides, you can use LongSequences parameterised with DNAAlphabet{2} instead.

DNAAlphabet{2} is an alphabet that uses two bits per base and limits to only unambiguous nucleotide symbols (A,C,G,T).

Changing this single parameter, is all you need to do in order to benefit from memory savings. Some computations that use bitwise operations will also be dramatically faster.

The same applies with LongSequence{RNAAlphabet{4}}, simply replace the alphabet parameter with RNAAlphabet{2} in order to benefit.

source

Sequence views

Similar to how Base Julia offers views of array objects, BioSequences offers view of LongSequences - the LongSubSeq{A<:Alphabet}.

Conceptually, a LongSubSeq{A} is similar to a LongSequence{A}, but instead of storing their own data, they refer to the data of a LongSequence. Modiying the LongSequence will be reflected in the view, and vice versa. If the underlying LongSequence is truncated, the behaviour of a view is undefined. For the same reason, some operations are not supported for views, such as resizing.

The purpose of LongSubSeq is that, since they only contain a pointer to the underlying array, an offset and a length, they are much lighter than LongSequences, and will be stack allocated on Julia 1.5 and newer. Thus, the user may construct millions of views without major performance implications.