-
Notifications
You must be signed in to change notification settings - Fork 7
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
In this mega-commit, Kmers.jl is thoroughly overhauled, with a new API, new docs, and new types.
- Loading branch information
1 parent
0bdf3a8
commit 8fd37eb
Showing
64 changed files
with
4,097 additions
and
2,890 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
always_for_in = true | ||
whitespace_typedefs = true | ||
whitespace_ops_in_indices = true | ||
remove_extra_newlines = true | ||
import_to_using = true | ||
normalize_line_endings = "unix" | ||
separate_kwargs_with_semicolon = true | ||
whitespace_in_kwargs = false |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2,4 +2,6 @@ | |
*.jl.*.cov | ||
*.jl.mem | ||
.DS_Store | ||
Manifest.toml | ||
Manifest.toml | ||
TODO.md | ||
docs/build |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,17 +1,33 @@ | ||
name = "Kmers" | ||
uuid = "445028e4-d31f-4f27-89ad-17affd83fc22" | ||
authors = ["Sabrina Jaye Ward <[email protected]>"] | ||
version = "0.1.0" | ||
authors = [ | ||
"Jakob Nybo Nissen <[email protected]>", | ||
"Sabrina Jaye Ward <[email protected]>" | ||
] | ||
version = "1.0.0" | ||
|
||
[weakdeps] | ||
StringViews = "354b36f9-a18e-4713-926e-db85100087ba" | ||
|
||
[deps] | ||
BioSequences = "7e6ae17a-c86d-528c-b3b9-7f778a29fe59" | ||
BioSymbols = "3c28c6f8-a34d-59c4-9654-267d177fcfa9" | ||
|
||
[extensions] | ||
StringViewsExt = "StringViews" | ||
|
||
# Note: We intentionally have strict compat on BioSequences because Kmers | ||
# reaches into the internals of BioSequences. | ||
[compat] | ||
BioSequences = "3.1.3" | ||
julia = "1.5" | ||
BioSequences = "~3.4.1" | ||
Random = "1.10" | ||
julia = "1.10" | ||
StringViews = "1" | ||
|
||
[extras] | ||
Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40" | ||
Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c" | ||
StringViews = "354b36f9-a18e-4713-926e-db85100087ba" | ||
|
||
[targets] | ||
test = ["Test"] | ||
test = ["Test", "Random", "StringViews"] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,9 @@ | ||
[deps] | ||
BioSequences = "7e6ae17a-c86d-528c-b3b9-7f778a29fe59" | ||
Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4" | ||
FASTX = "c2308a5c-f048-11e8-3e8a-31650f418d12" | ||
Kmers = "445028e4-d31f-4f27-89ad-17affd83fc22" | ||
MinHash = "4b3c9753-2685-44e9-8a29-365b96c023ed" | ||
|
||
[compat] | ||
Documenter = "0.24" | ||
Documenter = "1" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,29 +1,33 @@ | ||
using Documenter, Kmers | ||
|
||
makedocs( | ||
format = Documenter.HTML(), | ||
sitename = "Kmers.jl", | ||
pages = [ | ||
"Home" => "index.md", | ||
"Kmer types" => "kmer_types.md", | ||
"Constructing kmers" => "construction.md", | ||
"Indexing & modifying kmers" => "transforms.md", | ||
"Predicates" => "predicates.md", | ||
"Random kmers" => "random.md", | ||
"Iterating over Kmers" => "iteration.md", | ||
"Translation" => "translate.md", | ||
#"Pattern matching and searching" => "sequence_search.md", | ||
#"Iteration" => "iteration.md", | ||
#"Counting" => "counting.md", | ||
#"I/O" => "io.md", | ||
#"Interfaces" => "interfaces.md" | ||
DocMeta.setdocmeta!( | ||
Kmers, | ||
:DocTestSetup, | ||
:(using BioSequences, Kmers, Test); | ||
recursive=true, | ||
) | ||
|
||
makedocs(; | ||
modules=[Kmers], | ||
format=Documenter.HTML(; prettyurls=get(ENV, "CI", nothing) == "true"), | ||
sitename="Kmers.jl", | ||
pages=[ | ||
"Home" => "index.md", | ||
"The Kmer type" => "kmers.md", | ||
"Iteration" => "iteration.md", | ||
"Translation" => "translation.md", | ||
"Hashing" => "hashing.md", | ||
"K-mer replacements" => "replacements.md", | ||
"FAQ" => "faq.md", | ||
"Cookbook" => ["MinHash" => "minhash.md", "Kmer composition" => "composition.md"], | ||
], | ||
authors = "Ben J. Ward, The BioJulia Organisation and other contributors." | ||
authors="Jakob Nybo Nissen, Sabrina J. Ward, The BioJulia Organisation and other contributors.", | ||
checkdocs=:exports, | ||
) | ||
|
||
deploydocs( | ||
repo = "github.com/BioJulia/Kmers.jl.git", | ||
push_preview = true, | ||
deps = nothing, | ||
make = nothing | ||
deploydocs(; | ||
repo="github.com/BioJulia/Kmers.jl.git", | ||
push_preview=true, | ||
deps=nothing, | ||
make=nothing, | ||
) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,52 @@ | ||
```@meta | ||
CurrentModule = Kmers | ||
DocTestSetup = quote | ||
using BioSequences | ||
using Test | ||
using Kmers | ||
end | ||
``` | ||
## Kmer composition | ||
In metagenomics, sequences are often summarized by counting the occurrence of | ||
all k-mers of a given length in a sequence. | ||
For example, for K=4, there are 4^4 = 256 possible DNA 4-mers. | ||
If these counts are ordered, the composition can be represented by a length 256 | ||
vector. | ||
|
||
Vector similarity operations (e.g. cosine distance) can then be used as an | ||
approximate proxy for phylogenetic distance. | ||
|
||
In the example below, we exploit that: | ||
* A `DNAKmer{4}`'s data is a single-element tuple, which | ||
stores the sequence in the 8 lower bits. | ||
* The `encoded_data` function will return this tuple. | ||
|
||
```jldoctest; output=false | ||
using BioSequences, FASTX, Kmers | ||
using BioSequences: encoded_data | ||
function composition(record::FASTARecord) | ||
counts = zeros(UInt32, 256) | ||
frequencies = zeros(Float32, 256) | ||
for kmer in FwDNAMers{4}(sequence(record)) | ||
@inbounds counts[only(encoded_data(kmer)) + 1] += 1 | ||
end | ||
factor = 1 / sum(counts; init=zero(eltype(counts))) | ||
for i in eachindex(counts, frequencies) | ||
frequencies[i] = counts[i] * factor | ||
end | ||
frequencies | ||
end | ||
# Make two FASTA records - could be from an assembly | ||
recs = [FASTARecord(string(i), randdnaseq(10000)) for i in "AB"] | ||
# Compute the 2-norm difference and verify it's in [0, 2]. | ||
(comp_a, comp_b) = map(composition, recs) | ||
comp_distance = sum((comp_a .- comp_b).^2) | ||
println(0.0 ≤ comp_distance ≤ 2.0) | ||
# output | ||
true | ||
``` |
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
```@meta | ||
CurrentModule = Kmers | ||
DocTestSetup = quote | ||
using BioSequences | ||
using Test | ||
using Kmers | ||
end | ||
``` | ||
## FAQ | ||
### Why can kmers not be compared to biosequences? | ||
It may be surprising that kmers cannot be compared to other biosequences: | ||
|
||
```jldoctest | ||
julia> dna"TAG" == mer"TAG"d | ||
ERROR: MethodError | ||
[...] | ||
``` | ||
|
||
In fact, this is implemented by a manually thrown `MethodError`; the generic case `Base.:==(::BioSequence, ::BioSequence)` is defined. | ||
|
||
The reason for this is the consequence of the following limitations: | ||
* `isequal(x, y)` implies `hash(x) == hash(y)` | ||
* `isequal(x, y)` and `x == y` ought to be identical for well-defined elements (i.e. in the absence of `missing`s and `NaN`s etc.) | ||
* `hash(::Kmer)` must be absolutely maximally efficient | ||
|
||
If kmers were to be comparable to `BioSequence`, then the hashing of `BioSequence` should follow `Kmer`, which practically speaking would mean that all biosequences would need to be recoded to `Kmer`s before hashing. | ||
|
||
### Why isn't there an iterator of unambiguous, canonical kmers or spaced, canonical kmers? | ||
Any iterator of nucleotide kmers can be made into a canonical kmer iterator by simply calling `canonical` on its output kers. | ||
|
||
The `CanonicalKmers` iterator is special cased, because with a step size of 1, it is generally faster to build the next kmer by storing both the reverse and forward kmer, then creating the next kmer by prepending/append the next symbol. | ||
|
||
However, with a larger step size, it becomes more efficient to build the forward kmer, then reverse-complement the whole kmer. | ||
|
||
### Why isn't there an iterator of skipmers/minimizers/k-min-mers, etc? | ||
The concept of kmers have turned out to be remarkably flexible and useful in bioinformatics, and have spawned a neverending stream of variations. | ||
We simply can't implement them all. | ||
|
||
However, see the section [Building kmer replacements](@ref replacements) on how to implement them | ||
as a user of Kmers.jl yourself. |
Oops, something went wrong.
8fd37eb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JuliaRegistrator register
8fd37eb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Registration pull request created: JuliaRegistries/General/121889
Tip: Release Notes
Did you know you can add release notes too? Just add markdown formatted text underneath the comment after the text
"Release notes:" and it will be added to the registry PR, and if TagBot is installed it will also be added to the
release that TagBot creates. i.e.
To add them here just re-invoke and the PR will be updated.
Tagging
After the above pull request is merged, it is recommended that a tag is created on this repository for the registered package version.
This will be done automatically if the Julia TagBot GitHub Action is installed, or can be done manually through the github interface, or via: