Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Moving ORF as an AbstractGenomicInterval{T} #34

Merged
merged 40 commits into from
Jul 1, 2024
Merged

Conversation

camilogarciabotero
Copy link
Owner

@camilogarciabotero camilogarciabotero commented May 23, 2024

Some time ago, I discovered the GenomicFeatures.jl package. It has a generic abstract type AbstractGenomicInterval{T} that has some neat features as it is build on the even more generic data structure IntervalTrees.jl. Back in the beginning of GeneFinder.jl I did not understand several important aspects of Julia and the nature of having a robust data structure for an ORF.

Now, watching the imp. of the more concrete GenomicInterval:

struct GenomicInterval{T} <: IntervalTrees.AbstractInterval{Int64}
    groupname::String
    first::Int64
    last::Int64
    strand::Strand
    metadata::T
end

It got me thinking that it is indeed a somehow generic ORF although simpler with only starting, ending, and strand fields that could be shared, it could expand later on the metadata or with more dedicated fields. However the cool stuff with this AbstractGenomicInterval{T} is that it inherits the nice properties of the IntervalTree{K,V} data structure so that several operations of an interval (which is indeed an ORF location) can be not only conveniently implemented (as they are already in that package) but also fast. For instance, finding intersections of intervals (see details). This is something that can be important for different gene-finder algorithms that rely on filtering based on overlapping ORFs. I can also be missing some other details that this data structure could bring to the ORF to be more robust and generic to build other finder methods upon.

This PR implements the ORF as a subtype of the AbstractGenomicInterval{F} where {F <: GeneFinderMethod}:

struct ORF{F} <: GenomicFeatures.AbstractGenomicInterval{F}
    groupname::String
    first::Int64
    last::Int64
    strand::Strand
    frame::Int
    finder::F
    scheme::Union{Nothing, Function}
    score::Union{Nothing, Float64}
    rbs::Union{Nothing, Vector{UnitRange{Int64}}} # Add RBS field 
end

Current benchmarks do not look very promising compared to the previous simple implemented ORF:

@time using GeneFinder
  0.268006 seconds (163.65 k allocations: 11.280 MiB, 9.64% gc time, 1.43% compilation time) # main
  0.312447 seconds (180.26 k allocations: 12.660 MiB, 8.44% gc time, 1.26% compilation time) # this PR
rseq = randdnaseq(10^6)
@btime findorfs(rseq, NaiveFinder());
 46.192 ms (346558 allocations: 22.79 MiB) # main
 51.964 ms (390313 allocations: 28.42 MiB) # PR -> findorfs(rseq, finder=NaiveFinder);

There are other imp. details on the ORF{F} like the addition of the finder::F field and the other fields below, that are also making more generic the implementations, but currently not necessarily affecting the benchmarks, and those could be added directly to old ORF struct.

@kescobo
Copy link

kescobo commented May 24, 2024

You don't really need the finder::F field if it's parameterized, you can just define finder(::ORF{F}) where F = F.

I'm not very good at performance chasing, but the concept seems like a good one to me. IntervalTrees make a lot of sense for genomic features, especially something like ORFs over a sequence that may overlap.

Hopefully, someone else can pitch in suggestions for chasing down the regression. Or worst case, post on discourse saying that python is faster and get dozens of responses 😅

@camilogarciabotero
Copy link
Owner Author

You don't really need the finder::F field if it's parameterized, you can just define finder(::ORF{F}) where F = F.

I did not get this suggestion, is this completely literal finder(::ORF{F}) where F = F? like should be outside of the struct right?

I'm not very good at performance chasing, but the concept seems like a good one to me. IntervalTrees make a lot of sense for genomic features, especially something like ORFs over a sequence that may overlap.

Yeah, for me it was like a very nice abstraction to create a more specific ORF type. Thanks.

Hopefully, someone else can pitch in suggestions for chasing down the regression. Or worst case, post on discourse saying that python is faster and get dozens of responses 😅

Good idea! I will wai a little bit to get it completely working with the remaining methods, but I agree I will get a storm of suggestions!

@kescobo
Copy link

kescobo commented May 28, 2024

I did not get this suggestion, is this completely literal finder(::ORF{F}) where F = F? like should be outside of the struct right?

Yeah

julia> abstract type AbstractFoo{T} end

julia> struct Foo1{T} <: AbstractFoo{T} end

julia> struct Foo2{T} <: AbstractFoo{T} end

julia> gett(::AbstractFoo{T}) where T = T
gett (generic function with 1 method)

julia> f1 = Foo1{String}()
Foo1{String}()

julia> f2 = Foo2{Vector}()
Foo2{Vector}()

julia> gett(f1)
String

julia> gett(f2)
Vector (alias for Array{T, 1} where T)

Copy link
Owner Author

@camilogarciabotero camilogarciabotero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I have now a more consistent and efficient (benches updated in the first comment) ORF type:

struct ORF{N,F} <: GenomicFeatures.AbstractGenomicInterval{F}
    groupname::String
    first::Int64
    last::Int64
    strand::Strand
    frame::Int
    seq::LongSubSeq{DNAAlphabet{N}}
    features::Features
    scheme::Union{Nothing,Function}
end

And as you suggested @kescobo, I removed the finder::F from the main struct, and make an outer constructor that acually takes a ::Type{F} so that it is now posssible to make new finder such that they will be <: GeneFinderMethod abstract and will populate the ORF{F}.

Another fixed thing I have not seen before is that the ORF{N,F} can handle LongSubSeq{DNAAlphabet{N}} where N=2.

Also, I was trying to make ORF{N,F}, so now it can be loaded with gene features since the Features type is a flexible Features(fts::NamedTuple) = Features{typeof(fts)}(fts). The score and many other thigs we can get from an ORF can be saved in the type.

A couple of demonstrations of how it is now very neat:

phi = dna"GTGTGAGGTTATAACGCCGAAGCGGTAAAAATTTTAATTTTTGCCGCTGAGGGGTTGACCAAGCGAAGCGCGGTAGGTTTTCTGCTTAGGAGTTTAATCATGTTTCAGACTTTTATTTCTCGCCATAATTCAAACTTTTTTTCTGATAAGCTGGTTCTCACTTCTGTTACTCCAGCTTCTTCGGCACCTGTTTTACAGACACCTAAAGCTACATCGTCAACGTTATATTTTGATAGTTTGACGGTTAATGCTGGTAATGGTGGTTTTCTTCATTGCATTCAGATGGATACATCTGTCAACGCCGCTAATCAGGTTGTTTCTGTTGGTGCTGATATTGCTTTTGATGCCGACCCTAAATTTTTTGCCTGTTTGGTTCGCTTTGAGTCTTCTTCGGTTCCGACTACCCTCCCGACTGCCTATGATGTTTATCCTTTGAATGGTCGCCATGATGGTGGTTATTATACCGTCAAGGACTGTGTGACTATTGACGTCCTTCCCCGTACGCCGGGCAATAACGTTTATGTTGGTTTCATGGTTTGGTCTAACTTTACCGCTACTAAATGCCGCGGATTGGTTTCGCTGAATCAGGTTATTAAAGAGATTATTTGTCTCCAGCCACTTAAGTGAGGTGATTTATGTTTGGTGCTATTGCTGGCGGTATTGCTTCTGCTCTTGCTGGTGGCGCCATGTCTAAATTGTTTGGAGGCGGTCAAAAAGCCGCCTCCGGTGGCATTCAAGGTGATGTGCTTGCTACCGATAACAATACTGTAGGCATGGGTGATGCTGGTATTAAATCTGCCATTCAAGGCTCTAATGTTCCTAACCCTGATGAGGCCGCCCCTAGTTTTGTTTCTGGTGCTATGGCTAAAGCTGGTAAAGGACTTCTTGAAGGTACGTTGCAGGCTGGCACTTCTGCCGTTTCTGATAAGTTGCTTGATTTGGTTGGACTTGGTGGCAAGTCTGCCGCTGATAAAGGAAAGGATACTCGTGATTATCTTGCTGCTGCATTTCCTGAGCTTAATGCTTGGGAGCGTGCTGGTGCTGATGCTTCCTCTGCTGGTATGGTTGACGCCGGATTTGAGAATCAAAAAGAGCTTACTAAAATGCAACTGGACAATCAGAAAGAGATTGCCGAGATGCAAAATGAGACTCAAAAAGAGATTGCTGGCATTCAGTCGGCGACTTCACGCCAGAATACGAAAGACCAGGTATATGCACAAAATGAGATGCTTGCTTATCAACAGAAGGAGTCTACTGCTCGCGTTGCGTCTATTATGGAAAACACCAATCTTTCCAAGCAACAGCAGGTTTCCGAGATTATGCGCCAAATGCTTACTCAAGCTCAAACGGCTGGTCAGTATTTTACCAATGACCAAATCAAAGAAATGACTCGCAAGGTTAGTGCTGAGGTTGACTTAGTTCATCAGCAAACGCAGAATCAGCGGTATGGCTCTTCTCATATTGGCGCTACTGCAAAGGATATTTCTAATGTCGTCACTGATGCTGCTTCTGGTGTGGTTGATATTTTTCATGGTATTGATAAAGCTGTTGCCGATACTTGGAACAATTTCTGGAAAGACGGTAAAGCTGATGGTATTGGCTCTAATTTGTCTAGGAAATAACCGTCAGGATTGACACCCTCCCAATTGTATGTTTTCATGCCTCCAAATCTTGGAGGCTTTTTTATGGTTCGTTCTTATTACCCTTCTGAATGTCACGCTGATTATTTTGACTTTGAGCGTATCGAGGCTCTTAAACCTGCTATTGAGGCTTGTGGCATTTCTACTCTTTCTCAATCCCCAATGCTTGGCTTCCATAAGCAGATGGATAACCGCATCAAGCTCTTGGAAGAGATTCTGTCTTTTCGTATGCAGGGCGTTGAGTTCGATAATGGTGATATGTATGTTGACGGCCATAAGGCTGCTTCTGACGTTCGTGATGAGTTTGTATCTGTTACTGAGAAGTTAATGGATGAATTGGCACAATGCTACAATGTGCTCCCCCAACTTGATATTAATAACACTATAGACCACCGCCCCGAAGGGGACGAAAAATGGTTTTTAGAGAACGAGAAGACGGTTACGCAGTTTTGCCGCAAGCTGGCTGCTGAACGCCCTCTTAAGGATATTCGCGATGAGTATAATTACCCCAAAAAGAAAGGTATTAAGGATGAGTGTTCAAGATTGCTGGAGGCCTCCACTATGAAATCGCGTAGAGGCTTTGCTATTCAGCGTTTGATGAATGCAATGCGACAGGCTCATGCTGATGGTTGGTTTATCGTTTTTGACACTCTCACGTTGGCTGACGACCGATTAGAGGCGTTTTATGATAATCCCAATGCTTTGCGTGACTATTTTCGTGATATTGGTCGTATGGTTCTTGCTGCCGAGGGTCGCAAGGCTAATGATTCACACGCCGACTGCTATCAGTATTTTTGTGTGCCTGAGTATGGTACAGCTAATGGCCGTCTTCATTTCCATGCGGTGCACTTTATGCGGACACTTCCTACAGGTAGCGTTGACCCTAATTTTGGTCGTCGGGTACGCAATCGCCGCCAGTTAAATAGCTTGCAAAATACGTGGCCTTATGGTTACAGTATGCCCATCGCAGTTCGCTACACGCAGGACGCTTTTTCACGTTCTGGTTGGTTGTGGCCTGTTGATGCTAAAGGTGAGCCGCTTAAAGCTACCAGTTATATGGCTGTTGGTTTCTATGTGGCTAAATACGTTAACAAAAAGTCAGATATGGACCTTGCTGCTAAAGGTCTAGGAGCTAAAGAATGGAACAACTCACTAAAAACCAAGCTGTCGCTACTTCCCAAGAAGCTGTTCAGAATCAGAATGAGCCGCAACTTCGGGATGAAAATGCTCACAATGACAAATCTGTCCACGGAGTGCTTAATCCAACTTACCAAGCTGGGTTACGACGCGACGCCGTTCAACCAGATATTGAAGCAGAACGCAAAAAGAGAGATGAGATTGAGGCTGGGAAAAGTTACTGTAGCCGACGTTTTGGCGGCGCAACCTGTGACGACAAATCTGCTCAAATTTATGCGCGCTTCGATAAAAATGATTGGCGTATCCAACCTGCAGAGTTTTATCGCTTCCATGACGCAGAAGTTAACACTTTCGGATATTTCTGATGAGTCGAAAAATTATCTTGATAAAGCAGGAATTACTACTGCTTGTTTACGAATTAAATCGAAGTGGACTGCTGGCGGAAAATGAGAAAATTCGACCTATCCTTGCGCAGCTCGAGAAGCTCTTACTTTGCGACCTTTCGCCATCAACTAACGATTCTGTCAAAAACTGACGCGTTGGATGAGGAGAAGTGGCTTAATATGCTTGGCACGTTCGTCAAGGACTGGTTTAGATATGAGTCACATTTTGTTCATGGTAGAGATTCTCTTGTTGACATTTTAAAAGAGCGTGGATTACTATCTGAGTCCGATGCTGTTCAACCACTAATAGGTAAGAAATCATGAGTCAAGTTACTGAACAATCCGTACGTTTCCAGACCGCTTTGGCCTCTATTAAGCTCATTCAGGCTTCTGCCGTTTTGGATTTAACCGAAGATGATTTCGATTTTCTGACGAGTAACAAAGTTTGGATTGCTACTGACCGCTCTCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTACGCTGGACTTTGTGGGATACCCTCGCTTTCCTGCTCCTGTTGAGTTTATTGCTGCCGTCATTGCTTATTATGTTCATCCCGTCAACATTCAAACGGCCTGTCTCATCATGGAAGGCGCTGAATTTACGGAAAACATTATTAATGGCGTCGAGCGTCCGGTTAAAGCCGCTGAATTGTTCGCGTTTACCTTGCGTGTACGCGCAGGAAACACTGACGTTCTTACTGACGCAGAAGAAAACGTGCGTCAAAAATTACGTGCGGAAGGAGTGATGTAATGTCTAAAGGTAAAAAACGTTCTGGCGCTCGCCCTGGTCGTCCGCAGCCGTTGCGAGGTACTAAAGGCAAGCGTAAAGGCGCTCGTCTTTGGTATGTAGGTGGTCAACAATTTTAATTGCAGGGGCTTCGGCCCCTTACTTGAGGATAAATTATGTCTAATATTCAAACTGGCGCCGAGCGTATGCCGCATGACCTTTCCCATCTTGGCTTCCTTGCTGGTCAGATTGGTCGTCTTATTACCATTTCAACTACTCCGGTTATCGCTGGCGACTCCTTCGAGATGGACGCCGTTGGCGCTCTCCGTCTTTCTCCATTGCGTCGTGGCCTTGCTATTGACTCTACTGTAGACATTTTTACTTTTTATGTCCCTCATCGTCACGTTTATGGTGAACAGTGGATTAAGTTCATGAAGGATGGTGTTAATGCCACTCCTCTCCCGACTGTTAACACTACTGGTTATATTGACCATGCCGCTTTTCTTGGCACGATTAACCCTGATACCAATAAAATCCCTAAGCATTTGTTTCAGGGTTATTTGAATATCTATAACAACTATTTTAAAGCGCCGTGGATGCCTGACCGTACCGAGGCTAACCCTAATGAGCTTAATCAAGATGATGCTCGTTATGGTTTCCGTTGCTGCCATCTCAAAAACATTTGGACTGCTCCGCTTCCTCCTGAGACTGAGCTTTCTCGCCAAATGACGACTTCTACCACATCTATTGACATTATGGGTCTGCAAGCTGCTTATGCTAATTTGCATACTGACCAAGAACGTGATTACTTCATGCAGCGTTACCATGATGTTATTTCTTCATTTGGAGGTAAAACCTCTTATGACGCTGACAACCGTCCTTTACTTGTCATGCGCTCTAATCTCTGGGCATCTGGCTATGATGTTGATGGAACTGACCAAACGTCGTTAGGCCAGTTTTCTGGTCGTGTTCAACAGACCTATAAACATTCTGTGCCGCGTTTCTTTGTTCCTGAGCATGGCACTATGTTTACTCTTGCGCTTGTTCGTTTTCCGCCTACTGCGACTAAAGAGATTCAGTACCTTAACGCTAAAGGTGCTTTGACTTATACCGATATTGCTGGCGACCCTGTTTTGTATGGCAACTTGCCGCCGCGTGAAATTTCTATGAAGGATGTTTTCCGTTCTGGTGATTCGTCTAAGAAGTTTAAGATTGCTGAGGGTCAGTGGTATCGTTATGCGCCTTCGTATGTTTCTCCTGCTTATCACCTTCTTGAAGGCTTCCCATTCATTCAGGAACCGCCTTCTGGTGATTTGCAAGAACGCGTACTTATTCGCCACCATGATTATGACCAGTGTTTCCAGTCCGTTCAGTTGTTGCAGTGGAATAGTCAGGTTAAATTTAATGTGACCGTTTATCGCAATCTGCCGACCACTCGCGATTCAATCATGACTTCGTGATAAAAGATTGA"

phiorfs = findorfs(phi, finder=NaiveFinder)

204-element Vector{ORF{4, NaiveFinder}}:
 ORF{NaiveFinder}(9:101, '-', 3)
 ORF{NaiveFinder}(94:126, '-', 1)
 ORF{NaiveFinder}(100:627, '+', 1)
 ORF{NaiveFinder}(209:214, '-', 2)
 ORF{NaiveFinder}(223:273, '-', 1)
 ORF{NaiveFinder}(223:447, '-', 1)
 ORF{NaiveFinder}(245:292, '-', 2)
 ORF{NaiveFinder}(248:436, '+', 2)
 ORF{NaiveFinder}(257:436, '+', 2)
 ORF{NaiveFinder}(270:278, '-', 3)
 ORF{NaiveFinder}(283:627, '+', 1)
 ORF{NaiveFinder}(344:436, '+', 2)
 ORF{NaiveFinder}(419:436, '+', 2)
 ORF{NaiveFinder}(422:436, '+', 2)
 ORF{NaiveFinder}(437:481, '+', 2)
 
 ORF{NaiveFinder}(4863:5258, '-', 3)
 ORF{NaiveFinder}(4933:5019, '+', 1)
 ORF{NaiveFinder}(4941:5375, '+', 3)
 ORF{NaiveFinder}(5053:5085, '+', 1)
 ORF{NaiveFinder}(5082:5375, '+', 3)
 ORF{NaiveFinder}(5089:5325, '+', 1)
 ORF{NaiveFinder}(5122:5202, '-', 1)
 ORF{NaiveFinder}(5152:5325, '+', 1)
 ORF{NaiveFinder}(5164:5325, '+', 1)
 ORF{NaiveFinder}(5180:5206, '-', 2)
 ORF{NaiveFinder}(5257:5325, '+', 1)
 ORF{NaiveFinder}(5263:5325, '+', 1)
 ORF{NaiveFinder}(5320:5325, '+', 1)
 ORF{NaiveFinder}(5330:5365, '-', 2)
 ORF{NaiveFinder}(5364:5375, '+', 3)

Since the ORF{N,F} stores the sequence, we can make cool things out of the box:

sequence.(orfs)

204-element Vector{LongSubSeq{DNAAlphabet{4}}}:
 ATGATTAAACTCCTAAGCAGAAAACCTACCGCGCTTCGCGCGGCAAAAATTAAAATTTTTACCGCTTCGGCGTTATAA
 ATGGCGAGAAATAAAAGTCTGAAACATGATTAA
 ATGTTTCAGACTTTTATTTCTCGCCATAATTCAAACTTTGTTATTAAAGAGATTATTTGTCTCCAGCCACTTAAGTGA
 ATGTAG
 ATGAAGAAAACCACCATTACCAGCATTAACCGTCAAACTATCAAAATATAA
 ATGGCGACCATTCAAAGGATAAACATCATAGGCAGTCGGACCATTACCAGCATTAACCGTCAAACTATCAAAATATAA
 ATGTATCCATCTGAATGCAATGAAGAAAACCACCATTACCAGCATTAA
 ATGCTGGTAATGGTGGTTTTCTTCATTGCATTCAGATGGCGACTACCCTCCCGACTGCCTATGATGTTTATCCTTTGA
 ATGGTGGTTTTCTTCATTGCATTCAGATGGATACATCTGCGACTACCCTCCCGACTGCCTATGATGTTTATCCTTTGA
 ATGCAATGA
 ATGGATACATCTGTCAACGCCGCTAATCAGGTTGTTTCTGTTATTAAAGAGATTATTTGTCTCCAGCCACTTAAGTGA
 ATGCCGACCCTAAATTTTTTGCCTGTTTGGTTCGCTTTGCGACTACCCTCCCGACTGCCTATGATGTTTATCCTTTGA
 ATGATGTTTATCCTTTGA
 ATGTTTATCCTTTGA
 ATGGTCGCCATGATGGTGGTTATTATACCGTCAAGGACTGTGTGA
 
 ATGGTGGCGAATAAGTACGCGTTCTTGCAAATCACCAGATTTATAGGTCTGTTGAACACGACCAGAAAACTGGCCTAA
 ATGGCACTATGTTTACTCTTGCGCTTGTTCGTTTTCCGCCTAAAGAGATTCAGTACCTTAACGCTAAAGGTGCTTTGA
 ATGTTTACTCTTGCGCTTGTTCGTTTTCCGCCTACTGCGAATCTGCCGACCACTCGCGATTCAATCATGACTTCGTGA
 ATGGCAACTTGCCGCCGCGTGAAATTTCTATGA
 ATGAAGGATGTTTTCCGTTCTGGTGATTCGTCTAAGAAGAATCTGCCGACCACTCGCGATTCAATCATGACTTCGTGA
 ATGTTTTCCGTTCTGGTGATTCGTCTAAGAAGTTTAAGAAGTTGTTGCAGTGGAATAGTCAGGTTAAATTTAATGTGA
 ATGGGAAGCCTTCAAGAAGGTGATAAGCAGGAGAAACATAAGGCGCATAACGATACCACTGACCCTCAGCAATCTTAA
 ATGCGCCTTCGTATGTTTCTCCTGCTTATCACCTTCTTGAGTTGTTGCAGTGGAATAGTCAGGTTAAATTTAATGTGA
 ATGTTTCTCCTGCTTATCACCTTCTTGAAGGCTTCCCATAGTTGTTGCAGTGGAATAGTCAGGTTAAATTTAATGTGA
 ATGAATGGGAAGCCTTCAAGAAGGTGA
 ATGATTATGACCAGTGTTTCCAGTCCGTTCAGTTGTTGCAGTGGAATAGTCAGGTTAAATTTAATGTGA
 ATGACCAGTGTTTCCAGTCCGTTCAGTTGTTGCAGTGGAATAGTCAGGTTAAATTTAATGTGA
 ATGTGA
 ATGATTGAATCGCGAGTGGTCGGCAGATTGCGATAA
 ATGACTTCGTGA

Exploring the new Features field

We can now try to use a scoring scheme, for instance, the simple log_odds_ratio_score and use minlen keyword argument:

findorfs(phi, scheme=lors, minlen=64) .|> features

134-element Vector{@NamedTuple{score::Float64}}:
 (score = -0.020953839808134023,)
 (score = -0.0008346917390085617,)
 (score = -0.005239249252611444,)
 (score = -0.007200284900850932,)
 (score = -0.007832152171979367,)
 (score = -0.0018172720089990923,)
 (score = -0.006404578260605308,)
 (score = -0.003209054988434599,)
 (score = -0.03100933596797633,)
 (score = -0.0005844727783277565,)
 (score = -0.0005442693515613672,)
 (score = -0.02946967703459084,)
 (score = -0.0006814154969336216,)
 (score = -0.0009388058672044588,)
 (score = -0.000946341251223099,)
 
 (score = -0.0003438480564352011,)
 (score = -0.00853670881288625,)
 (score = -0.00802605474396142,)
 (score = -0.01604090594572921,)
 (score = -0.00037344639883859457,)
 (score = -0.011635555210909566,)
 (score = -0.0006769082407312228,)
 (score = -0.014804098291466597,)
 (score = -0.0006454376565976915,)
 (score = -0.0014012196428411425,)
 (score = -0.0024182335566292662,)
 (score = -0.01466675827643835,)
 (score = -0.00302236305890491,)
 (score = -0.003913378336104148,)
 (score = -0.030410038701528405,)

Very neet interface for filtering

And much more we can now filter with very nice fine grain and can close #33 :

filter(x -> iscoding(sequence(x), η=1e-10) && length(x) > 100, findorfs(phi))

25-element Vector{ORF{4, NaiveFinder}}:
 ORF{NaiveFinder}(1021:1389, '+', 1)
 ORF{NaiveFinder}(1447:1635, '+', 1)
 ORF{NaiveFinder}(1489:1635, '+', 1)
 ORF{NaiveFinder}(1501:1635, '+', 1)
 ORF{NaiveFinder}(1531:1635, '+', 1)
 ORF{NaiveFinder}(2697:3227, '+', 3)
 ORF{NaiveFinder}(2745:3227, '+', 3)
 ORF{NaiveFinder}(2780:3142, '+', 2)
 ORF{NaiveFinder}(2841:3227, '+', 3)
 ORF{NaiveFinder}(2859:3227, '+', 3)
 ORF{NaiveFinder}(2865:3227, '+', 3)
 ORF{NaiveFinder}(2874:3227, '+', 3)
 ORF{NaiveFinder}(2973:3227, '+', 3)
 ORF{NaiveFinder}(3142:3312, '+', 1)
 ORF{NaiveFinder}(3481:3939, '+', 1)
 ORF{NaiveFinder}(3659:3934, '+', 2)
 ORF{NaiveFinder}(3734:3934, '+', 2)
 ORF{NaiveFinder}(3772:3939, '+', 1)
 ORF{NaiveFinder}(3806:3934, '+', 2)
 ORF{NaiveFinder}(4129:4287, '+', 1)
 ORF{NaiveFinder}(4160:4291, '-', 2)
 ORF{NaiveFinder}(4540:4644, '+', 1)
 ORF{NaiveFinder}(4690:4866, '+', 1)
 ORF{NaiveFinder}(4741:4866, '+', 1)
 ORF{NaiveFinder}(4744:4866, '+', 1)

Using GenomeFeature GenomeIntervalCollections

Since the ORF{N,F} is a subtype of a AbstractGenomicInterval we can use some nice methods and explore, for instance, the coverage of the ORF intersections:

orfcolleciton =  GenomicIntervalCollection(phiorfs)

GenomicIntervalCollection{ORF{4, NaiveFinder}} with 134 intervals:
  ORF{NaiveFinder}(9:101, '-', 3)
  ORF{NaiveFinder}(100:627, '+', 1)
  ORF{NaiveFinder}(223:447, '-', 1)
  ORF{NaiveFinder}(248:436, '+', 2)
  ORF{NaiveFinder}(257:436, '+', 2)
  ORF{NaiveFinder}(283:627, '+', 1)
  ORF{NaiveFinder}(344:436, '+', 2)
  ORF{NaiveFinder}(532:627, '+', 1)
  

coverage(orfcollection)

GenomicIntervalCollection{GenomicInterval{UInt32}} with 178 intervals:
  phi:9-99  .  1
  phi:100-101  .  2
  phi:102-222  .  1
  phi:223-247  .  2
  phi:248-256  .  3
  phi:257-282  .  4
  phi:283-343  .  5
  phi:344-436  .  6
  

@camilogarciabotero camilogarciabotero linked an issue Jun 8, 2024 that may be closed by this pull request
@kescobo
Copy link

kescobo commented Jun 10, 2024

This looks fantastic :-)

@camilogarciabotero
Copy link
Owner Author

Thanks for all the suggestions!

I still doubt whether to keep the seq field in the ORF struct. Do you have any thoughts about it? It has the benefit of making the ORF struct more "portable" or robust but maybe it is somehow redundant as it could be enough to have the positions, frame, and strand...

@kescobo
Copy link

kescobo commented Jun 10, 2024

Are SubSeqs implemented as views? To me, that would make the most sense, so they could be carried along with the object, but not be reallocated

@camilogarciabotero
Copy link
Owner Author

Yes, it is seq::LongSubSeq{DNAAlphabet{N}} so yeah, allocations are minimal, but I am in this battle of making this package somehow generic, but also UI rich... I will leave this PR linger open a little bit more to give the general implementation a little more thought. Thanks for taking the time to take a look at it.

@camilogarciabotero camilogarciabotero merged commit 6ba4aa5 into main Jul 1, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Using score to filter what getorfs delivers
2 participants