Moving `ORF` as an `AbstractGenomicInterval{T}` #34

camilogarciabotero · 2024-05-23T23:25:25Z

Some time ago, I discovered the GenomicFeatures.jl package. It has a generic abstract type AbstractGenomicInterval{T} that has some neat features as it is build on the even more generic data structure IntervalTrees.jl. Back in the beginning of GeneFinder.jl I did not understand several important aspects of Julia and the nature of having a robust data structure for an ORF.

Now, watching the imp. of the more concrete GenomicInterval:

struct GenomicInterval{T} <: IntervalTrees.AbstractInterval{Int64}
    groupname::String
    first::Int64
    last::Int64
    strand::Strand
    metadata::T
end

It got me thinking that it is indeed a somehow generic ORF although simpler with only starting, ending, and strand fields that could be shared, it could expand later on the metadata or with more dedicated fields. However the cool stuff with this AbstractGenomicInterval{T} is that it inherits the nice properties of the IntervalTree{K,V} data structure so that several operations of an interval (which is indeed an ORF location) can be not only conveniently implemented (as they are already in that package) but also fast. For instance, finding intersections of intervals (see details). This is something that can be important for different gene-finder algorithms that rely on filtering based on overlapping ORFs. I can also be missing some other details that this data structure could bring to the ORF to be more robust and generic to build other finder methods upon.

This PR implements the ORF as a subtype of the AbstractGenomicInterval{F} where {F <: GeneFinderMethod}:

struct ORF{F} <: GenomicFeatures.AbstractGenomicInterval{F}
    groupname::String
    first::Int64
    last::Int64
    strand::Strand
    frame::Int
    finder::F
    scheme::Union{Nothing, Function}
    score::Union{Nothing, Float64}
    rbs::Union{Nothing, Vector{UnitRange{Int64}}} # Add RBS field 
end

Current benchmarks do not look very promising compared to the previous simple implemented ORF:

@time using GeneFinder
  0.268006 seconds (163.65 k allocations: 11.280 MiB, 9.64% gc time, 1.43% compilation time) # main
  0.312447 seconds (180.26 k allocations: 12.660 MiB, 8.44% gc time, 1.26% compilation time) # this PR

rseq = randdnaseq(10^6)
@btime findorfs(rseq, NaiveFinder());
 46.192 ms (346558 allocations: 22.79 MiB) # main
 51.964 ms (390313 allocations: 28.42 MiB) # PR -> findorfs(rseq, finder=NaiveFinder);

There are other imp. details on the ORF{F} like the addition of the finder::F field and the other fields below, that are also making more generic the implementations, but currently not necessarily affecting the benchmarks, and those could be added directly to old ORF struct.

src/algorithms/naivefinder.jl

kescobo · 2024-05-24T01:43:25Z

You don't really need the finder::F field if it's parameterized, you can just define finder(::ORF{F}) where F = F.

I'm not very good at performance chasing, but the concept seems like a good one to me. IntervalTrees make a lot of sense for genomic features, especially something like ORFs over a sequence that may overlap.

Hopefully, someone else can pitch in suggestions for chasing down the regression. Or worst case, post on discourse saying that python is faster and get dozens of responses 😅

camilogarciabotero · 2024-05-25T17:58:11Z

You don't really need the finder::F field if it's parameterized, you can just define finder(::ORF{F}) where F = F.

I did not get this suggestion, is this completely literal finder(::ORF{F}) where F = F? like should be outside of the struct right?

I'm not very good at performance chasing, but the concept seems like a good one to me. IntervalTrees make a lot of sense for genomic features, especially something like ORFs over a sequence that may overlap.

Yeah, for me it was like a very nice abstraction to create a more specific ORF type. Thanks.

Hopefully, someone else can pitch in suggestions for chasing down the regression. Or worst case, post on discourse saying that python is faster and get dozens of responses 😅

Good idea! I will wai a little bit to get it completely working with the remaining methods, but I agree I will get a storm of suggestions!

kescobo · 2024-05-28T14:56:14Z

I did not get this suggestion, is this completely literal finder(::ORF{F}) where F = F? like should be outside of the struct right?

Yeah

julia> abstract type AbstractFoo{T} end

julia> struct Foo1{T} <: AbstractFoo{T} end

julia> struct Foo2{T} <: AbstractFoo{T} end

julia> gett(::AbstractFoo{T}) where T = T
gett (generic function with 1 method)

julia> f1 = Foo1{String}()
Foo1{String}()

julia> f2 = Foo2{Vector}()
Foo2{Vector}()

julia> gett(f1)
String

julia> gett(f2)
Vector (alias for Array{T, 1} where T)

camilogarciabotero

Alright, I have now a more consistent and efficient (benches updated in the first comment) ORF type:

struct ORF{N,F} <: GenomicFeatures.AbstractGenomicInterval{F}
    groupname::String
    first::Int64
    last::Int64
    strand::Strand
    frame::Int
    seq::LongSubSeq{DNAAlphabet{N}}
    features::Features
    scheme::Union{Nothing,Function}
end

And as you suggested @kescobo, I removed the finder::F from the main struct, and make an outer constructor that acually takes a ::Type{F} so that it is now posssible to make new finder such that they will be <: GeneFinderMethod abstract and will populate the ORF{F}.

Another fixed thing I have not seen before is that the ORF{N,F} can handle LongSubSeq{DNAAlphabet{N}} where N=2.

Also, I was trying to make ORF{N,F}, so now it can be loaded with gene features since the Features type is a flexible Features(fts::NamedTuple) = Features{typeof(fts)}(fts). The score and many other thigs we can get from an ORF can be saved in the type.

A couple of demonstrations of how it is now very neat:

phi = dna"GTGTGAGGTTATAACGCCGAAGCGGTAAAAATTTTAATTTTTGCCGCTGAGGGGTTGACCAAGCGAAGCGCGGTAGGTTTTCTGCTTAGGAGTTTAATCATGTTTCAGACTTTTATTTCTCGCCATAATTCAAACTTTTTTTCTGATAAGCTGGTTCTCACTTCTGTTACTCCAGCTTCTTCGGCACCTGTTTTACAGACACCTAAAGCTACATCGTCAACGTTATATTTTGATAGTTTGACGGTTAATGCTGGTAATGGTGGTTTTCTTCATTGCATTCAGATGGATACATCTGTCAACGCCGCTAATCAGGTTGTTTCTGTTGGTGCTGATATTGCTTTTGATGCCGACCCTAAATTTTTTGCCTGTTTGGTTCGCTTTGAGTCTTCTTCGGTTCCGACTACCCTCCCGACTGCCTATGATGTTTATCCTTTGAATGGTCGCCATGATGGTGGTTATTATACCGTCAAGGACTGTGTGACTATTGACGTCCTTCCCCGTACGCCGGGCAATAACGTTTATGTTGGTTTCATGGTTTGGTCTAACTTTACCGCTACTAAATGCCGCGGATTGGTTTCGCTGAATCAGGTTATTAAAGAGATTATTTGTCTCCAGCCACTTAAGTGAGGTGATTTATGTTTGGTGCTATTGCTGGCGGTATTGCTTCTGCTCTTGCTGGTGGCGCCATGTCTAAATTGTTTGGAGGCGGTCAAAAAGCCGCCTCCGGTGGCATTCAAGGTGATGTGCTTGCTACCGATAACAATACTGTAGGCATGGGTGATGCTGGTATTAAATCTGCCATTCAAGGCTCTAATGTTCCTAACCCTGATGAGGCCGCCCCTAGTTTTGTTTCTGGTGCTATGGCTAAAGCTGGTAAAGGACTTCTTGAAGGTACGTTGCAGGCTGGCACTTCTGCCGTTTCTGATAAGTTGCTTGATTTGGTTGGACTTGGTGGCAAGTCTGCCGCTGATAAAGGAAAGGATACTCGTGATTATCTTGCTGCTGCATTTCCTGAGCTTAATGCTTGGGAGCGTGCTGGTGCTGATGCTTCCTCTGCTGGTATGGTTGACGCCGGATTTGAGAATCAAAAAGAGCTTACTAAAATGCAACTGGACAATCAGAAAGAGATTGCCGAGATGCAAAATGAGACTCAAAAAGAGATTGCTGGCATTCAGTCGGCGACTTCACGCCAGAATACGAAAGACCAGGTATATGCACAAAATGAGATGCTTGCTTATCAACAGAAGGAGTCTACTGCTCGCGTTGCGTCTATTATGGAAAACACCAATCTTTCCAAGCAACAGCAGGTTTCCGAGATTATGCGCCAAATGCTTACTCAAGCTCAAACGGCTGGTCAGTATTTTACCAATGACCAAATCAAAGAAATGACTCGCAAGGTTAGTGCTGAGGTTGACTTAGTTCATCAGCAAACGCAGAATCAGCGGTATGGCTCTTCTCATATTGGCGCTACTGCAAAGGATATTTCTAATGTCGTCACTGATGCTGCTTCTGGTGTGGTTGATATTTTTCATGGTATTGATAAAGCTGTTGCCGATACTTGGAACAATTTCTGGAAAGACGGTAAAGCTGATGGTATTGGCTCTAATTTGTCTAGGAAATAACCGTCAGGATTGACACCCTCCCAATTGTATGTTTTCATGCCTCCAAATCTTGGAGGCTTTTTTATGGTTCGTTCTTATTACCCTTCTGAATGTCACGCTGATTATTTTGACTTTGAGCGTATCGAGGCTCTTAAACCTGCTATTGAGGCTTGTGGCATTTCTACTCTTTCTCAATCCCCAATGCTTGGCTTCCATAAGCAGATGGATAACCGCATCAAGCTCTTGGAAGAGATTCTGTCTTTTCGTATGCAGGGCGTTGAGTTCGATAATGGTGATATGTATGTTGACGGCCATAAGGCTGCTTCTGACGTTCGTGATGAGTTTGTATCTGTTACTGAGAAGTTAATGGATGAATTGGCACAATGCTACAATGTGCTCCCCCAACTTGATATTAATAACACTATAGACCACCGCCCCGAAGGGGACGAAAAATGGTTTTTAGAGAACGAGAAGACGGTTACGCAGTTTTGCCGCAAGCTGGCTGCTGAACGCCCTCTTAAGGATATTCGCGATGAGTATAATTACCCCAAAAAGAAAGGTATTAAGGATGAGTGTTCAAGATTGCTGGAGGCCTCCACTATGAAATCGCGTAGAGGCTTTGCTATTCAGCGTTTGATGAATGCAATGCGACAGGCTCATGCTGATGGTTGGTTTATCGTTTTTGACACTCTCACGTTGGCTGACGACCGATTAGAGGCGTTTTATGATAATCCCAATGCTTTGCGTGACTATTTTCGTGATATTGGTCGTATGGTTCTTGCTGCCGAGGGTCGCAAGGCTAATGATTCACACGCCGACTGCTATCAGTATTTTTGTGTGCCTGAGTATGGTACAGCTAATGGCCGTCTTCATTTCCATGCGGTGCACTTTATGCGGACACTTCCTACAGGTAGCGTTGACCCTAATTTTGGTCGTCGGGTACGCAATCGCCGCCAGTTAAATAGCTTGCAAAATACGTGGCCTTATGGTTACAGTATGCCCATCGCAGTTCGCTACACGCAGGACGCTTTTTCACGTTCTGGTTGGTTGTGGCCTGTTGATGCTAAAGGTGAGCCGCTTAAAGCTACCAGTTATATGGCTGTTGGTTTCTATGTGGCTAAATACGTTAACAAAAAGTCAGATATGGACCTTGCTGCTAAAGGTCTAGGAGCTAAAGAATGGAACAACTCACTAAAAACCAAGCTGTCGCTACTTCCCAAGAAGCTGTTCAGAATCAGAATGAGCCGCAACTTCGGGATGAAAATGCTCACAATGACAAATCTGTCCACGGAGTGCTTAATCCAACTTACCAAGCTGGGTTACGACGCGACGCCGTTCAACCAGATATTGAAGCAGAACGCAAAAAGAGAGATGAGATTGAGGCTGGGAAAAGTTACTGTAGCCGACGTTTTGGCGGCGCAACCTGTGACGACAAATCTGCTCAAATTTATGCGCGCTTCGATAAAAATGATTGGCGTATCCAACCTGCAGAGTTTTATCGCTTCCATGACGCAGAAGTTAACACTTTCGGATATTTCTGATGAGTCGAAAAATTATCTTGATAAAGCAGGAATTACTACTGCTTGTTTACGAATTAAATCGAAGTGGACTGCTGGCGGAAAATGAGAAAATTCGACCTATCCTTGCGCAGCTCGAGAAGCTCTTACTTTGCGACCTTTCGCCATCAACTAACGATTCTGTCAAAAACTGACGCGTTGGATGAGGAGAAGTGGCTTAATATGCTTGGCACGTTCGTCAAGGACTGGTTTAGATATGAGTCACATTTTGTTCATGGTAGAGATTCTCTTGTTGACATTTTAAAAGAGCGTGGATTACTATCTGAGTCCGATGCTGTTCAACCACTAATAGGTAAGAAATCATGAGTCAAGTTACTGAACAATCCGTACGTTTCCAGACCGCTTTGGCCTCTATTAAGCTCATTCAGGCTTCTGCCGTTTTGGATTTAACCGAAGATGATTTCGATTTTCTGACGAGTAACAAAGTTTGGATTGCTACTGACCGCTCTCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTACGCTGGACTTTGTGGGATACCCTCGCTTTCCTGCTCCTGTTGAGTTTATTGCTGCCGTCATTGCTTATTATGTTCATCCCGTCAACATTCAAACGGCCTGTCTCATCATGGAAGGCGCTGAATTTACGGAAAACATTATTAATGGCGTCGAGCGTCCGGTTAAAGCCGCTGAATTGTTCGCGTTTACCTTGCGTGTACGCGCAGGAAACACTGACGTTCTTACTGACGCAGAAGAAAACGTGCGTCAAAAATTACGTGCGGAAGGAGTGATGTAATGTCTAAAGGTAAAAAACGTTCTGGCGCTCGCCCTGGTCGTCCGCAGCCGTTGCGAGGTACTAAAGGCAAGCGTAAAGGCGCTCGTCTTTGGTATGTAGGTGGTCAACAATTTTAATTGCAGGGGCTTCGGCCCCTTACTTGAGGATAAATTATGTCTAATATTCAAACTGGCGCCGAGCGTATGCCGCATGACCTTTCCCATCTTGGCTTCCTTGCTGGTCAGATTGGTCGTCTTATTACCATTTCAACTACTCCGGTTATCGCTGGCGACTCCTTCGAGATGGACGCCGTTGGCGCTCTCCGTCTTTCTCCATTGCGTCGTGGCCTTGCTATTGACTCTACTGTAGACATTTTTACTTTTTATGTCCCTCATCGTCACGTTTATGGTGAACAGTGGATTAAGTTCATGAAGGATGGTGTTAATGCCACTCCTCTCCCGACTGTTAACACTACTGGTTATATTGACCATGCCGCTTTTCTTGGCACGATTAACCCTGATACCAATAAAATCCCTAAGCATTTGTTTCAGGGTTATTTGAATATCTATAACAACTATTTTAAAGCGCCGTGGATGCCTGACCGTACCGAGGCTAACCCTAATGAGCTTAATCAAGATGATGCTCGTTATGGTTTCCGTTGCTGCCATCTCAAAAACATTTGGACTGCTCCGCTTCCTCCTGAGACTGAGCTTTCTCGCCAAATGACGACTTCTACCACATCTATTGACATTATGGGTCTGCAAGCTGCTTATGCTAATTTGCATACTGACCAAGAACGTGATTACTTCATGCAGCGTTACCATGATGTTATTTCTTCATTTGGAGGTAAAACCTCTTATGACGCTGACAACCGTCCTTTACTTGTCATGCGCTCTAATCTCTGGGCATCTGGCTATGATGTTGATGGAACTGACCAAACGTCGTTAGGCCAGTTTTCTGGTCGTGTTCAACAGACCTATAAACATTCTGTGCCGCGTTTCTTTGTTCCTGAGCATGGCACTATGTTTACTCTTGCGCTTGTTCGTTTTCCGCCTACTGCGACTAAAGAGATTCAGTACCTTAACGCTAAAGGTGCTTTGACTTATACCGATATTGCTGGCGACCCTGTTTTGTATGGCAACTTGCCGCCGCGTGAAATTTCTATGAAGGATGTTTTCCGTTCTGGTGATTCGTCTAAGAAGTTTAAGATTGCTGAGGGTCAGTGGTATCGTTATGCGCCTTCGTATGTTTCTCCTGCTTATCACCTTCTTGAAGGCTTCCCATTCATTCAGGAACCGCCTTCTGGTGATTTGCAAGAACGCGTACTTATTCGCCACCATGATTATGACCAGTGTTTCCAGTCCGTTCAGTTGTTGCAGTGGAATAGTCAGGTTAAATTTAATGTGACCGTTTATCGCAATCTGCCGACCACTCGCGATTCAATCATGACTTCGTGATAAAAGATTGA"

phiorfs = findorfs(phi, finder=NaiveFinder)

204-element Vector{ORF{4, NaiveFinder}}:
 ORF{NaiveFinder}(9:101, '-', 3)
 ORF{NaiveFinder}(94:126, '-', 1)
 ORF{NaiveFinder}(100:627, '+', 1)
 ORF{NaiveFinder}(209:214, '-', 2)
 ORF{NaiveFinder}(223:273, '-', 1)
 ORF{NaiveFinder}(223:447, '-', 1)
 ORF{NaiveFinder}(245:292, '-', 2)
 ORF{NaiveFinder}(248:436, '+', 2)
 ORF{NaiveFinder}(257:436, '+', 2)
 ORF{NaiveFinder}(270:278, '-', 3)
 ORF{NaiveFinder}(283:627, '+', 1)
 ORF{NaiveFinder}(344:436, '+', 2)
 ORF{NaiveFinder}(419:436, '+', 2)
 ORF{NaiveFinder}(422:436, '+', 2)
 ORF{NaiveFinder}(437:481, '+', 2)
 ⋮
 ORF{NaiveFinder}(4863:5258, '-', 3)
 ORF{NaiveFinder}(4933:5019, '+', 1)
 ORF{NaiveFinder}(4941:5375, '+', 3)
 ORF{NaiveFinder}(5053:5085, '+', 1)
 ORF{NaiveFinder}(5082:5375, '+', 3)
 ORF{NaiveFinder}(5089:5325, '+', 1)
 ORF{NaiveFinder}(5122:5202, '-', 1)
 ORF{NaiveFinder}(5152:5325, '+', 1)
 ORF{NaiveFinder}(5164:5325, '+', 1)
 ORF{NaiveFinder}(5180:5206, '-', 2)
 ORF{NaiveFinder}(5257:5325, '+', 1)
 ORF{NaiveFinder}(5263:5325, '+', 1)
 ORF{NaiveFinder}(5320:5325, '+', 1)
 ORF{NaiveFinder}(5330:5365, '-', 2)
 ORF{NaiveFinder}(5364:5375, '+', 3)

Since the ORF{N,F} stores the sequence, we can make cool things out of the box:

sequence.(orfs)

204-element Vector{LongSubSeq{DNAAlphabet{4}}}:
 ATGATTAAACTCCTAAGCAGAAAACCTACCGCGCTTCGC…GCGGCAAAAATTAAAATTTTTACCGCTTCGGCGTTATAA
 ATGGCGAGAAATAAAAGTCTGAAACATGATTAA
 ATGTTTCAGACTTTTATTTCTCGCCATAATTCAAACTTT…GTTATTAAAGAGATTATTTGTCTCCAGCCACTTAAGTGA
 ATGTAG
 ATGAAGAAAACCACCATTACCAGCATTAACCGTCAAACTATCAAAATATAA
 ATGGCGACCATTCAAAGGATAAACATCATAGGCAGTCGG…ACCATTACCAGCATTAACCGTCAAACTATCAAAATATAA
 ATGTATCCATCTGAATGCAATGAAGAAAACCACCATTACCAGCATTAA
 ATGCTGGTAATGGTGGTTTTCTTCATTGCATTCAGATGG…CGACTACCCTCCCGACTGCCTATGATGTTTATCCTTTGA
 ATGGTGGTTTTCTTCATTGCATTCAGATGGATACATCTG…CGACTACCCTCCCGACTGCCTATGATGTTTATCCTTTGA
 ATGCAATGA
 ATGGATACATCTGTCAACGCCGCTAATCAGGTTGTTTCT…GTTATTAAAGAGATTATTTGTCTCCAGCCACTTAAGTGA
 ATGCCGACCCTAAATTTTTTGCCTGTTTGGTTCGCTTTG…CGACTACCCTCCCGACTGCCTATGATGTTTATCCTTTGA
 ATGATGTTTATCCTTTGA
 ATGTTTATCCTTTGA
 ATGGTCGCCATGATGGTGGTTATTATACCGTCAAGGACTGTGTGA
 ⋮
 ATGGTGGCGAATAAGTACGCGTTCTTGCAAATCACCAGA…TTTATAGGTCTGTTGAACACGACCAGAAAACTGGCCTAA
 ATGGCACTATGTTTACTCTTGCGCTTGTTCGTTTTCCGC…CTAAAGAGATTCAGTACCTTAACGCTAAAGGTGCTTTGA
 ATGTTTACTCTTGCGCTTGTTCGTTTTCCGCCTACTGCG…AATCTGCCGACCACTCGCGATTCAATCATGACTTCGTGA
 ATGGCAACTTGCCGCCGCGTGAAATTTCTATGA
 ATGAAGGATGTTTTCCGTTCTGGTGATTCGTCTAAGAAG…AATCTGCCGACCACTCGCGATTCAATCATGACTTCGTGA
 ATGTTTTCCGTTCTGGTGATTCGTCTAAGAAGTTTAAGA…AGTTGTTGCAGTGGAATAGTCAGGTTAAATTTAATGTGA
 ATGGGAAGCCTTCAAGAAGGTGATAAGCAGGAGAAACAT…AAGGCGCATAACGATACCACTGACCCTCAGCAATCTTAA
 ATGCGCCTTCGTATGTTTCTCCTGCTTATCACCTTCTTG…AGTTGTTGCAGTGGAATAGTCAGGTTAAATTTAATGTGA
 ATGTTTCTCCTGCTTATCACCTTCTTGAAGGCTTCCCAT…AGTTGTTGCAGTGGAATAGTCAGGTTAAATTTAATGTGA
 ATGAATGGGAAGCCTTCAAGAAGGTGA
 ATGATTATGACCAGTGTTTCCAGTCCGTTCAGTTGTTGCAGTGGAATAGTCAGGTTAAATTTAATGTGA
 ATGACCAGTGTTTCCAGTCCGTTCAGTTGTTGCAGTGGAATAGTCAGGTTAAATTTAATGTGA
 ATGTGA
 ATGATTGAATCGCGAGTGGTCGGCAGATTGCGATAA
 ATGACTTCGTGA

Exploring the new `Features` field

We can now try to use a scoring scheme, for instance, the simple log_odds_ratio_score and use minlen keyword argument:

findorfs(phi, scheme=lors, minlen=64) .|> features

134-element Vector{@NamedTuple{score::Float64}}:
 (score = -0.020953839808134023,)
 (score = -0.0008346917390085617,)
 (score = -0.005239249252611444,)
 (score = -0.007200284900850932,)
 (score = -0.007832152171979367,)
 (score = -0.0018172720089990923,)
 (score = -0.006404578260605308,)
 (score = -0.003209054988434599,)
 (score = -0.03100933596797633,)
 (score = -0.0005844727783277565,)
 (score = -0.0005442693515613672,)
 (score = -0.02946967703459084,)
 (score = -0.0006814154969336216,)
 (score = -0.0009388058672044588,)
 (score = -0.000946341251223099,)
 ⋮
 (score = -0.0003438480564352011,)
 (score = -0.00853670881288625,)
 (score = -0.00802605474396142,)
 (score = -0.01604090594572921,)
 (score = -0.00037344639883859457,)
 (score = -0.011635555210909566,)
 (score = -0.0006769082407312228,)
 (score = -0.014804098291466597,)
 (score = -0.0006454376565976915,)
 (score = -0.0014012196428411425,)
 (score = -0.0024182335566292662,)
 (score = -0.01466675827643835,)
 (score = -0.00302236305890491,)
 (score = -0.003913378336104148,)
 (score = -0.030410038701528405,)

Very neet interface for filtering

And much more we can now filter with very nice fine grain and can close #33 :

filter(x -> iscoding(sequence(x), η=1e-10) && length(x) > 100, findorfs(phi))

25-element Vector{ORF{4, NaiveFinder}}:
 ORF{NaiveFinder}(1021:1389, '+', 1)
 ORF{NaiveFinder}(1447:1635, '+', 1)
 ORF{NaiveFinder}(1489:1635, '+', 1)
 ORF{NaiveFinder}(1501:1635, '+', 1)
 ORF{NaiveFinder}(1531:1635, '+', 1)
 ORF{NaiveFinder}(2697:3227, '+', 3)
 ORF{NaiveFinder}(2745:3227, '+', 3)
 ORF{NaiveFinder}(2780:3142, '+', 2)
 ORF{NaiveFinder}(2841:3227, '+', 3)
 ORF{NaiveFinder}(2859:3227, '+', 3)
 ORF{NaiveFinder}(2865:3227, '+', 3)
 ORF{NaiveFinder}(2874:3227, '+', 3)
 ORF{NaiveFinder}(2973:3227, '+', 3)
 ORF{NaiveFinder}(3142:3312, '+', 1)
 ORF{NaiveFinder}(3481:3939, '+', 1)
 ORF{NaiveFinder}(3659:3934, '+', 2)
 ORF{NaiveFinder}(3734:3934, '+', 2)
 ORF{NaiveFinder}(3772:3939, '+', 1)
 ORF{NaiveFinder}(3806:3934, '+', 2)
 ORF{NaiveFinder}(4129:4287, '+', 1)
 ORF{NaiveFinder}(4160:4291, '-', 2)
 ORF{NaiveFinder}(4540:4644, '+', 1)
 ORF{NaiveFinder}(4690:4866, '+', 1)
 ORF{NaiveFinder}(4741:4866, '+', 1)
 ORF{NaiveFinder}(4744:4866, '+', 1)

Using GenomeFeature `GenomeIntervalCollections`

Since the ORF{N,F} is a subtype of a AbstractGenomicInterval we can use some nice methods and explore, for instance, the coverage of the ORF intersections:

orfcolleciton =  GenomicIntervalCollection(phiorfs)

GenomicIntervalCollection{ORF{4, NaiveFinder}} with 134 intervals:
  ORF{NaiveFinder}(9:101, '-', 3)
  ORF{NaiveFinder}(100:627, '+', 1)
  ORF{NaiveFinder}(223:447, '-', 1)
  ORF{NaiveFinder}(248:436, '+', 2)
  ORF{NaiveFinder}(257:436, '+', 2)
  ORF{NaiveFinder}(283:627, '+', 1)
  ORF{NaiveFinder}(344:436, '+', 2)
  ORF{NaiveFinder}(532:627, '+', 1)
  ⋮

coverage(orfcollection)

GenomicIntervalCollection{GenomicInterval{UInt32}} with 178 intervals:
  phi:9-99  .  1
  phi:100-101  .  2
  phi:102-222  .  1
  phi:223-247  .  2
  phi:248-256  .  3
  phi:257-282  .  4
  phi:283-343  .  5
  phi:344-436  .  6
  ⋮

…uires a new version of BMC.

kescobo · 2024-06-10T19:35:21Z

This looks fantastic :-)

camilogarciabotero · 2024-06-10T20:35:18Z

Thanks for all the suggestions!

I still doubt whether to keep the seq field in the ORF struct. Do you have any thoughts about it? It has the benefit of making the ORF struct more "portable" or robust but maybe it is somehow redundant as it could be enough to have the positions, frame, and strand...

kescobo · 2024-06-10T21:11:20Z

Are SubSeqs implemented as views? To me, that would make the most sense, so they could be carried along with the object, but not be reallocated

camilogarciabotero · 2024-06-11T00:59:13Z

Yes, it is seq::LongSubSeq{DNAAlphabet{N}} so yeah, allocations are minimal, but I am in this battle of making this package somehow generic, but also UI rich... I will leave this PR linger open a little bit more to give the general implementation a little more thought. Thanks for taking the time to take a look at it.

camilogarciabotero added 5 commits May 17, 2024 21:58

First steps towards ORFs as GenomicIntervals

16edd60

Some io fixes, and tests

d9cc396

Some attempts to correct tests

87b49c3

refactor: Remove unused code in runtests.jl

9132e11

Relax some tests

8e138ba

kescobo reviewed May 24, 2024

View reviewed changes

src/algorithms/naivefinder.jl Outdated Show resolved Hide resolved

Move get_variable_name outside loop

b412355

camilogarciabotero added 16 commits June 3, 2024 00:40

Major refactor of the ORF type and also add some groundwork for RBS

0aaf4e2

Update deps

14fff61

Improve getindex with some types

4dc47a6

Up translate and other utils are now more stable

61ac5c7

Update the min_len into minlen

27ca3f8

Remove unused function

4336542

Update deps

0516508

Refactor finders with correct score calculation

f17a1f5

chore: Update Julia version to 1.10.4

41e509b

Update README

140b672

Manifest update

8cbd961

Create Features struct and include it in the ORF as the NamedTuple field

56d3f1c

Refactor code to improve precompile file size and loading speed

fd839c9

Update io methods

d0cf754

Remove getorfs.jl

2ac4ce9

Update finder methods with the Feature struct

ba9e75e

camilogarciabotero commented Jun 8, 2024

View reviewed changes

camilogarciabotero linked an issue Jun 8, 2024 that may be closed by this pull request

Using score to filter what getorfs delivers #33

Closed

camilogarciabotero added 2 commits June 8, 2024 17:44

Update lors from BMC taking only one argument

80a0594

Update naivecolletor.jl to new BMC losr

44c791c

camilogarciabotero added 6 commits June 9, 2024 01:07

Update iscoding method, so that it takes an ORF directly

7663ab6

Update score method to handle directly the score field

b01af90

Update finder methods kwargs to only use the scheme kwargs

784c1d5

After fixing the lors from BMC package the lordr is also updated, req…

01c3e3e

…uires a new version of BMC.

Update to BMC v0.10.1

24b05b5

Update some docstring in lordr

2047656

camilogarciabotero added 10 commits July 1, 2024 14:12

Update deps

f21c4d9

Update getindex of ORFs

91cb0bb

reduce ORF type fields: get rid of seq

5235e69

Add new internal methods for calling var names and symbol

4607f96

Update docs simple coding rule

6419219

Up deps and extras

91e2450

Add YAML extra

f7ae859

Update gitignore and delete Manifest

0ffa402

Update actions cache and cov

929f482

refactor: Simplify NaiveCollector createorfs function

67e046d

camilogarciabotero merged commit 6ba4aa5 into main Jul 1, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Moving `ORF` as an `AbstractGenomicInterval{T}` #34

Moving `ORF` as an `AbstractGenomicInterval{T}` #34

camilogarciabotero commented May 23, 2024 •

edited

Loading

kescobo commented May 24, 2024

camilogarciabotero commented May 25, 2024

kescobo commented May 28, 2024

camilogarciabotero left a comment •

edited

Loading

kescobo commented Jun 10, 2024

camilogarciabotero commented Jun 10, 2024

kescobo commented Jun 10, 2024

camilogarciabotero commented Jun 11, 2024

Moving ORF as an AbstractGenomicInterval{T} #34

Moving ORF as an AbstractGenomicInterval{T} #34

Conversation

camilogarciabotero commented May 23, 2024 • edited Loading

kescobo commented May 24, 2024

camilogarciabotero commented May 25, 2024

kescobo commented May 28, 2024

camilogarciabotero left a comment • edited Loading

Choose a reason for hiding this comment

Exploring the new Features field

Very neet interface for filtering

Using GenomeFeature GenomeIntervalCollections

kescobo commented Jun 10, 2024

camilogarciabotero commented Jun 10, 2024

kescobo commented Jun 10, 2024

camilogarciabotero commented Jun 11, 2024

Moving `ORF` as an `AbstractGenomicInterval{T}` #34

Moving `ORF` as an `AbstractGenomicInterval{T}` #34

camilogarciabotero commented May 23, 2024 •

edited

Loading

camilogarciabotero left a comment •

edited

Loading

Exploring the new `Features` field

Using GenomeFeature `GenomeIntervalCollections`