-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate mappings of common ranges from one path to another #73
Conversation
let tree: IntervalTree<i64, PathBlock> = blocks | ||
.into_iter() | ||
.map(|block| (block.path_start..block.path_end, block)) | ||
.collect(); | ||
tree | ||
} | ||
|
||
pub fn find_block_mappings(&self, conn: &Connection, other_path: &Path) -> Vec<RangeMapping> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we need some kind of sequence extension logic option here. Imagine we have 2 gfas of identical sequences but different node lengths -- what happens here? I think the default should be everything is in the same frame/etc. but we should have a flag that is sequence based
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i wonder if we can actually resolve this with contig names and not need to ever fetch sequences for that case too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we import from one GFA, then update from a second, we should generate the sequences from the paths defined in the second and check if there's an existing path that matches any of them, and if so insert appropriate edges etc. so that there's only one path through the block group that generates that sequence.
Is that the scenario you have in mind?
I think you're asking about an issue I brought up in my doc about reference paths: If a GFA is meant to update an existing path in the gen db, how does the user specify that. The naive way would be to force them to redefine the entire path node-for-node as it exists in the database in their input GFA, and then they can define new segments etc off that in their input file. They could of course export what's in the database and then add to the exported file, then update gen from that.
I do think it'll be possible to define an abbreviated format eventually, where they just specify new segments etc and specify where those would go on an existing path.
Regardless, I think your suggestion is out of scope for this PR? This method is only meant to be used for paths that already exist in the database. Resolving identical sequences seems to be a problem for updating with files.
Ideally, we'd catch the duplicate sequence problem before they get stored in the db.
Let me know if I'm not understanding.
mappings.push(mapping); | ||
} | ||
|
||
if their_block.sequence_end < our_block.sequence_end { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would we ever have a wrap around case that inadvertently hits the break condition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like for a circular contig? The way I'm envisioning it, no. Since this is all node-based, any edits or variants will be recorded with respect to the path, which has the virtual start and end nodes, so wraparound can't occur that way. If the user updated the same contig with a duplicate sequence that was rotated partway, it would get a new set of nodes under the current logic. If we change updates so we look for duplicate sequences under rotation, I imagine we'd rotate any incoming duplicate sequence to match the existing sequence, and add to that with anything new coming in from the update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks good. Just some questions and thoughts on the data return api.
src/range.rs
Outdated
is_circular_contig: bool, | ||
) -> i64 { | ||
if !self.contains(index) { | ||
panic!("Index {} is not contained in range {:?}", index, self); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe make this return Result
? like Result<i64, &'static str>
. So this condition would be return Err("Index xxx
and we don't break the api.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, done
} | ||
} | ||
|
||
mappings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm not sure if we should be returning extra info w/ the range start. As a user receiving back overlapping coordinates, if i put in a heterogenous set of nodes, how would i use the output? For example, an entire blockgroup or 2 different paths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll be honest, I'm not understanding the question. I'll start a slack thread
49341fd
to
063364f
Compare
No description provided.