Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate mappings of common ranges from one path to another #73

Merged
merged 6 commits into from
Oct 30, 2024

Conversation

dkhofer
Copy link
Collaborator

@dkhofer dkhofer commented Oct 16, 2024

No description provided.

let tree: IntervalTree<i64, PathBlock> = blocks
.into_iter()
.map(|block| (block.path_start..block.path_end, block))
.collect();
tree
}

pub fn find_block_mappings(&self, conn: &Connection, other_path: &Path) -> Vec<RangeMapping> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we need some kind of sequence extension logic option here. Imagine we have 2 gfas of identical sequences but different node lengths -- what happens here? I think the default should be everything is in the same frame/etc. but we should have a flag that is sequence based

Copy link
Collaborator

@Chris7 Chris7 Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wonder if we can actually resolve this with contig names and not need to ever fetch sequences for that case too

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we import from one GFA, then update from a second, we should generate the sequences from the paths defined in the second and check if there's an existing path that matches any of them, and if so insert appropriate edges etc. so that there's only one path through the block group that generates that sequence.

Is that the scenario you have in mind?

I think you're asking about an issue I brought up in my doc about reference paths: If a GFA is meant to update an existing path in the gen db, how does the user specify that. The naive way would be to force them to redefine the entire path node-for-node as it exists in the database in their input GFA, and then they can define new segments etc off that in their input file. They could of course export what's in the database and then add to the exported file, then update gen from that.

I do think it'll be possible to define an abbreviated format eventually, where they just specify new segments etc and specify where those would go on an existing path.

Regardless, I think your suggestion is out of scope for this PR? This method is only meant to be used for paths that already exist in the database. Resolving identical sequences seems to be a problem for updating with files.
Ideally, we'd catch the duplicate sequence problem before they get stored in the db.

Let me know if I'm not understanding.

mappings.push(mapping);
}

if their_block.sequence_end < our_block.sequence_end {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would we ever have a wrap around case that inadvertently hits the break condition?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like for a circular contig? The way I'm envisioning it, no. Since this is all node-based, any edits or variants will be recorded with respect to the path, which has the virtual start and end nodes, so wraparound can't occur that way. If the user updated the same contig with a duplicate sequence that was rotated partway, it would get a new set of nodes under the current logic. If we change updates so we look for duplicate sequences under rotation, I imagine we'd rotate any incoming duplicate sequence to match the existing sequence, and add to that with anything new coming in from the update.

Copy link
Collaborator

@Chris7 Chris7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks good. Just some questions and thoughts on the data return api.

src/range.rs Outdated
is_circular_contig: bool,
) -> i64 {
if !self.contains(index) {
panic!("Index {} is not contained in range {:?}", index, self);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe make this return Result? like Result<i64, &'static str>. So this condition would be return Err("Index xxx and we don't break the api.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, done

}
}

mappings
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not sure if we should be returning extra info w/ the range start. As a user receiving back overlapping coordinates, if i put in a heterogenous set of nodes, how would i use the output? For example, an entire blockgroup or 2 different paths.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll be honest, I'm not understanding the question. I'll start a slack thread

@dkhofer dkhofer merged commit 8731c09 into main Oct 30, 2024
1 check passed
@dkhofer dkhofer deleted the path-mappings branch October 30, 2024 16:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants