-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alternative to blocks #16
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sending back for initial take. Need some clarity on the border stuff.
migrations/01-initial/up.sql
Outdated
CREATE UNIQUE INDEX block_group_edges_uidx ON block_group_edges(block_group_id, edge_id); | ||
|
||
INSERT INTO sequence (hash, sequence_type, sequence, "length") values ("yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy", "OTHER", "yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy", 64), ("zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz", "OTHER", "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz", 64); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we call these start
and end
or something we can guarantee to not collide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we want length matching for some optimizations if they exist can pad with - as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looked to me like the hashes are hex, so I chose y's and z's because they wouldn't collide. I might be misunderstanding the suggestion though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You got it right. I didn't realize sha was only hexadecimal, but generally we're saying the same thing. I meant start/end just to make it abundantly obvious what this represents when looking at the db. Or edge-start edge-termination or whatever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed them to "start-node-yyyy..." and "end-node-zzzz..." to make it a bit clearer
src/main.rs
Outdated
let new_sequence_hash = Sequence::create(conn, "DNA", alt_seq, true); | ||
let sequences_by_hash = Sequence::sequences_by_hash( | ||
conn, | ||
vec![format!("\"{}\"", new_sequence_hash)], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use the rusqlite Value to do this, like
vec![SQLValue::from(new_sequence_hash)]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I actually got mad at myself for how I wrote the sequences_by_hash method, I'm going to clean it up in a later PR. sequences_by_hash should format the strings, not the caller. I'll use SQLValue::from there
} | ||
|
||
let mut boundary_edges_by_hash = HashMap::<String, Vec<NewEdge>>::new(); | ||
for edge in edges { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ca nyou comment what this is representing? sort of confused by an essential identity edge (source == target)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each represents a split point on an existing sequence, where it's been split into two virtual blocks.
The naive intuition (which I also had) is that if a sequence is split after the bp at index 4, the resulting edge that marks that split should be (4, 5). But it turns out that if we make it (5, 5), it plays better with the [x, y) notation (0 based end-exclusive) that we're using for blocks, and also the python/rust slice notation.
In fact in general, having the source coordinate of an edge be one greater than the bp index it's referencing works out better, because it fits better with the range indices for blocks.
Let's assume we're using the naive version. Say we have a pair of edges that define a block:
Edge 1: target sequence S1, target coordinate 5
Edge 2: source sequence S1, source coordinate 10
Then in rust slice terms, the subsequence is S1[5..11]. To get the correct slice (or block) end coordinate, you need to add one to the source coordinate.
Then, in insert_change, we have some blocks and we need to create edges between them. To create an edge that goes from the end of a range [x, y) to the beginning of a range [z, a), the source coordinate would be y-1.
There are at least 6 lines of code where if we use the naive version, we need to add 1 or subtract 1 to convert from a block/slice coordinate to an edge coordinate or vice versa. I finally gave up and went with the convention that the edge source coordinate is 1 greater than the bp index it's referencing.
Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(It took me a while to get to that concise description. All I knew when I was trying to get the unit tests passing was that I kept screwing up one coordinate or another, until I realized the edge coordinates were using a different range definition than the block/slice coordinates.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tricky thing to wrap the brain around is that edges are the complement of blocks, so the beginning of an edge is the end of a block, and vice versa, so it's nice to have the coordinate systems agree.
for (hash, sequence) in sequences_by_hash.into_iter() { | ||
let sequence_edges = boundary_edges_by_hash.get(&hash); | ||
if sequence_edges.is_some() { | ||
let sorted_sequence_edges: Vec<NewEdge> = sequence_edges |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't really get why this would make the edges sorted in a way we want. They're being sorted by the order they slice a stored sequence, not the order they appear in a path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if my sequence was:
AAAAAACCCCCTTTTTT
and i had a path of TTTTT -> AAAA, what would happen here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think i just need more intuition on boundary edges to comment here. will wait
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above comments
let edge1 = NewEdge::create( | ||
conn, | ||
NewEdge::PATH_START_HASH.to_string(), | ||
-123, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why -123 as a coord?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically anything less than 0 works, this is the input edge from the virtual "start node". I'm not 100% what we want to do for this, and I don't remember but I think I put -123 because it showed the arbitrariness of how it works.
It turns out that it's useful to have virtual start and end blocks in the interval tree of blocks for a path, so I think we want to keep this around in some form. Unless this seems seriously bogus to you, I'd like to keep it in this PR and then write up the pros and cons of it and discuss, and if we want to change anything do that next week
src/models/path.rs
Outdated
let block_sequence; | ||
let block_sequence_length; | ||
|
||
if end >= start { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty opinionated to not use this convention. I've seen it cause horrors in some dbs that attempted this. We should reserve this for circles where something wraps the origin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. OK, I can add a strand field to the new edges
use rusqlite::{params_from_iter, Connection}; | ||
use std::collections::HashMap; | ||
|
||
#[derive(Clone, Debug)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this, removes the issue of making a graph and finding nodes.
use std::hash::RandomState; | ||
|
||
#[derive(Clone, Debug)] | ||
pub struct NewEdge { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should have strand here now, effectively what does the edge point to on the sequence (top side or bottom side)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can get strand for a "virtual block" by looking at the pair of edges that defines the block. If the first edge's target coordinate is greater than the second edge's source coordinate, the strand is reverse. See edge_pairs_to_block in path.rs
I know all this is nonintuitive, it took me a while to figure it all out too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forget it, saw your later comment, I'll put strand on the edge
let mut blocks = vec![]; | ||
|
||
let mut block_index = 0; | ||
for (hash, sequence) in sequences_by_hash.into_iter() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it matter that the blocks aren't going to be in the order of the edges? i don't see anywhere that is keeping order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is only used for get_all_sequences, so we don't care about ordering the blocks. We just want to get blocks back that have a way to connect them to the edges, so we can feed them into the graph. (The way to connect them is, the blocks have a sequence hash and start/end coordinates, and we get the edges that match on target hash/coordinate or source hash/coordinate to know which ones are going into or coming out of a block.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a related method in path.rs that does do the ordering like you're expecting, because that is needed for a path
|
||
// NOTE: Add edges marking the existing part of the sequence that is being substituted out, | ||
// so we can retrieve it as one node of the overall graph | ||
if start < start_block.path_end { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems like start == path_start and end == path_end. should htis code just go in the non-deletion block above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These also seem to be what the boundry edges are made from, but i still don't understand them and why we have source == target.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, see above. This is the part of the code that adds the boundary edges. Intuitively, the conditional is checking if the virtual blocks are already split at this location. If start == start_block.path_end, that's the case, because the path_end field is the location of the split in the path.
Changes today:
|
.cloned() | ||
.collect(); | ||
let first_edge = sorted_sequence_edges[0].clone(); | ||
let start = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
confused why start would always be 0, but i think we're going to be dropping this code anyways to get rid of boundary edges/branch points.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or in other words, start/end always refer to the position in the sequence string, not the coordinates so much. The coordinates are obtained by a collection of blocks w/ lengths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal with this part of the code is to iterate over all sequences, and for each sequence, to get its blocks, which will then be fed into the graph (in order to get all sequences). The first block of a sequence will always start with 0, so I don't think this will change. If it's still not making sense, we can discuss more
blocks.push(last_block); | ||
block_index += 1; | ||
} else { | ||
blocks.push(GroupBlock { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For future note -- on the builder/default pattern, i think we need to go w/ builder and this makes me realize a great reason why -- shallow sequences will be a real pain to deal with at this level everywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i still need to work out some things, but i think this is in a place it should be mergd so we can explore it more.
Switch to noodle fasta parser
No description provided.