Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternative to blocks #16

Merged
merged 20 commits into from
Aug 26, 2024
Merged

Alternative to blocks #16

merged 20 commits into from
Aug 26, 2024

Conversation

dkhofer
Copy link
Collaborator

@dkhofer dkhofer commented Aug 19, 2024

No description provided.

@dkhofer dkhofer requested a review from Chris7 August 22, 2024 15:50
Copy link
Collaborator

@Chris7 Chris7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sending back for initial take. Need some clarity on the border stuff.

CREATE UNIQUE INDEX block_group_edges_uidx ON block_group_edges(block_group_id, edge_id);

INSERT INTO sequence (hash, sequence_type, sequence, "length") values ("yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy", "OTHER", "yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy", 64), ("zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz", "OTHER", "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz", 64);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we call these start and end or something we can guarantee to not collide.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we want length matching for some optimizations if they exist can pad with - as well

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looked to me like the hashes are hex, so I chose y's and z's because they wouldn't collide. I might be misunderstanding the suggestion though

Copy link
Collaborator

@Chris7 Chris7 Aug 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You got it right. I didn't realize sha was only hexadecimal, but generally we're saying the same thing. I meant start/end just to make it abundantly obvious what this represents when looking at the db. Or edge-start edge-termination or whatever.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed them to "start-node-yyyy..." and "end-node-zzzz..." to make it a bit clearer

src/main.rs Outdated
let new_sequence_hash = Sequence::create(conn, "DNA", alt_seq, true);
let sequences_by_hash = Sequence::sequences_by_hash(
conn,
vec![format!("\"{}\"", new_sequence_hash)],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use the rusqlite Value to do this, like

vec![SQLValue::from(new_sequence_hash)]

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I actually got mad at myself for how I wrote the sequences_by_hash method, I'm going to clean it up in a later PR. sequences_by_hash should format the strings, not the caller. I'll use SQLValue::from there

}

let mut boundary_edges_by_hash = HashMap::<String, Vec<NewEdge>>::new();
for edge in edges {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ca nyou comment what this is representing? sort of confused by an essential identity edge (source == target)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each represents a split point on an existing sequence, where it's been split into two virtual blocks.

The naive intuition (which I also had) is that if a sequence is split after the bp at index 4, the resulting edge that marks that split should be (4, 5). But it turns out that if we make it (5, 5), it plays better with the [x, y) notation (0 based end-exclusive) that we're using for blocks, and also the python/rust slice notation.

In fact in general, having the source coordinate of an edge be one greater than the bp index it's referencing works out better, because it fits better with the range indices for blocks.

Let's assume we're using the naive version. Say we have a pair of edges that define a block:

Edge 1: target sequence S1, target coordinate 5
Edge 2: source sequence S1, source coordinate 10

Then in rust slice terms, the subsequence is S1[5..11]. To get the correct slice (or block) end coordinate, you need to add one to the source coordinate.

Then, in insert_change, we have some blocks and we need to create edges between them. To create an edge that goes from the end of a range [x, y) to the beginning of a range [z, a), the source coordinate would be y-1.

There are at least 6 lines of code where if we use the naive version, we need to add 1 or subtract 1 to convert from a block/slice coordinate to an edge coordinate or vice versa. I finally gave up and went with the convention that the edge source coordinate is 1 greater than the bp index it's referencing.

Does that make sense?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(It took me a while to get to that concise description. All I knew when I was trying to get the unit tests passing was that I kept screwing up one coordinate or another, until I realized the edge coordinates were using a different range definition than the block/slice coordinates.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tricky thing to wrap the brain around is that edges are the complement of blocks, so the beginning of an edge is the end of a block, and vice versa, so it's nice to have the coordinate systems agree.

for (hash, sequence) in sequences_by_hash.into_iter() {
let sequence_edges = boundary_edges_by_hash.get(&hash);
if sequence_edges.is_some() {
let sorted_sequence_edges: Vec<NewEdge> = sequence_edges
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't really get why this would make the edges sorted in a way we want. They're being sorted by the order they slice a stored sequence, not the order they appear in a path

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if my sequence was:
AAAAAACCCCCTTTTTT and i had a path of TTTTT -> AAAA, what would happen here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think i just need more intuition on boundary edges to comment here. will wait

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above comments

let edge1 = NewEdge::create(
conn,
NewEdge::PATH_START_HASH.to_string(),
-123,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why -123 as a coord?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically anything less than 0 works, this is the input edge from the virtual "start node". I'm not 100% what we want to do for this, and I don't remember but I think I put -123 because it showed the arbitrariness of how it works.

It turns out that it's useful to have virtual start and end blocks in the interval tree of blocks for a path, so I think we want to keep this around in some form. Unless this seems seriously bogus to you, I'd like to keep it in this PR and then write up the pros and cons of it and discuss, and if we want to change anything do that next week

let block_sequence;
let block_sequence_length;

if end >= start {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty opinionated to not use this convention. I've seen it cause horrors in some dbs that attempted this. We should reserve this for circles where something wraps the origin.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. OK, I can add a strand field to the new edges

use rusqlite::{params_from_iter, Connection};
use std::collections::HashMap;

#[derive(Clone, Debug)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this, removes the issue of making a graph and finding nodes.

use std::hash::RandomState;

#[derive(Clone, Debug)]
pub struct NewEdge {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have strand here now, effectively what does the edge point to on the sequence (top side or bottom side)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can get strand for a "virtual block" by looking at the pair of edges that defines the block. If the first edge's target coordinate is greater than the second edge's source coordinate, the strand is reverse. See edge_pairs_to_block in path.rs

I know all this is nonintuitive, it took me a while to figure it all out too

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forget it, saw your later comment, I'll put strand on the edge

let mut blocks = vec![];

let mut block_index = 0;
for (hash, sequence) in sequences_by_hash.into_iter() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it matter that the blocks aren't going to be in the order of the edges? i don't see anywhere that is keeping order.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is only used for get_all_sequences, so we don't care about ordering the blocks. We just want to get blocks back that have a way to connect them to the edges, so we can feed them into the graph. (The way to connect them is, the blocks have a sequence hash and start/end coordinates, and we get the edges that match on target hash/coordinate or source hash/coordinate to know which ones are going into or coming out of a block.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a related method in path.rs that does do the ordering like you're expecting, because that is needed for a path


// NOTE: Add edges marking the existing part of the sequence that is being substituted out,
// so we can retrieve it as one node of the overall graph
if start < start_block.path_end {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems like start == path_start and end == path_end. should htis code just go in the non-deletion block above?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These also seem to be what the boundry edges are made from, but i still don't understand them and why we have source == target.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, see above. This is the part of the code that adds the boundary edges. Intuitively, the conditional is checking if the virtual blocks are already split at this location. If start == start_block.path_end, that's the case, because the path_end field is the location of the split in the path.

@dkhofer
Copy link
Collaborator Author

dkhofer commented Aug 26, 2024

Changes today:

  • Reworked Sequence::sequences_by_hash() to do the formatting itself. I couldn't get SQLValue working with a vec of strings and the sql syntax, but the formatting is just in that one method now so callers don't have to worry about it
  • Added source_strand and target_strand to new edges, and changed the relevant logic for RC sequence parts

@dkhofer dkhofer assigned Chris7 and unassigned dkhofer Aug 26, 2024
.cloned()
.collect();
let first_edge = sorted_sequence_edges[0].clone();
let start = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

confused why start would always be 0, but i think we're going to be dropping this code anyways to get rid of boundary edges/branch points.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or in other words, start/end always refer to the position in the sequence string, not the coordinates so much. The coordinates are obtained by a collection of blocks w/ lengths.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal with this part of the code is to iterate over all sequences, and for each sequence, to get its blocks, which will then be fed into the graph (in order to get all sequences). The first block of a sequence will always start with 0, so I don't think this will change. If it's still not making sense, we can discuss more

blocks.push(last_block);
block_index += 1;
} else {
blocks.push(GroupBlock {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For future note -- on the builder/default pattern, i think we need to go w/ builder and this makes me realize a great reason why -- shallow sequences will be a real pain to deal with at this level everywhere.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Collaborator

@Chris7 Chris7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i still need to work out some things, but i think this is in a place it should be mergd so we can explore it more.

@dkhofer dkhofer merged commit 26bccf6 into main Aug 26, 2024
1 check passed
@Chris7 Chris7 deleted the get-rid-of-blocks branch September 24, 2024 16:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants