-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache block group, path, and interval tree for path, and batch inserts #23
Conversation
347ba1f
to
fefe7b2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. One comment on a place i think we can potentially get a good speedup.
src/main.rs
Outdated
name: String, | ||
) -> i32 { | ||
let block_group_key = BlockGroupData { | ||
collection_name: collection_name.to_string(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is already a string. I think this block can be made simpler by making the String arguments into &str
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I converted as many as I could for the block group stuff, but ran in to some reference lifetime issues with one variable, so I'm declaring success
@@ -224,46 +299,17 @@ fn update_with_vcf( | |||
let allele = allele.unwrap(); | |||
if allele != 0 { | |||
let alt_seq = alt_alleles[allele - 1]; | |||
// TODO: new sequence may not be real and be <DEL> or some sort. Handle these. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for consolidating this
chromosome_index as i32, | ||
phased, | ||
); | ||
changes.push(change); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should think about checkpointing this to make it inserts of like 10k or 50k to reduce memory pressure. Could get messy with the many gb vcfs as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely, but not going to do it for this PR. Adding it to the list though
src/main.rs
Outdated
.sequence_type("DNA") | ||
.sequence(alt_seq) | ||
.save(conn); | ||
let sequence = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be a good place to have a small cache. For two cases this has great benefits:
- For many samples, we iterate by samples then by genotype, meaning we can be looking up the sample 3-4 alt alleles many times.
- For trivial changes, every simple SNP is a lookup
Maybe a cache size of like 200?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. I'm going to do that in my next PR. I'd like to merge this one
No description provided.