Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache block group, path, and interval tree for path, and batch inserts #23

Merged
merged 4 commits into from
Aug 28, 2024

Conversation

dkhofer
Copy link
Collaborator

@dkhofer dkhofer commented Aug 27, 2024

No description provided.

@dkhofer dkhofer requested a review from Chris7 August 28, 2024 13:39
@dkhofer dkhofer changed the title Cache block group, path, and interval tree for path Cache block group, path, and interval tree for path, and batch inserts Aug 28, 2024
Copy link
Collaborator

@Chris7 Chris7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. One comment on a place i think we can potentially get a good speedup.

src/main.rs Outdated
name: String,
) -> i32 {
let block_group_key = BlockGroupData {
collection_name: collection_name.to_string(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is already a string. I think this block can be made simpler by making the String arguments into &str

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I converted as many as I could for the block group stuff, but ran in to some reference lifetime issues with one variable, so I'm declaring success

@@ -224,46 +299,17 @@ fn update_with_vcf(
let allele = allele.unwrap();
if allele != 0 {
let alt_seq = alt_alleles[allele - 1];
// TODO: new sequence may not be real and be <DEL> or some sort. Handle these.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for consolidating this

chromosome_index as i32,
phased,
);
changes.push(change);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should think about checkpointing this to make it inserts of like 10k or 50k to reduce memory pressure. Could get messy with the many gb vcfs as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely, but not going to do it for this PR. Adding it to the list though

src/main.rs Outdated
.sequence_type("DNA")
.sequence(alt_seq)
.save(conn);
let sequence =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be a good place to have a small cache. For two cases this has great benefits:

  • For many samples, we iterate by samples then by genotype, meaning we can be looking up the sample 3-4 alt alleles many times.
  • For trivial changes, every simple SNP is a lookup

Maybe a cache size of like 200?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. I'm going to do that in my next PR. I'd like to merge this one

@dkhofer dkhofer merged commit 0ebe206 into main Aug 28, 2024
1 check passed
@Chris7 Chris7 deleted the batch-insert-changes branch September 24, 2024 16:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants