Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add --clip argument to bedgraphtobigwig #58

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

nleroy917
Copy link

Hey @jackh726!

As mentioned in #56, this PR adds a new option to the bedgraphtobigwig binary that will "skip" any chromosomes in the .bedGraph file that are not represented in the given chrom.sizes file. This functionality is inspired by the kent utils -clip option in wigToBigWig.

I appreciate you letting me take a stab at this. Currently... it's unfinished. It is not working properly, but I thought I'd open it and get your immediate feedback. I think I'm getting stuck in an infinite loop and will continue working.

Anyways, after getting my bearings, I was able to quickly add --clip to BBIWriteOptions. I had some troubles deciding how to implement the skip logic, however.

I believe that I was able to find the source of the error... The do_read that gets passed as the start_processing argument in process_to_bbi will return the ProcessDataError::InvalidChromosome error if the chromosome is not found in chrom.sizes. I had three ideas to go forward:

  1. Add a fourth argument called clip to the process_to_bbi function in the BBIDataSource trait. Then, inside match the error and skip if clip is true. I didn't like this since it felt too specific for the trait.
  2. Add a new error called ProcessDataError::ClippedChromosome. If clip is true, return this instead of the current ProcessDataError::ClippedChromosome. Likewise, this can be matched inside process_to_bbi and enable a "skip"-like feature. This felt weird since its not really an error.
  3. Add a boolean field to ProcessDataError::InvalidChromosome. Again, match on this error and that field being true, skip the chromosome.

I went with option three since it seemed the least hacky and most idiomatic/rusty way. I'm curious to your thoughts on this and it seems the real trick is just getting those BBIWriteOptions into process_to_bbi.

Will continue working on the infinite loop, so stay tuned. Thank you!

@jackh726
Copy link
Owner

Hi @nleroy917, nice to see a PR :)

Of your three options, 1 definitely does seem like not the right option. 2 and 3 are similar, in that you're modifying BBIProcessError to capture the "don't process this chromosome" state. I would offer one other option: make StartProcessing be FnMut(String) -> Result<Option<P>, ProcessDataError>, where Ok(None) means "no error, but don't process this chromosome.

One important note: in the BedParserStreamingIterator impl, you will still have to read through the bed file past the skipped chromosome. The BedParserParallelStreamingIterator impl can just be continued since each chrom gets a separate File open+seek.

I have a couple other thoughts:

First, the naive thing to do would be just "pretend" everything is fine to process_to_bbi. This would really only require faking the length by setting to u32::MAX. Then, later when actually writing the data, skip the chromosomes. This has the benefit that proces_to_bbi stays agnostic to clipping. But, has a few major downsides. Particularly that a lot of extra work is done for no reason, and that the chrom will get an id assigned to it. However, it seems appealing because...

Second, it's not a big deal, but so far process_to_bbi hasn't actually cared about the chromosome length at all. With this option, it doesn't need to, since this check can (and should) just be done in process_val (btw, I'm noticing that bigbedwrite::process_val has the wrong check: current_val.start >= chrom_length instead of current_val.end >= chrom_length - can you fix that when you touch that bit of code?). But, it does split the handling of clip into two places. An alternative is to move those checks into process_val, but I don't like that option particularly much, since now implementations of BBIDataProcessor have to duplicate value validation. And, I think just having a way to tell process_to_bbi to skip a chromsome is fine if the actual reason behind that is agnostic.

Thoughts?

@nleroy917
Copy link
Author

Wow has it been a month? @jackh726 apologies for how long it took to get back to this! I promise I haven't forgotten!

I've read through, and I really like the idea of making the StartProcessing arg be a closure to return Result<Option<P>, ProcessDataError>. I think I've got that implemented. I'm working my way through actual skip logic now, which is less clear to me. I think I need to spend a tad bit more time with the codebase...

Stay tuned I am going to keep going on this.

@jackh726
Copy link
Owner

No wories, let me know if you get stuck and need some thoughts!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants