Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide way to stop normalization if the expression is obviously problematic (such as deletions in large gap/unknown regions) #397

Open
larrybabb opened this issue Apr 11, 2024 · 5 comments

Comments

@larrybabb
Copy link
Contributor

When trying to normalize the variant NC_000015.9:g.7211_7214del the routine will go into a seemingly endless routine to try to figure out the normalized result for the Allele.state.

Without a full analysis their is evidence that this is likely caused by the fact that the first 17 million bases in chromosome 15 are all Ns. So as it rolls right/left to get to a unique sequence region it will go on for an impractical amount of time.

I suggest we put a limit in terms of how large the sequence can grow up to when normalizing the Allele. But we should discuss how to best handle this.

@toneillbroad just suggested that maybe we simply disallow any normalization that includes ambiguity coded bases not A, C, T or G. I sort of like that as a general rule of thumb, since it is very difficult to address the true normality of a sequence that includes any of the ambiguity codes. We can make this a vrs-python rule so that our normalizer doesn't go off and never return in these portions of the reference sequences

@larrybabb
Copy link
Contributor Author

@ahwagner we would like you to weigh in on this so we can put a stop gap solution into vrs-python ASAP. even if we have to revisit a more formal decision later.

@ahwagner
Copy link
Member

I agree with the solution proposed by @toneillbroad.

@theferrit32 theferrit32 changed the title Normalization needs to throw exception in exceptional situations Provide way to stop normalization if the expression is obviously problematic (such as deletions in large gap/unknown regions) Apr 30, 2024
@theferrit32
Copy link
Contributor

It sounds like based on discussion with @larrybabb that this is only a problem for genomic sequences, not transcripts.

(so by N below I really mean anything not A C T G)

cases:

  • substitution: check if the ref or alt includes an N
  • insertion: check if the sequence being inserted includes an N
  • deletion: check if the sequence being deleted includes an N
  • dup: check if the sequence being duplicated includes an N

others?

@theferrit32
Copy link
Contributor

theferrit32 commented Apr 30, 2024

With great frustration with multithreading in Python, I have found a way to work around this issue in client code at a higher level that doesn't add that much overhead. Using a background task queue, a return value queue, a background process which runs the tasks and can be interrupted, and a timeout on return values, I can terminate any call into a Translator that takes, say, longer than 1 minute, and add an error message to the output file that indicates that variant was skipped.

It may still be nice to implement something in vrs-python which checks the sequence beforehand, or in bioutils during roll left/right, because this would make this available to other codebases which use vrs-python. Or I could look at adding something like a translator wrapper which has the timeout logic built in.

@ahwagner
Copy link
Member

@theferrit32 have we proposed implementing this over in Biocommons? I agree that it makes sense for us to implement the solution there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants