Provide way to stop normalization if the expression is obviously problematic (such as deletions in large gap/unknown regions) #397

larrybabb · 2024-04-11T14:02:33Z

When trying to normalize the variant NC_000015.9:g.7211_7214del the routine will go into a seemingly endless routine to try to figure out the normalized result for the Allele.state.

Without a full analysis their is evidence that this is likely caused by the fact that the first 17 million bases in chromosome 15 are all Ns. So as it rolls right/left to get to a unique sequence region it will go on for an impractical amount of time.

I suggest we put a limit in terms of how large the sequence can grow up to when normalizing the Allele. But we should discuss how to best handle this.

@toneillbroad just suggested that maybe we simply disallow any normalization that includes ambiguity coded bases not A, C, T or G. I sort of like that as a general rule of thumb, since it is very difficult to address the true normality of a sequence that includes any of the ambiguity codes. We can make this a vrs-python rule so that our normalizer doesn't go off and never return in these portions of the reference sequences

The text was updated successfully, but these errors were encountered:

larrybabb · 2024-04-11T14:03:20Z

@ahwagner we would like you to weigh in on this so we can put a stop gap solution into vrs-python ASAP. even if we have to revisit a more formal decision later.

ahwagner · 2024-04-11T14:14:29Z

I agree with the solution proposed by @toneillbroad.

theferrit32 · 2024-04-30T17:43:40Z

It sounds like based on discussion with @larrybabb that this is only a problem for genomic sequences, not transcripts.

(so by N below I really mean anything not A C T G)

cases:

substitution: check if the ref or alt includes an N
insertion: check if the sequence being inserted includes an N
deletion: check if the sequence being deleted includes an N
dup: check if the sequence being duplicated includes an N

others?

theferrit32 · 2024-04-30T17:46:32Z

With great frustration with multithreading in Python, I have found a way to work around this issue in client code at a higher level that doesn't add that much overhead. Using a background task queue, a return value queue, a background process which runs the tasks and can be interrupted, and a timeout on return values, I can terminate any call into a Translator that takes, say, longer than 1 minute, and add an error message to the output file that indicates that variant was skipped.

It may still be nice to implement something in vrs-python which checks the sequence beforehand, or in bioutils during roll left/right, because this would make this available to other codebases which use vrs-python. Or I could look at adding something like a translator wrapper which has the timeout logic built in.

ahwagner · 2024-05-16T11:33:56Z

@theferrit32 have we proposed implementing this over in Biocommons? I agree that it makes sense for us to implement the solution there.

theferrit32 changed the title ~~Normalization needs to throw exception in exceptional situations~~ Provide way to stop normalization if the expression is obviously problematic (such as deletions in large gap/unknown regions) Apr 30, 2024

theferrit32 mentioned this issue Apr 30, 2024

Multiprocessing and timeouts clingen-data-model/clinvar-gk-python#8

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide way to stop normalization if the expression is obviously problematic (such as deletions in large gap/unknown regions) #397

Provide way to stop normalization if the expression is obviously problematic (such as deletions in large gap/unknown regions) #397

larrybabb commented Apr 11, 2024

larrybabb commented Apr 11, 2024

ahwagner commented Apr 11, 2024

theferrit32 commented Apr 30, 2024

theferrit32 commented Apr 30, 2024 •

edited

Loading

ahwagner commented May 16, 2024

Provide way to stop normalization if the expression is obviously problematic (such as deletions in large gap/unknown regions) #397

Provide way to stop normalization if the expression is obviously problematic (such as deletions in large gap/unknown regions) #397

Comments

larrybabb commented Apr 11, 2024

larrybabb commented Apr 11, 2024

ahwagner commented Apr 11, 2024

theferrit32 commented Apr 30, 2024

theferrit32 commented Apr 30, 2024 • edited Loading

ahwagner commented May 16, 2024

theferrit32 commented Apr 30, 2024 •

edited

Loading