Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail if the phenopacket has variant in mismatching genome build #84

Closed
ielis opened this issue Oct 19, 2023 · 3 comments · Fixed by #128
Closed

Fail if the phenopacket has variant in mismatching genome build #84

ielis opened this issue Oct 19, 2023 · 3 comments · Fixed by #128
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@ielis
Copy link
Member

ielis commented Oct 19, 2023

We need to handle the bug #83 where ingest fails due to a variant on hg19 genome build while the app uses hg38.

Background

The PhenopacketVariantCoordinateFinder is responsible for turning GenomicInterpretation from Phenopacket Schema into VariantCoordinates. The vcf_record field of the GenomicInterpretation has a genome_assembly subfield that should contain the build of the variant in a usable format.

In case of CNVs that use VRS elements, we can use the sequence_id to test if we're on the right build. The example in phenopacket docs lists an allele with a sequence_id==NC_000010.11. The RefSeq identifier corresponds to chr10 in GRCh38.p13 build. We know this based on the assembly report tables that are in our code base. Upon inspection of both tables, we can only find the corresponding contig in GRCh38.p13 (chr10 in GRCh37.p13 corresponds to NC_000010.10, note the difference in version).

PhenopacketVariantCoordinateFinder, the parsing code, knows about GenomeBuild (field self._build) which has an identifier property. The property has the following values {'GRCh37.p13', 'GRCh38.p13'}. Therefore, we can match the identifier with variant's build to check that the variant uses the right build.

Definition of done

  • When parsing VCF record, PhenopacketVariantCoordinateFinder matches the genome_assembly field to genome build's identifier and raises an exception if the assemblies don't match. We can be permissive in value matching:
GenomeBuild.identifier Phenopacket
GRCh37.p13 grch37, GRCh37, GRCh37.p13, hg19, HG19, ...
GRCh38.p13 grch38, GRCh38, GRCh38.p13, hg38, HG38, ...
  • When parsing a CNV, PhenopacketVariantCoordinateFinder raises an exception if the sequence_id is not among contigs of the build. Here, the match must be exact.
  • The issues raise a ValueError with a helpful error message. Note, that in the PhenopacketVariantCoordinateFinder code we may not have enough context to, e.g. report the id of the offending phenopacket. Therefore, we must catch any errors in an upstream class and re-raise with a better message.
@ielis ielis added the enhancement New feature or request label Oct 19, 2023
@ielis ielis changed the title Warn if the phenopacket has variant in mismatching genome build Fail if the phenopacket has variant in mismatching genome build Oct 19, 2023
@pnrobinson
Copy link
Member

There are phenopackets with structural variants for which we only have the label, and not the contig. This is going to be the case for everything that was identified before GS. I do not think that the CNVs should be required to have the contig.

@ielis
Copy link
Member Author

ielis commented Oct 20, 2023

Can you please include an example of these cases?

We may run into issues with such variants. I am not sure how to perform functional annotation without contig info. I think VEP won't talk to us..

@ielis ielis added this to the Manuscript milestone Nov 16, 2023
@ielis
Copy link
Member Author

ielis commented Dec 19, 2023

Related to #120

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
3 participants