You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We need to handle the bug #83 where ingest fails due to a variant on hg19 genome build while the app uses hg38.
Background
The PhenopacketVariantCoordinateFinder is responsible for turning GenomicInterpretation from Phenopacket Schema into VariantCoordinates. The vcf_record field of the GenomicInterpretation has a genome_assembly subfield that should contain the build of the variant in a usable format.
In case of CNVs that use VRS elements, we can use the sequence_id to test if we're on the right build. The example in phenopacket docs lists an allele with a sequence_id==NC_000010.11. The RefSeq identifier corresponds to chr10 in GRCh38.p13 build. We know this based on the assembly report tables that are in our code base. Upon inspection of both tables, we can only find the corresponding contig in GRCh38.p13 (chr10 in GRCh37.p13 corresponds to NC_000010.10, note the difference in version).
PhenopacketVariantCoordinateFinder, the parsing code, knows about GenomeBuild (field self._build) which has an identifier property. The property has the following values {'GRCh37.p13', 'GRCh38.p13'}. Therefore, we can match the identifier with variant's build to check that the variant uses the right build.
Definition of done
When parsing VCF record, PhenopacketVariantCoordinateFinder matches the genome_assembly field to genome build's identifier and raises an exception if the assemblies don't match. We can be permissive in value matching:
GenomeBuild.identifier
Phenopacket
GRCh37.p13
grch37, GRCh37, GRCh37.p13, hg19, HG19, ...
GRCh38.p13
grch38, GRCh38, GRCh38.p13, hg38, HG38, ...
When parsing a CNV, PhenopacketVariantCoordinateFinder raises an exception if the sequence_id is not among contigs of the build. Here, the match must be exact.
The issues raise a ValueError with a helpful error message. Note, that in the PhenopacketVariantCoordinateFinder code we may not have enough context to, e.g. report the id of the offending phenopacket. Therefore, we must catch any errors in an upstream class and re-raise with a better message.
The text was updated successfully, but these errors were encountered:
ielis
changed the title
Warn if the phenopacket has variant in mismatching genome build
Fail if the phenopacket has variant in mismatching genome build
Oct 19, 2023
There are phenopackets with structural variants for which we only have the label, and not the contig. This is going to be the case for everything that was identified before GS. I do not think that the CNVs should be required to have the contig.
We need to handle the bug #83 where ingest fails due to a variant on hg19 genome build while the app uses hg38.
Background
The
PhenopacketVariantCoordinateFinder
is responsible for turningGenomicInterpretation
from Phenopacket Schema intoVariantCoordinates
. Thevcf_record
field of theGenomicInterpretation
has agenome_assembly
subfield that should contain the build of the variant in a usable format.In case of CNVs that use VRS elements, we can use the
sequence_id
to test if we're on the right build. The example in phenopacket docs lists an allele with asequence_id==NC_000010.11
. The RefSeq identifier corresponds tochr10
in GRCh38.p13 build. We know this based on the assembly report tables that are in our code base. Upon inspection of both tables, we can only find the corresponding contig in GRCh38.p13 (chr10
in GRCh37.p13 corresponds toNC_000010.10
, note the difference in version).PhenopacketVariantCoordinateFinder
, the parsing code, knows aboutGenomeBuild
(fieldself._build
) which has anidentifier
property. The property has the following values{'GRCh37.p13', 'GRCh38.p13'}
. Therefore, we can match theidentifier
with variant's build to check that the variant uses the right build.Definition of done
PhenopacketVariantCoordinateFinder
matches thegenome_assembly
field to genome build'sidentifier
and raises an exception if the assemblies don't match. We can be permissive in value matching:GenomeBuild.identifier
GRCh37.p13
grch37
,GRCh37
,GRCh37.p13
,hg19
,HG19
, ...GRCh38.p13
grch38
,GRCh38
,GRCh38.p13
,hg38
,HG38
, ...PhenopacketVariantCoordinateFinder
raises an exception if thesequence_id
is not among contigs of the build. Here, the match must be exact.ValueError
with a helpful error message. Note, that in thePhenopacketVariantCoordinateFinder
code we may not have enough context to, e.g. report the id of the offending phenopacket. Therefore, we must catch any errors in an upstream class and re-raise with a better message.The text was updated successfully, but these errors were encountered: