-
Notifications
You must be signed in to change notification settings - Fork 12
Import accessions from dbSNP
Since dbSNP and EVA have different architectures, it is necessary to transform dbSNP data in order to be able to ingest it into the EVA.
- Preparing input
- Replacing sequence names
- Checking against the reference sequence
- Building the data model
- Declustering and merging
- Storing the results
The EVA only stores a subset of the information held in dbSNP. To make reading from the database easier and more efficient, a table has been created per species, build and assembly. If a species requires multiple builds to be imported in order to have a 100% coverage, then multiple tables are created. If a species has been mapped against multiple assemblies, there will also be multiple tables. A table name contains the MD5 of the assembly name it refers to.
Each row in these tables represents one SubSNP (SS), as well as the RefSNP (RS) it is linked to. As a result, an RS supported by multiple submissions will be listed in multiple rows. An SS could also be split in multiple rows, depending on how dbSNP decided to represent it, but this is less frequent.
dbSNP submissions are not required to be in any specific strand orientation (forward or reverse), but EVA stores everything in the forward strand.
The 3 different orientations stored in dbSNP [reference] are:
- Contig orientation compared to the assembly
- RS orientation compared to the contig
- SS orientation compared to the RS
To put the reference allele in forward, only the first orientation is needed (the reference allele is stored as "contig allele"), but for the alternate alleles the 3 orientations have to be composed to know if the alternate alleles should be reversed or not.
Some species lack some of the orientation information. When one of these flags is null, it is arbitrarily assumed to be forward; any inconsistencies will be flagged while checking against the reference sequence..
0-base means that the position of the first nucleotide is position number 0. Likewise, in 1-base positions start with 1.
At EVA we display variants in 1-base, for instance, in the Variant Browser. In the dbSNP website, variants are shown in 1-base as well, but internally they are stored in 0-base [reference], so the first transformation we have to do is change the positions so that we store them in 1-base.
This means we need to add 1 to every position of each imported variant. This change has been applied in the input database table so the code doesn’t need to care about it.
In dbSNP the variants may have no coordinates, contig coordinates only, or both chromosome and contig coordinates. The contig coordinates are stored as RefSeq contig accessions.
In EVA we may need to store variants that use any coordinate system, and we use the convention of storing in GenBank contig accessions. Therefore, for those variants in dbSNP that at least have contig coordinates, we use assembly reports like this one to translate RefSeq accessions to GenBank, only if the sequence is identical.
However, some contigs are no longer in use, such as NT_455924.1. Those contigs won’t appear in assembly reports nor in any updated reference sequence (see the next section "Checking against the reference sequence"). In this case we have to use the chromosome coordinates and replace them with the appropriate GenBank coordinates.
To make sure the imported variants actually match the reference sequence they were supposedly mapped against, we perform an assembly check on every variant.
This process takes the chromosome (or contig), position and reference allele from a variant and tries to match it against the reference sequence. If the sequences don’t match, the variant is still imported, but flagged as an "assembly mismatch".
An assembly report like this one allows to query the FASTA file containing the reference sequence using multiple synonyms, such as the sequence name, GenBank and RefSeq accessions.
Different conventions are followed in dbSNP and EVA, so there are general points that have to be taken into account when we transform the data into the EVA model.
dbSNP and EVA use different variant classifications: EVA uses the Sequence Ontology, whereas dbSNP has its own classification. However, they are similar enough so that we can do a mapping and not lose any information.
In this page of the EVA documentation there is a table with the mapping of variant classes.
In dbSNP, insertions are handled in a different way from the rest of variant types. For most types (such as deletions, SNPs, MNVs, etc) the coordinate intervals are inclusive, but insertions are stored with exclusive intervals.
This means adding an extra 1 to the position of imported insertions in order to make them always inclusive at EVA.
dbSNP do not split multiallelic variants and simply take the submitted raw data, but every allele must be the same type. This means that if a submission sends a SNP A > T,C, that SNP will get one SS ID only. But if the SNP was submitted as 2 records A > T and A > C, that SNP would receive 2 SS IDs.
In EVA we narrow the concept of a variant to a single change from one reference allele to one alternate allele. This means that those multiallelic variants will be split during this import, although they will keep sharing the same SS ID.
Internally, the alleles are stored this way: one field for the reference allele and other field with a list containing all the alternate alleles and reference allele together. Order is not relevant [reference].
As stated above, for multiallelic variants we split the alternate alleles into separate variants. The problem is that sometimes, some missing orientation or some other error causes that the reference allele is not present in the alleles list. This makes impossible to know which of the alleles in the alleles list should not be taken as alternate allele. When this happens, a flag is set to mark an "alleles mismatch".
In dbSNP, a normalization process is done where the context nucleotide in INDELs is removed, giving priority to remove the leftmost nucleotide.
Until the year 2017 this was done even if the context nucleotide did not match between the reference and alternate alleles, meaning that it was actually a complex INDEL, but this behaviour was changed in 2017 to remove the context nucleotide only if it matches.
In contrast, in EVA there’s a priority to remove the rightmost bases. We prefer this because it behaves better if the variants were called doing left-alignment. This difference in priorities can yield incorrect imported data with ambiguous variants, so an adjustment in the variant positions is done to avoid most errors.
Unfortunately, this renormalization process can’t possibly fix all the ambiguities without a complete remapping, specially if the same logical variant is present twice, aligned at both ends of a repetitive section. An example of this follows:
rs385284696 and rs714068841 are the same logical variant. Starting at position 2568353, there’s a sequence of ATGTTCTTCTTCTC. After applying any of rs385284696 (deletion of "TTC" in position 2568356) or rs714068841 (deletion of "TCT" in position 2568363), the sequence ATGTTCTTCTC is left:
ATGTTCTTCTTCTC: reference sequence
ATG---TTCTTCTC: rs385284696
ATGTTCTTCT---C: rs714068841
ATGTTCTTCTC: resulting sequence in both cases
The decision is to leave this duplicates undetected until a more thorough complete remapping is done, both here in the accessioning service and the main EVA warehouse.
From each row in the input database, and in concordance with the above considerations, one clustered variant and a list of submitted variants is generated.
In the EVA website the main attributes of Submitted Variants and Clustered Variants are listed: https://www.ebi.ac.uk/eva/?Help#variant-accession-administred-by-eva
However, some extra fields are stored internally. This is the meaning of every field:
-
accession: integer (64 bits). Numeric identifier for a submitted variant. This field stores the ss IDs.
-
clusteredVariantAccession: integer (64 bits). If present, states the accession of the Clustered Variant in which this Submitted Variant is clustered. In other words, this is the rs ID where this ss ID is clustered, if any.
-
taxonomyAccession: integer. Non-authoritative taxonomy database: https://www.ncbi.nlm.nih.gov/taxonomy.
-
referenceSequenceAccession: string. This will usually be the reference assembly, but can be reference sequence, or reference transcriptome.
-
projectAccession: string. EVA study ID if available (starts with "PRJ"). Otherwise dbSNP batch (batch handle and batch ID joined with an underscore ‘_’).
-
contig: string. Choromosome name if available. Otherwise contig accession.
-
start: integer (64 bits). 1-base position of the variant.
-
referenceAllele: string. Reference allele. If assemblyMatch is true, it matches the reference sequence.
-
alternateAllele: string. Single alternate allele related to this Submitted Variant Accession. Other alternate alleles could appear in the same position, but will be linked to a different accession.
-
supportedByEvidence: boolean. Will be 'true' if this submitted variant is supported by genotypes or frequencies.
-
assemblyMatch: boolean. Will be ‘true’ if there is no doubt the reference allele matches the reference sequence. ‘false’ might suggest a mismatch with the reference sequence, or that it couldn’t be found.
-
allelesMatch: boolean. Will be ‘true’ if there is no doubt the alleles are correct. ‘false’ if the strand orientation might be wrong or if the allele might be an artifact and thus the variant is not reliable.
-
validated: boolean. Will be ‘true’ if the variant was curated manually, and not only detected by computational methods. see https://www.ncbi.nlm.nih.gov/books/NBK21088/table/ch5.ch5_t4/?report=objectonly .
-
date: string. Date when the ss was created.
-
version: integer. Internal field to track updates, merges or deprecations.
Periodically, all submitted variants are compared, and for all those that share taxonomy, reference sequence, contig or chromosome, start position and variant class; a clustered variant is created to relate those submitted variants.
Below there is a list of the fields of a clustered variant:
-
accession: integer (64 bits). Numeric identifier for a clustered variant. This field stores the rs IDs.
-
taxonomyAccession: integer. Non-authoritative taxonomy database: https://www.ncbi.nlm.nih.gov/taxonomy.
-
assemblyAccession: string. The clustered variant is mapped against this reference assembly.
-
contig: string. Choromosome name if available. Otherwise contig accession.
-
start: integer (64 bits). 1-base position of the variant.
-
type: VariantType. One of SNV, DEL, INS, INDEL, TANDEM_REPEAT, SEQUENCE_ALTERATION, NO_SEQUENCE_ALTERATION, MNV, SV, CNV.
-
validated: boolean. Will be ‘true’ if the variant was curated manually, and not only detected by computational methods. see https://www.ncbi.nlm.nih.gov/books/NBK21088/table/ch5.ch5_t4/?report=objectonly .
-
date: string. Date when the ss was created.
In case a specific submitted variant was flagged with "alleles mismatch" or "type mismatch" (if the type stated by the RS [reference] does not match the type deduced from the alleles in the SS), it is declustered from its clustered variant. This means that the relationship between a RS and a SS is no longer valid. This operation is recorded in a separate collection so that the history is not lost.
It can happen that 2 or more submitted variants that initially had information that seemed different, end up having exactly the same properties after all the transformations above. In this case, only the first variant is kept. The subsequent ones are moved to the history collection as "merged into an existent variant" with enough information to reconstruct what happened.
The objects extracted are clustered variants, submitted variants, and history events, which are stored in a MongoDB database, each type in its own collection.
Alleles order in the alleles list (e.g. "A/C") is alphabetical. In other words, the first allele is NOT necessarily the reference allele: [reference].
How many orientations should be taken into account while reading dbSNP data: [reference].
dbSNP stores in 0-base, but the website displays variants in 1-base: [reference].
"snp class" definition: [reference].