Import accessions from dbSNP

dbSNP and EVA represent data differently, so it is necessary to transform dbSNP data in order to be able to ingest it into the EVA.

Main import stages

stages image

Preparing input
Replacing sequence names
Checking against the reference sequence
Building the data model
Declustering
Storing the results

Preparing input

The EVA only stores a subset of the information held in dbSNP. To make reading from the database easier and more efficient, a table has been created per species, build and assembly. If a species requires multiple builds to be imported in order to have a 100% coverage, then multiple tables are created. If a species has been mapped against multiple assemblies, there will also be multiple tables. A table name contains the MD5 of the assembly name it refers to.

Each row in these tables represents one SubSNP (SS), as well as the RefSNP (RS) it is linked to. As a result, an RS supported by multiple submissions will be listed in multiple rows. An SS could also be split in multiple rows, depending on how dbSNP decided to represent it, but this is less frequent.

Strand orientations

dbSNP submissions are not required to be in any specific strand orientation (forward or reverse), but EVA stores everything in the forward strand.

The 3 different orientations stored in dbSNP [reference] are:

Contig orientation compared to the assembly
RS orientation compared to the contig
SS orientation compared to the RS

To put the reference allele in forward, only the first orientation is needed (the reference allele is stored as "contig allele"), but for the alternate alleles the 3 orientations have to be combined to know if the alternate alleles should be reversed or not.

Missing orientations

Some species lack some of the orientation information. When one of these flags is null, it is arbitrarily assumed to be forward; any inconsistencies will be flagged while checking against the reference sequence.

Coordinates system

dbSNP data is stored in 0-base, which means that the position of the first nucleotide is position number 0. The same data is displayed in in 1-base, which means the positions start at 1.

At EVA we display variants in 1-base, for instance, in the Variant Browser. In the dbSNP website, variants are shown in 1-base as well, but internally they are stored in 0-base [reference], so the first transformation we have to do is change the positions so that we store them in 1-base.

This means we need to add 1 to every position of each imported variant. This change has been applied in the input database table so the code doesn’t need to care about it.

Insertions’ special coordinate system

In dbSNP, insertions are handled in a different way from the rest of variant types. For most types (such as deletions, SNPs, MNVs, etc) the coordinate intervals are inclusive, but insertions are stored with exclusive intervals.

This means adding an extra 1 to the position of imported insertions in order to make them always inclusive at EVA.

Replacing sequence names

In dbSNP variants may have no coordinates, contig coordinates only, or both chromosome and contig coordinates. Contig coordinates are stored in relation to RefSeq contig accessions.

In EVA we may need to store variants that use any coordinate system, and we favor GenBank contig accessions. Therefore, for those variants in dbSNP that at least have contig coordinates, we use assembly reports like this one to translate RefSeq accessions to GenBank, only if the sequence is identical.

However, some contigs are no longer in use, such as NT_455924.1. Those contigs won’t appear in assembly reports nor in any updated reference sequence (see the next section "Checking against the reference sequence"). In this case we have to use the chromosome coordinates, linking them with the appropriate GenBank contig accession.

Checking against the reference sequence

To make sure the imported variants actually match the reference sequence they were supposedly mapped against, we perform an assembly check on every variant.

This process takes the chromosome (or contig), position and reference allele from a variant and tries to match it against the reference sequence. If the sequences don’t match, the variant is still imported, but flagged as an "assembly mismatch".

An assembly report like this one allows to query the FASTA file containing the reference sequence using multiple synonyms, such as the sequence name, GenBank and RefSeq accessions.

Building the data model

For each row from the input database, the following are generated:

One clustered variant (RS)
A list of submitted variants (SS) providing evidence for the clustered variant

Different conventions are followed in dbSNP and EVA, which have to be taken into account when we transform the data into the EVA model.

Clustered Variant (RefSnp or RS)

Periodically, all submitted variants are compared, and for all those that share reference sequence, contig or chromosome, start position and variant class; a clustered variant is created to relate those submitted variants.

Below there is a list of the fields of a clustered variant:

Accession

Integer (64 bits). Numeric identifier for a clustered variant. This field stores the rs IDs.

Taxonomy accession

Integer. Non-authoritative taxonomy database: https://www.ncbi.nlm.nih.gov/taxonomy.

Assembly accession

String. The clustered variant is mapped against this reference assembly.

Contig

String. GenBank contig accession if available, otherwise RefSeq contig accession.

Start

Integer (64 bits). 1-base position of the variant.

Type

VariantType. One of SNV, DEL, INS, INDEL, TANDEM_REPEAT, SEQUENCE_ALTERATION, NO_SEQUENCE_ALTERATION, MNV, SV, CNV.

dbSNP and EVA use different variant classifications: EVA uses the Sequence Ontology, whereas dbSNP has its own classification. However, they are similar enough so that we can do a mapping and not lose any information.

This page of the EVA documentation contains a table with the mapping between variant classes.

The mapping process also takes into account the alleles to make sure we are mapping to the right variant class.

Validated

Boolean. Will be ‘true’ if the variant was curated manually, and not only detected by computational methods. see https://www.ncbi.nlm.nih.gov/books/NBK21088/table/ch5.ch5_t4/?report=objectonly .

Created Date

String. Date when the RS was created.

Mapping weight

The number of loci that the RS maps to (only populated when more than 1). See Database Tables definition of mapping weight for details on what the mapping weight values mean.

Submitted Variant (SubSnp or SS)

Looking forward to EVA issuing SS IDs for submitted variants, only one SS ID will be assigned to all the variants that share the same project, reference sequence, contig or chromosome, start position, reference allele and alternate allele.

Accession

Integer (64 bits). Numeric identifier for a submitted variant. This field stores the ss IDs.

Clustered variant accession

Integer (64 bits). If present, states the accession of the Clustered Variant in which this Submitted Variant is clustered. In other words, this is the rs ID where this ss ID is clustered, if any.

Taxonomy accession

Integer. Non-authoritative taxonomy database: https://www.ncbi.nlm.nih.gov/taxonomy.

Reference sequence accession

String. Any reference sequence, such as a genome assembly, gene sequence or transcriptome.

Project accession

String. EVA study ID if available (starts with "PRJ"). Otherwise dbSNP batch (batch handle and batch ID joined with an underscore ‘_’).

Contig

String. Contig accession. Generally GenBank accessions, but may be RefSeq for special cases.

Start

Integer (64 bits). 1-base position of the variant.

Reference allele

String. Reference allele. If assemblyMatch is true, it matches the reference sequence.

Alternate allele

String. Single alternate allele related to this Submitted Variant Accession. Other alternate alleles could appear in the same position, linked to the same or to a different Submitted Variant Accession (read the multiallelic section for more information).

Supported by evidence

Boolean. Will be 'true' if this submitted variant is supported by genotypes or frequencies.

Assembly match

Boolean. Will be ‘true’ if there is no doubt the reference allele matches the reference sequence. ‘false’ might suggest a mismatch with the reference sequence, or that it couldn’t be found.

Alleles match

Boolean. Will be ‘true’ if there is no doubt the alleles are correct. ‘false’ if the strand orientation might be wrong or if the allele might be an artifact and thus the variant is not reliable.

Internally, dbSNP store one field for the reference allele and other field with a list containing all the alternate alleles and reference allele together. Order is not relevant [reference].

As stated above, for multiallelic variants we split the alternate alleles into separate variants. Sometimes it is not possible to differentiate which allele is reference or alternate. The most common examples would be:

The reference allele is not present in the alleles list (example: the alleles in ss275060734 "A/T" don't contain the reference allele "G", no matter what their orientation is).
A missing orientation prevents from being 100% sure about the orientation of the alleles listed compared to the reference (example: orientations shown as "NA"). The flag "assembly match" can be helpful in understanding these cases.
The orientation is wrong (example: ss266911375 has reverse "ss to rs" orientation and also reverse "snp to contig" orientation, which overall should result in forward orientation, and thus the contig allele "A" should be explicitly present in the alleles list "C/T". Also, the orientation flags and alleles in the other subsnps (ss266602754 and ss266911374) are incompatible with those of ss266911375, no matter what orientation convention is used).

These situations are flagged as an "alleles mismatch".

Validated

Boolean. Will be ‘true’ if the variant was curated manually, and not only detected by computational methods. see https://www.ncbi.nlm.nih.gov/books/NBK21088/table/ch5.ch5_t4/?report=objectonly .

Created date

String. Date when the ss was created, or if it is not available, the date when the rs was created, or if it is not available, the date when the variant was imported to EVA.

Version

Integer. Internal field to track updates, merges or deprecations.

Mapping weight

The number of loci that the RS, that clusters this SS, maps to (only populated when more than 1). See Database Tables definition of mapping weight for details on what the mapping weight values mean.

Other model considerations

Multiallelic

dbSNP do not split multiallelic variants and simply take the submitted raw data, but every allele must be the same type. This means that if a submission sends a SNP A > T,C, that SNP will get one SS ID only. But if the SNP was submitted as 2 records A > T and A > C, that SNP would receive 2 SS IDs.

In EVA we narrow the concept of a variant to a single change from one reference allele to one alternate allele. This means that those multiallelic variants will be split during this import, although they will keep sharing the same SS ID.

Renormalization

During their normalization process, dbSNP remove the context nucleotide before a DIV. Until the year 2017 this was done even if the context nucleotide did not match between the reference and alternate alleles, meaning that it was actually a complex INDEL. This transformed a variant like AC > GGT into C > GT. This behaviour was changed in 2017 to remove the context nucleotide only when it matches.

In contrast, the EVA first remove the rightmost bases, because it behaves better if the variants were left-aligned during the calling, and it preserves the position whenever possible. In the case of ambiguous variants such as T > TCT, the coordinates are adjusted accordingly.

Limits of renormalization

Unfortunately, this renormalization process can’t possibly fix all the ambiguities without a complete remapping, specially if the same logical variant is present twice, aligned at both ends of a repetitive section. An example of this follows:

rs385284696 and rs714068841 are the same logical variant. Starting at position 2568353, there is a sequence of ATGTTCTTCTTCTC. After applying any of rs385284696 (deletion of "TTC" in position 2568356) or rs714068841 (deletion of "TCT" in position 2568363), the sequence ATGTTCTTCTC is left:

ATGTTCTTCTTCTC: reference sequence
ATG---TTCTTCTC: rs385284696
ATGTTCTTCT---C: rs714068841
ATGTTCTTCTC: resulting sequence in both cases

We have decided not to report these duplicates until a more thorough remapping is implemented, both here in the accessioning service and the main EVA warehouse.

Declustering

Declustering is the process by which the relationship between a submitted variant and a clustered variant is made no longer valid. This will happen when one of the following flags is set:

Type mismatch: The type stated by the RS [reference] does not match the type deduced from the alleles in the SS
Alleles mismatch: The reference allele is not among the alleles associated with an RS, or some of the orientations are missing

Declustering operations are recorded in a separate collection so that the history is not lost.

Storing the results

Clustered variants, submitted variants, and history events are finally stored in a MongoDB database, each type in its own collection. Due to different uniqueness constraints, dbSNP and EVA variants are stored in separate collections, but this is transparent to the webservices users.

Declustered RS

Every time an SS is declustered from an RS, the latter is written to a separate collection too. This allows to compare these declustered RS against the main clustered variants collection; if one is in the declustered collection only, it means that to all effects it must be deprecated because no SS is linked to it any more. This operation is out of scope of the pipeline here described.

New merging operations

2 or more variants that initially seemed different could end up having exactly the same properties after all the aforementioned transformations. In this case, only the first one is preserved. The subsequent ones are moved to the history collection as "merged into another variant", with enough information to reconstruct what happened.

This merging process applies both to submitted and clustered variants.

References

Alleles order in the alleles list (e.g. "A/C") is alphabetical. In other words, the first allele is NOT necessarily the reference allele: [reference].

How many orientations should be taken into account while reading dbSNP data: [reference].

dbSNP stores in 0-base, but the website displays variants in 1-base: [reference].

"snp class" definition: [reference].

Import accessions from dbSNP

Main import stages

Preparing input

Strand orientations

Missing orientations

Coordinates system

Insertions’ special coordinate system

Replacing sequence names

Checking against the reference sequence

Building the data model

Clustered Variant (RefSnp or RS)

Accession

Taxonomy accession

Assembly accession

Contig

Start

Type

Validated

Created Date

Mapping weight

Submitted Variant (SubSnp or SS)

Accession

Clustered variant accession

Taxonomy accession

Reference sequence accession

Project accession

Contig

Start

Reference allele

Alternate allele

Supported by evidence

Assembly match

Alleles match

Validated

Created date

Version

Mapping weight

Other model considerations

Multiallelic

Renormalization

Limits of renormalization

Declustering

Storing the results

Declustered RS

New merging operations

References

Clone this wiki locally