Svart is a small library for representing genomic variants and regions. It attempts to solve several common issues:
- Coordinate system off-by-one errors
- Representation of VCF small, structural and breakend variants with a consistent API
- Different variant trimming strategies
The library provides a consistent API for creating, manipulating and comparing variation of different types by providing default implementations and extensible base classes and interfaces along with immutable default implementations which developers can utilise to gain maximum utility from a minimum amount of code without having to address issues in bioinformatics which are a common source of duplicated code and frequently errors.
The code is completely free of external dependencies.
This is intended to be used as a standard library for various Monarch and Monarch-related projects involving genomic variation such as Jannovar, Exomiser, Squirls, PBGA.
These projects are inter-dependent to some extent and all require some level of ability to represent and manipulate genomic variation.
Given the inter-dependency there is substantial copying/re-implementation in the code-bases of these projects of various core concepts surrounding genomic variation. This library aims to provide a common set of interfaces and implementations which can be used to fulfill their collective use-cases and so reduce code duplication, bugs, and GC pressure due to object conversion between common types.
There is a (mostly) compositional hierarchy of data types:
Contig
-has_length-> int
Contig
-has_identifier-> int
Contig
-has_name-> String
GenomicAssembly
-has_some-> Contig
ONE_BASED
or ZERO_BASED
Coordinates
-has_a-> CoordinateSystem
Coordinates
-has_start-> int
Coordinates
-has_end-> int
Region
-has_some-> Coordinates
POSITIVE
or NEGATIVE
GenomicRegion
-is_a-> Region
GenomicRegion
-has_a-> Contig
GenomicRegion
-has_a-> Strand
GenomicVariant
-is_a-> GenomicRegion
GenomicVariant
-has_a-> VariantType
GenomicBreakendVariant
-is_a-> GenomicVariant
GenomicBreakendVariant
-has_left-> GenomicBreakend
GenomicBreakendVariant
-has_right-> GenomicBreakend
GenomicBreakend
-is_a-> GenomicRegion
With this model, it is possible to express variation, using zero or one-based coordinates on a strand of a chromosome, for a
given genome assembly in any coordinate system and easily manipulate and compare them without having to worry about what
strand or which coordinate system another GenomicRegion
uses. Svart will do all this for you eliminating off-by-one
errors and providing an extensible model to plug into your code, so you can focus on the task at hand. It will enforce
Modelling a gene as a region on a chromosome. Here we're using the gene FGFR2.
Chromosome 10 (CM000672.1): 123,237,848-123,353,481 reverse strand. (1-based, positive strand coordinates for the FGFR2 gene on GRCh37)
Chromosome 10 (CM000672.2): 121,478,332-121,598,458 reverse strand. (1-based, positive strand coordinates for the FGFR2 gene on GRCh38)
The code below shows how to create a GenomicAssembly
, create a GenomicRegion
to represent the region of chromosome
10 to which the longest transcript expressing the FGFR2 can be aligned. We then create a Variant
object representing a
SNV and check to see if this is located within the gene. FGFR2 is located on the negative strand, whereas variants from
VCF are reported on the positive strand, so at first sight it might not appear that the variant is contained within the
gene.
class SnpAndGeneTest {
public void checkGeneContainsVariant() {
// Load the Human GRCh37.13 assembly from a NCBI assembly report
GenomicAssembly b37 = GenomicAssembly.readAssembly(Path.of("src/test/resources/GCF_000001405.25_GRCh37.p13_assembly_report.txt"));
// alternatively you can use svart to download one using the GenBank or RefSeq accession:
// GenomicAssembly grch37p13 = GenomicAssemblies.downloadAssembly("GCF_000001405.25");
// or use a default assembly:
// GenomicAssembly grch37p13 = GenomicAssemblies.GRCh37p13();
// FGFR2 gene is located on chromosome 10 (CM000672.1): 123,237,848-123_357_972 reverse strand. (1-based, positive strand coordinates)
Contig chr10b37 = b37.contigByName("10");
GenomicRegion fgfr2Gene = GenomicRegion.of(chr10b37, Strand.POSITIVE, CoordinateSystem.oneBased(), 123_237_848, 123_357_972);
// 10 123256215 . T G - a pathogenic missense variant (GRCh37 VCF coordinates - 1-based, positive strand)
GenomicVariant snv = GenomicVariant.of(chr10b37, "", Strand.POSITIVE, CoordinateSystem.oneBased(), 123_256_215, "T", "G");
// Because svart knows about coordinate systems and strands it is possible to...
// keep the gene on the positive strand:
// GenomicRegion{contig=10, strand=+, coordinateSystem=ONE_BASED, startPosition=123237848, endPosition=123357972}
assertThat(fgfr2Gene.contains(snv), equalTo(true));
// or use it on the negative strand:
// GenomicRegion{contig=10, strand=-, coordinateSystem=ONE_BASED, startPosition=12176776, endPosition=12296900}
assertThat(fgfr2Gene.toNegativeStrand().contains(snv), equalTo(true));
}
}
In the above example the default implementations provided by the library were used. These are immutable classes which
can be used compositionally. Alternatively, for users requiring extra functionality on top of the default implementations
there are several Base
classes which can be extended - BaseGenomicRegion
, BaseVariant
and BaseBreakendVariant
.
Svart models variants in a similar manner to VCF, however it attempts to provide a unified and consistent model using
the Variant
interface. In VCF precise variants of known sequence (sequence variants) described using CHR, ID, POS,
REF, ALT and also the symbolic variants where the ALT field denotes the type in angle brackets e.g. <INS>
for an
insertion along with a change length and end (the SVLEN and END fields in the INFO column).
Internally Svart treats both the sequence and symbolic types in the same way modelling them as GenomicRegion
so
that all variants have a start and end (like BED), but also have a VariantType
a refLength
and a changeLength
derived
from the REF/ALT sequences and END if available. The Variant
interface contains static factory methods to help users
easily create variants, requiring them to consider important, but often implicit values for the Strand
and CoordinateSystem
with which a variant is described.
Sequence and symbolic variants are always intra-chromosomal changes (i.e. on the same chromosome/contig) whereas
breakend variants are somewhat anomalous as they describe the way one or two different chromosomes are broken and
re-arranged. These are handled by the BreakendVariant
which is composed of a left and right Breakend
derived from a
single VCF record (line). The full re-arrangement requires assembling the Breakend
s together by matching their id
with another Breakend.mateId
. The VCF representation of breakends is quite esoteric and relies on a pattern in the
ALT
column of a record to indicate the strand, chromosome, location and orientation of a mate end relative to the REF
position of the record e.g. C[ctg2:5[
indicates ctg2
position 5
(fully-closed coordinates) on the positive strand
is to the right of the reference allele C
also on the positive strand. Svart provides a VcfBreakendResolver
to handle the parsing and conversion of
these into BreakendVariant
instances.
While svart is heavily influenced by VCF it does not provide a parser for VCF as this is better handled by the HTSJDK,
however much of the terminology and data is very close, for example the ConfidenceInterval
class is used in conjunction
with the Position
class to encapsulate the POS/CIPOS and END/CIEND fields from VCF. Svart does, however provide a
VcfConverter
class which provides methods to easily convert VCF small, symbolic and breakend records into svart Variant
instances from, for example an HTSJDK VariantContext
. The VcfConverter
needs to be provided with the correct
GenomicAssembly
for the VCF file and a VariantTrimmer
to allow the user to specify how they wish any multi-allelic
sites to be trimmed. For example, this is how the VCF fields for a SNV called against GRCh37 would be converted:
class ConvertVcfTest {
@Test
public void convert() {
VcfConverter vcfConverter = new VcfConverter(b37, VariantTrimmer.leftShiftingTrimmer(VariantTrimmer.retainingCommonBase()));
// non-symbolic alleles
// CHR POS ID REF ALT
// chr1 12345 rs123456 C T 6 PASS .
Variant snv = vcfConverter.convert(vcfConverter.parseContig("chr1"), "rs123456", 12345, "C", "T");
assertThat(snv, equalTo(Variant.of(chr1, "rs123456", Strand.POSITIVE, CoordinateSystem.ONE_BASED, 12345, "C", "T")));
// symbolic alleles
// chr1 12345 . C <INS> 6 PASS SVTYPE=INS;END=12345;SVLEN=200
Variant ins = vcfConverter.convertSymbolic(vcfConverter.parseContig("chr1"), "", 12345, 12345, "C", "<INS>", 200);
assertThat(ins, equalTo(Variant.of(chr1, "rs123456", Strand.POSITIVE, CoordinateSystem.ONE_BASED, 12345, 12345, "C", "<INS>", 200)));
// breakend variants
// 1 12345 bnd_U C C[2:321682[ 6 PASS SVTYPE=BND;MATEID=bnd_V;EVENT=tra2
Variant bnd = vcfConverter.convertBreakend(vcfConverter.parseContig("chr1"), "bnd_U", 12345, "C", "C[2:321682[", ConfidenceInterval.precise(), "bnd_V", "tra2");
Breakend left = Breakend.of(chr1, "bnd_U", Strand.POSITIVE, CoordinateSystem.ONE_BASED, 12346, 12345);
Breakend right = Breakend.of(chr2, "bnd_V", Strand.POSITIVE, CoordinateSystem.ONE_BASED, 321682, 321681);
assertThat(bnd, equalTo(Variant.of("tra2", left, right, "C", "")));
}
}
Another source of errors and confusion for the unprepared are VCF multi-allelic sites where a single VCF record lists two or more ALT alleles at a position. In order that the variants can be correctly represented they require 'trimming' to their shortest form which, depending on the untrimmed variant can have a different start to what is displayed in the VCF record.
Again there are several, occasionally opposing, schools of thought on how best to trim a variant left, then right ('right shift')
or right, then left ('left shift'). The HGVS recommend right shifting, whereas VCF requires left shifting. This is further
complicated by requiring a 'padding' base, as VCF does for the reference allele, or removing this base. The svart API
provides the VariantTrimmer
class to enable users to perform this in a strand-safe way.
class TrimMultipleAlleleTests {
@Test
public void trimMultiAllelicSite() {
// Given the VCF record:
// chr1 225725424 . CTT C,CT 258.06 FreqFilter AC=1,1;AF=0.500,0.500;AN=2;DP=7;ExcessHet=3.0103;FS=0.000;MAX_FREQ=3.1360424;MLEAC=1,1;MLEAF=0.500,0.500;MQ=60.00;QD=29.21;SOR=0.941 GT:AD:DP:GQ:PL 1/2:0,4,3:7:58:275,76,58,106,0,85
Contig chr1 = TestContig.of(1, 249_250_621);
VariantTrimmer leftShiftingTrimmer = VariantTrimmer.leftShiftingTrimmer(VariantTrimmer.retainingCommonBase());
GenomicVariant firstAllele = trim(leftShiftingTrimmer, chr1, "", Strand.POSITIVE, CoordinateSystem.oneBased(), 225_725_424, "CTT", "C");
assertThat(firstAllele, equalTo(GenomicVariant.of(chr1, "", Strand.POSITIVE, CoordinateSystem.oneBased(), 225_725_424, "CTT", "C")));
GenomicVariant secondAllele = trim(leftShiftingTrimmer, chr1, "", Strand.POSITIVE, CoordinateSystem.oneBased(), 225_725_424, "CTT", "CT");
assertThat(secondAllele, equalTo(GenomicVariant.of(chr1, "", Strand.POSITIVE, CoordinateSystem.oneBased(), 225_725_424, "CT", "C")));
}
private GenomicVariant trim(VariantTrimmer variantTrimmer, Contig contig, String id, Strand strand, CoordinateSystem coordinateSystem, int start, String ref, String alt) {
VariantPosition trimmed = variantTrimmer.trim(strand, start, ref, alt);
return GenomicVariant.of(contig, id, strand, coordinateSystem, trimmed.start(), trimmed.ref(), trimmed.alt());
}
}
Svart does not however perform full variant normalisation as this should have been performed as part of the VCF creation.
What about BED? BED provides coordinates of a genomic region in
standard genomic coordinates i.e. left open intervals (a.k.a. zero-based) on the positive strand, but can optionally
indicate the strand on which the feature is located in column 6 with a +
(forward/positive) or -
(reverse/negative) character. The code
below shows how these records can be parsed into GenomicRegion
instances.
class BedTests {
@Test
public void parseGenomicRegionFromBedFile() {
// Load the Human GRCh37.13 assembly
GenomicAssembly b37 = GenomicAssembly.GRCh37p13();
// BED uses left-open coordinates, with positions in standard genomic coordinates (i.e. positive strand), with
// the 6th column indicating the strand. Using the example from - https://grch37.ensembl.org/info/website/upload/bed.html
GenomicRegion pos1 = parseBedRecord(b37, "chr7\t127471196\t127472363\tPos1\t0\t+");
GenomicRegion neg1 = parseBedRecord(b37, "chr7\t127475864\t127477031\tNeg1\t0\t-");
assertThat(pos1.contigName(), equalTo("7"));
assertThat(pos1.startOnStrand(Strand.POSITIVE), equalTo(127471196));
assertThat(pos1.endOnStrand(Strand.POSITIVE), equalTo(127472363));
assertThat(pos1.strand(), equalTo(Strand.POSITIVE));
assertThat(neg1.contigName(), equalTo("7"));
// a new instance on the positive strand can be created using neg1.toPositiveStrand(), however this will create
// a new object which may not be wanted in the long term, so to avoid this you can use the start/endOnStrand method
// which will efficiently calculate the result of neg1.withStrand(POSITIVE).start()/.end()
assertThat(neg1.startOnStrand(Strand.POSITIVE), equalTo(127475864));
assertThat(neg1.endOnStrand(Strand.POSITIVE), equalTo(127477031));
assertThat(neg1.strand(), equalTo(Strand.NEGATIVE));
}
private GenomicRegion parseBedRecord(GenomicAssembly genomicAssembly, String bedRecord) {
String[] fields = bedRecord.split("\t");
Contig contig = genomicAssembly.contigByName(fields[0]);
Strand strand = Strand.parseStrand(fields[5]);
int start = strand == Strand.POSITIVE ? Integer.parseInt(fields[1]) : Coordinates.invertCoordinate(CoordinateSystem.zeroBased(), contig, Integer.parseInt(fields[2]));
int end = strand == Strand.POSITIVE ? Integer.parseInt(fields[2]) : Coordinates.invertCoordinate(CoordinateSystem.zeroBased(), contig, Integer.parseInt(fields[1]));
return GenomicRegion.of(contig, strand, CoordinateSystem.zeroBased(), start, end);
}
}