The Variant Library Annotation Tool (VaLiAnT) is an oligonucleotide library design and annotation tool for Saturation Genome Editing and other Deep Mutational Scanning experiments.
A selection of libraries is included in the examples
directory, including all necessary inputs and instructions to generate them.
Please also see the VaLiAnT Wiki for more information on use cases.
- VaLiAnT
Please cite this paper when using VaLiAnT for your publications:
Variant Library Annotation Tool (VaLiAnT): an oligonucleotide library design and annotation tool for saturation genome editing and other deep mutational scanning experiments.
Barbon L, Offord V, Radford EJ, Butler AP, Gerety SS, Adams DJ, Tan HK, Waters AJ.
Bioinformatics. 2022 Jan 27;38(4):892-899.
DOI: 10.1093/bioinformatics/btab776. PMID: 34791067; PMCID: PMC8796380.
See the command line interface section for the full list and a more detailed description of the parameters.
Input parameters:
- species
- assembly
- 5' adaptor (optional)
- 3' adaptor (optional)
Main command input files:
- configuration file (JSON)
SGE input files:
- SGE targeton file (TSV)
- reference genome sequence (FASTA)
- reference genome index (FAI)
- PAM protection file (VCF, optional)
- VCF manifest file (CSV, optional)
- custom variant files (VCF, optional)
- features file (GTF/GFF2, optional)
- codon table with frequencies (CSV, optional)
- background variant file (VCF, optional)
- background variant mask file (BED, optional)
cDNA input files:
- cDNA targeton file (TSV)
- cDNA sequences (single multi-FASTA)
- cDNA annotation file (TSV, optional)
- codon table with frequencies (CSV, optional)
Output files:
- reference sequence retrieval quality check file (CSV, SGE-only)
- oligonucleotide metadata file (CSV)
- variant file (VCF, SGE-only)
- unique oligonucleotides file (CSV)
- configuration file (JSON)
The reference directory should contain both the FASTA file and its index; e.g., if the FASTA file is named genome.fa
, a genome.fa.fai
file should also be present in the same directory. When running the tool in a container, the directory containing both files should therefore be mounted.
The features file (SGE-only gff
option) is required to detect exonic regions in the targeton, and should therefore be provided in most circumstances. The files should only contain features for one transcript per gene (the gene_id
and transcript_id
attributes are required to perform this check). The features file should match the assembly of the target reference genome. Any features of type other than CDS
and UTR
are ignored.
If the codon-table
option is not set, this table will be used.
Ambiguous nucleotides are not allowed in the reference sequence. Soft-masking is ignored.
Oligonucleotides exceeding a given length (max-length
option) will not be included in the unique oligonucleotide files and their metadata will be stored in separate files marked as 'excluded'.
Background variants, if provided, are applied before PAM protection variants; when the CDS features of a transcript are provided (via a GTF/GFF2 file), to keep the annotation consistent across targetons, such variants are applied in the minimal range of positions that spans at least the entire CDS, further extended to the boundaries of any targeton overlapping the CDS, and finally to the start and end position of the first and last background variant intersecting the resulting range, respectively.
Background variants may be filtered out by position by providing a set of genomic ranges to be excluded via a BED file (bg-mask
option). Excluding frame-shifting variants may affect the annotation.
By default, errors are raised when background variants are not synonymous or shift the reading frame; the force-bg-ns
and force-bg-indels
flags may be passed to allow them.
Mutations that overlap background variants are discarded; warnings are always raised to identify them, with their positions expressed in absolute genomic coordinates for custom variants, and targeton-relative coordinates for pattern variants.
After installing the package in an appropriate virtual environment:
valiant sge \
"${TARGETONS_FILE}" \
"${REFERENCE_FILE}" \
"${OUTPUT_DIR}" \
"${SPECIES}" \
"${ASSEMBLY}" \
--gff "${GTF_FILE}" \
--adaptor-5 "${ADAPTOR_5}" \
--adaptor-3 "${ADAPTOR_3}"
Alternatively, a configuration file can be provided to the main command:
valiant -c config.json
After building or pulling the Docker image (quay.io/wtsicgp/valiant:X.X.X
, where X.X.X
is a version tag):
docker run \
-v "${HOST_INPUTS}":"${INPUT_DIR}":ro \
-v "${HOST_REF}":"${REF_DIR}":ro \
-v "${HOST_OUTPUT}":"${OUTPUT_DIR}" \
valiant \
valiant sge \
"${INPUT_DIR}/${TARGETONS_FILE}" \
"${REF_DIR}/${REFERENCE_FILE}" \
"${OUTPUT_DIR}" \
"${SPECIES}" \
"${ASSEMBLY}" \
--gff "${INPUT_DIR}/${GTF_FILE}" \
--adaptor-5 "${ADAPTOR_5}" \
--adaptor-3 "${ADAPTOR_3}"
The HOST_*
environment variables represent local paths to be mounted by the Docker container.
After pulling the Docker image with Singularity:
singularity exec \
--cleanenv \
-B "${HOST_INPUTS}":"${INPUT_DIR}":ro \
-B "${HOST_REF}":"${REF_DIR}":ro \
-B "${HOST_OUTPUT}":"${OUTPUT_DIR}" \
${SINGULARITY_IMAGE} \
valiant sge \
"${INPUT_DIR}/${TARGETONS_FILE}" \
"${REF_DIR}/${REFERENCE_FILE}" \
"${OUTPUT_DIR}" \
"${SPECIES}" \
"${ASSEMBLY}" \
--gff "${INPUT_DIR}/${GTF_FILE}" \
--adaptor-5 "${ADAPTOR_5}" \
--adaptor-3 "${ADAPTOR_3}"
Separate subcommands are provided depending on the target sequence origin:
- SGE (
sge
): sequences from reference, genomic coordinates, three target regions; - cDNA DMS (
cdna
): user-provided sequences, relative coordinates, single target region.
The arguments and a few options are the same for both subcommands (see here), but the file formats may vary.
Main command.
Option | Format | Default | Description |
---|---|---|---|
version |
flag | false |
Show the version of the tool and quit. |
config |
file path | - | Path to the configuration file. |
Arguments and options required or supported by both subcommands. The format of the input files may be different for SGE and cDNA targets (see the argument descriptions).
Argument | Format | Description |
---|---|---|
OLIGO_INFO |
file path | Path to the targeton file (SGE or cDNA format). |
REF_FASTA |
file path | Path to the FASTA file of the target reference genome (sge ) or cDNA sequences (cdna ). |
OUTPUT |
file path | Output path (should exist already). |
SPECIES |
species name | Target species, to be reported in the oligonucleotide metadata. |
ASSEMBLY |
assembly name | Target assembly, to be reported in the oligonucleotide metadata. |
Option | Format | Default | Description |
---|---|---|---|
codon-table |
file path | - | Path to a codon table with frequencies. |
max-length |
integer | 300 | Maximum oligonucleotide length. |
adaptor-5 |
DNA sequence | - | DNA sequence to be added at the 5' end of the oligonucleotide. |
adaptor-3 |
DNA sequence | - | DNA sequence to be added at the 3' end of the oligonucleotide. |
log |
log level | WARNING |
Name of the preferred log level (see the official documentation of the logging module). |
Options specific to SGE.
The REF_FASTA
path is expected to point to a reference genome in FASTA format.
Option | Format | Default | Description |
---|---|---|---|
gff |
file path | - | Path to GTF/GFF2 file containing CDS and UTR features; one transcript per gene only. |
bg |
file path | - | Path to a background variant VCF file. |
pam |
file path | - | Path to a PAM protection file. |
vcf |
file path | - | Path to a VCF manifest file. |
revcomp-minus-strand |
flag | false |
For minus strand targets, include the reverse complement of the mutated reference sequence in the oligonucleotide. |
sequences-only |
flag | false |
Generate the reference sequence retrieval quality check file and quit. |
mask_bg_fp |
file path | - | Path to a BED file to exclude background variants from being applied to the specified genomic intervals. |
force-bg-ns |
flag | false |
Allow non-synonymous background variants. |
force-bg-indels |
flag | false |
Allow frame-shifting background variants. |
Options specific to cDNA DMS.
The REF_FASTA
path is expected to point to a multi-FASTA containing cDNA sequences.
Option | Format | Default | Description |
---|---|---|---|
annot |
file path | - | Path to a cDNA annotation file. |
Types of mutation that apply to any target (label):
- parametric deletion (e.g.:
1del
,2del0
,2del1
) - single-nucleotide variant (
snv
)
Types of mutation that apply to CDS targets only (label):
- in-frame deletion (
inframe
) - alanine codon substitution (
ala
) - stop codon substitution (
stop
) - all amino acid codon substitution (
aa
) - SNVRE (
snvre
)
Variants imported from VCF files are labelled as custom
.
Non-overlapping stretches of nucleotides of a given length are deleted starting from a given offset. No partial deletions are performed at the end of the target regions. Format: <SPAN>del[<OFFSET>]
(the offset is assumed to be zero if not set).
For backwards compatibility, in the metadata table, 1del0
is reported as 1del
.
Given the target ACGTAAA
, span two, and start offset zero (2del0
), e.g.:
GTAAA
ACAAA
ACGTA
With start offset one (2del1
), e.g.:
ATAAA
ACGAA
ACGTA
Each nucleotide is replaced with all the alternatives.
Given the target AA
, e.g.:
CA
GA
TA
AC
AG
AT
For CDS targets, the resulting amino acid change is reported.
Only for CDS targets.
Delete each triplet so that the reading frame is preserved.
Given the target GAAATTTGG
with frame 2, e.g.:
GTTTGG
GAAAGG
Only for CDS targets.
Replace each codon with the top-ranking alanine codon.
Given the target GCAAAATTT
, with GCC
being the top-ranking alanine codon, e.g.:
GCCAAATTT
GCAGCCTTT
GCAAAAGCC
Only for CDS targets.
Replace each codon with the top-ranking stop codon.
Given the target TAACCCGGG
, with TGA
being the top-ranking stop codon, e.g.:
TGACCCGGG
TAATGAGGG
TAACCCTGA
Only for CDS targets.
Replace each codon with the top-ranking codon of all amino acids. Given the default codon table, this results in 19 mutated sequences for each codon mapping to an amino acid (the reference amino acid being excluded) and 20 for each stop codon.
Given the target AAATGA
on the plus strand, e.g. (each column representing the sequences generated from one codon):
ATCTGA AAAATC
ATGTGA AAAATG
ACCTGA AAAACC
AACTGA AAAAAC
AAAAAG
AGCTGA AAAAGC
CGGTGA AAACGG
CTGTGA AAACTG
CCCTGA AAACCC
CACTGA AAACAC
CAGTGA AAACAG
GTGTGA AAAGTG
GCCTGA AAAGCC
GACTGA AAAGAC
GAGTGA AAAGAG
GGCTGA AAAGGC
TTCTGA AAATTC
TACTGA AAATAC
TGCTGA AAATGC
TGGTGA AAATGG
Only for CDS targets.
Given a set of SNV's, replace triplets according to the following rules:
- if the SNV results in a synonymous mutation, replace the triplet with all the synonymous triplets of the variant
- if the SNV results in a missense mutation, replace the triplet with the top-ranking synonymous triplet of the variant
- if the SNV results in a nonsense mutation, replace the triplet with the top-ranking stop codon
For a given missense or nonsense SNV mutation, if the resulting triplet is already the top-ranking one, the second highest ranking triplet is used to generate the SNVRE mutation instead.
Given the following SNV's for sequence AAAAGT
, e.g.:
mseq ref alt
CAAAGT K Q
GAAAGT K E
TAAAGT K STOP
ACAAGT K T
AGAAGT K R
ATAAGT K I
AACAGT K N
AAGAGT K K
AATAGT K N
...
AAAAGC S S
...
There is only one synonymous mutation for the first triplet (AAGAGT
), but since lysine maps to only two codons and one of them is the reference, no SNVRE variants are generated from it. The one for the second triplet (AAAAGC
), though, results in the top-ranking codon for serine, that maps to six codons, and therefore the following four SNVRE's are generated:
mseq ref alt
AAATCA S S
AAATCC S S
AAATCG S S
AAATCT S S
For missense mutations, the top-ranking codon (the current being excluded) for each alternative amino acid replaces the reference sequence:
mseq ref alt snv svnre
CAAAGT K Q CAA CAG
GAAAGT K E GAA GAG
ACAAGT K T ACA ACC
AGAAGT K R AGA CGG
ATAAGT K I ATA ATC
AACAGT K N AAC AAT
AATAGT K N AAT AAC
The resulting SNVRE variants would be:
mseq ref alt
CAGAGT K Q
GAGAGT K E
ACCAGT K T
CGGAGT K R
ATCAGT K I
AATAGT K N
AACAGT K N
For the nonsense mutation:
mseq ref alt snv svnre
TAAAGT K STOP TAA TGA
The resulting SNVRE variant would be:
mseq ref alt
TGAAGT K STOP
Unique codons do not generate SNVRE variants.
Applied to the targeton reference sequence as a whole. Only simple variants such as the following are supported:
- substitutions
- insertions (see below)
- deletions (see below)
- indels
The classification of the variants is based exclusively on the POS
, REF
, and ALT
fields to be agnostic with respect to the VCF source.
While in the VCF format insertion and deletion positions refer to the base preceding the event and the reference and alternative sequences both include the preceding (or following, if the variants start at position one) base, for consistency with the conventions adopted for generated mutations, in the metadata table such variants are reported as shifted right by one and omitting the preceding (or following) base in the reference and alternative sequences.
Some of the dependencies are unsupported on Windows, and the tool cannot therefore be installed natively on it. The following options are available:
- installing the Windows Subsystem for Linux (WSL) and creating a Python virtual environment
- installing Docker (requires the WSL or Windows 10 Pro) and building or pulling the Docker image
- installing Singularity (requires a virtualisation solution) and building a Singularity image from the Docker image
The instructions that follow apply to Linux and macOS.
Please take care to read errors during the dependency installation step carefully. HTSlib (pysam) has system dependencies and will highlight the packages that need to be installed.
Requirements:
- Python 3.11 or above
To install in a virtual environment:
# Initialise the virtual environment
python3.11 -m venv .env
# Activate the virtual environment
source .env/bin/activate
# Install the valiant package
pip install .
To build the Docker container:
docker build -t valiant .
JSON file collecting the execution parameters. It is always generated as an output (config.json
) and can optionally be used as input by the main command, e.g.:
valiant -c config.json
Property | Format | Description |
---|---|---|
appName |
valiant |
Name of the application (constant). |
appVersion |
x.y.z |
Version of the application. |
mode |
sge |cdna |
Execution mode. |
params |
object |
Execution parameters. |
An application version mismatch will result in a warning.
The execution parameters depend on the execution mode, and each corresponds to one of the command line arguments or options.
CLI argument | JSON property |
---|---|
oligo_info_fp |
oligoInfoFilePath |
ref_fasta_fp |
refFASTAFilePath |
output_dir |
outputDirPath |
species |
species |
assembly |
assembly |
CLI option | JSON property |
---|---|
adaptor-5 |
adaptor5 |
adaptor-3 |
adaptor3 |
min-length |
minOligoLength |
max-length |
maxOligoLength |
codon-table |
codonTableFilePath |
CLI option | JSON property |
---|---|
revcomp-minus-strand |
reverseComplementOnMinusStrand |
gff |
GFFFilePath |
bg |
backgroundVCFFilePath |
pam |
PAMProtectionVCFFilePath |
vcf |
customVCFManifestFilePath |
mask_bg_fp |
maskBackgroundFilePath |
force-bg-ns |
forceBackgroundNonSynonymous |
force-bg-indels |
forceBackgroundFrameShifting |
include-no-op-oligo |
includeNoOpOligo |
Example:
{
"appName": "valiant",
"appVersion": "4.0.0",
"mode": "sge",
"params": {
"species": "homo sapiens",
"assembly": "GRCh38",
"adaptor5": "AATGATACGGCGACCACCGA",
"adaptor3": "TCGTATGCCGTCTTCTGCTTG",
"minOligoLength": 1,
"maxOligoLength": 300,
"codonTableFilePath": null,
"backgroundVCFFilePath": null,
"oligoInfoFilePath": "parameter_input_files/brca1_nuc_targeton_input.txt",
"refFASTAFilePath": "reference_input_files/chr17.fa",
"outputDirPath": "brca1_nuc_output",
"reverseComplementOnMinusStrand": true,
"includeNoOpOligo": false,
"GFFFilePath": "reference_input_files/ENST00000357654.9.gtf",
"PAMProtectionVCFFilePath": "parameter_input_files/brca1_protection_edits.vcf",
"customVCFManifestFilePath": "reference_input_files/brca1_custom_variants_manifest.csv",
"maskBackgroundFilePath": null,
"forceBackgroundNonSynonymous": false,
"forceBackgroundFrameShifting": false
}
}
CLI option | JSON property |
---|---|
annot |
annotationFilePath |
Example:
{
"appName": "valiant",
"appVersion": "4.0.0",
"mode": "cdna",
"params": {
"species": "human",
"assembly": "pCW57.1",
"adaptor5": "AATGATACGGCGACCACCGA",
"adaptor3": "TCGTATGCCGTCTTCTGCTTG",
"minOligoLength": 1,
"maxOligoLength": 300,
"codonTableFilePath": null,
"oligoInfoFilePath": "examples/cdna/input/cdna_targeton.tsv",
"refFASTAFilePath": "examples/cdna/input/BRCA1_NP_009225_1_pCW57_1.fa",
"outputDirPath": "examples/cdna/output",
"annotationFilePath": "examples/cdna/input/cdna_annot.tsv"
}
}
Tab-separated values (TSV) file describing the reference sequence coordinates and the types of mutation to be applied to the three target regions therein contained (collectively referred to as targeton). Multiple types of mutations can be applied to each target region. The coordinates of the target regions are derived from the genomic range of the second target region and an extension vector describing the lengths of the preceding and following regions.
Duplicate mutation types in any given group within the action vector are ignored.
Spacing is ignored when parsing the extension and action vectors.
The chromosome name needs to match the naming in the GTF/GFF2 file and in the reference genome.
Field | Format | Description |
---|---|---|
ref_chr |
string | Chromosome name. |
ref_strand |
+ or - |
DNA strand. |
ref_start |
integer | Start position of the reference sequence. |
ref_end |
integer | End position of the reference sequence. |
r2_start |
integer | Start position of the second target region. |
r2_end |
integer | End position of the second target region. |
ext_vector |
<int>, <int> |
Lengths of the first and third target regions. |
action_vector |
(<str>, ...), (<str>, ...), (<str>, ...) |
Type of mutation labels grouped by target region. |
sgrna_vector |
<str>, ... |
sgRNA identifiers matching with SGRNA tags in the PAM protection VCF file. |
Example:
ref_chr ref_strand ref_start ref_end r2_start r2_end ext_vector action_vector sgrna_vector
chrX + 41334132 41334320 41334253 41334297 25, 15 (1del), (1del, snv), (1del) sgrna_1, sgrna_2
Tab-separated values (TSV) file describing the target cDNA and the types of mutation to be applied to the target region therein contained (expressed in relative coordinates). Multiple types of mutations can be applied the the target region.
The cDNA identifier (seq_id
) has to correspond to an entry in the multi-FASTA and (optionally) annotation files.
Field | Format | Description |
---|---|---|
seq_id |
string | cDNA identifier. |
targeton_start |
integer | Targeton start position. |
targeton_end |
integer | Targeton stop position. |
r2_start |
integer | Target region start position. |
r2_end |
integer | Target region stop position. |
action_vector |
<str>, ... |
Type of mutation labels. |
Example:
seq_id targeton_start targeton_end r2_start r2_end action_vector
ENST00000357654.9 114 121 114 121 snv,1del,snvre
ENST00000357654.9 114 150 120 130 1del,2del0
TSV file describing the CDS region of each cDNA in relative coordinates (one-based and end-inclusive). Gene and transcript identifiers can also be provided.
Field | Format | Description |
---|---|---|
seq_id |
string | cDNA identifier. |
gene_id |
string | Gene ID. |
transcript_id |
string | Transcript ID. |
cds_start |
string | cDNA CDS relative start position. |
cds_end |
string | cDNA CDS relative end position. |
Example:
seq_id gene_id transcript_id cds_start cds_end
brca1_357654.9 ENSG00000012048.23 ENST00000357654.9 114 5705
CSV file listing the VCF files from which to import variants. Each VCF file is given an alias. If a tag is specified (vcf_id_tag
), the VCF INFO
field will be expected to contain it and its values will be used as variant identifiers; if no tag is specified, the ID
field will be used instead.
Field | Format | Description |
---|---|---|
vcf_alias |
string | VCF file alias. |
vcf_id_tag |
VCF tag | (Optional) Variant ID tag. |
vcf_path |
file path | VCF file path. |
Example:
vcf_alias,vcf_id_tag,vcf_path
clinvar_1,ALLELEID,clinvar_abc.vcf
clinvar_2,ALLELEID,clinvar_xyz.vcf
gnomad,,gnomad_abc.vcf
VCF file containing single-nucleotide substitution variants linked to sgRNA identifiers via the SGRNA
tag.
Example:
##fileformat=VCFv4.3
##INFO=<ID=SGRNA,Number=1,Type=String,Description="sgRNA identifier">
#CHROM POS ID REF ALT QUAL FILTER INFO
chrX 41334252 . G C . . SGRNA=sgRNA_1
chrX 41337416 . C T . . SGRNA=sgRNA_2
chrX 41339064 . G A . . SGRNA=sgRNA_3
chrX 41341504 . T C . . SGRNA=sgRNA_4
chrX 41341509 . G A . . SGRNA=sgRNA_4
Comma-separated values (CSV) file containing name, label, and all metadata of the oligonucleotides generated for any given targeton.
For cDNA targets, the reference chromosome (ref_chr
) and strand (ref_strand
) will be missing and all positions will be relative to the cDNA sequence. All fields related to PAM protection (pam_seq
) and custom VCF variants (vcf_alias
, vcf_var_id
, and vcf_var_in_const
), features unavailable for this target type, will also be empty (except for vcf_var_in_const
, which will be set to zero).
The MAVE-HGVS strings are all linear genomic (relative to the start of the targeton) and do not include the reference. Because in HGVS insertion positions are described by the flanking nucleotides, those occurring at either end of the reference sequence should be treated differently (see the 3' rule in the relevant HGVS documentation); for consistency between SGE and cDNA mode, simplicity, and given the limited usefulness of liminal insertions, this is not the case in the current implementation, and therefore the invalid position zero might be found in insertion names.
Array fields use the semicolon as separator.
Index | Field | Format | Description |
---|---|---|---|
1 | oligo_name |
string | Name of the oligonucleotide. |
2 | species |
species name | Species. |
3 | assembly |
assembly name | Assembly. |
4 | gene_id |
string | Gene ID. |
5 | transcript_id |
string | Transcript ID. |
6 | src_type |
ref |cdna |
Sequence source type (reference genome or cDNA). |
7 | ref_chr |
string | Chromosome name. |
8 | ref_strand |
+ |- |
DNA strand. |
9 | ref_start |
integer | Start position of the reference sequence. |
10 | ref_end |
integer | End position of the reference sequence. |
11 | revc |
0|1 | Whether the oligonucleotide contains the reverse complement of the reference sequence (minus strand transcripts only). |
12 | ref_seq |
DNA sequence | Reference sequence. |
13 | pam_seq |
DNA sequence | PAM-protected reference sequence. |
14 | vcf_alias |
string | VCF file alias (custom mutations only). |
15 | vcf_var_id |
string | Variant ID (custom mutations only). |
16 | mut_position |
integer | Start position of the mutation. |
17 | ref |
DNA sequence | Reference nucleotide or triplet. |
18 | new |
DNA sequence | Mutated nucleotide or triplet. Not set for deletions. |
19 | ref_aa |
amino acid | Reference amino acid. |
20 | alt_aa |
amino acid | Alternative amino acid. |
21 | mut_type |
syn |mis |non |
Mutation type. |
22 | mutator |
type of mutator | Label of the type of mutator that generated the oligonucleotide. |
23 | oligo_length |
integer | Oligonucleotide length. |
24 | mseq |
DNA sequence | Full oligonucleotide sequence (with adaptors, if any). |
25 | mseq_no_adapt |
DNA sequence | Oligonucleotide sequence excluding adaptors. |
26 | pam_mut_annot |
Array of syn |mis |non |ncd |
Applied PAM protection variant mutation types (or ncd if affecting a noncoding region). |
27 | pam_mut_sgrna_id |
Array of sgRNA ID's | sgRNA ID's bound to the PAM protection variants spanned by the mutation or affecting the same codons as the mutation, if any. |
28 | mave_nt |
MAVE-HGVS string | MAVE-HGVS string corresponding to the mutation. |
29 | mave_nt_ref |
MAVE-HGVS string | MAVE-HGVS string corresponding to the mutation, where REF does not include PAM protection. |
30 | vcf_var_in_const |
0|1 | Whether the variant is in a region defined as constant (custom mutations only). |
31 | background_variants |
MAVE-HGVS strings | MAVE-HGVS strings corresponding to the background variants overlapping the targeton range, semicolon-separated. |
32 | background_seq |
DNA sequence | Reference sequence altered by the background variants. |
Example:
oligo_name,species,assembly,gene_id,transcript_id,src_type,ref_chr,ref_strand,ref_start,ref_end,revc,ref_seq,pam_seq,vcf_alias,vcf_var_id,mut_position,ref,new,ref_aa,alt_aa,mut_type,mutator,oligo_length,mseq,mseq_no_adapt,pam_mut_annot,pam_mut_sgrna_id,mave_nt,mave_nt_ref,vcf_var_in_const,background_variants,background_seq
ENST00000357654.9.ENSG00000012048.23_chr17:43104102_1del_rc,homo sapiens,GRCh38,ENSG00000012048.23,ENST00000357654.9,ref,chr17,-,43104080,43104330,1,AGAAAAGAAGAAGAAGAAGAAGAAGAAAACAAATGGTTTTACCAAGGAAGGATTTTCGGGTTCACTCTGTAGAAGTCTTTTGGCACGGTTTCTGTAGCCCATACTTTGGATGATAGAAACTTCATCTTTTAGATGTTCAGGAGAGTTATTTTCCTTTTTTGCAAAATTATAGCTGTTTGCATCTGTAAAATACAAGGGAAAACATTATGTTTGCAGTTAGAGAAAAATGTATGAATTATAATCAAAGAAAC,AGAAAAGAAGAAGAAGAAGAAGAAGAAAACAAATGGTTTTACCAAGGAAGGATTTTCGGGTTCACTCTGTAGAAGTCTTTTGGCGCGATTTCTGTAGCCCATACTTTGGATGATAGAAACTTCATCTTTTAGATGTTCAGGAGAGTTATTTTCCTTTTTTGCAAAATTATAGCTGTTTGCATCTGTAAAATACAAGGGAAAACATTATGTTTGCAGTTAGAGAAAAATGTATGAATTATAATCAAAGAAAC,,,43104102,A,,,,,1del,291,AATGATACGGCGACCACCGAGTTTCTTTGATTATAATTCATACATTTTTCTCTAACTGCAAACATAATGTTTTCCCTTGTATTTTACAGATGCAAACAGCTATAATTTTGCAAAAAAGGAAAATAACTCTCCTGAACATCTAAAAGATGAAGTTTCTATCATCCAAAGTATGGGCTACAGAAATCGCGCCAAAAGACTTCTACAGAGTGAACCCGAAAATCCTTCCTTGGTAAAACCATTTGTTTTCTCTTCTTCTTCTTCTTCTTTTCTTCGTATGCCGTCTTCTGCTTG,GTTTCTTTGATTATAATTCATACATTTTTCTCTAACTGCAAACATAATGTTTTCCCTTGTATTTTACAGATGCAAACAGCTATAATTTTGCAAAAAAGGAAAATAACTCTCCTGAACATCTAAAAGATGAAGTTTCTATCATCCAAAGTATGGGCTACAGAAATCGCGCCAAAAGACTTCTACAGAGTGAACCCGAAAATCCTTCCTTGGTAAAACCATTTGTTTTCTCTTCTTCTTCTTCTTCTTTTCT,syn;syn,,g.23del,g.23del,0,,AGAAAAGAAGAAGAAGAAGAAGAAGAAAACAAATGGTTTTACCAAGGAAGGATTTTCGGGTTCACTCTGTAGAAGTCTTTTGGCACGGTTTCTGTAGCCCATACTTTGGATGATAGAAACTTCATCTTTTAGATGTTCAGGAGAGTTATTTTCCTTTTTTGCAAAATTATAGCTGTTTGCATCTGTAAAATACAAGGGAAAACATTATGTTTGCAGTTAGAGAAAAATGTATGAATTATAATCAAAGAAAC
VCF files containing a subset of the metadata in VCF format. The metadata are stored in the INFO
field. The REF
field reports the reference sequence including (*_pam.vcf
) or excluding (*_ref.vcf
) PAM protection edits.
The variants can be linked to the corresponding oligonucleotides via the SGE_OLIGO
tag, and, for custom variants, to the original VCF files via the SGE_VCF_ALIAS
and SGE_VCF_VAR_ID
tags.
INFO
tags:
Tag | Metadata field | Description |
---|---|---|
SGE_OLIGO |
oligo_name |
Corresponding oligonucleotide name. |
SGE_SRC |
mutator |
Variant source. |
SGE_REF |
ref |
(Optional) Reference sequence, if different from the PAM-protected reference sequence (PAM VCF only). |
SGE_VCF_ALIAS |
vcf_alias |
(Optional) VCF variant identifier, only for custom variants. |
SGE_VCF_VAR_ID |
vcf_var_id |
(Optional) VCF variant source file alias, only for custom variants. |
Comma-separated values (CSV) file containing only the label and the sequence of the oligonucleotides generated for any given targeton, where the sequences are unique. This is a subset of the oligonucleotide metadata file fields (oligo_name
and mseq
) and rows. When multiple oligonucleotides have the same sequence, the first name in lexicographic order is chosen.
Example:
oligo_name,mseq
ENST00000256474.3.ENSG00000134086.8_chr3:10146513_G>A_snv,GGATTACAGGTGTGGGCCACCGTGCCCAGCCACCGGTGTGGCTCTTTAACAACCTTTGCTTGTCCCGATAAGTCACCTTTGGCTCTTCAGAGATGCAGGGACACACGATGGGCTTCTGGTTAACCAAACTGAATTATTTGTGCCATCTCTCAATGTTGACGGACAGCCTATTTTTGCCAATATCACACTGCCAGGTACTGACGTTTTACTTTTTAAAAAGATAAGGTTGTTGTGGTAAGTACAGG
ENST00000256474.3.ENSG00000134086.8_chr3:10146513_G>C_snv,GGATTACAGGTGTGGGCCACCGTGCCCAGCCACCGGTGTGGCTCTTTAACAACCTTTGCTTGTCCCGATACGTCACCTTTGGCTCTTCAGAGATGCAGGGACACACGATGGGCTTCTGGTTAACCAAACTGAATTATTTGTGCCATCTCTCAATGTTGACGGACAGCCTATTTTTGCCAATATCACACTGCCAGGTACTGACGTTTTACTTTTTAAAAAGATAAGGTTGTTGTGGTAAGTACAGG
ENST00000256474.3.ENSG00000134086.8_chr3:10146513_G>T_snv,GGATTACAGGTGTGGGCCACCGTGCCCAGCCACCGGTGTGGCTCTTTAACAACCTTTGCTTGTCCCGATATGTCACCTTTGGCTCTTCAGAGATGCAGGGACACACGATGGGCTTCTGGTTAACCAAACTGAATTATTTGTGCCATCTCTCAATGTTGACGGACAGCCTATTTTTGCCAATATCACACTGCCAGGTACTGACGTTTTACTTTTTAAAAAGATAAGGTTGTTGTGGTAAGTACAGG
ENST00000256474.3.ENSG00000134086.8_chr3:10146474_A_1del,GGATTACAGGTGTGGGCCACCGTGCCCAGCCCCGGTGTGGCTCTTTAACAACCTTTGCTTGTCCCGATAGGTCACCTTTGGCTCTTCAGAGATGCAGGGACACACGATGGGCTTCTGGTTAACCAAACTGAATTATTTGTGCCATCTCTCAATGTTGACGGACAGCCTATTTTTGCCAATATCACACTGCCAGGTACTGACGTTTTACTTTTTAAAAAGATAAGGTTGTTGTGGTAAGTACAGG
ENST00000256474.3.ENSG00000134086.8_chr3:10146475_C_1del,GGATTACAGGTGTGGGCCACCGTGCCCAGCCACGGTGTGGCTCTTTAACAACCTTTGCTTGTCCCGATAGGTCACCTTTGGCTCTTCAGAGATGCAGGGACACACGATGGGCTTCTGGTTAACCAAACTGAATTATTTGTGCCATCTCTCAATGTTGACGGACAGCCTATTTTTGCCAATATCACACTGCCAGGTACTGACGTTTTACTTTTTAAAAAGATAAGGTTGTTGTGGTAAGTACAGG
ENST00000256474.3.ENSG00000134086.8_chr3:10146477_G_1del,GGATTACAGGTGTGGGCCACCGTGCCCAGCCACCGTGTGGCTCTTTAACAACCTTTGCTTGTCCCGATAGGTCACCTTTGGCTCTTCAGAGATGCAGGGACACACGATGGGCTTCTGGTTAACCAAACTGAATTATTTGTGCCATCTCTCAATGTTGACGGACAGCCTATTTTTGCCAATATCACACTGCCAGGTACTGACGTTTTACTTTTTAAAAAGATAAGGTTGTTGTGGTAAGTACAGG
Comma-separated values (CSV) file with no header reporting the reference sequences as retrieved based on the genomic coordinates and extension vector provided in the SGE targeton file.
The targeton name is derived from the genomic coordinates of the reference sequence.
Field | Format | Description |
---|---|---|
(Targeton name) | <CHR>_<START>_<END>_<STRAND> |
Name of the targeton. |
(Reference genomic range) | <CHR>:<START>-<END> |
Reference sequence region. |
(5' constant region start) | integer | Start position of the 5' constant region. |
(5' constant region sequence) | DNA sequence | Sequence of the 5' constant region. |
(Target region 1 start) | integer | Start position of target region 1. |
(Target region 1 sequence) | DNA sequence | Sequence of target region 1. |
(Target region 2 start) | integer | Start position of target region 2. |
(Target region 2 sequence) | DNA sequence | Sequence of target region 2. |
(Target region 3 start) | integer | Start position of target region 3. |
(Target region 3 sequence) | DNA sequence | Sequence of target region 3. |
(3' constant region start) | integer | Start position of the 3' constant region. |
(3' constant region sequence) | DNA sequence | Sequence of the 3' constant region. |
Example:
chr3_10146443_10146687_plus,chr3:10146443-10146687,10146443,GGATTACAGGTGTGGGCCACCGTGCCCAGCC,10146474,ACCGGTGTGGCTCTTTAACAACCTTTGCTTGTCCCGATAG,10146514,GTCACCTTTGGCTCTTCAGAGATGCAGGGACACACGATGGGCTTCTGGTTAACCAAACTGAATTATTTGTGCCATCTCTCAATGTTGACGGACAGCCTATTTTTGCCAATATCACACTGCCAG,10146637,GTACTGACGTTTTACTTTTTAAAAAGATAAGGTTG,10146672,TTGTGGTAAGTACAGG
To run the unit tests, install the extra requirements first:
pip install -r test-requirements.txt
./run_tests.sh
VaLiAnT
Copyright (C) 2020, 2021, 2022, 2023, 2024 Genome Research Ltd
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU Affero General Public License as
published by the Free Software Foundation, either version 3 of the
License, or (at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU Affero General Public License for more details.
You should have received a copy of the GNU Affero General Public License
along with this program. If not, see <https://www.gnu.org/licenses/>.