Seq2C is Copy-Number Variations (CNV) variant caller. Seq2C normalizes the coverage from targeted sequencing to CNV log2ratio, and detects CNVs in regions that have abnormally high or low read depths compared to the rest.
Requirements:
- Program 'samtools' need to be in your PATH.
- Perl version 5.8 and higher.
- You need to add PATH variables to Seq2C scripts, description inthis section.
The Seq2C source code is located at https://github.com/AstraZeneca-NGS/Seq2C.
To load the project, execute the following command:
git clone https://github.com/AstraZeneca-NGS/Seq2C.git
Add short paths to the scripts seq2cov.pl, cov2lr.pl, lr2gene.pl, bam2reads.pl and others to your OS PATH.
In Linux it can be done by adding this line to your .bashrc
file:
export PATH=path_to_Seq2C_scripts_folder:$PATH
For using plotting script, also add the plotting folder:
export PATH=path_to_Seq2C_scripts_folder/plotting:$PATH
Usage:
seq2c.sh sample2bam.txt bed [control_sample1:[control_sample2]]
Parameters:
sample2bam.txt
- required, file must contains name of sample and paths to bam files with TAB delimiter (\t
). Must contain one empty line in end of file. The example of the file:
NA12878.unmapped /samples/NA12878.unmapped.bam
NA12889.mapped /samples/NA12889.mapped.bam
bed
- required, path to BED file with TAB delimiter. Can be simple or amplicon. For simple mode it must contain 4 columns:
2 45904950 45904999 gene
For amplicon mode it must contain 8 columns. 7 column must contain amplicon start (must be more than start of region), 8 column must contain amplicon end (must be less than end of region).
2 45904950 45904999 gene . . 45904958 45904994
control_samples
- optional, for multiple controls, separate them using colon (:
).
seq2c.sh script starts scripts from this repository in this order:
- seq2cov.pl - Calculate candidate variance for a given region(s) in an indexed BAM file.
- bam2reads.pl - Reads each read from sample and BAM file with samtools, creates statistics in
read_stats.txt
file. - cov2lr.pl - Normalizes the coverage from targeted sequencing to CNV log2 ratio.
- lr2gene.pl - Converts a coverage file with CNV log2 ratio to copy number profile.
Output result (seq2c_results.txt
file) contains columns:
- Sample - sample name from original sample2bam file
- Gene - gene name from BED file/region info
- Chr - chromosome name from BED file/region info
- Start - start of the segment/gene from BED file/region info
- End - end of the segment/gene from BED file/region info
- Length - length of the segment/gene
- Log2ratio - CNV log2 ratio of normalized median depth by sample
- Sig - significance (0 if AMP or DEL)
- BP_Whole - type of CNV: Whole if AMP or DEL, else empty
- Amp_Del - AMP (amplified) if log2 ratio more then -A option, DEL (deleted) if log2 ratio less then -D option
- Ab_Seg - affected segments on gene
- Total_Seg - total count of segments on gene
- Ab_log2ratio - log2 ratio of normalized median depth by sample
- Log2r_Diff - difference between CNV log2 ratio
- Ab_Seg_Loc - segment location: ALL if AMP or DEL, else empty
- Ab_Samples - abberation samples count for segments. Output only for BP calls that are “good”, i.e. wasn’t filtered by thresholds.
- Ab_Samples_Pcnt - abberation samples fraction from total number of samples.
You can run scripts separately if you want transfer additional command (for example, gender files or tumor purities). Parameters for scripts are listed in this chapter.
Usage:
seq2cov.pl [-hz] [-n name_reg] [-b bam] [-c chr] [-S start] [-E end] [-s seg_starts] [-e seg_ends] [-x #_nu] [-g gene] [-o ori] [-d depth] region_info
The program will calculate candidate variance for a given region(s) in an indexed BAM file. The default input is IGV's one or more entries in refGene.txt, but can be regions (BED file).
Arguments are:
region_info
Required. IGV's one or more entries in refGene.txt, but can be regions (BED file).
Options are:
-h
Print this help-a int:float
Indicate that it's PCR amplicon based calling. Each line in region_info represents a PCR amplicon (including primers). Two numbers in option are parameter to decide whether a particular read or pairs belongs to the amplicon.
First is the number of base pairs.
The second is the fraction of overlapped portion to the length of read or pairs.
If the edges of reads (paired for Illumina) are within defined base pairs to the edges of amplicons and overlapped portion greater then the fraction, it's considered belonging to the amplicon.
Suggested values are: 10:0.95. When given a 8 column amplicon format BED files, it'll be set to 10:0.95 automatically, but can still be overwritten by -a option.-n name_reg
The regular expression to extract sample name from bam filename-N name
Mutual exclusive to -n. Set the sample name to name-b bam
The indexed BAM file-c chr
The column for chr-S start
The column for region start, e.g. gene start-E end
The column for region end, e.g. gene end-s seg_starts
The column for segment starts in the region, e.g. exon starts-e seg_ends
The column for segment ends in the region, e.g. exon ends-g gene
The column for gene name-x num
The number of nucleotide to extend for each segment, default: 0-z
Indicate whether it's zero based numbering, default is 1-based
The program produces a file containing number of mapped or sequenced reads for samples.
Usage:
bam2reads.pl sample2bam.txt
Arguments are:
sample2bam.txt
file must contains name of sample and paths to bam files with TAB delimiter (\t
).
Usage:
cov2lr.pl [-aH] [-c control] mapping_reads.txt coverage.txt
The program will convert a coverage file to copy number profile.
Arguments are:
mapping_reads.txt
Required. A file containing # of mapped or sequenced reads for samples. At least two columns. Can be created bybam2reads.pl
script. First is the sample name, 2nd is the number of mapped or sequenced reads.coverage.txt
The coverage output file from seq2cov.pl script. Can also take from standard in or more than one file.
Options are:
-a
Indicate this is amplicon or exon based calling. By default, it'll aggregate at gene level.-M
Indicate to adjust the MAD when transforming the distribution. Default: no, or just simple linear function. If not sure, don't use this option. It might have better performance when cohort size is over 30.-c sample(s)
Specify the control sample(s), if aplicable. Multiple controls are allowed, which are separated by ":"-F double
The failed factor for individual amplicons. If (the 80th percentile of an amplicon depth)/(the global median depth) is less than the argument, the amplicon is considered failed and won't be used in calculation. Default: 0.2.-G Gender
Take a file of gender information. Two columns, first is sample name, second is either M or F. If not provided, the program will try to guess.-Y chrYratio
For gender testing, if chrY is designed. Default: 0.15. If chrY is carefully designed, such as Foundation's assay, default is good. For exome, the number should be higher, such as 0.3.-Z
Indicate to output the frozen_file and all parameters into file Seq2C.frozen.txt
The program will convert a coverage file to copy number profile. The default parameters are designed for
detecting such aberrations for high tumor purity samples, such as cancer cell lines.
For clinical samples, many parameters need to be adjusted.
Usage:
lr2gene.pl [-aPH] [-cy] [-F float] [-s min_amplicon_#] [-A float] [-D float] cov2lr.txt
Arguments are:
cov2lr.txt
The coverage output file from cov2lr.pl script. Can also take from standard in or more than one file.
Options are:
-c
Indicate that control sample is used for normalization-y
Debugging mode-s int
The minimum consecutive amplicons to look for deletions and amplifications. Default: 1. Use with caution when it's less than 3. There might be more false positives. Though it has been successfully applied with option "-s 1" and identified one-exon deletion of PTEN and TP53 that were confirmed by RNA-seq.
For whole gene:
-A float
Minimum log2 ratio for a whole gene to be considered amplified. Default: 1.50-D float
Minimum log2 ratio for a whole gene to be considered deleted. Default: -2.00
For < 3 exons:
-E float
Minimum mean log2 ratio difference for <3 exon deletion/amplification to be called. Default: 1.25-M float
When considering partial deletions less than 3 exons/amplicons, the minimum MAD value, in addition to -E, before considering it to be amplified or deleted. Default: 10-d float
When considering >=3 exons deletion/amplification within a gene, the minimum differences between the log2 of two segments. Default: 0.5-p float (0-1)
The p-value for t-test when consecutive exons/amplicons are >= 3. Default: 0.00001
For break point in the middle of the gene:
-e float
When considering breakpoint in the middle of a gene, the minimum number of exons. Default: 5-t float
When considering breakpoint in the middle of a gene, the minimum differences between the log2 of two segments. Default: 0.4-P float (0-1)
The p-value for t-test when the breakpoint is in the middle with min exons/amplicons >= [-e]. Default: 0.000001
For cohort level aberrations:
-R float (0-1)
If a breakpoint has been detected more than "float" fraction of samples, it's considered false positive and removed. Default: 0.03, or 3%. Use in combination with -N-N int
If a breakpoint has been detected more than "int" samples, it's considered false positives and removed. Default: 5. Use in combination with -R.
Here is the information about scripts that can be used to generate gender files or convert the results of Seq2C.
The program will parse seq2c output and make calls for each gene and output in the format compatible with OncoPrint.
It has the option to provide the purity so that log2ratio thresholds will be adjusted accordingly.
By default, it calls genes that are homozygously deleted or amplified with >= 6 copies.
Usage:
seq2c2fm.pl [-g] [-e exons] [-n reg] [-N num] [-A amp] [-a amp] [-D del] [-d del] [-p purity_file] [-P purity] lr2gene_output
Arguments are:
lr2gene_output
The output file from seq2c.sh script (or lr2gene.pl). Can also take from standard in or more than one file.
Options:
-k
Print header-g
Whether to output copy gains [4-5] copies. Default: no.-p file
A file contains the tumor purity for all samples. Two columns, first is sample name, second is the purity in % or fraction [0-1].-P double
The purity. Default: 1 or 100%, as is for cell lines. If set, all samples will assume to have the same purity.-n regex
The regular expression to extract sample names. Default: none.-N num
If an breakpoint has been called in >= num of samples, it's deemed false positive. Default: 5-e exons
Minimum number of exons/amplicon. Default: 1
For whole gene:
-D log2ratio
The log2ratio to determine that a gene is homozygously deleted. Default: -2.0-A log2ratio
The log2ratio to determine that a gene is amplified. Default: 1.45.
For < 3 exons:
-d log2ratio
The minimum log2ratio to determine that 1-2 exons are deleted. Should be lower than [-d] to reduce false positives. Default: -2.5-a log2ratio
The minimum log2ratio to determine that 1-2 exons are amplified. Should be larger than [-a] to reduce false positives. Default: 1.75
For gains:
-G Genes
List of genes, seperated by ":", for which gain will be captured. Default: MYC
Output:
Some columns are filled with blanks, "NA" or "-" for compatibility with OncoPrint format.
- Sample
- Always ""
- Variant_type: always "copy_number_alteration"
- Gene
- Always "NA"
- Always "-"
- Always "-"
- Segment "chr:start"
- Always "-"
- Always "-"
- Transf_LogRatio: transformed log_ratio as (2^LogRatio)*2
- Segments: for BP - affected segments of total segments, for Whole - both numbers are total segments.
- LogRatio
- Alteration in format "amplification/gain/loss"
- Always "-"
- Always "-"
- Always "-"
- Always "-"
- Always "-"
- Always "-"
- Always "-"
- Alteration in format "Amplification/Gain/Deletion"
The program will calculate the chrY coverage and make prediction of genders. The BAM file need to be
targeted with chrY genes, exome or WGS. Otherwise, it might make wrong prediction.
Output can be used in cov2lr.pl script with -G option.
Usage:
testGender.pl [-hHy] [-B gender_BED] [-d dir] [-x cov] [-L len] [-b bam] [-n regex] [-N name] sample2bam.txt
Output: The output contains 7 columns:
- The name of the sample
- Predicted gender. Possible values are: 2.1 Male It's likely a male 2.2 Female It's likely a female 2.3 X,-Y It's likely only one X chr, but without Y chr. 2.4 Unknown The information is not enough to determine the gender
- ChrY median depth
- ChrX median depth
- ChrA (autosome) median depth
- p_value of t-test between chrX and chrA
- The ratio between chrA/chrX (0.75-1.25 for Female, 1.65-2.35 for Male, roughly)
Limitations:
- The bed file used for gender test should be targeted for targeted panels. No issue for WXS or WGS.
- The chrY in bed file should be unique to chrY
- For pure samples (e.g. cancer cell lines), males lost chrY or females lost one chrX can't not be differentiated for type "X,-Y". It should be less an issue for clinical sequencing, as there're almost always have normal cells in the sample.
- For pure samples, males who lose chrY and but with amplified chrX can be mistaken as female.
Options:
-H
Print this help usage.-h
Print the header.-y
Print intermediate results (debuging purpose only).-B chrY_BED
The BED file with CDS of chrY specific genes. Default to hg19_gender.bed in seq2c base directory.-d dir
The installed seq2c base directory, if it's not in the path-x coverage
The expected depth of coverage. Specify it unless the depth is > 10x. Default: 10.-L read_length
The read length. Default: 100-b bam_file (optional)
The bam file. For single sample prediction only, and no need to have sample2bam.txt file.-n regex
Used with -b option. The regular expression to extract sample name-N string
The sample name
EXAMPLES
- Given a BAM file aligned to hg19, sample.bam, determine the gender:
testGender.pl -b sample.bam -B hg19_gender.bed -N sample_name
- If you have a file (samples.txt) with list of BAM files (tab delimited), then run the following command,
assuming aligned to hg19:
cat samples.txt | testGender.pl -B hg19_gender.bed
The program will create matrix consists needed number of lines containg 5 random elements in each line. Sum of elements in each line will be the requested.
Usage:
testrand.pl sum_of_elements lines
Arguments are:
sum_of_elements
Sum of random elements in each line.lines
Number of lines in constructed matrix.
Written by Zhongwu Lai, AstraZeneca, Boston, USA