SV-HotSpot: detection of hotspots targeted by recurrent structural variants associated with gene expression
SV-HotSpot is developed at Christopher Maher Lab at Washington University in St. Louis.
Clone the SV-Hotspot repository (with PATH_TO_SV_HOTSPOT is the local directory where you want to install SV-HotSpot):
cd PATH_TO_SV_HOTSPOT
git clone https://github.com/ChrisMaherLab/SV-Hotspot.git
Install required R packages by running the following command inside an R session:
install.packages(peakPick)
install.packages(ggplot2)
install.packages(plyr)
install.packages(reshape2)
install.packages(grid)
install.packages(gridExtra)
install.packages(gtable)
install.packages(data.table)
install.packages(RCircos)
install.packages(stringr)
install.packages(ggsignif)
SV-HotSpot uses bedTools (https://github.com/arq5x/bedtools2) for various overlapping and counting procedures, which should also be installed and made available as command lines.
The tool requires as an input the following:
- Genome assembly name (e.g. hg38, hg19, mm9, mm10, etc.) which is used to extract chromosomes sizes file (a tab-delimited file with two columns, chrom and size). Genome name should be one of the UCSC genome releases (https://genome.ucsc.edu/FAQ/FAQreleases.html#release1). SV-HotSpot built-in genomes are: hg18, hg19, hg38, mm9, mm10, dm3, dm6, rn5, rn6.
- Structural variants file in BEDPE format (Example of how to prepare this file is available in test_data folder). In case only VCF files are available, please try to convert them to BEDPE format using existing tools such vcftobedpe form svtools. Instructions on how to install it and use it can be found here.
Optional files:
-
Expression data in a matrix format where the first column represents the feature (e.g. gene) and columns represents samples. See the example provided at test_data folder. Please note that features in this file have to be unique.
-
Copy number segments in BED format. See the example at test_data on how to prepare this file.
-
An annotation file in BED format. If the user didn't provide this file, a built-in annotation file based on the genome name will be used. Please note that the feature in the expression file has to match the one in this file.
-
Region of interest file(s) (e.g. enhancers, transcription binding sites, etc) in BED format. An example of this file is given in test_data folder. Multiple files can be provided but have to be separated by comma (make sure no space between the file names).
-
ChIP-Seq coverage in BED format. See the example at test_data on how to prepare this file.
All other parameters are optional and a default value was assigned to each (run sv-hotspot.pl --help
for more details).
IMPORTANT NOTES:
- To prepare all your files, please look at the examples provided in test_data folder and do accordingly.
- Structural variant types must be in the format of THREE letters and should only include the following types:
BND, DEL, DUP, INS, INV
. - All files headers should be the same as the ones in the examples files.
- The chromosome names in all files should be consistent and in the format of chr#.
- The feature name in the annotation file and the expression file should be the same.
- The "name" column in the SVs file should be in the format of sample/type (e.g. Sample1/INV)
Assume PATH_TO_SV_HOTSPOT is the local directory where SV-HotSpot was installed, the following command runs SV-HotSpot on the mCRPC test data provided by SV-HotSpot. The test data are specifically for identifying SV hotspots affecting androgen receptor (AR) gene. To read more about this study, please refer to this Cell paper.
perl PATH_TO_SV_HOTSPOT/src/sv-hotspot.pl -o OUTPUT \
-g hg38 -C chrX \
--sv PATH_TO_SV_HOTSPOT/test_data/sv.bedpe \
-e PATH_TO_SV_HOTSPOT/test_data/exp.tsv \
-c PATH_TO_SV_HOTSPOT/test_data/cna.tsv \
-r PATH_TO_SV_HOTSPOT/test_data/enhancers.bed \
--chip-cov PATH_TO_SV_HOTSPOT/test_data/H3K27ac.bg \
--chip-cov-lbl H3K27ac \
-w 100000 -s 30000 -d 50000 \
--t-amp 2.99 --t-del 1.35 \
--stat-test wilcox.test --pval 0.05 \
--plot-top-peaks 10 --left-ext 100000 --right-ext 100000
More SV-HotSpot command line paramters:
USAGE:
sv-hotspot.pl [OPTIONS] -g/--genome <genomeName> --sv <structuralVariants>
NOTE:
(1) Genome name should be one of the UCSC genome releases (https://genome.ucsc.edu/FAQ/FAQreleases.html#release1).
- Built-in Genomes: hg18, hg19, hg38, mm9, mm10, dm3, dm6, rn5, rn6.
- Please refer to the documentation in case your genome is not listed above.
(2) Structutal variants file should be in "BEDPE" format.
(3) Gene expression data and copy number segments are required to visualize hotspot regions.
(4) Region of interest file(s) (e.g. promoters, enhancers, chip-seq, etc.) should be in "BED" format
OPTIONS:
-w/--sliding-win-size sliding window size <int> [ sliding window size. default: 100kb ]
-s/--sliding-win-step sliding window step <int> [ sliding window step. default: 1kb ]
-a/--annot annotation file <filename> [ an annotation file in "BED" format ]
-e/--exp expression file <filename> [ expression file in a "matrix" format ]
-c/--cn Copy number file <filename> [ copy number segments in BED format ]
-W/--peakPick-window-size peak calling window <int> [ peakPick window size. default: 100bp ]
-m/--peakPick-min-sd peak calling min. SD <int> [ peakPick standard deviation cutoff. default: 5 ]
-t/--pct-samples percentage of samples <int> [ percentage of samples cutoff to call peaks. default: 10 ]
-o/--output output directory <string> [ results output directory. default: ./ ]
-p/--pval pvalue cuttoff <float> [ pvalue threshold used for significance. default: 0.05 ]
-G/--genes-of-int list of genes <filename> [ list of genes of interest to be used for visualization ]
-r/--region-of-int region(s) of interest <filename> [ region of interest file(s) in "BED" format separated by comma ]
-C/--chrom chromosome name <string> [ chromosome name used to detect hotspots. default: ALL ]
-S/--sv-type structural variant type <string> [ SV type used to detect hotspots. default: ALL ]
-d/--merge-dist-size distance size <int> [ distance cutoff used to merge adjacent peaks. default: 10kb ]
--merge-pct-samples percentage of samples <int> [ percentage of samples cutoff to merge similar peaks. default: 5 ]
--stop-merge-num-peaks number of peaks <int> [ number of peaks cutoff to stop merging adjacent peaks. default: 0 ]
-k/--num-nearby-genes Number nearby genes <int> [ number of up/downstream genes to the peak. default: 1 ]
--t-amp amplification threshold <float/int> [ amplification cutoff. default: 2.8 ]
--t-del deletion threshold <float/int> [ deletion cutoff. default: 0.5 ]
--stat-test statistical test <string> [ statistical test used for comparison (wilcox.test or t.test). default: wilcox.test ]
--chip-cov chip-seq coverage <filename> [ chip-seq coverage file in "BED" foramt ]
--chip-cov-lbl chip-seq coverage label <string> [ label used for chip-seq coverage ]
--plot-top-peaks plot top # peaks <int> [ number of top peaks to plot. default: top 10 ]
--left-ext size of left extension <int> [ size of the left extension of the peak. default: 0bp ]
--right-ext size of right extension <int> [ size of the right extension of the peak. default: 0bp ]
--genes-to-plot genes names <string> [ names of genes to show on the plot. default: none.
If no genes were provided, all genes in the peak will be plotted ]
--plot-layout plot orientation <string> [ orientation of the peak plot (wide or narrow). default: narrow ]
Results will be written to a subdirectory named sv-hotspot-output inside OUTPUT directory specified in the command.
There are two main output files:
-
annotated_peaks_summary.tsv
: this file has all information about identified peaks. -
genes.associated.with.SVs.tsv
: this file contains statisitcal information for all genes affected by SVs. -
In addition, SV-HotSpot provides various visualizations composed of overlaying tracks representing copy number aggregation, SV aggregation, and gene/regulatory/region of interest annotation tracks.
Additional files:
- Peaks files for each chromosome and their corresponding figures. These files are located in the
peaks
folder. - UCSC custom track files. These files are located in
peaks/ucsc_custom_track_files
. These files can be viewed on the UCSC Genome Browser.
In some cases when the number of detected peaks is high, it is impractical to plot all peaks as this process takes long time. Thus, we set the tool to plot only the top # of peaks (default is 10). In case you need to increase/decrease this number, you need to provide this parameter --plot-top-peaks=#
with the number of peaks required. Set this parameter to 0 in case you do not want to plot any peaks.
To plot peak(s), we provided a script for this process. You just need to provide the peak name(s), SV file, the results directory, the expression, and copy number data with the remaining parameters shown above. Peak names must be separated by comma. To show the usage page of this command type the following command:
plot-peak.pl
USAGE:
plot-peak.pl [OPTIONS] -p <peakName1,peakName2,...> --sv <structuralVariants> --res-dir <resultsDirectory> -e/--expr <expression> -c/--cn <copynumber>
NOTE:
(1) Results directory should be the same as the output directory used with sv-hotspot.pl
OPTIONS:
-a/--annot Annotation file <filename> [ an annotation file in BED format ]
-o/--output output directory <string> [ default: ./ ]
--t-amp amplification threshold <float/int> [ threshold for copy number amplifications. default: 2.8 ]
--t-del deletion threshold <float/int> [ threshold for copy number deletions. default: 0.5 ]
--chip-cov chip-seq coverage <filename> [ If ChIP-Seq coverage file is provided, peaks will be overlapped with this file ]
--chip-cov-lbl chip-seq coverage label <string> [ the chip-seq coverage label used in the plot (e.g. histone name) ]
--left-ext size of left extension <int> [ number of extended bases from the left side of the peak. default: 0 ]
--right-ext size of right extension <int> [ number of extended bases from the right side of the peak. default: 0 ]
As an example, the following command plots peak "pX.55.1" from the above test.
PATH_TO_SV_HOTSPOT/src/plot-peak.pl -p pX.55.1 \
--sv PATH_TO_SV_HOTSPOT/test_data/sv.bedpe \
--res-dir /RESULTS/PATH \
-e PATH_TO_SV_HOTSPOT/test_data/exp.tsv \
-c PATH_TO_SV_HOTSPOT/test_data/cna.tsv \
--chip-cov PATH_TO_SV_HOTSPOT/test_data/H3K27ac.bg \
-r PATH_TO_SV_HOTSPOT/test_data/enhancers.bed \
-o /SOME/PATH --t-amp 2.99 --t-del 1.35 --chip-cov-lbl H3K27ac \
--left-ext 100000 --rigth-ext 100000
To use SV-HotSpot, a docker image has been created and tested on Linux and Mac. To run SV-HotSpot, you need to have Docker installed on your machine.
- Ubuntu: follow the instructions to get Docker CE for Ubuntu.
- Mac: follow the instructions to install the Stable verion of Docker CE on Mac.
To obtain the latest docker image, run the following in your command line:
docker pull chrismaherlab/sv-hotspot
To test the image, run the following command which shows the usage of this tool:
docker run chrismaherlab/sv-hotspot sv-hotspot
We have provided an example data available in "test_data" folder specifically for identifying SV hotspots affecting androgen receptor (AR) gene. To read more about this study, please refer to this Cell paper.
To run SV-HotSpot on this test data, use the following command:
docker run -v /local/folder:/data chrismaherlab/sv-hotspot sv-hotspot -g hg38 -C chrX \
--sv /data/test_data/sv.bedpe -e /data/test_data/exp.tsv \
-c /data/test_data/cna.tsv --chip-cov /data/test_data/H3K27ac.bg \
-r /data/test_data/enhancers.bed -o /data/SV-HOTSPOT-TEST \
-w 100000 -s 30000 -d 50000 --merge-pct-samples 5 \
-t-amp 2.99 --t-del 1.35 --stat-test wilcox.test --pval 0.05 \
--chip-cov-lbl H3K27ac --left-ext 100000 --right-ext 100000 \
--plot-top-peaks 1 --genes-to-plot AR
- Note, the -v flags map your local filesystem locations to a “location” within the Docker image. Therefore, you need to change
/local/folder
to your local folder on your machine. This folder must contain the "test_data" folder. The final output will be sent to this folder as well. Use"$PWD"
in case you want to use the current directroy. - Please note that you need to provide the absolute path for
/local/folder
. /data
is a folder on the image container used to receive the input data that was mapped by -v flag. The SV-HotSpot pipeline is configured with this folder and it has to be provided the same way as in the command above.
Once hotspots of SVs are identified, users should perform post-processing and filtering. SV-HotSpot provides a script to perform post-filtering of the results based on multiple criteria including filtering of genes with low expression and/or weak expression association (p-value, logFC, mean group expression), removal of wide peaks not associated with known cancer genes (eg. using COSMIC census genes), removal of peaks with low number of samples harboring hotspot SVs. To performed filtering, run the filter.r
command as follows:
Usage: Rscript filter.r
<SV-HotSpot_result_dir>
<max.p-value-to-infer-expression-association>
<min.logfc>
<min.group.expr>
<max.number.of.associated.gene.per.peak, peak w/ more genes associated are ignored>
<max.peak.length, peak wider than this w/o assoc. w/ known genes are ignored>
<min.peak.percent.samples, 0-100, peaks with percent of samples smaller than this are ignored>
<tsv_file_of_cosmic_census_genes>
<csv_other_known_genes_to_keep>
Example:
Rscript filter.r /path/to/sv-hotspot-output 0.05 0.22 10 9 500000 15 \
data/cosmic_census_genes.tsv AR,ERG,PTEN,TP53
Output will be written to SV-HotSpot result directory which includes the following two files:
1) annotated_peaks_summary.filtered.tsv
2) genes.associated.with.SVs.filtered.tsv
The top hotspots are automatically visualized when running SV-Hotspot detection (parameter --plot-top-peaks=#
when running SV-HotSpot detection). Additionally, users can plot hotspots identified by SV-HotSpot independently of SV-Hotspot detection. To plot hotspot sites (or peaks), run the following command:
To show the usage page of this script, run the following command:
docker run chrismaherlab/sv-hotspot plot-peak
To plot peaks, use the following command which plots pX.172 and pX.173 peaks (peak name(s) taken from "annotated_peaks_summary.tsv" output file). You will also need to provide the SV calls file used to run SV-Hotspot, path to SV-HotSpot results directory, expression and copy number data, and the remaining parameters shown in SV-HotSpot detection command above. Peak names must be separated by comma and no space between them.
docker run -v "$PWD":/data chrismaherlab/sv-hotspot plot-peak \
-p pX.55.1 --genes-to-plot AR \
--res-dir /data/SV-HOTSPOT-TEST/sv-hotspot-output \
-o /data/SV-HOTSPOT-TEST/sv-hotspot-output \
--sv /data/test_data/sv.bedpe -e /data/test_data/exp.tsv \
-c /data/test_data/cna.tsv --chip-cov /data/test_data/H3K27ac.bg \
-r /data/test_data/enhancers.bed --t-amp 2.99 --t-del 1.35 \
--chip-cov-lbl H3K27ac --left-ext 100000 --right-ext 100000
res-dir
must refer to the folder containing the results generated from running sv-hospot command with-o
option. Please note that you need to include "sv-hotspot-output" at the end ofres-dir
path since SV-HotSpot always creates this folder which is used to write all output results.
SV-HotSpot also provides whole genome level visualization of hotspot sites. The script plot_whole_genome.r
can be used to generate the histogram aggregation of the counts of samples harboring structural variants at whole genome level (circos plot) or individual chromosome level. This script can be run as follows.
Usage: Rscript plot_whole_genome.r
<SV-Hotspot_result_dir>
<annotation_dir>
<genome_assembly_version, eg. hg38>
<plot_circos, eg. TRUE, FALSE>
<chromosomes_to_plot, eg. "ALL", "chr1,chrX">
<genes_to_show, eg. "ERG,PTEN,ETV1">
<color_genes_by_association_direction_with_SV, eg. "TRUE", "FALSE">
<output_figure_format, eg. "png", "pdf">
<output_figure_dir>
Example:
Rscript plot_whole_genome.r /path/to/sv-hotspot-output /path/to/annotations \
hg38 TRUE chr1,chrX,chr21 ERG,AR,PTEN TRUE png out
- Quigley, D. A., Dang, H. X., Zhao, S. G., Lloyd, P., Aggarwal, R., Alumkal, J. J., ... & Playdle, D. (2018). Genomic hallmarks and structural variation in metastatic prostate cancer. Cell, 174(3), 758-769.
Eteleeb, A.M., Quigley, D.A., Zhao, S.G. et al. SV-HotSpot: detection and visualization of hotspots targeted by structural variants associated with gene expression. Sci Rep 10, 15890 (2020). https://doi.org/10.1038/s41598-020-71168-7
Abdallah Eteleeb: [email protected]