Summit is a re-interpretation of GoPeaks written in Python to call narrow and broad peaks in epigenetic CUT&Tag/CUT&RUN/ATAC-Seq datasets. The major changes in this refactor is a signficant reduction of computational demands (please see below). If you use GoPeaks or Summit in your publication, please cite the original paper:
Yashar, W.M., Kong, G., VanCampen, J. et al. GoPeaks: histone modification peak calling for CUT&Tag. Genome Biol 23, 144 (2022). https://doi.org/10.1186/s13059-022-02707-w
$ python summit.py -h
usage: summit.py [-h] -b BAM -o OUTPUT [-c CONTROL] [-m MINREADS] [-d MINDIST] [-w MINWIDTH] [-t STEP] [-l SLIDE] [-p PVAL] [-s CHROMSIZE] [--broad] [--chromosomes CHROMOSOMES] [--canonical [CANONICAL]] [-v] [--version]
Summit is a peak caller designed for CUT&TAG/CUT&RUN sequencing data.
Summit by default works best with narrow peaks such as H3K4me3 and
transcription factors. Summit can be used with the "--broad" flag to
call broad peaks like H3K27Ac/H3K4me1. We encourage users to explore
the parameters of Summit to analyze their data.
options:
-h, --help show this help message and exit
Required Arguments:
-b BAM, --bam BAM Input BAM file with histone mark or protein-of-interest
-o OUTPUT, --output OUTPUT
Output prefix e.g. data/peaks/sample-1
Summit Parameters:
-c CONTROL, --control CONTROL
Control BAM file
-m MINREADS, --minreads MINREADS
Analyze bins greater than mindreads. Default: 15
-d MINDIST, --mindist MINDIST
Merge bins within <mindist> base pairs. Default: 1000
-w MINWIDTH, --minwidth MINWIDTH
Remove peaks less than <mindwidth> base pairs wide. Default: 150
-t STEP, --step STEP Bin size for genome bins. Default: 100
-l SLIDE, --slide SLIDE
Slide size for genome bins. Default: 50
-p PVAL, --pval PVAL Filter away bins above <pval> significance level. Default: 0.05
-s CHROMSIZE, --chromsize CHROMSIZE
Tab-delimited file containing chromosome name and sizes.
Optional Arguments:
--broad Call broad peaks with step=5000, slide=1000, mindist=3000.
--chromosomes CHROMOSOMES
Use a regex string to subset chromosomes.
--canonical [CANONICAL]
Subset canonically-named chromosomes. Default: 'chr[0-9XY]+$'
-v, --verbose Run the program in verbose mode
--version Display Summit version
Sequencing Modality | Protein of Interest | Recommended Parameters |
---|---|---|
CUT&Tag or CUT&RUN | Transcription factor | Default parameters |
CUT&Tag or CUT&RUN | Narrow histone mark (H3K4me3) | Default parameters |
CUT&Tag or CUT&RUN | Broad histone mark (H3K4me1, H3K27Ac, H3K27me3) | --broad |
ATAC-Seq | Open chromatin | Default parameters |
Summit is meant to be an ultrafast and light adaptation to the original GoPeaks algorithm, while keeping true to the previous findings. The following benchmarks come from the same samples used in the original paper found in GoPeaks. Broadly, these data contained previously published data (K562 samples from Kaya-Okur et. al. 2019) and in-house cell-line data (Kasumi samples from GSE190793). The following benchmarks were obtained from Oregon Health and Science Univeristy's high-performance clusters (HPCs) Exacloud.
Max USS tracks the maximal memory used in a process. This metric is commonly employed in HPCs to kill any process that takes up much RAM. While CUT&Tag data frequently use 3-8 million paired-end reads (<200MB), GoPeaks requires a prohibitively high amount of memory (20-30GBs) to process the data. By comparison, Summit typically uses less than 2GB of memory with very little deviation between runs (individual points). This makes Summit between 90-95% more efficient than GoPeaks!
Not only is Summit more memory-efficient than GoPeaks, Summit commonly takes 1/2 to 1/3 the time to process the same dataset regardless of input sequencing depth.
In addition to reduced overall processing time, Summit requires less CPU time. This is the time that the CPU is actually executing a command, as opposed to waiting for data input/output.
Memory Usage | Time Usage |
---|---|
If we take an example dataset and downsampled the depth from 90-20%, we can see the effects of memory requirement and program duration over many sequencing depths for a specific (but representative) sample. In the plots above, we can observe the following:
- Maximal memory consumption is invariant to sequencing depth for Summit.
- Summit's program duration increases linearly with sequencing depth. GoPeaks also increases linearly, but with much higher deviation in its trajectory.
- While the time differences are not significantly different (e.g. 6 minutes vs. 3 minutes), the increased efficiency of Summit over GoPeaks allows users to call peaks on samples with very high sequencing depth, or process hundreds of samples simultaneously without requesting prohibitively high computational resources.
Treated sample read depths ranged between 3M to 30M as show below:
K562_1_H3K27Ac.sorted.markd.bam 4_278_700
K562_1_H3K4me1.sorted.markd.bam 6_867_838
K562_1_H3K4me2.sorted.markd.bam 6_231_210
K562_1_H3K4me3.sorted.markd.bam 7_478_036
K562_1_IgG.sorted.markd.bam 3_073_808
K562_2_H3K27Ac.sorted.markd.bam 5_770_016
K562_2_H3K4me1.sorted.markd.bam 8_550_358
K562_2_H3K4me2.sorted.markd.bam 6_256_098
K562_2_H3K4me3.sorted.markd.bam 7_490_544
K562_2_IgG.sorted.markd.bam 1_891_652
Kasumi_1_H3K27Ac.sorted.markd.bam 11_793_316
Kasumi_1_H3K4me3.sorted.markd.bam 30_640_304
Kasumi_1_IgG.sorted.markd.bam 3_213_274
Kasumi_2_H3K27Ac.sorted.markd.bam 10_169_058
Kasumi_3_H3K27Ac.sorted.markd.bam 5_157_920
Kasumi_3_H3K4me3.sorted.markd.bam 13_668_580