This is a pipeline to run basic Chromatin immunoprecipitation sequencing analysis for single-end or paired-end data.
This is a package of bash scripts that enable reading, processing and analysis of ChIP-seq Omics' datasets. This package implements the Snakemake management workflow system and is currently implemented to work with the cluster management and job scheduling system SLURM. This snakemake workflow utilizes conda installations to download and use packages for further analysis, so please ensure that you have installed miniconda prior to use.
Please add an issue to the ChIP-seq repository. We would appreciate if your issue included sample code/files (as appropriate) so that we can reproduce your bug/issue.
We welcome contributors! For your pull requests, please include the following:
Sample code/file that reproducibly causes the bug/issue Documented code providing fix Unit tests evaluating added/modified methods.
- Alignment output
- QC metrics = for each sample, total reads, alignment rate, unique reads, duplicate read rate
- Bam files with high-quality, mapped, and de-duplicated reads per sample.
- Peak calling + consensus peaks output
- QC metrics = for each sample, total peaks, peaks in consensus catalog (CC), % reads in CC
- Narrow or broad peak files per sample.
- Consensus peaks = for each factor, a list of genomic intervals where a peak is in n replicates in at least one condition.
- Raw counts table generated by
bedtools multicov
per factor.
- Differential peaks output
- QC metrics = for each factor and contrast combo, the no. of significant peaks, split by up and down regulation
- DESeq2 output for each unique combination of contrasts
- PCA plot for each factor
- Essential report
- This report aggregates all QC metrics at the above steps and allows the user to add custom content via the
config.yaml
file.
- This report aggregates all QC metrics at the above steps and allows the user to add custom content via the
The data should be in the shape of a square or rectangle where all antibodies are treated to all conditions with equal replicates. Something like this:
Cond | Replicate | Transcription Factor / Histone Mark |
---|---|---|
A | 1 | X |
A | 2 | X |
B | 1 | X |
B | 2 | X |
C | 1 | X |
C | 2 | X |
D | 1 | Y |
D | 2 | Y |
E | 1 | Y |
E | 2 | Y |
F | 1 | Y |
F | 2 | Y |
This is not so important for read alignment and peak calling, but will be important for merging counts tables across factors followed by differential analysis.
Locate raw files.
$ cd /path/to/raw/data
$ ls -alh
Check md5sum.
$ md5sum –c md5sum.txt > md5sum_out.txt
Move your files into the archive to be stored.
$ mv /path/to/raw/data /path/to/archive
Check md5sum again in the archive to ensure your sequencing files are not corrupted.
$ md5sum –c md5sum.txt > md5sum_out.txt
Clone the ChIP-seq Pipeline into your working directory.
$ git clone https://github.com/ohsu-cedar-comp-hub/ChIP-seq-pipeline.git
Create a samples/cases
, a samples/controls
directory, and a logs
directory in your wdir()
.
$ mkdir logs
$ mkdir samples
$ mkdir samples/cases
$ mkdir samples/controls
Symbollically link the fastq files of your samples to the wdir/samples/cases
directory using a bash script loop in your terminal.
- For PE data:
$ cd samples/cases
$ ls -1 /path/to/data/LIB*R1*gz | while read gz; do
R1=$( basename $gz | cut -d _ -f 2 | awk '{print $1"_R1.fastq.gz"}' )
R2=$( basename $gz | cut -d _ -f 2 | awk '{print $1"_R2.fastq.gz"}' )
echo $R1 : $R2
ln -s $gz ./$R1
ln -s ${gz%R1_001.fastq.gz}R2_001.fastq.gz ./$R2
done
- For SE data:
$ cd samples/cases
$ ls -1 /path/to/data/LIB*gz | while read gz; do
R=$( basename $gz | cut -d '_' -f 2 | awk '{print $1".fastq.gz"}' )
echo $R
ln -s $gz ./$R
done
Then move your controls that you have defined in the metadata.txt
file to samples/controls
.
$ mv samples/cases/{FileName1,FileNameN}.fastq.gz samples/controls
Lastly your filenames should be formatted like so:
{condition}{replicate}_{factor}_[R1|R2].fastq.gz
In the case of MOLM24D1_MYC_R1.fastq.gz
, MOLM24D is condition, 1 is replicate, and MYC factor.
Upload your metadata file to the data
directory, with the correct columns:
- This file should be formatted as such:
SampleID Control Peak_call
H3K4Me3-1 samples/bed/IgG.bed narrow
- Each row should be a
CASE
(i.e. not an Input/IgG), with theControl
for that sample listed (written assamples/bed/{ControlID}.bed
) and the type of peak calling you would like (eitherbroad
ornarrow
) - All elements in this file should be tab-separated
- An example file is located in this repository here:
data/metadata.txt
Edit the omic_config.yaml
in your wdir()
:
- Write the full path towards your
metadata.txt
file under thesamples
category - Specify the genome
assembly
you would like to use- Current options for this are: hg19 and hg38
- Specify your
seq_type
- Either
SE
orPE
- Either
Do a dry-run of snakemake to ensure proper execution before submitting it to the cluster (in your wdir).
$ snakemake -np --verbose
To configure your snakemake profile for a SLURM system (required), copy the slurm
folder to ~/.config/snakemake
, and delete the local copy. Once your files are symbolically linked, you can submit the job to exacloud via your terminal window.
$ snakemake -j 99 --use-conda --rerun-incomplete --latency-wait 60 --keep-going --profile slurm --cluster-config cluster.yaml
To see how the job is running, look at your queue.
$ squeue -u your_username
Once you have finished the pipeline, you can generate a report via the config.yaml
file by ticking the gen_report
flag yes, filling out subsequent fields, and re-running SnakeMake again.