This is a pipeline to run basic Chromatin immunoprecipitation sequencing analysis for single-end or paired-end data.
This is a package of bash scripts that enable reading, processing and analysis of ChIP-seq Omics' datasets. This package implements the Snakemake management workflow system and is currently implemented to work with the cluster management and job scheduling system SLURM. This snakemake workflow utilizes conda installations to download and use packages for further analysis, so please ensure that you have installed miniconda prior to use.
Please add an issue to the ChIP-seq repository. We would appreciate if your issue included sample code/files (as appropriate) so that we can reproduce your bug/issue.
We welcome contributors! For your pull requests, please include the following:
Sample code/file that reproducibly causes the bug/issue Documented code providing fix Unit tests evaluating added/modified methods.
Locate raw files.
$ cd /path/to/raw/data
$ ls -alh
Check md5sum.
$ md5sum –c md5sum.txt > md5sum_out.txt
Move your files into the archive to be stored.
$ mv /path/to/raw/data /path/to/archive
Check md5sum again in the archive to ensure your sequencing files are not corrupted.
$ md5sum –c md5sum.txt > md5sum_out.txt
Clone the ChIP-seq Pipeline into your working directory.
$ git clone https://github.com/ohsu-cedar-comp-hub/ChIP-seq-pipeline.git
Create a samples/cases
, a samples/controls
directory, and a logs
directory in your wdir()
.
$ mkdir logs
$ mkdir samples
$ mkdir samples/cases
$ mkdir samples/controls
Symbollically link the fastq files of your samples to the wdir/samples/cases
directory using a bash script loop in your terminal.
- For PE data:
$ cd samples/cases
$ ls -1 /path/to/data/LIB*R1*gz | while read gz; do
R1=$( basename $gz | cut -d _ -f 2 | awk '{print $1"_R1.fastq.gz"}' )
R2=$( basename $gz | cut -d _ -f 2 | awk '{print $1"_R2.fastq.gz"}' )
echo $R1 : $R2
ln -s $gz ./$R1
ln -s ${gz%R1_001.fastq.gz}R2_001.fastq.gz ./$R2
done
- For SE data:
$ cd samples/cases
$ ls -1 /path/to/data/LIB*gz | while read gz; do
R=$( basename $gz | cut -d '_' -f 2 | awk '{print $1".fastq.gz"}' )
echo $R
ln -s $gz ./$R
done
Then move your controls that you have defined in the metadata.txt
file to samples/controls
.
$ mv samples/cases/{FileName1,FileNameN}.fastq.gz samples/controls
Upload your metadata file to the data
directory, with the correct columns:
- This file should be formatted as such:
SampleID Control Peak_call
H3K4Me3-1 samples/bed/IgG.bed narrow
- Each row should be a
CASE
(i.e. not an Input/IgG), with theControl
for that sample listed (written assamples/bed/{ControlID}.bed
) and the type of peak calling you would like (eitherbroad
ornarrow
) - All elements in this file should be tab-separated
- An example file is located in this repository here:
data/metadata.txt
Edit the omic_config.yaml
in your wdir()
:
- Write the full path towards your
metadata.txt
file under thesamples
category - Specify the genome
assembly
you would like to use- Current options for this are: hg19 and hg38
- Specify your
seq_type
- Either
SE
orPE
- Either
Do a dry-run of snakemake to ensure proper execution before submitting it to the cluster (in your wdir).
$ snakemake -np --verbose
Once your files are symbolically linked, you can submit the job to exacloud via your terminal window.
$ sbatch submit_snakemake.sh
To see how the job is running, look at your queue.
$ squeue -u your_username