ChIP-seq-pipeline

This is a pipeline to run basic Chromatin immunoprecipitation sequencing analysis for single-end or paired-end data.

This is a package of bash scripts that enable reading, processing and analysis of ChIP-seq Omics' datasets. This package implements the Snakemake management workflow system and is currently implemented to work with the cluster management and job scheduling system SLURM. This snakemake workflow utilizes conda installations to download and use packages for further analysis, so please ensure that you have installed miniconda prior to use.

Questions/issues

Please add an issue to the ChIP-seq repository. We would appreciate if your issue included sample code/files (as appropriate) so that we can reproduce your bug/issue.

Contributing

We welcome contributors! For your pull requests, please include the following:

Sample code/file that reproducibly causes the bug/issue Documented code providing fix Unit tests evaluating added/modified methods.

Features exclusive to Maxson-Braun Lab fork

Alignment output
- QC metrics = for each sample, total reads, alignment rate, unique reads, duplicate read rate
- Bam files with high-quality, mapped, and de-duplicated reads per sample.
Peak calling + consensus peaks output
- QC metrics = for each sample, total peaks, peaks in consensus catalog (CC), % reads in CC
- Narrow or broad peak files per sample.
- Consensus peaks = for each factor, a list of genomic intervals where a peak is in n replicates in at least one condition.
- Raw counts table generated by bedtools multicov per factor.
Differential peaks output
- QC metrics = for each factor and contrast combo, the no. of significant peaks, split by up and down regulation
- DESeq2 output for each unique combination of contrasts
- PCA plot for each factor
Essential report
- This report aggregates all QC metrics at the above steps and allows the user to add custom content via the config.yaml file.

Assumptions of the data

The data should be in the shape of a square or rectangle where all antibodies are treated to all conditions with equal replicates. Something like this:

Cond	Replicate	Transcription Factor / Histone Mark
A	1	X
A	2	X
B	1	X
B	2	X
C	1	X
C	2	X
D	1	Y
D	2	Y
E	1	Y
E	2	Y
F	1	Y
F	2	Y

This is not so important for read alignment and peak calling, but will be important for merging counts tables across factors followed by differential analysis.

Usage:

Locate raw files.

$ cd /path/to/raw/data
$ ls -alh

Check md5sum.

$ md5sum –c md5sum.txt > md5sum_out.txt

Move your files into the archive to be stored.

$ mv /path/to/raw/data /path/to/archive

Check md5sum again in the archive to ensure your sequencing files are not corrupted.

$ md5sum –c md5sum.txt > md5sum_out.txt

Clone the ChIP-seq Pipeline into your working directory.

$ git clone https://github.com/ohsu-cedar-comp-hub/ChIP-seq-pipeline.git

Create a samples/cases, a samples/controls directory, and a logs directory in your wdir().

$ mkdir logs
$ mkdir samples
$ mkdir samples/cases
$ mkdir samples/controls

Symbollically link the fastq files of your samples to the wdir/samples/cases directory using a bash script loop in your terminal.

For PE data:

$ cd samples/cases
$ ls -1 /path/to/data/LIB*R1*gz | while read gz; do
    R1=$( basename $gz | cut -d _ -f 2 | awk '{print $1"_R1.fastq.gz"}' )
    R2=$( basename $gz | cut -d _ -f 2 | awk '{print $1"_R2.fastq.gz"}' )
    echo $R1 : $R2
    ln -s $gz ./$R1
    ln -s ${gz%R1_001.fastq.gz}R2_001.fastq.gz ./$R2
done

For SE data:

$ cd samples/cases
$ ls -1 /path/to/data/LIB*gz | while read gz; do 
  R=$( basename $gz | cut -d '_' -f 2 | awk '{print $1".fastq.gz"}' )
  echo $R
  ln -s $gz ./$R
done

Then move your controls that you have defined in the metadata.txt file to samples/controls.

$ mv samples/cases/{FileName1,FileNameN}.fastq.gz samples/controls

Lastly your filenames should be formatted like so:

{condition}{replicate}_{factor}_[R1|R2].fastq.gz

In the case of MOLM24D1_MYC_R1.fastq.gz, MOLM24D is condition, 1 is replicate, and MYC factor.

Upload your metadata file to the data directory, with the correct columns:

This file should be formatted as such:

SampleID  Control   Peak_call
H3K4Me3-1  samples/bed/IgG.bed  narrow

Each row should be a CASE (i.e. not an Input/IgG), with the Control for that sample listed (written as samples/bed/{ControlID}.bed) and the type of peak calling you would like (either broad or narrow)
All elements in this file should be tab-separated
An example file is located in this repository here: data/metadata.txt

Edit the omic_config.yaml in your wdir():

Write the full path towards your metadata.txt file under the samples category
Specify the genome assembly you would like to use
- Current options for this are: hg19 and hg38
Specify your seq_type
- Either SE or PE

Do a dry-run of snakemake to ensure proper execution before submitting it to the cluster (in your wdir).

$ snakemake -np --verbose

To configure your snakemake profile for a SLURM system (required), copy the slurm folder to ~/.config/snakemake, and delete the local copy. Once your files are symbolically linked, you can submit the job to exacloud via your terminal window.

$ snakemake -j 99 --use-conda --rerun-incomplete --latency-wait 60 --keep-going --profile slurm --cluster-config cluster.yaml

To see how the job is running, look at your queue.

$ squeue -u your_username

Once you have finished the pipeline, you can generate a report via the config.yaml file by ticking the gen_report flag yes, filling out subsequent fields, and re-running SnakeMake again.

Name		Name	Last commit message	Last commit date
Latest commit History 110 Commits
config		config
envs		envs
rules		rules
scripts		scripts
slurm		slurm
.gitignore		.gitignore
README.md		README.md
Snakefile		Snakefile
cluster.yaml		cluster.yaml
config.yaml		config.yaml
rulegraph.svg		rulegraph.svg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ChIP-seq-pipeline

Questions/issues

Contributing

Features exclusive to Maxson-Braun Lab fork

Assumptions of the data

Usage:

Methods

About

Releases

Packages

Languages

maxsonBraunLab/ChIP-seq-pipeline-----old

Folders and files

Latest commit

History

Repository files navigation

ChIP-seq-pipeline

Questions/issues

Contributing

Features exclusive to Maxson-Braun Lab fork

Assumptions of the data

Usage:

Methods

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages