Skip to content

Chromatin immunoprecipitation sequencing analysis pipeline.

Notifications You must be signed in to change notification settings

maxsonBraunLab/ChIP-seq-pipeline-----old

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ChIP-seq-pipeline

This is a pipeline to run basic Chromatin immunoprecipitation sequencing analysis for single-end or paired-end data.

This is a package of bash scripts that enable reading, processing and analysis of ChIP-seq Omics' datasets. This package implements the Snakemake management workflow system and is currently implemented to work with the cluster management and job scheduling system SLURM. This snakemake workflow utilizes conda installations to download and use packages for further analysis, so please ensure that you have installed miniconda prior to use.

Questions/issues

Please add an issue to the ChIP-seq repository. We would appreciate if your issue included sample code/files (as appropriate) so that we can reproduce your bug/issue.

Contributing

We welcome contributors! For your pull requests, please include the following:

Sample code/file that reproducibly causes the bug/issue Documented code providing fix Unit tests evaluating added/modified methods.

Features exclusive to Maxson-Braun Lab fork

  • Alignment output
    • QC metrics = for each sample, total reads, alignment rate, unique reads, duplicate read rate
    • Bam files with high-quality, mapped, and de-duplicated reads per sample.
  • Peak calling + consensus peaks output
    • QC metrics = for each sample, total peaks, peaks in consensus catalog (CC), % reads in CC
    • Narrow or broad peak files per sample.
    • Consensus peaks = for each factor, a list of genomic intervals where a peak is in n replicates in at least one condition.
    • Raw counts table generated by bedtools multicov per factor.
  • Differential peaks output
    • QC metrics = for each factor and contrast combo, the no. of significant peaks, split by up and down regulation
    • DESeq2 output for each unique combination of contrasts
    • PCA plot for each factor
  • Essential report
    • This report aggregates all QC metrics at the above steps and allows the user to add custom content via the config.yaml file.

Assumptions of the data

The data should be in the shape of a square or rectangle where all antibodies are treated to all conditions with equal replicates. Something like this:

Cond Replicate Transcription Factor / Histone Mark
A 1 X
A 2 X
B 1 X
B 2 X
C 1 X
C 2 X
D 1 Y
D 2 Y
E 1 Y
E 2 Y
F 1 Y
F 2 Y

This is not so important for read alignment and peak calling, but will be important for merging counts tables across factors followed by differential analysis.

Usage:

Locate raw files.

$ cd /path/to/raw/data
$ ls -alh

Check md5sum.

$ md5sum –c md5sum.txt > md5sum_out.txt

Move your files into the archive to be stored.

$ mv /path/to/raw/data /path/to/archive

Check md5sum again in the archive to ensure your sequencing files are not corrupted.

$ md5sum –c md5sum.txt > md5sum_out.txt

Clone the ChIP-seq Pipeline into your working directory.

$ git clone https://github.com/ohsu-cedar-comp-hub/ChIP-seq-pipeline.git

Create a samples/cases, a samples/controls directory, and a logs directory in your wdir().

$ mkdir logs
$ mkdir samples
$ mkdir samples/cases
$ mkdir samples/controls

Symbollically link the fastq files of your samples to the wdir/samples/cases directory using a bash script loop in your terminal.

  • For PE data:
$ cd samples/cases
$ ls -1 /path/to/data/LIB*R1*gz | while read gz; do
    R1=$( basename $gz | cut -d _ -f 2 | awk '{print $1"_R1.fastq.gz"}' )
    R2=$( basename $gz | cut -d _ -f 2 | awk '{print $1"_R2.fastq.gz"}' )
    echo $R1 : $R2
    ln -s $gz ./$R1
    ln -s ${gz%R1_001.fastq.gz}R2_001.fastq.gz ./$R2
done
  • For SE data:
$ cd samples/cases
$ ls -1 /path/to/data/LIB*gz | while read gz; do 
  R=$( basename $gz | cut -d '_' -f 2 | awk '{print $1".fastq.gz"}' )
  echo $R
  ln -s $gz ./$R
done

Then move your controls that you have defined in the metadata.txt file to samples/controls.

$ mv samples/cases/{FileName1,FileNameN}.fastq.gz samples/controls

Lastly your filenames should be formatted like so:

{condition}{replicate}_{factor}_[R1|R2].fastq.gz

In the case of MOLM24D1_MYC_R1.fastq.gz, MOLM24D is condition, 1 is replicate, and MYC factor.

Upload your metadata file to the data directory, with the correct columns:

  • This file should be formatted as such:
SampleID  Control   Peak_call
H3K4Me3-1  samples/bed/IgG.bed  narrow
  • Each row should be a CASE (i.e. not an Input/IgG), with the Control for that sample listed (written as samples/bed/{ControlID}.bed) and the type of peak calling you would like (either broad or narrow)
  • All elements in this file should be tab-separated
  • An example file is located in this repository here: data/metadata.txt

Edit the omic_config.yaml in your wdir():

  • Write the full path towards your metadata.txt file under the samples category
  • Specify the genome assembly you would like to use
    • Current options for this are: hg19 and hg38
  • Specify your seq_type
    • Either SE or PE

Do a dry-run of snakemake to ensure proper execution before submitting it to the cluster (in your wdir).

$ snakemake -np --verbose

To configure your snakemake profile for a SLURM system (required), copy the slurm folder to ~/.config/snakemake, and delete the local copy. Once your files are symbolically linked, you can submit the job to exacloud via your terminal window.

$ snakemake -j 99 --use-conda --rerun-incomplete --latency-wait 60 --keep-going --profile slurm --cluster-config cluster.yaml

To see how the job is running, look at your queue.

$ squeue -u your_username

Once you have finished the pipeline, you can generate a report via the config.yaml file by ticking the gen_report flag yes, filling out subsequent fields, and re-running SnakeMake again.

Methods

Example Workflow

About

Chromatin immunoprecipitation sequencing analysis pipeline.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 44.6%
  • Shell 39.0%
  • R 16.4%