This is a pipeline used to collect ChIP-Seq data from humans (Homo sapiens) and fruit fly (Drosophila melanogaster) from various sources and experiments to update the Motif Assessment and Ranking Suite (MARS) benchmark data.
In this pipeline we collated ChIP-Seq data from different source;
- Humans
- ENCODE Consortium
- Geo datasets
- Sequence Read Archives
- Drosophila
- ModERN
- Geo datasets
- modENCODE
The ChIP-Seq data is processed using encode chipseq pipeline if its collected from a source outside encode consortium
Set up the encode chip seq pipeline following this guide.
Activate the pipeline's conda environment
conda activate encode-chip-seq-pipeline
Install additional tools to the pipeline
conda install --file requirements.txt
GEO experiments were searched and downloaded in the soft file format using a custom script geo_search.sh
.
Edit the input.json
file with the organism you are working with i.e Drosophila melanogaster or Homo sapiens.
"organism" : "Drosophila melanogaster"
This step involves excluding experiments which do not contain ChIP-Seq data and cleaning experiments with mixed data to remain with experiments containing only ChiP-seq data.
The downloaded experiments were cleaned using a build custom script filesort.py
which separates experiments which dont contain ChIP-Seq data from those that contain ChIP-Seq data. The script also separates data in GEO experiments soft files which have mixed data i.e RNA Seq data and ChIP-Seq data to only retain ChIP-Seq Data.
For each GEO experiment soft file, the experimental data (TF antibody targets, experimental description and GSM accessions) was extracted and used to create a json file for each antibody target in the experiment and the GSM samples accessions associated.
For each antibody target in the experiment the metadata (GEO acc, Ab Target, cell line, tissue, cell type and the json file name) were recorded in a metadata.tsv file for later processing.
The SRA accessions associated to each GSM accession were also extracted for later downloading of the raw reads in fastq format when processing the data uniformly using the Encode ChIP-Seq pipeline.
All of the processes in this step are perfomed by a custom build module geosoft_extractor.py
which uses support modules at different steps.
In our study we are intereted in TF experiments only and not other type of experiments. Some of the downloaded experiments contain Epigenetic targets and Cell signalling pathways which were not of interest to us. We developed custom scripts to filter out non-TF targets.
We used the metadata file to find antibody targets which are not for transcription factors and remove the while alsp droping the json files asoociated with them.
This step involved manual curation of the each record in the generated metadata file to assign the correct Antibody Target for each record. After this curation step the Epigenetics targets, Cell signalling Targets, RNA targets and other targets which were not TF targets were removed from the metadata file using a custom script cleanjson.sh
.
The raw reads of each curated experiment were download in .sra
file format then dumped into fastq.gz format using the Ncbi's sra-tool kit.
We develop a custom script links_download.sh
that downloads the raw reads from sra. It donwloads the SRA accessions from a list.
Once downloaded the reads are dumped using a custom script dumper.sh
, this converts them from .sra to .fastq.
The analysis pipeline was run using a custom script chip_analysis.sh
. This script invokes the chipseq analysis and outputs the results in the results dir. The peak files are also colleted in the peak-files
directory for easy collection and retrieval.
All the steps above are wrapped up in a reproducible snakemake environments which starts from experiment search to analysis of the data.
The steps have been tied up in a job submission script analyze_chip.pbs
for working on an hpc environment.