Skip to content

ruchirkakkad/elprep-demo

Repository files navigation

Overview

This repository contains the elPrep demo files:

  • [run-large.sh]: a bash script that runs a preparation pipeline using elPrep on "NA12878-chr22", a subset of NA12878 that maps to chromosome 22
  • [run-large-gatk.sh]: a bash script that runs the GATK best practices preparation pipeline on NA12878-chr22 using elPrep
  • [run-small.sh]: a bash script that runs a preparation pipeline using elPrep on 10% of the reads of NA12878-chr22
  • [run-small-gatk.sh]: a bash script that runs the GATK best practices preparation pipeline using elPrep on 10% of the reads of NA12878-chr22
  • [clean.sh]: a bash script for deleting the output files generated by executing the pipeline scripts
  • [ucsc.hg19.dict]: a .sam/.bam header file compatible with the GATK best practices pipeline

Upon executing one of the run scripts for the first time, the following input .bam files are downloaded:

  • [NA12878-chr22.bam]: a subset of NA12878 that maps to chromosome 22, created using BWA (1.2GB)
  • [NA12878-chr22-10pct.bam]: the first 10% entries of NA12878-chr22.bam (120MB)

Alternatively, you can also download these files manually here.

System requirements

Operating system

elPrep is developed for Linux and has been tested with the following distributions:

  • Ubuntu 12.04.3 LTS
  • Manjaro Linux
  • Red Hat Enterprise Linux 6.4 and 6.5

Workloads

Our demo comes with a larger .bam file (NA12878-chr22) and a smaller one (NA12878-chr22-10pct). The large file is meant for running tests on a server, while the small file is suitable for a desktop machine.

We recommend using the small input file for quick testing, and running with the large .bam file for benchmarking.

NA12878-chr22

The minimal system requirements are:

* For demo 1:

	* RAM: 			11.0	GB
	* Disk space: 	 2.4	GB

* For demo 2:

	* RAM:			44.0	GB
	* Disk space:	 2.4	GB

On our test machine, a server with two 6-core Intel Xeon processors clocked at 2.8 Ghz, the observed runtimes are:

* For demo 1: 
	* with 24 threads:	1m 19s 

* For demo 2: 
	* with 24 threads:	2m 06s 

NA12878-chr22-10pct

The minimal system requirements are:

* For demo 1:

	* RAM: 				5.0	GB
	* Disk space: 	  241.0	MB

* For demo 2:

	* RAM:			   7.0	GB
	* DisK space:	 241.0	MB

On our test machine, a server with two 6-core Intel Xeon processors clocked at 2.8 Ghz, the observed runtimes are:

* For demo 1: 
	* with 24 threads:	   9s 
	
* For demo 2:
	* with 24 threads:	   13s 

Running the demos

Path setup

Install SAMtools and elPrep and add them to your path. For example, fill in your username and execute:

export PATH=$PATH:/home/username/tools/samtools-0.1.19:/home/username/tools/elprep-1.0:

Demo 1: a simple preparation pipeline

The scripts run-large.sh and run-small.sh execute a prepartion pipeline that consists of removing the unmapped reads, replacing the reference dictionary, and adding read groups, respectively for the large (NA12878-chr22) and small (NA12878-chr22-10pct) input files.

  1. By default, the scripts use the maximum number of available threads, based on your processor's capabilities. If you want to use a different number of threads, edit the scripts to do so (cf. 2nd line).

  2. Run the scripts by executing:

    sh run-small.sh

for the small .bam file

or

sh run-large.sh

for the large .bam file.

Executing these scripts will print the following feedback for the small .bam file:

elPrep version 1.0. See http://github.com/exascience/elprep for more information.
Executing command:
elprep NA12878-chr22-10pct.bam NA12878-chr22-10pct.only_mapped.reordered-contigs.read-group.bam --filter-unmapped-reads --replace-reference-sequences ucsc.hg19.dict --replace-read-group "ID:group1 LB:lib1 PL:illumina PU:unit1 SM:sample1" --sorting-order unsorted --gc-on 2 --nr-of-threads 24

or the following feedback for the large .bam file:

elPrep version 1.0. See http://github.com/exascience/elprep for more information.
Executing command:
elprep NA12878-chr22.bam NA12878-chr22.only_mapped.reordered-contigs.read-group.bam --filter-unmapped-reads --replace-reference-sequences ucsc.hg19.dict --replace-read-group "ID:group1 LB:lib1 PL:illumina PU:unit1 SM:sample1" --sorting-order unsorted --gc-on 2 --nr-of-threads 24

The elPrep commands that are printed in the feedback are the actual elPrep commands that are executed by those scripts. Hence you can also copy-paste these commands directly into your terminal instead of running the bash scripts.

Demo 2: the GATK best practices preparation pipeline

The scripts run-large-gatk.sh and run-small-gatk.sh execute the GATK best practices preparation pipeline that consists of removing the unmapped reads, replacing the reference sequence dictionary, adding read groups, marking duplicates, and sorting by coordinate order.

  1. By default, the scripts use the maximum number of available threads, based on your processor's capabilities. If you want to use a different number of threads, edit the scripts to do so (cf. 2nd line).

  2. Run the scripts by executing:

    sh run-small-gatk.sh

for the small .bam file

or

sh run-large-gatk.sh

Executing these scripts will print the following feedback for the small .bam file:

elPrep version 1.0. See http://github.com/exascience/elprep for more information.
Executing command:
  elprep NA12878-chr22-10pct.sam NA12878-chr22-10pct.only_mapped.reordered-contigs.sorted.deduplicated.read-group.sam --filter-unmapped-reads --replace-reference-sequences ucsc.hg19.dict --replace-read-group "ID:group1 LB:lib1 PL:illumina PU:unit1 SM:sample1" --mark-duplicates --sorting-order coordinate --gc-on 0 --nr-of-threads 24

or the following for the large .bam file:

elPrep version 1.0. See http://github.com/exascience/elprep for more information.
Executing command:
  elprep NA12878-chr22.sam NA12878-chr22.only_mapped.reordered-contigs.sorted.deduplicated.read-group.sam --filter-unmapped-reads --replace-reference-sequences ucsc.hg19.dict --replace-read-group "ID:group1 LB:lib1 PL:illumina PU:unit1 SM:sample1" --mark-duplicates --sorting-order coordinate --gc-on 0 --nr-of-threads 24

The elPrep commands that are printed in the feedback are the actual elPrep commands that are executed by those scripts. Hence you can also copy-paste these commands directly into your terminal instead of running the bash scripts.

Resetting the demos

To remove the output files generated by executing the demos, execute the clean script:

sh clean.sh

Releases

No releases published

Packages

No packages published