This repository contains the elPrep demo files:
- [run-large.sh]: a bash script that runs a preparation pipeline using elPrep on "NA12878-chr22", a subset of NA12878 that maps to chromosome 22
- [run-large-gatk.sh]: a bash script that runs the GATK best practices preparation pipeline on NA12878-chr22 using elPrep
- [run-small.sh]: a bash script that runs a preparation pipeline using elPrep on 10% of the reads of NA12878-chr22
- [run-small-gatk.sh]: a bash script that runs the GATK best practices preparation pipeline using elPrep on 10% of the reads of NA12878-chr22
- [clean.sh]: a bash script for deleting the output files generated by executing the pipeline scripts
- [ucsc.hg19.dict]: a .sam/.bam header file compatible with the GATK best practices pipeline
Upon executing one of the run scripts for the first time, the following input .bam files are downloaded:
- [NA12878-chr22.bam]: a subset of NA12878 that maps to chromosome 22, created using BWA (1.2GB)
- [NA12878-chr22-10pct.bam]: the first 10% entries of NA12878-chr22.bam (120MB)
Alternatively, you can also download these files manually here.
elPrep is developed for Linux and has been tested with the following distributions:
- Ubuntu 12.04.3 LTS
- Manjaro Linux
- Red Hat Enterprise Linux 6.4 and 6.5
Our demo comes with a larger .bam file (NA12878-chr22) and a smaller one (NA12878-chr22-10pct). The large file is meant for running tests on a server, while the small file is suitable for a desktop machine.
We recommend using the small input file for quick testing, and running with the large .bam file for benchmarking.
The minimal system requirements are:
* For demo 1:
* RAM: 11.0 GB
* Disk space: 2.4 GB
* For demo 2:
* RAM: 44.0 GB
* Disk space: 2.4 GB
On our test machine, a server with two 6-core Intel Xeon processors clocked at 2.8 Ghz, the observed runtimes are:
* For demo 1:
* with 24 threads: 1m 19s
* For demo 2:
* with 24 threads: 2m 06s
The minimal system requirements are:
* For demo 1:
* RAM: 5.0 GB
* Disk space: 241.0 MB
* For demo 2:
* RAM: 7.0 GB
* DisK space: 241.0 MB
On our test machine, a server with two 6-core Intel Xeon processors clocked at 2.8 Ghz, the observed runtimes are:
* For demo 1:
* with 24 threads: 9s
* For demo 2:
* with 24 threads: 13s
Install SAMtools and elPrep and add them to your path. For example, fill in your username and execute:
export PATH=$PATH:/home/username/tools/samtools-0.1.19:/home/username/tools/elprep-1.0:
The scripts run-large.sh and run-small.sh execute a prepartion pipeline that consists of removing the unmapped reads, replacing the reference dictionary, and adding read groups, respectively for the large (NA12878-chr22) and small (NA12878-chr22-10pct) input files.
-
By default, the scripts use the maximum number of available threads, based on your processor's capabilities. If you want to use a different number of threads, edit the scripts to do so (cf. 2nd line).
-
Run the scripts by executing:
sh run-small.sh
for the small .bam file
or
sh run-large.sh
for the large .bam file.
Executing these scripts will print the following feedback for the small .bam file:
elPrep version 1.0. See http://github.com/exascience/elprep for more information.
Executing command:
elprep NA12878-chr22-10pct.bam NA12878-chr22-10pct.only_mapped.reordered-contigs.read-group.bam --filter-unmapped-reads --replace-reference-sequences ucsc.hg19.dict --replace-read-group "ID:group1 LB:lib1 PL:illumina PU:unit1 SM:sample1" --sorting-order unsorted --gc-on 2 --nr-of-threads 24
or the following feedback for the large .bam file:
elPrep version 1.0. See http://github.com/exascience/elprep for more information.
Executing command:
elprep NA12878-chr22.bam NA12878-chr22.only_mapped.reordered-contigs.read-group.bam --filter-unmapped-reads --replace-reference-sequences ucsc.hg19.dict --replace-read-group "ID:group1 LB:lib1 PL:illumina PU:unit1 SM:sample1" --sorting-order unsorted --gc-on 2 --nr-of-threads 24
The elPrep commands that are printed in the feedback are the actual elPrep commands that are executed by those scripts. Hence you can also copy-paste these commands directly into your terminal instead of running the bash scripts.
The scripts run-large-gatk.sh and run-small-gatk.sh execute the GATK best practices preparation pipeline that consists of removing the unmapped reads, replacing the reference sequence dictionary, adding read groups, marking duplicates, and sorting by coordinate order.
-
By default, the scripts use the maximum number of available threads, based on your processor's capabilities. If you want to use a different number of threads, edit the scripts to do so (cf. 2nd line).
-
Run the scripts by executing:
sh run-small-gatk.sh
for the small .bam file
or
sh run-large-gatk.sh
Executing these scripts will print the following feedback for the small .bam file:
elPrep version 1.0. See http://github.com/exascience/elprep for more information.
Executing command:
elprep NA12878-chr22-10pct.sam NA12878-chr22-10pct.only_mapped.reordered-contigs.sorted.deduplicated.read-group.sam --filter-unmapped-reads --replace-reference-sequences ucsc.hg19.dict --replace-read-group "ID:group1 LB:lib1 PL:illumina PU:unit1 SM:sample1" --mark-duplicates --sorting-order coordinate --gc-on 0 --nr-of-threads 24
or the following for the large .bam file:
elPrep version 1.0. See http://github.com/exascience/elprep for more information.
Executing command:
elprep NA12878-chr22.sam NA12878-chr22.only_mapped.reordered-contigs.sorted.deduplicated.read-group.sam --filter-unmapped-reads --replace-reference-sequences ucsc.hg19.dict --replace-read-group "ID:group1 LB:lib1 PL:illumina PU:unit1 SM:sample1" --mark-duplicates --sorting-order coordinate --gc-on 0 --nr-of-threads 24
The elPrep commands that are printed in the feedback are the actual elPrep commands that are executed by those scripts. Hence you can also copy-paste these commands directly into your terminal instead of running the bash scripts.
To remove the output files generated by executing the demos, execute the clean script:
sh clean.sh