Skip to content
This repository has been archived by the owner on Jan 3, 2018. It is now read-only.

Commit

Permalink
Merge pull request #1 from UCSC-Treehouse/v3-local-fasta
Browse files Browse the repository at this point in the history
V3 local fasta
  • Loading branch information
hbeale authored Sep 25, 2017
2 parents dcb8a96 + d8bbd7a commit bb31022
Show file tree
Hide file tree
Showing 12 changed files with 492 additions and 198 deletions.
File renamed without changes.
File renamed without changes.
84 changes: 84 additions & 0 deletions GATK/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
# GATK RNAseq Variant Calling Analysis
Docker image of GATK best practices approach for variant calling on RNAseq data.

For full details please visit https://software.broadinstitute.org/gatk/guide/article?id=3891.

The steps for the workflow are:
* Add read groups, sort, mark duplicates, and create index using PICARD (https://broadinstitute.github.io/picard/)
* Split’N’Trim and reassign mapping qualities
* GATK IndelRealigner
* snpEff (http://snpeff.sourceforge.net/SnpEff.html) 5) GATK HaplotypeCaller
* GATK SelectVariants

## Usage
docker pull linhvoyo/gatk_rna_variant_v2

docker run -it -v /path/to/ref_folder:/data/ref \ -v /path/to/input_folder/:/data/work \
-e refgenome=input_reference_genome.fa \
-e input=sample.sorted.bam \
linhvoyo/gatk_rna_variant_v2 | tee >>/path/to/input_folder/out

## Details
docker run
Command to initiate docker.

-v /path/to/ref folder:/data/ref
State the full path of directory containing the reference genome. The -v parameter will mount the listed directory into the docker container.

-v /path/to/input folder/:.data/work
State the full path of working directory containing the input bam file.

-e refgenome=input reference genome.fa
Provide the name of reference genome.

-e input=sample.sorted.bam
Provide the name of the input file. The docker takes in a sorted bam file that was aligned using STAR aligner (https://github.com/alexdobin/STAR).

## Example run command and the expected output

### Run command
docker run -it -v /data/ref:/data/ref \
-v $(pwd):/data/work \
-e refgenome=GCA_000001405.15_GRCh38_no_alt_analysis_set.fa \
-e input=test.bam \
linhvoyo/gatk_rna_variant_v2 | tee >> test.bam.out

### Output files
All output files will be in the mounted directory containing test.bam.

test.split.realigned.bam
test.split.realigned.bai
test.split.realigned.HaplotypeCaller.filtered.ann.vcf
test.split.realigned.HaplotypeCaller.filtered.ann.vcf.idx

Variant calls filtered by the list of canonical variants obtained from Max H. The coordinates were converted from hg19 to hg38 using UCSC liftOver (https://genome.ucsc.edu/cgi-bin/hgLiftOver).

test.split.realigned.HaplotypeCaller.filtered.ann.vl.vcf
test.split.realigned.HaplotypeCaller.filtered.ann.vl.vcf.idx

Variant calls filtered by cancer genes. The list of cancer genes was downloaded from Cancer Census (https://cancer.sanger.ac.uk). The coordiantes are in hg38.

test.split.realigned.HaplotypeCaller.filtered.ann.cg.vcf
test.split.realigned.HaplotypeCaller.filtered.ann.cg.vcf.idx

The generated ”log” shows a brief overview of the variant calling process. The file will show ”Done (linhvoy- o/gatk rna variant v2) - test.bam” once sample has finished processing.

test.bam.log

### Example Output
Wed Mar 15 17:06:18 UTC 2017 Add Read Groups - test.bam
Wed Mar 15 17:06:27 UTC 2017 Add Read Groups - test.bam - Done
Wed Mar 15 17:06:27 UTC 2017 SplitNCigarReads - test.bam
Wed Mar 15 17:06:38 UTC 2017 SplitNCigarReads - test.bam - Done
Wed Mar 15 17:06:38 UTC 2017 RealignerTargetCreator - test.bam
Wed Mar 15 17:11:47 UTC 2017 RealignerTargetCreator - test.bam - Done Wed Mar 15 17:11:47 UTC 2017 IndelRealigner - test.bam
Wed Mar 15 17:12:01 UTC 2017 IndelRealigner - test.bam - Done
Wed Mar 15 17:12:01 UTC 2017 HaplotypeCaller -
Wed Mar 15 17:47:56 UTC 2017 HaplotypeCaller -
Wed Mar 15 17:47:56 UTC 2017 VariantFiltration
Wed Mar 15 17:48:00 UTC 2017 VariantFiltration
Wed Mar 15 17:48:00 UTC 2017 snpEff - test.bam
Wed Mar 15 17:49:34 UTC 2017 snpEff - test.bam
Wed Mar 15 17:49:34 UTC 2017 SelectVariants - variant list - test.bam
Wed Mar 15 17:49:38 UTC 2017 SelectVariants - variant list - test.bam - Done Wed Mar 15 17:49:38 UTC 2017 SelectVariants - gene list - test.bam
Wed Mar 15 17:49:42 UTC 2017 SelectVariants - gene list - test.bam - Done Wed Mar 15 17:49:42 UTC 2017 Done (linhvoyo/gatk_rna_variant_v2) - test.bam
File renamed without changes.
File renamed without changes.
140 changes: 140 additions & 0 deletions GATK/run.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
#!/bin/bash
#v2 add tempdir option.

#if [[ $tempdir = *[!\ ]* ]]; then tmpdir="-Djava.io.tmpdir="$tempdir; else tmpdir=" "; fi
set -eu -o pipefail


i=$input
r=$refgenome
tmpdir="-Djava.io.tmpdir="/data/work/temp_$input
gl=cc_genelist.bed
vl=hglft_genome_24bb_d53250_variants.bed




if [ ! -f "${i/.bam}.readgroups.bam.ok" ]; then
echo `date` "Add Read Groups - $input">>/data/work/${i/.bam}.log;
java -jar $tmpdir /root/picard.jar AddOrReplaceReadGroups I=/data/work/$i O=/data/work/${i/.bam}.readgroups.bam SORT_ORDER=coordinate RGLB=${i/.bam} RGPL=illumina RGPU=illumina RGSM=${i/.bam} VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=true ;
echo `date` "Add Read Groups - $input - Done">>/data/work/${i/.bam}.log;
>${i/.bam}.readgroups.bam.ok;
fi

if [ ! -f "${i/.bam}.split.bam.ok" ]; then
echo `date` "SplitNCigarReads - $input">>/data/work/${i/.bam}.log;
java -Xmx50000m -jar $tmpdir /root/GenomeAnalysisTK.jar -T SplitNCigarReads -R /data/ref/$r -I /data/work/${i/.bam}.readgroups.bam -o /data/work/${i/.bam}.split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS;
echo `date` "SplitNCigarReads - $input - Done">>/data/work/${i/.bam}.log;
>${i/.bam}.split.bam.ok;
fi

if [ ! -f "${i/.bam}.split.intervals.ok" ]; then
echo `date` "RealignerTargetCreator - $input">>/data/work/${i/.bam}.log;
java -Xmx5000m -jar $tmpdir /root/GenomeAnalysisTK.jar \
-T RealignerTargetCreator \
-R /data/ref/$r \
-I /data/work/${i/.bam}.split.bam \
-o /data/work/${i/.bam}.split.intervals \
-nt 24 ;
echo `date` "RealignerTargetCreator - $input - Done">>/data/work/${i/.bam}.log;
>${i/.bam}.split.intervals.ok;
fi

if [ ! -f "${i/.bam}.split.realigned.bam.ok" ]; then
echo `date` "IndelRealigner - $input">>/data/work/${i/.bam}.log;
java -Xmx50000m -jar $tmpdir /root/GenomeAnalysisTK.jar \
-T IndelRealigner \
-R /data/ref/$r \
-I /data/work/${i/.bam}.split.bam \
-targetIntervals /data/work/${i/.bam}.split.intervals \
-o /data/work/${i/.bam}.split.realigned.bam;
echo `date` "IndelRealigner - $input - Done">>/data/work/${i/.bam}.log;
>${i/.bam}.split.realigned.bam.ok;
fi

if [ ! -f "${i/.bam}.split.realigned.HaplotypeCaller.vcf.ok" ]; then
echo `date` "HaplotypeCaller - $input">>/data/work/${i/.bam}.log;
java -Xmx20000m -jar $tmpdir /root/GenomeAnalysisTK.jar \
-T HaplotypeCaller \
-R /data/ref/$r \
-I /data/work/${i/.bam}.split.realigned.bam \
-dontUseSoftClippedBases \
-stand_call_conf 20.0 \
-stand_emit_conf 20.0 \
-o /data/work/${i/.bam}.split.realigned.HaplotypeCaller.vcf \
-nct 4 ;
echo `date` "HaplotypeCaller - $input - Done">>/data/work/${i/.bam}.log;
>${i/.bam}.split.realigned.HaplotypeCaller.vcf.ok;
fi

if [ ! -f "${i/.bam}.split.realigned.HaplotypeCaller.filtered.vcf.ok" ]; then
echo `date` "VariantFiltration - $input">>/data/work/${i/.bam}.log;
java -Xmx20000m -jar $tmpdir /root/GenomeAnalysisTK.jar \
-T VariantFiltration \
-R /data/ref/$r \
-V /data/work/${i/.bam}.split.realigned.HaplotypeCaller.vcf \
-window 35 \
-cluster 3 \
-filterName FS \
-filter "FS > 30.0" \
-filterName QD \
-filter "QD < 2.0" \
-o /data/work/${i/.bam}.split.realigned.HaplotypeCaller.filtered.vcf;
echo `date` "VariantFiltration - $input - Done">>/data/work/${i/.bam}.log;
>${i/.bam}.split.realigned.HaplotypeCaller.filtered.vcf.ok;
fi

if [ ! -f "${i/.bam}.split.realigned.HaplotypeCaller.filtered.ann.vcf.ok" ]; then
echo `date` "snpEff - $input">>/data/work/${i/.bam}.log;
java -Xmx10000m -jar $tmpdir /root/snpEff/snpEff.jar -v -classic GRCh38.86 /data/work/${i/.bam}.split.realigned.HaplotypeCaller.filtered.vcf > /data/work/${i/.bam}.split.realigned.HaplotypeCaller.filtered.ann.vcf;
rm snpEff_genes.txt;
rm snpEff_summary.html;
echo `date` "snpEff - $input - Done">>/data/work/${i/.bam}.log;
>${i/.bam}.split.realigned.HaplotypeCaller.filtered.ann.vcf.ok;
fi

if [ ! -f "${i/.bam}.split.realigned.HaplotypeCaller.filtered.ann.vl.vcf.ok" ]; then
echo `date` "SelectVariants - variant list - $input">>/data/work/${i/.bam}.log;
java -Xmx20000m -jar $tmpdir /root/GenomeAnalysisTK.jar \
-T SelectVariants \
-R /data/ref/$refgenome \
-V /data/work/${i/.bam}.split.realigned.HaplotypeCaller.filtered.ann.vcf \
-o /data/work/${i/.bam}.split.realigned.HaplotypeCaller.filtered.ann.vl.vcf \
-L /root/$vl;
echo `date` "SelectVariants - variant list - $input - Done">>/data/work/${i/.bam}.log;
>${i/.bam}.split.realigned.HaplotypeCaller.filtered.ann.vl.vcf.ok;
fi

if [ ! -f "${i/.bam}.split.realigned.HaplotypeCaller.filtered.ann.cg.vcf.ok" ]; then
echo `date` "SelectVariants - gene list - $input ">>/data/work/${i/.bam}.log;
java -Xmx50000m -jar $tmpdir /root/GenomeAnalysisTK.jar \
-T SelectVariants \
-R /data/ref/$refgenome \
-V /data/work/${i/.bam}.split.realigned.HaplotypeCaller.filtered.ann.vcf \
-o /data/work/${i/.bam}.split.realigned.HaplotypeCaller.filtered.ann.cg.vcf \
-L /root/$gl;
echo `date` "SelectVariants - gene list - $input - Done ">>/data/work/${i/.bam}.log;
>${i/.bam}.split.realigned.HaplotypeCaller.filtered.ann.cg.vcf.ok;
fi

#chown output files
finish() {
# Fix ownership of output files
uid=$(stat -c '%u:%g' /data/work)
chown $uid ${i/.bam}.log
chown $uid ${i/.bam}.split.realigned.HaplotypeCaller.filtered.ann.cg.vcf
chown $uid ${i/.bam}.split.realigned.HaplotypeCaller.filtered.ann.cg.vcf.idx
chown $uid ${i/.bam}.split.realigned.HaplotypeCaller.filtered.ann.vcf
chown $uid ${i/.bam}.split.realigned.HaplotypeCaller.filtered.ann.vcf.idx
chown $uid ${i/.bam}.split.realigned.HaplotypeCaller.filtered.ann.vl.vcf
chown $uid ${i/.bam}.split.realigned.HaplotypeCaller.filtered.ann.vl.vcf.idx
chown $uid ${i/.bam}.split.realigned.bai
chown $uid ${i/.bam}.split.realigned.bam
}

trap finish EXIT


echo `date` "Done (linhvoyo/gatk_variant_v2) - $input">>/data/work/${i/.bam}.log;

rm ${i/.bam}.readgroups.bai ${i/.bam}.readgroups.bam ${i/.bam}.split.bai ${i/.bam}.split.bam ${i/.bam}.split.intervals ${i/.bam}.split.realigned.HaplotypeCaller.vcf ${i/.bam}.split.realigned.HaplotypeCaller.vcf.idx ${i/.bam}.readgroups.bam.ok ${i/.bam}.split.bam.ok ${i/.bam}.split.intervals.ok ${i/.bam}.split.realigned.bam.ok ${i/.bam}.split.realigned.HaplotypeCaller.vcf.ok ${i/.bam}.split.realigned.HaplotypeCaller.filtered.vcf.ok ${i/.bam}.split.realigned.HaplotypeCaller.filtered.ann.vl.vcf.ok ${i/.bam}.split.realigned.HaplotypeCaller.filtered.ann.cg.vcf.ok ${i/.bam}.split.realigned.HaplotypeCaller.filtered.ann.vcf.ok ${i/.bam}.split.realigned.HaplotypeCaller.filtered.vcf ${i/.bam}.split.realigned.HaplotypeCaller.filtered.vcf.idx -r $(pwd)/temp_$input
File renamed without changes.
Loading

0 comments on commit bb31022

Please sign in to comment.