Clone the git repository with the command:
git clone https://github.com/durrantmm/rnaseq_variant_calling_workflow.git
This workflow uses a conda environment to satisfy all the necessary dependencies.
Make sure you have anaconda installed. Download it here.
You should be able to install the workflow simply by entering:
bash install.sh
In your console. Now it's time to move on to configuration.
This is a very important step in running the workflow properly.
Open the provided config.yaml
file to get started
The first two parameters of the config.yaml
file are
gatk_path: "java -jar /path/to/GenomeAnalysisTK.jar"
picard_path: "java -jar /path/to/Picard.jar"
These are two strings that allow the workflow to execute GATK and Picard. You can download these from https://broadinstitute.github.io/picard/ and https://software.broadinstitute.org/gatk/download/
GATK requires you to make an account to download.
Once they are both installed, you can enter their execution paths in the config.yaml
file.
You may want to add an addition flag, -Xmx50G
, to make sure that Java allocates enough memory. The final
config.yaml
file would look something like this:
gatk_path: "java -jar -Xmx50G /path/to/GenomeAnalysisTK.jar"
picard_path: "java -jar -Xmx50G /path/to/Picard.jar"
The next parameter of the config.yaml
file is wd: test
. This is the default working directory, and it contains
some test data to try out. You can give this a try by typing
snakemake --cores 12
It should take a few minutes to run, but once it runs correctly you know you are all set.
It is essential that you set up your workflow properly before you run it. Follow these directions exactly.
- When naming files, periods must only be used where specified. They cannot be used as a general name delimiter.
myfile.R1.fq.gz
- CORRECT - this is the right way to name a fastq filemyfile.Run1.Replicate2.R1.fq.gz
- INCORRECT - this is the WRONG way to name a fastq file.myfile_Run1_Replicate2.R1.fq.gz
- CORRECT - Replacing the improper periods with underscores is appropriate.
- Files can be be symbolic links to save disk space.
Navigate to your working directory. Make sure you have chosen a file system that has lots of free disk space available.
From within your working directory, create 5 directories as follows:
mkdir 0.fastq;
mkdir 0.reference_genome_fasta;
mkdir 0.gencode;
mkdir 0.dbsnp;
mkdir 0.adar_sites;
Or it may be easier to download a template working directory by executing
wget https://s3-us-west-1.amazonaws.com/mdurrant/biodb/bundles/rnaseq_variant_calling_workflow/wd_hg19.tar.gz;
tar -zxvf wd_hg19.tar.gz;
And then setting your wd
parameter in the config file to this directory.
Now we'll go through each of the directories and describe how to populate them.
This is the directory that contains all of the paired-end FASTQ files to be analyzed. Single-end analysis is possible, but hasn't been built into this workflow automatically. You can figure that one out on your own.
These fastq files must follow this precise format.
<sample1>.R1.fq.gz
<sample1>.R2.fq.gz
<sample2>.R1.fq.gz
<sample2>.R2.fq.gz
...
This folder contains as many samples as you want. Make sure that the only periods are found in the .R#.fq.gz
file extension.
This is the reference genome fasta file to be used in this analysis. You can have as many reference genomes as you want.
Each genome must also have matching files in 0.gencode
, 0.dbsnp
, and 0.adar_sites
. The identifying name must
be the same between these directories.
They must be named as follows:
<genome>.fa
They cannot be gzipped.
This folder contains the gencode GTF file to be used by the STAR aligner.
It must follow the naming convention:
<genome>.gtf
It must have the same <genome>
name as the corresponding reference genome in
0.reference_genome_fasta
A version corresponding to hg19 can be downloaded from http://www.gencodegenes.org/releases/19.html
This folder contains the dbsnp VCF file.
It must follow the naming convention
<genome>.vcf
It must have the same <genome>
name as the corresponding reference genome in
0.reference_genome_fasta
and 0.gencode
.
This folder contains known ADAR A-to-I RNA editing sites. They are excluded outright from the resulting RNA-seq variant calls.
It must follow the naming convention
<genome>.txt
It must have the same <genome>
name as the corresponding reference genome in
0.reference_genome_fasta
, 0.gencode
, and 0.dbsnp
.
You should now be able to run the snakemake workflow.
You can run this with the command
snakemake
From the rnaseq_variant_calling_workflow
directory.
You can set the number of cores used with
snakemake --cores 12
If you are using the SGE computing cluster system, you can submit a job using the provided script
qsub submission.sh
You may need to play with the parameters in cluster.json
to make sure that each step has enough memory and wall time
allocated to it.