Skip to content

A collection of scripts commonly used while running bioinformatics analyses on the Tulane University Cypress HPC

Notifications You must be signed in to change notification settings

erikenbody/Bioinformatics_Scripts

Repository files navigation

Bioinformatics Scripts Overview

And other related first steps, including software installation

This repository is just a place to deposit scripts used while analyzing Whole-genome Resequencing and RNAseq data related to my dissertation on White-shouldered Fairywrens (more info: erikenbody.github.io).

There are few custom scripts in here and mostly these are sbatch submission's to the Tulane HPC cluster (Cypress). Cypress uses SLURM based submissions, so .sh in this repository use SLURM syntax. However, the rest of the commands should be general purpose and I try to annotate the code as much as possible throughout.

There are some places where I used custom python scripts and R scripts written by Allison Shultz on her detailed WGS repository here: https://github.com/ajshultz/whole-genome-reseq Which follows her reccomended workflow here: https://github.com/ajshultz/pop-gen-pipeline

Below, you will find a running list of notes I made while running miscellaneous tasks related to this project. These include simple unix commands, helpful bash scripts, tips for transferring NGS data between clusters, and notes on the installation of various software.

Useful unix commands

cd .. 	#goes back to the previous directory
mv 	#use this command to rename a file, you must include the name of the file followed by the new name
rm –r [folder] 	#to delete folders with contents, MUST GIVE A FOLDER, be careful! Can delete important folders, like root!
zcat 		#takes a compressed file to decompress .gz and run through cat
grep 	#searches gnu regular expressions, will have to use ^ to find something at beginning of line and $ to find at the end of the line – very powerful – you can use this to find all the reads that didn’t work
zgrep   #can use this to also look up things – for example look up a barcode in a fastq file
control z #then type# bg 	#use this when you want to push a command into the back ground – it will eventually pop out a number when the command is finished
du #checks size of files use -b for byte calcs (more accurate?) and -h for human readable
ln -s #make a soft link to a file or folder. I did this in my cypress home folder to my lustre working directory
chmod +x #for giving yourself permission for a .sh file to run it
column -t #outputs a tsv file nicely (i.e. columns seperated nicely)
du -sch .[!.]* * |sort -h #outputs size of all hidden files (was nice for when I hit quota on home dir)

To run something in the background, run it, then click ctrl-z then type bg, to check active jobs type jobs

Copy only a certain type of file using rsync

https://stackoverflow.com/questions/11111562/rsync-copy-over-only-certain-types-of-files-using-include-option

cd /Users/erikenbody/Google_Drive/Tulane/WSFW_Data/Genomics_DNA_RNA/DNA/gatk_cluster_output
rsync -zarvm --include="*/" --include="*.pdf" --exclude="*" cyp:/home/eenbody/reseq_WD/GATK_Haplotype_Caller/fst_vcftools .

#or
cd /Users/erikenbody/Google_Drive/Tulane/WSFW_Data/Genomics_DNA_RNA/DNA/angsd_cluster_output/fst_angsd
rsync -zarvm --include="*/" --include="*.pdf" --exclude="*" cyp:/home/eenbody/reseq_WD/angsd/fst_angsd/fst_actual_analysis .
rsync -zarvm --include="*/" --include="*.fst.txt" --exclude="*" cyp:/home/eenbody/reseq_WD/angsd/fst_angsd/fst_actual_analysis .
rsync -zarvm --include="*/" --include="*.csv" --exclude="*" cyp:/home/eenbody/reseq_WD/angsd/fst_angsd/fst_actual_analysis .

cd /Users/erikenbody/Google_Drive/Tulane/WSFW_Data/Genomics_DNA_RNA/DNA/phylotree_cluster_output/
rsync -zarvm --include="*/" --include="*.pdf" --exclude="*" cyp:/home/eenbody/reseq_WD/phylotree/for_plink/Results_treemix .

Change file extensions

I kinda messed them up at one point, so it is helpful to change extensions to simplify.

for file in *.clp.fq.1_fastqc.zip
do
mv "$file" "${file%.clp.fq.1_fastqc.zip}.R1_fastqc.zip"
done

for file in *.clp.fq.2_fastqc.zip
do
mv "$file" "${file%.clp.fq.2_fastqc.zip}.R2_fastqc.zip"
done

for file in *.clp.fq.1_fastqc.html
do
mv "$file" "${file%.clp.fq.1_fastqc.html}.R1_fastqc.html"
done

for file in *.clp.fq.2_fastqc.html
do
mv "$file" "${file%.clp.fq.2_fastqc.html}.R2_fastqc.html"
done

Copying files from Odyssey

I was not able to transfer files using BBCP from Harvard Oddyssey to Tulane Cypress. This may be because Harvard is not setup to run it? It is installed there however. When running just SCP, it timed out after 40min. So I set up an interactive session using:

idev -N 1 -t 10:00:00

This requested ten hours on one node. I had to do idev, because I couldnt figure out how to log in to odyssey when submitting a job submission file for transferring files. This way, I can enter my ssh key info and run it. I used rsync at this point, because I had transferred ~ half the files.

rsync -a --ignore-existing [email protected]:/n/ngsdata/170728_NB502063_0023_AH7HFWBGX3/ .

Because scp had been cut off earlier, I noticed that it had stopped mid way through transferring a file (21 R1). So, I had to replace the partial file with the full file using the command below.

scp [email protected]:/n/ngsdata/170728_NB502063_0023_AH7HFWBGX3/Lane1.indexlength_6/Fastq/21-47745-moretoni-Chest_S20.R1.fastq.gz .

I was still worried that the file sizes seemed off using du -s or du -b between the original files and on cypress. So, I decided to check the md5sum.txt that they provide with the sequencing data. Unfortunately, the columns were switched (should be checksum space space filename), so I downloaded the md5sum.txt, opened in excel, moved columns, saved as .txt (space delimited?) then opened in textwrangler, clicked 'text' dropdown then detab and 2 for lines. Then saved as unix format. Uploaded to cluster using scp. There was probably an easier way to do this and it took several file format attempts. Once this was correct and on Cypress, I ran:

md5sum -c updated3_md5sum.txt

When the second lane was available, I used this to copy

scp -rp [email protected]:/n/ngsdata/170824_NS500422_0539_AHWGYJBGX2/ .

Running R on Cypress

Pretty straightforward. One issue I had is to not run R when I have an Anaconda environment active. I could probably troubleshoot this, but it just seems to mess up the paths somehow.

To install packages:

module load R/3.4.1-intel
export R_LIBS_USER=/home/eenbody/BI_software/R/Library:$R_LIBS_USER
R
install.packages("ggplot2")
library(ggplot2)
#sometimes had trouble with cran,try
library(devtools)
install_github("hadly/tidyverse")

q()

Installing Software

FastQc

Downloaded linux zip folder and used scp to copy to cluster. I can run it from the FastQC in my WD using ./fastqc. I was able to add it to my .bash_profile (which sources on startup) on cypress by adding it to my path

helpful related link: https://www.ccs.uky.edu/docs/cluster/env.html

nano .bash_profile
#add the following line
PATH=$PATH:$HOME/bin:/home/eenbody/Enbody_WD/FastQC
source #only neccessary this time when I hadnt restarted
fastqc --help #to check
Trinity

Note: Trinity is a module, but I made this path to access scripts available to Trinity.

PATH=$PATH:$HOME/bin:/home/eenbody/BI_software/trinityrnaseq-Trinity-v2.4.0/util
multiqc

This is a helpful little software package that compiles all of the fastqc files into a readable report. I installed locally using pip (its python based) and copied files from cypress locally. Harvard generated fastqc files during processing.

To run on the cluster:

cd /path/to/fastqc/files
module load anaconda
source activate ede_py
multiqc .
rcorrector

I scp repo from github that I downloaded locally to /BI_software and ran make within that directory. This pumped out an error code relating to kmercode, but I am still able to run the software. Some digging and the problem could be jellyfish, but Im not even sure of this. Was a bit unclear.

Maybe fixed September 19 First, had to successfully install jellyfish independently of the rcorrector install Followed github directions and installed from here: https://github.com/gmarcais/Jellyfish/releases

./configure --prefix=$HOME
make -j 4
make install

But got error at make -j, found stackoverflow solution: https://stackoverflow.com/questions/33278928/how-to-overcome-aclocal-1-15-is-missing-on-your-system-warning-when-compilin

touch aclocal.m4 configure
touch Makefile.am
touch Makefile.in

Then I was able to run the two make commands. I added jellyfish/bin to my .bash_profile PATH Now I can run with jellyfish --help from anywhere.

Now I had to delete the original version of rcorrector that I installed (because it had tried and failed to install jellyfihs, was now looking for it in rcorrector folder). This time, it did not download and try to install jellyfish files, becuase it found jellyfish in path.

module load git
git clone https://github.com/mourisl/rcorrector.git
make

This seems to have worked, but I can't run it from path for some reason. Here is path: /home/eenbody/Enbody_WD/BI_software/rcorrector/run_rcorrector.pl

So to run:

perl /home/eenbody/Enbody_WD/BI_software/rcorrector/run_rcorrector.pl
cutadapt

Details here: https://wiki.hpc.tulane.edu/trac/wiki/cypress/AnacondaInstallPackage

module load anaconda
conda install -c bioconda cutadapt

this cant finish install, so then I run (as suggested) conda create -n ede_py --clone=/share/apps/anaconda/2/2.5.0 now i made the root ede_py

conda install -c bioconda cutadapt

but this gave me an error, so ran the below

conda remove conda-build
conda remove conda-env

finally:

conda install -c bioconda cutadapt

In a script, I will have to include:

module load anaconda
source activate ede_py
fastx toolkit

Was suggested by simon to use to make rRNA reference. But could not get it to compile. I instead used it on odyssey

BUSCO

This is used to evaluate the quality of assembled transcripts. It is a python based software so I used anaconda to install. It should now run when I load anaconda and activate ede_py.

module load anaconda
source activate ede_py
conda install -c bioconda busco

Maintaining scripts

I write scripts locally, then use rsync to keep them synced with my cypress directory.

rsync -aP ~/Google_Drive/Tulane/WSFW_Data/Genomics_DNA_RNA/Bioinformatics_Scripts cyp:/home/eenbody/Enbody_WD

About

A collection of scripts commonly used while running bioinformatics analyses on the Tulane University Cypress HPC

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published