Skip to content

Methods: Dry Lab

Renato Augusto Corrêa dos Santos edited this page Dec 13, 2024 · 55 revisions

Computational / Bioinformatics Resources

Table of Contents


General


Reproducible Research


Data Management

Below is the rules, principles for organizing and managing datasets.

Data Transfer

Sources for transferring large data securely.

  • SendFiles - UGA source for securely sharing sensitive documents and large files online.
  • UGA Google Drive - UGA offers eligible students, faculty and staff access to several Google collaborative tools within the uga.edu domain.
  • The File Transfer nodes (xfer.gacrc.uga.edu) - https://wiki.gacrc.uga.edu/wiki/Transferring_Files
  • Rclone - CLI (command line interface) software for file transfer and management.

Naming Files


Git and Version Control


Environment Management

Statistics


Unix/Bash


R


Python


LaTeX


Workflow Languages

Digital notebooks

Storytelling

R Markdown

JuPyTer

Reviews

Snakemake


Machine Learning

Machine Learning for Omics data


Data Visualization


High Performance Computing (Sapelo2)

SAPELO2 HIGH PERFORMANCE COMPUTING BASICS

If you’re planning on doing bioinformatic analyses, I STRONGLY recommend getting an account on the HPC. If used correctly, it can really speed up your analyses, and it is WELL WORTH YOUR TIME to learn how to use it efficiently. GACRC provides free initial training to get you started and explain the HPC structure and how to use the system Plus it’s free for students!

TOOLS

Bash: You need to have some base knowledge of bash to interact with the HPC. It is entirely command line. If you need a primer Dr Wallace recommends the Software Carpentry module “The Unix Shell”. If you have a Unix-based machine (Mac or Linux) you can simply connect with ssh!

SLURM: This is more a set of functions that allow you to submit HPC jobs and check on the status of what you are running on the HPC. The key functions are sbatch to submit a script, squeue -u to check on your jobs, and scancel <JOB#> to cancel a job.

VSCODE: An integrated development environment similar to Rstudio that allows you to connect to the HPC, and your file system, transfer files and write code all in one place. It has a bit of a learning curve but it is currently the most common way to interact with the HPC. Most folks in the lab use this method to interact with the HPC and I would recommend if you are just starting out. I personally don’t use this method because I think it is bloated and I don’t use a lot of the features.

PUTTY: A truly barebones means by which to interact with the HPC. I like the simplicity of this method and I don’t really use a lot of the features of VSCODE so I actually use this method. I appreciate simple things!

WinSCP: This tool is a Windows tool for transferring files between the HPC and your local machine. It has putty integrated into it. There are some advanced features but really I appreciate the simple drag-and-drop utility that lets you quickly transfer files. It also has simple built-in text editors that are convenient for adjusting small scripts or parameters. BE WARNED! This tool is inefficient with large files so if transferring anything larger than a couple of Gigabytes I recommend using a different tool like Globus or Cyberduck. NOTE: When connecting WinSCP (or VSCODE) to the HPC for the purpose of transferring files be sure to connect to the data transfer (xfer) nodes that are designed for this purpose.

RStudio: This is an integrated development environment similar to VSCODE but is used specifically for interacting with R. Learn it. Dr Wallace and I recommend the Software Carpentry module “R for Reproducible Scientific Analysis” It is powerful for both analysis and data wrangling (basically anything numeric). HOWEVER, if you’re messing with TEXT… Do not use R, use Python, or if you’re feeling archaic Perl which has solid text processing features.

HPC QUICK START

  1. Login to the HPC
    Use Bash, PuTTY & WinSCP, RStudio, or VSCode. You’ll need:
  • MyID and password.
  • Duo (2-factor) authentication.
    Passcode or option (1-3):
  1. Duo Push to XXX-XXX-4304
  2. Phone call to XXX-XXX-4304
  3. SMS passcodes to XXX-XXX-4304

Note: A VPN connection is required for off-campus login.

  1. Navigate to the Scratch Directory This is just a simple cd command to get to scratch.
cd /lustre2/scratch/USERNAME
  1. Start an interactive session When you first log in you will be in the “login node” in your home directory. You DO NOT want to run programs on the login node. You will be reprimanded as the login node is for, you guessed it, logging in. Draining the resources of the login node by running programs will slow things down for everyone trying to work on the HPC which is not courteous.

To get around this we need to start an “interactive session” which is your own little environment where you can work and run programs without impacting other users. To make one run the command below…

interact --mem 4gb

This sets up an interactive session with 4 Gb of ram which is plenty for most applications unless you are working with large files. 4. Run your programs There are many preinstalled programs on the HPC called modules. They are simple and efficient ways to run commonly used software. To see what modules are installed in the HPC run...

ml av

To load modules use the simplified "ml" command...

ml <MODULE>

The HPC is separated into partitions that are used for different things. batch is the standard partition, MAX RESOURCES 120GB of RAM, 64 cores If you need more memory for a particular task you may need the highmem partitions. highmem_p, MAX RESOURCES 990GB RAM, 64 cores hugemem_p, MAX RESOURCES 2000GB RAM, 32 cores If a job is taking a long time to run, first check that things are running correctly and consider more efficient ways of solving the problem. If you still need long runtimes use the partitions below. These may be necessary for things like training AI models and such. For extended time runs (30 days) use... batch_30d is the standard partition, MAX RESOURCES 120GB of RAM, 64 cores highmem_30d_p, MAX RESOURCES 990GB RAM, 64 cores hugemem_30d_p, MAX RESOURCES 2000GB RAM, 32 cores

You submit jobs to the HPC using a special shell script I call a SLURM script. this script has a special header with information about the job you are trying to submit. Below is an example of a basic SLURM script to get you started…

#!/bin/bash
# The shebang line specifies the shell to be used for the script execution.

# SLURM directives (all start with #SBATCH) to specify resource requests and other settings.

# Job name
#SBATCH --job-name=example_job    # Replace 'example_job' with a descriptive name for your job.

# Output and error files
#SBATCH --output=job_%j.out       # File to store standard output (%j is the job ID).
#SBATCH --error=job_%j.err        # File to store standard error (%j is the job ID).

# Partition (queue) to submit the job to
#SBATCH --partition=standard      # Replace 'standard' with the name of the partition you want to use.

# Number of nodes and tasks
#SBATCH --nodes=1                 # Number of nodes to use (adjust as needed).
#SBATCH --ntasks=1                # Number of tasks (processes) to run (adjust for MPI or multi-threading).

# Number of CPU cores per task
#SBATCH --cpus-per-task=4         # Number of CPU cores per task (adjust for multi-threaded applications).

# Memory per node
#SBATCH --mem=16G                 # Total memory requested per node (adjust as needed).

# Walltime (maximum runtime for the job)
#SBATCH --time=02:00:00           # Maximum run time in HH:MM:SS format.

# Email notifications (optional)
#SBATCH [email protected]  # Replace with your email address.
#SBATCH --mail-type=ALL                     # When to send email: NONE, BEGIN, END, FAIL, ALL.

source ~/.bash_profile		#The source, in this case your bash environment

# Load necessary modules (customize based on your job's requirements)
module load python/3.9               # Example: load Python module.
module load gcc/10.2                 # Example: load GCC compiler.

# Execute the job's commands (replace with your actual job commands)
echo "Starting job $SLURM_JOB_ID on $(hostname)"
python my_script.py                  # Replace with the command to run your job.

# End of script

You can submit jobs to the HPC with the simple command

sbatch JOB.slurm

You can submit many jobs at once but be careful. See #6 in the notes and best practices section…

##Notes on Best Practices:

  1. BACK UP YOUR WORK!!!!!! Sapelo2 is not a permanent repository, they WILL delete your work if it is older than 30 days and they will not tell you. If you want to keep something BACK! IT! UP! Have AT LEAST 3 copies, one local, one local on an external drive, and one on a remote server in case the building burns down. It is your responsibility as a publicly funded scientist to ensure that your data and analyses are RETRIVABLE, REPRODUCIBLE and INTERPRETABLE in perpetuity. ANYONE should be able to look at your work AT ANY TIME and be able to figure out what you did and how you did it. This level of organization will serve you well throughout your graduate studies and in your future career. DO NOT UNDERESTIMATE ITS IMPORTANCE.
  2. Comment your scripts. This is related to backing up your work. Dr Wallace can provide an example of a well-commented script to give you guidance on his expectations. Well-commended scripts allow others to interpret how you conduct your analyses and are REQUIRED. Don’t make Dr. Wallace do program archaeology on your work. It wastes time and It’s not fun.
  3. Do not be afraid to contact GACRC for help. They have full-time support staff to help you fix problems that you can’t solve. If something isn’t running that should, or throwing an error that you can’t figure out. Or you can’t install a piece of software you need. Start a help ticket. They will help to the best of their ability. This is an underused resource.
  4. Be conscientious of your resource use. Pay attention to how many resources you are requesting in the slurm header. Remember this is a shared resource and you don’t want to request what you are not going to use. Having said that, if you need a lot of resources to run a job, and you’ve ensured that the job will run as efficiently as possible USE THE RESOURCES! The HPC is a computational sledgehammer! You can really get things run fast if you use it correctly.
  5. Parallelize. The HPC excels at running a lot of things simultaneously. Be cognizant of when you can break a computational task into sub-tasks to speed things up. This will SIGNIFICANTLY reduce the time required to run jobs if you parallelize at every opportunity.
  6. Don’t be afraid to experiment with ways to increase efficiency but be careful. Storytime… When I was in grad school I was experimenting with ways to parallelize scripts on the HPC by writing a program to make thousands of scripts that would map all the RNAseq datasets in NCBI to the maize genome in parallel. I tested it on 5 SRA accessions and it worked fine so I ran all of them… Well, turns out I didn’t disable the email status updates line in my slurm scripts. I then received well over 10,000 status emails and crashed the email server on the HPC which cascaded to crashing the entire HPC…. I contacted the HPC admins to apologize, which they appreciated, and we talked about how to avoid this issue in the future. Ultimately, they wrote a special piece of software in the mail server because of my mistake to prevent this from happening again. The moral of the story is it ok to experiment to speed things up but mistakes can happen, so try to be conscientious of what you are doing. And if you royally screw things as I did, be sure to own up to it, and work with the HPC admins so that they can prevent it from happening again. You never want to be on bad terms with an IT guy. Seriously.

Bioinformatics

QUANTITATIVE GENETICS (Specifically QTL & GWAS Analyses)

These analyses are intended to identify regions of the genome that are associated with a particular phenotype. To run these types of analyses you need to have genotype data for a population and the corresponding genotype data for that population. Usually, genotype data comes in a variant call format file (.vcf) or a HapMap file (.hmp). Occasionally though you may receive raw reads that you have to convert to genotypes. The phenotype files are almost invariably simple csv files. Filtering and preparation of genotypes/phenotypes and a simple QTL analysis. Be aware that 90% of the work for these types of analyses is doing quality control and filtering on your data.

TOOLS

Tassel: This is a GUI-based program that contains all the tools required to conduct a GWAS or QTL study and is a great place to start is you have a vcf/hmp file and haven’t done one of these analyses before or if you have limited command line experience. There are some quirks and there is a bit of a learning curve but it is a great place to start.

Cutadapt: This is a relatively harsh read trimming program. It works well but I find the default parameters to be stringent, which can result in the loss of good data. It’s a good program but check your results with FastQC.

Trimmomatic: This is an alternative read trimming program. It works well but it can be a little clunky to run. It’s a good alternative to Cutadapt but again, check your results with FastQC.

FastQC: This is a critical tool used to do quality checks on your reads. It quantifies all kinds of metrics on your reads including things like read length, GC content, and adapter content which are useful to evaluate how well your initial read processing was conducted.

multiQC: If you have a ton of fastq files that you want to do quality control on you can use multiQC which is essentially a wrapper for FastQC that can accommodate large numbers of FastQC outputs and parse them into something meaningful.

Bowtie/Bowtie2: This is a read mapping program used to take sequence reads and align them to a reference genome. As such a requirement of this tool is a fasta file containing the genome which you can download from any number of online repositories such as NCBI, Gramene, or Phytozome. Bowtie and Bowtie2 are older programs but I find they work quite well and reliably for everyday short-read mapping. They are also widely used in the literature though many other mappers exist that may be more appropriate for your reads (for example, minimap2 which is used for long read PacBio/Nanopore reads).

BWA: Another simple short read mapper that is commonly used in the literature. It is also an older read alignment program but again it is relatively fast and reliable. It is a good alternative to Bowtie/Bowtie2.

HTSlib:

  • SAMtools: This suite of tools contains various useful functions for the manipulation of sam/bam files which are the output of read mappers like Bowtie and Bwa. It is used to sort and index your bam files which will allow you to visualize them in something like IGV or Geneious (see the comparative genomics section for details on these programs).

  • BCFtools: BCFtools specifically the BCFtools Call function is used to convert the sorted, indexed bam files from samtools into a VCF format SNP file you can use in downstream analyses. The parameters you use are heavily dependent on your system so I recommend reading the manual to make your choices.

R: This will be used to visualize and evaluate your phenotypes and genotype data. And can serve as the backbone of your analysis. There are tons of packages available in R that can be used to filter and evaluate genomic and phenotypic data.

  • lme4: Contains functions for modeling data using linear regression. Useful to generate BLUPS for your phenotype data.
  • rQTL2: A series of functions for running QTL analyses. It has some annoyingly specific file input requirements which are outlined in the manual. However, once these files are assembled and formatted correctly it is extremely simple to identify QTL.
  • ggplot2: Use this to visualize literally anything and everything. Learn it and it will serve you well (see data carpentry module “Data Analysis and Visualisation in R for Ecologists”). You can use generative AI to help you fine-tune things on the plots. GGPLOT2 Basics - Data Carpentry lesson introducing GGPLOT2.

Notes on Best Practices:

  1. You will spend most of your time on these analyses doing quality control and filtering. This is normal and should be expected. QTL and GWAS analyses are dependent on well-filtered, clean data. And as you likely know there is no such thing as “clean” raw data.

  2. Import phenotype data into R and plot each phenotype as a scatter and a histogram. This seems trivial but it is actually CRITICAL. You want to look for strange patterns in the scatter that would indicate data entry issues or data duplication. Also, check the histograms to see if the data are generally normally distributed. These simple things give you a feel for the data and the general range of phenotype values. It will also help you identify and remove obvious outliers.

  3. Look for patterns. Your beautiful brain is REALLY good at spotting patterns in data. Patterns can be good or bad depending on your data but bottom line if you observe a pattern, figure out WHY you’re observing the pattern. That is the only way to know how to proceed with the analysis.

  4. Use an R notebook or similar markdown when doing these analyses. It is a convenient way to share analyses and make them interpretable and reproducible. People in the lab have used GitHub for this. I have used R markdown. It’s really up to you which you choose but use something with markdown so you can make clean, sharable, and reproducible reports for us and our collaborators.

  5. Check the minor allele frequency. This is a useful quality control check for your genotypes to ensure that the population is behaving as expected. For example, a BC1S3 population in maize should have a MAF of ~0.25. However, its a major red flag if your supposed BC1S3 pop has an estimated MAF of 0.5. It’s a simple quality control check but gives you confidence in the quality of your data.

  6. Run PCAs on your genotypes. This is yet another simple quality control measure you can employ to evaluate whether or not your genotypes can be trusted. The PCA can be run in R or Tassel and is a means by which you can cluster your data. Essentially the closer two points are in a PCA the more similar they are. It is an effective means to look for relationships between your genotypes. Again, you are looking for outstanding patterns. Are there sub-clusters? Do the sub-clusters make sense with how the population was made? If not, you may have genetic contamination in your population and need to toss out some individuals.

  7. Visually inspect your data. This cannot be stressed enough LOOK AT YOUR DATA. It is very easy to just roll through each step of an analysis and assume everything worked correctly but it is vital to your success to investigate the data at each step to ensure things are operating as you expect.

COMPARATIVE GENOMICS (Comparisons Between Genomes)

These types of analyses are useful for evaluating the relationships between genomes both within the same species and in different species. When looking at crops, wild ancestors are often sequenced and compared to modern genotypes to identify changes related to crop domestication. From the perspective of basic science, these analyses can be used to evaluate evolutionary changes/relatedness and can provide context into the evolutionary and genetic history of an organism(s). From a practical perspective, these studies can give insight into the mechanism and genetic changes that take place during domestication.

TOOLS

Geneious: This is the “Do Everything” program of bioinformatics which is its greatest strength and biggest weakness. It can essentially do every analysis but it’s a beast to learn. It excels at annotation which is critical for comparing genomes. However, it is limited by the power of your local machine. You can probably connect it to the HPC but I don’t know how. It requires a paid subscription, but UGA has an enterprise version of the software you can use for free with no restrictions. There is a learning curve but it can be really useful, especially if you want to organize everything in one place. It’s also useful in the wet lab for designing primers or annotating/designing constructs.

IGV: Short for Integrated Genomics Viewer. This tool is used to visualize indexed sorted bam files and annotation files (specifically gff3 files). This program is awkward but lets you easily visualize SNPs and coverage of alignment files overlaid onto an annotated genome. I often use this for quality control checks for alignments generated with Bowtie/Bowtie2, and bwa. This can also be valuable when trying to look at the positions of SNPs within genes or evaluating haplotypes for a particular gene among multiple datasets. You can also use it to help identify small structural variations such as small (<10kb) insertions, deletions and/or transposable elements.

NCBI CDD: The conserved domain database is a useful tool for investigating functional regions of genes or retrotransposons. It’s as simple as copying your sequence to the text box and hitting run. It provides a visualization of functional domains in the sequence, their reading frame, and orientation.

BLAST: This is a powerhouse tool for comparative genomics. It is a quick means to identify the positions of similar DNA sequences on a physical map. The tool is available as a web app and as a command-line application. I strongly urge you to use the command line application. It will produce a tabular output with the coordinates of all the similar sequences to the query in the subject genome with the “best” hit listed first. You can create Python scripts to parse this output to do things like identify orthologs and paralogs, find duplications, search for gene families, identify TE’s and a host of other things.

Mauve: This is an alignment program with both GUI and command line interfaces. The command line is VERY clunky to use but gives you much more control over the alignment parameters so I prefer it. This program was originally designed to align whole bacterial genomes but it works surprisingly well to align large segments of DNA (up to ~200kb). I have used this tool to align 26 whole maize genomes. The key is to make sure both ends of the sequences you want to align share an “anchor” gene and ensure there is no gap extension penalty (this allows for large insertions and deletions caused by TE insertions). You can write a Python script to stitch the subalignments together to make truly massive whole genome alignments efficiently this way. I also used this method to compare whole centromeres at the nucleotide level!

NCBI: While this is a data repository rather than a program it is still a “tool” per se and worthy to list. There is a mind boggling amount of free data available to help you supplement your analyses if you look for it. NCBI is a wonderful resource to find genomic, transcriptomic, epigenetic, proteomic, and SNP data on all manner of organisms. I urge you to investigate what is available for your system and think about what you can use to supplement or improve your analyses. Just be sure to take not of what you used (i.e. specific accession numbers) so you can credit the original study that produced the data. There are also tons of databases that you can investigate to answer all kinds of biological questions. I encourage you to peruse what’s available at your leisure.

MUSCLE: This is another alignment program but MUSCLE excels at aligning two or more shorter lengths of DNA (<10kb). The command line interface and parameters are easy to use and interpret. It’s also quick and reliable making it a solid aligner for everyday use. You can use this to do nucleotide level comparisons over short DNA lengths. MAFFT: If you need to align THOUSANDS of short (<10kb) sequences (such as 16S sequences) it’s usually better to use MAFFT which is way more efficient at the cost of slightly reduced alignment accuracy. The command line interface and parameters are easy to set up but I only use this when MUSCLE is ineffective.

Python: Because most of what you are working with will be text-based tab-delimited files I recommend using Python as your core scripting language. It’s highly flexible and widely used in both industry and academia and has a ton of packages that you can use to work with all kinds of bioinformatics data. It also integrates fairly well with Bash and R.

Notes on Best Practices:

  1. The key to good comparative genomics is coordinates and context, specifically the nucleotide coordinates to the base and the genome that the coordinates come from (the context). Attention to detail is everything.

  2. Check your alignments. Looking at the alignments you make in either MEGA or Geneious is necessary to ensure that things are in fact aligned. Aligners will run and provide and output even if two sequences share no similarity but the resulting alignment will be garbage. You MUST LOOK as the alignments to ensure that

  3. Check your orientations. If you’re trying to align two sequences that should align (i.e. gene paralogs) and they are producing a poor alignment, check the orientation of your sequences. It’s possible some are forward and some are reverse which will result in a garbage alignment.

  4. Comparative genomics can be tedious. Sometimes the work is just a slog especially if you have to inspect different regions of an annotated genome for epigenetic, transcriptomic and variant data over a bunch of genes and make interpretations on that data. If you can make a script to do something MAKE THE SCRIPT. It is far more reproducible which makes up for any extra time you take developing it.

  5. Regularly check on available data. Many comparative genomics projects use publicly available data which is periodically released to repositories like NCBI and Gramene. Regularly check on these repos to see if there are new datasets that you can add to your analyses.

  6. Annotation is a very powerful tool. Most genomes come with annotations but you can add your own annotations to sequences with tools like Geneious. This is EXTREMELY useful for comparing sequences. When generating alignments or looking for specific features in a genome, create annotations for those features. They will greatly simplify your interpretation of the comparisons.

Microbiome & Metagenomics

  • Kraken2 - Classify sequences as belonging to likely species based off of k-mers
    • Pro tip: Loading the database can take a long time. If doing a lot of different runs, it can be much faster to copy your database to /dev/shm (a virtual RAM disk; basically preloading it) and use the --memory-mapping option.