-
Notifications
You must be signed in to change notification settings - Fork 7
Bacterial RNASeq Analysis
Welcome to the EDAMAME RNASeq wiki!
Learning objectives:
Analyze a RNASeq dataset- from raw data > quality filtering > mapping > counting reads > differential expression.
What this tutorial will teach you: Referenced-based bacterial RNASeq
The dataset:
Lactobacillus reuteri JCM 1112 is considered to be a natural inhabitant of the human gut that confers health benefits to its host (Morita H. et al. 2008. Comparative genome analysis of Lactobacillus renteri and Lactobacillus fermentum reveal a genomic Island for reuterin and cobalamin production. DNA Res). Recent studies reveal a trend between a reduced abundance of endogenous L. reuteri and inflammatory disease (Mu Q. et al. 2018. Role of Lactobacillus reuteri in human health and diseases. Front Microbiol 9:1–17). Thus, probiotic supplements are a promising way to re-inhabit the the gut with L. reuteri. But, the efficacy of any orally supplemented probiotic depends on the ability to survive transient environmental conditions before reaching its site of colonization. Undergraduate students at a Michigan State microbiology lab exposed L. reuteri JCM 1112 to pH 3.0 (stomach pH) and pH 6.2 (~pH at site of colonization in the proximal digestive tract). RNA was extracted and sequenced to perform differential gene expression analysis to understand how L. reuteri JCM 1112 alters its gene expression to survive at low pH.
The analysis platform: Galaxy
We will use the galaxy platform to organize the workflow and analyze the data.
Access the galaxy platform and register here
Projects in Galaxy are saved in the "History" section on the right side of the Galaxy interface. You can organize one history for one project or, you create multiple histories for one project by outputting files from the workflow to new histories. I recommend the latter so that sharing the raw files is much easier between users if they intend to perform a different analysis with the files.
To share a history with other users, click on the options/settings icon under the History tab and go to Share or Publish. This will provide you with a link to share your dataset with other users.
Access the L. reuteri JCM 1112 transcriptomic data here
After clicking the link, click on "Import history" to make a copy of the dataset to your history.
This should now become your default history aka the datafiles that will be used to run a workflow.
You should see the following files in this history:
- .fa file - complete genome sequence for reference organism
- .gtf file - gene annotation file.
- 6 fastqc files
You do not need to perform the following steps for obtaining the .fa and .gtf files. These files have been provided for you. Regardless, I'll briefly show you how to obtain and prepare the .fa and .gtf files so that this process is applicable to any organism.
Go to the NCBI database and search the "Assembly" database for your organism of interest.
You can obtain these files from GenBank, RefSeq, USCS, Ensembl Bacteria, and others. I recommend obtaining the .fa and .gff files from the same source because there may be slight differences in formatting/naming between different databases.
We'll go ahead and grab the required files from GenBank because they contain the locus tag names of interest (e.g. LAR_0001 for GenBank vs. LAR_RS00005 for RefSeq). You'll be prompted to sign in to the FTP server. Choose to continue as a guest. Download the following files:
Now, we need to upload these files to your Galaxy workspace. Remember, files will be uploaded to your current history. Return the galaxy website and upload your files:
Find your files of interest on your local disk:
To proceed, you would need to make sure that you have uploaded all your files of interest (fastq, fa, gff/gtf). Once uploaded, we need to finish preparing the gff file. The gff file needs to be converted to a gtf file. Under tools, search for "gffread":
Click on the tool and adjust the parameters so that the the Input is the .gff file and the Feature File Output is GTF:
All other parameters can remain as default. Scroll to the bottom and click on execute:
Confirm that the process completed successfully. In your history you should see a new gtf file and, this should be green. When green, it means that the process has completed and can be used for downstream analysis. If red, there was an error during execution. Click on the file and read the error report for guidance on troubleshooting the issue.
This is how you would prepare the .fa and .gtf files for your organism of interest. Continue with analysis.
The 6 fastqc files contain 3 independent replicates for L. reuteri JCM 1112 in pH 3.0 and pH 6.2.
File Name (..._L005_R1_001.fastq.gz) | pH Condition | Treatment(T)/Control(C) |
---|---|---|
2B_S20 | 3.0 | T |
3B_S9 | 3.0 | T |
4B_S21 | 3.0 | T |
2A_S17 | 6.2 | C |
3A_S6 | 6.2 | C |
4A_S18 | 6.2 | C |
Now, let's start the analysis.
The following steps include:
- Prepare workflow up to counting reads.
- Run workflow.
- Export output from workflow to remove unnecessary sequences. Re-upload data to Galaxy.
- Prepare workflow for differential gene expression analysis.
- Run differential gene expression analysis and analyze output.
Let's start making a workflow to analyze our RNASeq data!
Click on the Workflow tab and add a new workflow. Name it and save it. You'll be directed to a workflow canvas.
Let's begin to add to the workflow canvas.
On the left, you'll see a tool tab consisting of various commands and packages we can add to the workflow. First step is to input the data. Click on the "Inputs" link and click on the "Input dataset collection". You'll notice this has now been added to your workflow canvas.
Let's place FastQC into the workflow. Use the search bar in the Tools tab to locate FastQC and add it to the workflow.
We now need to direct the steps of the analysis. Locate the ">" symbol on the right side of Input dataset collection. Click on it and drag it to the first tab on FastQC ("Short read data from your current history").
Additionally, click on the symbol to the right of html_file under FastQC. This is a command to create output files. Only user-selected files will be provided as output
Click on the the FastQC tool and notice how a details page appears on the right side.
The details window allows you to perform operations such as switching versions of the tool and adjusting parameters.
In addition, the details window provides information on the tool and relevant citations:
If you ever wanted to create a workflow using command line, most of these tools provide the command line options within the details window, too.
Let's finish making the workflow.
Use the tools search bar to find the remaining tools and recapitulate the workflow as shown below. Make sure you also select the appropriate output files.
Recall, the steps we'll take for this first part of RNASeq analysis include:
FastQc- Quality checking data RNASeq reads
Trimmomatic- removing low quality sequences
Bowtie- mapping reads to a reference genome
HTseq- counting reads throughout the genome for each sample
Notice:
- The two input files for Bowtie and HTSeq are files provided to you. Both "Input Reference Genome" and "Input Annotation File" were created from the "Input dataset" command (not "Input dataset collection") under "Inputs" in the toolbar. They were renamed under "Label" in the Details window on the right:
- Your default Bowtie2 may not have the option for "mapping_stats" output. If that's the case, click on the Bowtie2 program and select the latest version under Details.
In addition, change the following parameters under the details section for Bowtie2
- make sure the select Single-end for library type under details for Bowtie2.
- Select "Use a genome from the history and build index" so that you can input your own reference genome:
Almost ready to run! Let's save our workflow.
Return to the Workflow tab. Before we start, we need to create a collection of input files. On your history tab on the right, locate the 6 fastq files. Click on the icon that looks like a checkmarked box. Select the 6 files and then click on For all selected > Build Dataset List. Provide a name and create the list.
A new data file should appear in your history which is now a collection of all six data files.
Return to your Workflow tab and let's begin analysis!
Click on your workflow icon and click Run
You'll now be able to make sure the correct files are selected and review parameters before running the analysis.
- I suggest writing the output to a new history. This keeps each step organized and allows you to share raw data with others without the additional output files from your analysis.
- Make sure the input dataset collection file is the collection of 6 files you just created.
- Make sure the reference genome has a .fa/.fna extension and the annotation genome has a .gtf extension
Review parameters and then Run Workflow.
The status of the analysis can be viewed by looking under the history tab.
The Galaxy Server may be running slow. Import my output from the workflow to continue with downstream analysis
Output files from the workflow executed in step 2 are saved to a new history. You need to navigate to this new history.
Then, switch your history.
You'll notice a collection of outputs from your workflow.
You can view outputs from each command but for now, we are most interested in the HTSeq output. Click on the HTSeq output and download the collection of files
/
This should be a zipped file.
Before we proceed, we'll need to use terminal.
If you are a Mac User, follow from here: If you are a Windows user, scroll down to find the Windows-specific section.
Mac users have a terminal app. This is located within Finder: Applications > Utilities > Terminal.app
Drag the Terminal app to your dock so it's easier to access.
Open the terminal.
Also open the GUI (Finder) to look at your directories (aka folders).
Using the GUI, navigate to the "Download" directory and within it, make a new directory called "Galaxy".
Move the zipped file into the Galaxy directory.
Double click the file and unzip the file.
You should now have a new directory. Move into that directory and you should now have 6 files with the .tabular extension.
Right click on one of these files and click on "Get Info".
When a new window opens, locate the complete path to the htseq directory (highlighted in blue). Copy this complete path.
Now, switch back to using terminal.
You'll need to navigate to the htseq directory. This is where you'll need the copied path from before. Perform the following command in terminal:
cd /path/to/the/htseq/directory
#using my path
cd /Users/JohnChod2130/Downloads/Galaxy/htseq-count on collection 149
###This won't work if a directory has spaces. Put quotes around complete path
cd "/Users/JohnChod2130/Downloads/Galaxy/htseq-count on collection 149"
To make sure you are in the correct directory. Type
ls
This lists all files/directories in your current directory. Also try
pwd
This prints working directory (pwd). Make sure the path ends in the htseq directory.
Within this htseq directory within the Galaxy directory, type the following command
nano Galaxy_script.sh
You're now editing a file using a text editor. We'll create a script to automate removal of sequences we don't want for differential gene expression analysis. Copy and paste the following into the text editor:
###For mac users
for filename in *.tabular; do
printf "%s\n" "$filename"
sed -i '' "/EBG0000/d" ./$(printf "%s\n" "$filename")
sed -i '' "/LAR_16SrRNA/d" ./$(printf "%s\n" "$filename")
sed -i '' "/LAR_23SrRNA/d" ./$(printf "%s\n" "$filename")
sed -i '' "/LAR_5SrRNA/d" ./$(printf "%s\n" "$filename")
sed -i '' "/LAR_tRNA/d" ./$(printf "%s\n" "$filename")
done
Then, to exit and save, hit the following keys in sequence
control + X
Y
You should have returned to the terminal.
Now, copy and paste the following command:
chmod 755 Galaxy_script.sh
This command provides you permission to run the .sh script you just made with the text editor.
Finally, let's run the script
./Galaxy_script.sh
End of Mac-specific section
You need to download a terminal program. I recommend MobaXterm: https://mobaxterm.mobatek.net/download.html
Download the "installer edition" for MobaXterm. Open MobaXterm and "Start local terminal"
You may be prompted to download the CygUtils plugin. Download it and then move the plugin into the same directory where your MobaXterm executable it located.
Restart and open MobaXterm. Then, start local terminal.
Type on the command line:
###Note that these need to be backwards slashes
cd C:\\
Also open the GUI (File Explorer) to look at your directories.
Using the GUI, navigate to the "Downloads" directory and within it, make a new directory called "Galaxy". Move the zipped HTSeq data set collection you downloaded from Galaxy into the new directory you just created.
If you want, download an unzipping application (http://www.7-zip.org/ to unzip the file or, we can unzip using MobaXterm (see below).
Right click on zipped file and click on "Properties".
When a new window opens, locate the complete path to the Galaxy directory (highlighted in blue). Copy this complete path.
Now, switch back to using MobaXterm.
You'll need to navigate to the new directory you created containing the HTSeq data set using terminal. This is where you'll need the copied path from above. Perform the following command in terminal
cd /path/to/the/Galaxy/directory
#using my path as an example
cd C:\Users\EDAMAME\Downloads\Galaxy
This initially won't work. You'll receive an error:
"No such file or directory"
Press the up arrow to reveal the previous command. Now, change all the backward slashes to forward slashes and hit enter.
cd C:/Users/JohnChod2130/Downloads/Galaxy
This should work.
To make sure you are in the correct directory. Type
ls
This lists all files/directories in your current directory. Also try
pwd
This prints working directory. Make sure the path ends in the Galaxy directory.
Now, let's unzip the file.
Perform the following commands:
gzip -d "name of tgz zipped file"
###Using the example above:
gzip -d 26_htseq-count on collection 17.tgz
###Output should be 26_htseq-count on collection 17.tar. Now, perform the next command:
tar xvf "name of tar file"
###For example, using output from the example above
tar xvf 26_htseq-count on collection 17.tar
This has now created a new directory within the Galaxy directory. Navigate into that directory as such:
cd "name_of_directory"
###Using the example above
cd 26_htseq-count on collection 17
Note: We're about to use the vim as a text editor. Feel free to use any text editor you prefer.
Once you have navigated to this new directory, type the following command:
vim Galaxy_script.sh
You're now editing a file using a text editor. We'll create a script to automate removal of sequences we don't want for differential gene expression analysis. Copy and paste the following into the text editor:
###First type
:set paste
###Hit "enter". This allows you to copy and paste from other sources without auto-indentation
###Now type
i
###This allows you to begin inserting text
###Then copy and paste the code below
for filename in *.tabular; do
printf "%s\n" "$filename"
sed -i "/EBG0000/d" ./$(printf "%s\n" "$filename")
sed -i "/LAR_16SrRNA/d" ./$(printf "%s\n" "$filename")
sed -i "/LAR_23SrRNA/d" ./$(printf "%s\n" "$filename")
sed -i "/LAR_5SrRNA/d" ./$(printf "%s\n" "$filename")
sed -i "/LAR_tRNA/d" ./$(printf "%s\n" "$filename")
done
###Then, to save and exit, press
esc
###And finally
:x
You should have returned to the terminal.
Now, copy and paste the following command:
chmod 755 Galaxy_script.sh
This command provides you permission to run the .sh script you just made with the text editor.
Finally, let's run the script
./Galaxy_script.sh
End of Windows-specific section
What is this script doing? This script goes through each file and removes 16s, 23s, 5s, and EGB (illumina internal control) sequences. During library prep, mRNA is enriched my removing rRNA sequences but removal of rRNA is not 100% efficient. That is why we are removing these sequences.
Now, we need to upload these files back to Galaxy.
Return to the Galaxy interface and in the History tab, click on History options > Create new
A new screen will pop up and "Choose local file"
Navigate to the directory containing your .tabular files and select all six. Make sure all six files have been selected and then hit start.
You can now close that screen and look at the new history tab. When all files have been uploaded successfully, they will be green.
Import from my history if the prior steps did not work.
As we did before, we need to make two data set lists- one for control and one for treatment (look earlier in the workflow if you need a refresher on creating data set lists).
Recall the treatment/control files specified before:
File Name (..._L005_R1_001.fastq.gz) | pH Condition | Treatment(T)/Control(C) |
---|---|---|
2B_S20 | 3.0 | T |
3B_S9 | 3.0 | T |
4B_S21 | 3.0 | T |
2A_S17 | 6.2 | C |
3A_S6 | 6.2 | C |
4A_S18 | 6.2 | C |
Once completed, your history should look as follows:
Go back to workflows and create a new workflow for DESeq analysis.
For this workflow, add two input dataset collections (These are renamed for convenience) and DESeq2 from the toolbar.
Make sure factor 1 is the treatment and factor 2 is the control. The first factor is considered the primary factor that affect gene expression (fold changes will reflect up and down regulation in the factor 1 with respect to factor 2).
Under Details for DESeq2 on the right side, you need to fill in the following information (Factor Name, Factor 1 label , and Factor 2 label):
Additionally, output the normalized counts table.
Save the workflow and let's run it.
Return to your workflows and run the new workflow you just made.
Make sure Input 1 is the treatment and Input 2 is the control
Once completed, we can look at the output. You should see three output files.
Click on DESeq2 Plots and visualize the data (eye icon).
You should see plots of the data. The first plot is a PCA plot:
Samples closer together are more similar in gene expression. It is apparent that the treatment has resulted in a distinct transcriptional response. Additional plots are provided including hierarchal clustering based on sample distances, dispersion estimate plot (used for estimation of parameters for GLM fitting), p-value frequency plot and, a MA plot that plots log2 fold changes (treatment vs control) vs read counts. Red points indicate genes with an adjusted p value < 0.1.
Now, take a look at the DESeq2 result output:
You'll see a table of genes that include, most importantly, the log fold change and p-value.
Negative and positive log fold changes indicate genes in the treatment that are down-regulated and up-regulated, respectively.
We can download this file if we click on the history item:
You can order the output by log fold change or p-value to focus on genes of interest.
Lastly, we can download the normalized counts file:
Although beyond the scope of this tutorial, this file can be used for downstream analyses that focus on gene sets rather than individual genes (e.g. KEGG pathway analysis or gene set enrichment analysis).
The data files containing the DESeq2 output and normalized counts table only provide the locus_tag, not the gene annotations.
Gene annotations were obtained from NCBI and placed alongside its locus_tag. That file can be accessed here
Download the file.
If you obtained genes of interest (e.g. priority determined by log-fold change/p-value), you can now obtain the annotations for these genes, too. Match the locus_tag for the highest fold-changes with the annotation file to look at the annotations of some of these genes.
End of the tutorial
Suggested readings:
- Conesa A, Madrigal P, Tarazona S, Gomez-Cabrero D, Cervera A, McPherson A, Szcześniak MW, Gaffney DJ, Elo LL, Zhang X, Mortazavi A. 2016. A survey of best practices for RNA-seq data analysis. Genome Biol 17:13.
- Love MI, Huber W, Anders S. 2014. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol.