Skip to content

Day 1: Genome Assembly Practical

evelien edited this page Dec 10, 2021 · 31 revisions

De novo assembly of a phage genome

🎥 Quick introduction to the command line, YouTube

🎥 This whole practical done, YouTube

Following the instructions we shared via e-mail, log in into the training server. If you use PuTTY you will see a terminal like this:

terminal emulator

1️⃣ Our tool shelf

A very common nightmare for a bioinformatician is the need of installing a new tool. Thankfully, we have a great friend to help us in doing so: the Miniconda package manager, that greatly simplifies the installation of packages and even the installation of multiple versions of the same tool, thanks to the ability (and more often, the necessity) of confining one or more tools in a dedicated space, called environment.

To install Miniconda in our server follow these commands:

bash /data/tools/Miniconda3-latest-Linux-x86_64.sh -b
source ~/.bashrc
conda --version

If you see something like conda 4.9.2 you got it! If you receive the command not found error you can try with source ~/.bashrc, that will refresh your settings.

If you end your practical now, you already learnt how to install 99% of the tools you might ever need, and we think it's fantastic.

💡 Further readings (if you want to install Miniconda on your machine):

Installing mamba

The first package we will install is a faster version of "conda", called "mamba":

# This command will instal the package "mamba" from the "conda-forge" channel
conda install -c conda-forge mamba

💡 In our server two channels are pre-enabled: conda-forge and bioconda, so we don't need to specify them with the -c CHANNEL parameter.

[Optional] Getting the workshop files

The workshop server is loaded with some scripts that are also publicly available from this repository.

You can download it in your home by:

# type "cd" only to go to the home directory
cd
# Then download the repository (clone it)
git clone https://github.com/quadram-institute-bioscience/phage-annotation-workshop.git

This will create a phage-annotation-workshop directory in your home, and we will some file it contains later.

2️⃣ Getting your raw datasets

From our kitchen

Dr. Teagan Brown (from Evelien's lab) prepared DNA for sequencing from both phage T4 and phage T7. The raw data is available on Zenodo.

💡 These datasets are in our server in a shared directory: /data/reads. You can list the files with the following command:

find /data/reads/

From public datasets (optional)

It can be convenient to download datasets to test our methods on publicly available repositories. In our case we might want to fish some reads for phage T4.

Using a tool like SRA Explorer we can identify datasets of interests in the public repositories. For example, looking for T4 Phages we might end up with the accession number SRR1714822.

Let's start creating a directory in the appropriate location where to store our data:

# First: cd {the place you want to go}
mkdir raw_reads

Then we can use a program called fastq-dump to download data from the SRA. First, we need to install it from the BioConda channel as:

conda install -c bioconda -c conda-forge sra-tools

Finally, we can download the raw reads providing the accession number. The --split-3 switch will tell the program to save the forward and the reverse reads in two separate files (if they are paired-end or mate-paired reads), and --gzip will produce a compressed output, that is a good way not to waste space in our system:

fastq-dump --split-3 --gzip SRR1714822 -O ./raw_reads

3️⃣ The tools we'll use

Sometimes it's necessary to create a "workbench" with the tools needed for a task. Conda allows creating environments. One of the advantages of conda environments is that they can confine conflicting software in isolated spaces, so we can use tools that otherwise would not happily coexist in the same machine.

In our machine, we have a template file /data/tools/env.yaml to create an environment with the required packages with the following command:

# Create a "ws" using a file with the exact composition (env.yaml)
mamba env create -n ws --file /data/tools/env.yaml

💡 Let's try to create a "ws" environment with the tools we are going to use:

# Do not run this during the workshop :)
# This will create the environment only specifying a set of packages and the resulting versions might vary
mamba create -n ws skesa multiqc fastp unicycler spades prokka blast=2.9 eggnog-mapper biopython

4️⃣ Preliminary checks

Read stats

First, let's count the number of reads in our Illumina dataset with seqfu stats, which will also print some statistics on the read lengths (-n will enable a nice user-friendly output, otherwise we can use --csv to copy and paste to a spreadsheet):

seqfu stats -n /data/reads/illumina/*.fastq.gz

We can similarly proceed with the long reads:

seqfu stats -n /data/reads/ont/*.fastq.gz

Quality profiles

The quality profile of reads can be also source of some information. There are several aspects that can be considered but an easy start is to check the average reads quality scores.

A well-known package to profile the quality scores of FASTQ datasets is FastQC.

# Let's create a directory where to store the output files
mkdir ~/qc

# Let's run the fastQC on Illumina reads
fastqc -o ~/qc /data/reads/illumina/*.fastq.gz

# [OPTIONAL] Profiling Nanopore reads requires more threads
fastqc -o ~/qc /data/reads/ont/*.fastq.gz --threads 4

🔍 To check how the reports look like, check this web directory.

Quality filtering

If we want to either discard very noisy reads, or trim the reads where the quality degrades, we can use tools like fastp.

💡 Fastp also produces nice reports of the quality before and after the filtering, making FastQC not necessary.

Our datasets are fine, but we can try a bit of trimming to our T4. For long commands remember that we can add a \ to prevent the linebreak to be interpreted as "Run the command".

# first create the directory in which the filtered reads will be saved
mkdir ~/filtered

fastp -i /data/reads/illumina/T4_R1.fastq.gz -I /data/reads/illumina/T4_R2.fastq.gz \
  -o filtered/T4_R1.fq.gz -O filtered/T4_R2.fq.gz \
  --length_required 100  --detect_adapter_for_pe  \
  --cut_tail -j qc/T4.json -h qc/T4.html

🔍 You can check the HTML report from this page.

5️⃣ Assembly

There are several tools for the de novo assembly of short reads. A powerful pipeline that allows the hybrid assembly of long and short reads is called "Unicycler", and it's particularly well suited to assemble bacterial isolates. Some tools are more commonly used with short reads, like "Spades", while others are designed to only work with long reads (like "flye").

Here we will use Skesa from NCBI to assemble the Illumina reads:

# First we will create an output directory
mkdir ~/assembly

# Then we can run skesa using 4 cores on our "filtered" T4 reads
skesa --cores 4 --reads ~/filtered/T4_R1.fq.gz ~/filtered/T4_R2.fq.gz > ~/assembly/T4.fasta

# and the original T7 reads
skesa --cores 4 --reads /data/reads/illumina/T7_R{1,2}.fastq.gz > ~/assembly/T7.fasta 

How many contigs did we get? How long?

seqfu stats -n ~/assembly/*.fasta

6️⃣ PhageTerm analysis

PhageTerm will analyse the coverage of the raw reads across the assembly to evaluate the phage genome structure.

PhageTerm -f /data/reads/illumina/T7_R1.fastq.gz \
   -p  /data/reads/illumina/T7_R2.fastq.gz \
   -r ~/assembly/T7.fasta \
   -c 4 -n T7

PhageTerm -f /data/reads/illumina/T4_R1.fastq.gz \
   -p  /data/reads/illumina/T4_R2.fastq.gz \
   -r  ~/assembly/T4.fasta \
   -c 4 -n T4

Where the parameters are:

  • The forward (-f) and reverse (-p) Illumina raw reads
  • The final assembly (-r)
  • And optionally a name for the output files (-n) and the number of processing cores (-c)

🔍 You can check the results from our server page.

The report on phage T4 only flags it as permuted (which is correct).

PhageTerm T4

The report on phage T7 is more interesting, as it detected a high coverage region that is due to the overlap of identical terminal repeats, allowing us to "rotate" the genome selecting the start where the DTR starts.

PhageTerm T7

📖 We have a stand-alone tutorial on PhageTerm

7️⃣ Reorienting the genome based on PhageTerm

You can use seqfu to reverse complement, restart or both.

# to make the reverse complement, if it is necessary
seqfu rc T7.fasta > T7_rc.fasta

# to change the start
seqfu rotate -i {coordinate_of_new_start} T7.fasta > T7_restart.fasta

⭐ One more thing...