-
Notifications
You must be signed in to change notification settings - Fork 2
Day 1: Genome Assembly Practical
🎥 Quick introduction to the command line, YouTube
🎥 This whole practical done, YouTube
Following the instructions we shared via e-mail, log in into the training server. If you use PuTTY you will see a terminal like this:
A very common nightmare for a bioinformatician is the need of installing a new tool. Thankfully, we have a great friend to help us in doing so: the Miniconda package manager, that greatly simplifies the installation of packages and even the installation of multiple versions of the same tool, thanks to the ability (and more often, the necessity) of confining one or more tools in a dedicated space, called environment.
To install Miniconda in our server follow these commands:
bash /data/tools/Miniconda3-latest-Linux-x86_64.sh -b
source ~/.bashrc
conda --version
If you see something like conda 4.9.2 you got it! If you receive the command not found error
you can try with source ~/.bashrc
, that will refresh your settings.
If you end your practical now, you already learnt how to install 99% of the tools you might ever need, and we think it's fantastic.
💡 Further readings (if you want to install Miniconda on your machine):
The first package we will install is a faster version of "conda", called "mamba":
# This command will instal the package "mamba" from the "conda-forge" channel
conda install -c conda-forge mamba
💡 In our server two channels are pre-enabled: conda-forge and bioconda, so we don't
need to specify them with the -c CHANNEL
parameter.
The workshop server is loaded with some scripts that are also publicly available from this repository.
You can download it in your home by:
# type "cd" only to go to the home directory
cd
# Then download the repository (clone it)
git clone https://github.com/quadram-institute-bioscience/phage-annotation-workshop.git
This will create a phage-annotation-workshop
directory in your home, and we will
some file it contains later.
Dr. Teagan Brown (from Evelien's lab) prepared DNA for sequencing from both phage T4 and phage T7. The raw data is available on Zenodo.
💡 These datasets are in our server in a shared directory: /data/reads
.
You can list the files with the following command:
find /data/reads/
It can be convenient to download datasets to test our methods on publicly available repositories. In our case we might want to fish some reads for phage T4.
Using a tool like SRA Explorer we can identify
datasets of interests in the public repositories.
For example, looking for T4 Phages we might end up with the accession number SRR1714822
.
Let's start creating a directory in the appropriate location where to store our data:
# First: cd {the place you want to go}
mkdir raw_reads
Then we can use a program called fastq-dump
to download data from the SRA.
First, we need to install it from the BioConda channel as:
conda install -c bioconda -c conda-forge sra-tools
Finally, we can download the raw reads providing the accession number.
The --split-3
switch will tell the program to save the forward
and the reverse reads in two separate files (if they are paired-end
or mate-paired reads), and --gzip
will produce a compressed output,
that is a good way not to waste space in our system:
fastq-dump --split-3 --gzip SRR1714822 -O ./raw_reads
Sometimes it's necessary to create a "workbench" with the tools needed for a task. Conda allows creating environments. One of the advantages of conda environments is that they can confine conflicting software in isolated spaces, so we can use tools that otherwise would not happily coexist in the same machine.
In our machine, we have a template file /data/tools/env.yaml
to create an environment with the required packages with
the following command:
# Create a "ws" using a file with the exact composition (env.yaml)
mamba env create -n ws --file /data/tools/env.yaml
💡 Let's try to create a "ws" environment with the tools we are going to use:
# Do not run this during the workshop :)
# This will create the environment only specifying a set of packages and the resulting versions might vary
mamba create -n ws skesa multiqc fastp unicycler spades prokka blast=2.9 eggnog-mapper biopython
First, let's count the number of reads in our Illumina dataset with
seqfu stats, which will
also print some statistics on the read lengths (-n
will enable a nice user-friendly output,
otherwise we can use --csv
to copy and paste to a spreadsheet):
seqfu stats -n /data/reads/illumina/*.fastq.gz
We can similarly proceed with the long reads:
seqfu stats -n /data/reads/ont/*.fastq.gz
The quality profile of reads can be also source of some information. There are several aspects that can be considered but an easy start is to check the average reads quality scores.
A well-known package to profile the quality scores of FASTQ datasets is FastQC.
# Let's create a directory where to store the output files
mkdir ~/qc
# Let's run the fastQC on Illumina reads
fastqc -o ~/qc /data/reads/illumina/*.fastq.gz
# [OPTIONAL] Profiling Nanopore reads requires more threads
fastqc -o ~/qc /data/reads/ont/*.fastq.gz --threads 4
🔍 To check how the reports look like, check this web directory.
If we want to either discard very noisy reads, or trim the reads where the quality degrades, we can use tools like fastp.
💡 Fastp also produces nice reports of the quality before and after the filtering, making FastQC not necessary.
Our datasets are fine, but we can try a bit of trimming to our T4. For long commands remember that
we can add a \
to prevent the linebreak to be interpreted as "Run the command".
# first create the directory in which the filtered reads will be saved
mkdir ~/filtered
fastp -i /data/reads/illumina/T4_R1.fastq.gz -I /data/reads/illumina/T4_R2.fastq.gz \
-o filtered/T4_R1.fq.gz -O filtered/T4_R2.fq.gz \
--length_required 100 --detect_adapter_for_pe \
--cut_tail -j qc/T4.json -h qc/T4.html
🔍 You can check the HTML report from this page.
There are several tools for the de novo assembly of short reads. A powerful pipeline that allows the hybrid assembly of long and short reads is called "Unicycler", and it's particularly well suited to assemble bacterial isolates. Some tools are more commonly used with short reads, like "Spades", while others are designed to only work with long reads (like "flye").
Here we will use Skesa from NCBI to assemble the Illumina reads:
# First we will create an output directory
mkdir ~/assembly
# Then we can run skesa using 4 cores on our "filtered" T4 reads
skesa --cores 4 --reads ~/filtered/T4_R1.fq.gz ~/filtered/T4_R2.fq.gz > ~/assembly/T4.fasta
# and the original T7 reads
skesa --cores 4 --reads /data/reads/illumina/T7_R{1,2}.fastq.gz > ~/assembly/T7.fasta
How many contigs did we get? How long?
seqfu stats -n ~/assembly/*.fasta
PhageTerm will analyse the coverage of the raw reads across the assembly to evaluate the phage genome structure.
PhageTerm -f /data/reads/illumina/T7_R1.fastq.gz \
-p /data/reads/illumina/T7_R2.fastq.gz \
-r ~/assembly/T7.fasta \
-c 4 -n T7
PhageTerm -f /data/reads/illumina/T4_R1.fastq.gz \
-p /data/reads/illumina/T4_R2.fastq.gz \
-r ~/assembly/T4.fasta \
-c 4 -n T4
Where the parameters are:
- The forward (
-f
) and reverse (-p
) Illumina raw reads - The final assembly (
-r
) - And optionally a name for the output files (
-n
) and the number of processing cores (-c
)
🔍 You can check the results from our server page.
The report on phage T4 only flags it as permuted (which is correct).
The report on phage T7 is more interesting, as it detected a high coverage region that is due to the overlap of identical terminal repeats, allowing us to "rotate" the genome selecting the start where the DTR starts.
📖 We have a stand-alone tutorial on PhageTerm
You can use seqfu
to reverse complement, restart or both.
# to make the reverse complement, if it is necessary
seqfu rc T7.fasta > T7_rc.fasta
# to change the start
seqfu rotate -i {coordinate_of_new_start} T7.fasta > T7_restart.fasta
Phage Annotation Workshop 2021