Skip to content

Extra hybrid assembly

Andrea Telatin edited this page Nov 18, 2021 · 3 revisions

Trying a hybrid assembly

Combining long reads from Oxford Nanopore and the shorter but high-quality reads from Illumina we might get very good results. This is a very well established procedure for bacterial genomes, and now - with the latest flowcells from ONT - even "Nanopore only" assemblies are not too expensive.

With phages, it's relatively less explored for several reasons: their very small genome size requires a very low coverage and it can be just easier to stick to short reads: they can (as we saw on our Day1) assemble the whole genome perfectly.

Unicycler

A pipeline that can combine both long reads an short reads is Unicycler.

Let's try it:

# Remember to activate the appropriate conda environment!

unicycler -1 /data/reads/illumina/T4_R1.fastq.gz -2 /data/reads/illumina/T4_R2.fastq.gz \
  -l /data/reads/ont/T4-ONT.fastq.gz -o T4-hybrid -t 8 

We supply both the two paired end Illumina files (with -1 and -2 respectively) and the long reads (with -l). The output will be saved in a new directory we specify via -o.

The output files

ls -l T4-hybrid/
total 1164
-rw-rw-r-- 1 ubuntu ubuntu 166770 Nov 18 11:27 001_best_spades_graph.gfa
-rw-rw-r-- 1 ubuntu ubuntu 166069 Nov 18 11:27 002_overlaps_removed.gfa
-rw-rw-r-- 1 ubuntu ubuntu 166051 Nov 18 11:27 003_bridges_applied.gfa
-rw-rw-r-- 1 ubuntu ubuntu 165862 Nov 18 11:27 004_final_clean.gfa
-rw-rw-r-- 1 ubuntu ubuntu 165862 Nov 18 11:30 005_polished.gfa
-rw-rw-r-- 1 ubuntu ubuntu 168235 Nov 18 11:30 assembly.fasta
-rw-rw-r-- 1 ubuntu ubuntu 165862 Nov 18 11:30 assembly.gfa
-rw-rw-r-- 1 ubuntu ubuntu  11567 Nov 18 11:30 unicycler.lo

Did we get a single contig?

Let's check

seqfu stats -n -b T4-hybrid/assembly.fasta 
┌──────────┬──────┬──────────┬──────────┬────────┬────────┬────────┬────────────┬────────┬────────┐
│ File     │ #Seq │ Total bp │ Avg      │ N50    │ N75    │ N90    │ auN        │ Min    │ Max    │
├──────────┼──────┼──────────┼──────────┼────────┼────────┼────────┼────────────┼────────┼────────┤
│ assembly │ 1    │ 165823   │ 165823.0 │ 165823 │ 165823 │ 165823 │ 165823.000 │ 165823 │ 165823 │
└──────────┴──────┴──────────┴──────────┴────────┴────────┴────────┴────────────┴────────┴────────┘

Are they similar?

We can paste the two assemblies in Blast using the "Blast two sequences" options

Bl2seq

And yes, the result is a perfect match (except the starting point, of course).