Skip to content

Latest commit

 

History

History
42 lines (36 loc) · 4.86 KB

genome_assembly_annotation_workflow.md

File metadata and controls

42 lines (36 loc) · 4.86 KB

Genome Assembly and Annotation Workflow

Notes below were shared for open discussion. Feel free to leave comments (in Issues?) or contact me for questions or suggestions. Most of the tools can be found by conda search tool_name and installed by conda create -name tool_name tool_name. Refer here to Install conda.

Genome assembly from PacBio HiFi reads (high accuracy long reads):

  1. PacBio HiFi reads assembly hifiasm or HiCanu.
  2. Filter contamination blobtools2.
  3. Scaffolding if long range info data like Hi-C, Omni-C, BioNano Genomics.
  4. Scaffolds/contigs sort and rename (GenomeTools, bedtools).
  5. Evaluation: assembly-stats, QUAST, BUSCO, Bandage.

Genome assembly from PacBio CLR reads (long but low accuracy reads)

  1. CLR reads polishing and assembly (Canu).
  2. Illumina PE reads polish (Pilon).
  3. Scaffolding with Illumina PE reads (SSPACE).
  4. Evaluation: assembly-stats, QUAST, BUSCO.

Tips for genome assembly evaluation:

After initial assembly and before contamination filtering, the common values to check for genome assembly quality are listed below. After contamination, these numbers may change.

  1. Genome size (whether close to expectation or consistent among different assembly methods/options);
  2. N50 (median like value, the longer the better)
  3. Number of contigs/scaffolds (smaller is better).
  4. Also the maximum and minimum length of contig/scaffolds. These values above could all be calculated quickly by the assembly-stats (-t option gives nice tabular format) listed above.
  5. BUSCO also helps to show if the core orthologs were covered.
  6. Most assemblers will provide assembly graphs e.g. GFA files, which can be viewed by the Bandage.

Genome annotation

  1. Build repeat models from genome assembly RepeatModeler (easy to use in Dfam-TETools container).
  2. Filter out repeat models which may be actually protein coding genes (BLASTx to close releated species or InterProScan domain prediction).
  3. Mask genome with clean repeat modes RepeatMasker (easy to use in Dfam-TETools container).
  4. Genome annotation using RNA-Seq data + close related proteins (BRAKER2).
  5. Genome annotation using close related coding sequences (PASA).
  6. EvidenceModeler merge predicted gene models from BRAKER2, BUSCO and PASA above.
  7. Local host web site for manual checking Apollo.
  8. Functional annotation for predicted genes:
    1. BLASTn to NCBI non-redundant nucleotide database (NT) NCBI BLAST+ Tools and databases
    2. BLASTp to NCBI non-redundant protein database (NR, database download see NT above) database and high quality manually curated protein database Uniprot.
    3. Functional domain predictionfor all predicted proteins (InterProScan).

Check out GenSAS, a free online genome annotation server.

The GenSAS server is free to use without exposing the genome assembly to the public, and it offers functions from RNA-Seq or protein evidence upload to the final Apollo interface for manual correction and sharing. NCBI also offers a genome annotation service NCBI Eukaryotic Genome Annotation Pipeline for free, but the genome assembly has to be made public before annotation. Also, you may need to write a paragraph of statement (https://www.ncbi.nlm.nih.gov/genome/annotation_euk/policy/) saying why the genome is important to be annotated before it's put into the annotation queue.