Skip to content

methods for orphan gene prediction paper optimization

Notifications You must be signed in to change notification settings

eswlab/orphan-prediction

Repository files navigation

See overview and documentation: Documentation Status publication

Enhanced prediction of orphan genes in assembled genomes

Table of Contents

Gene prediction and optimization using BIND and MIND workflows:

MIND: ab initio gene predictions by MAKER combined with gene predictions INferred Directly from alignment of RNA-Seq evidence to the genome. BIND: ab initio gene predictions by BRAKER combined with gene predictions INferred Directly from alignment of RNA-Seq evidence to the genome.

1. Find an Orphan-Enriched RNA-Seq dataset from NCBI-SRA (See details here):

  • Search RNA-Seq datasets for your organism on NCBI, filter Runs (SRR) for Illumina, paired-end, HiSeq 2500 or newer.
  • Download Runs from NCBI (SRA-toolkit)
  • If existing annotations is available, expression quantification is done against every gene using every SRR with Kallisto.
  • run phylostratr on current gene models to infer phylostrata of each gene model
  • Rank the SRRs with highest number of expressed orphans and select feasible amounts of data to work with.

Note: If NCBI-SRA has no samples for your organism, and you are relying solely on RNA-Seq that you generate yourself, best practice is to maximize representation of all genes by including conditions like reproductive tissues and stresses in which orphan gene expression is high.

2. Ab initio gene prediction:

Pick one of the 2 ab initio predictions below:

  1. Run BRAKER (See details here):

    • Align RNA-Seq with splice aware aligner (STAR or HiSat2 preferred, HiSat2 used here)
    • Generate BAM file for each SRA-SRR id, merge them to generate a single sorted BAM file
    • Run BRAKER
  2. Run MAKER (See details here):

    • Align RNA-Seq with splice aware aligner (STAR or HiSat2 preferred, HiSat2 used here)
    • Generate BAM file for each SRA-SRR id, merge them to generate a single sorted BAM file
    • Run Trinity to generate transcriptome assembly using the BAM file
    • Run TransDecoder on Trinity transcripts to predict ORFs and translate them to protein
    • Run MAKER with transcripts (Trinity), proteins (TransDecoder and SwissProt), in homology-only mode
    • Use the MAKER predictions to train SNAP and AUGUSTUS. Self-train GeneMark
    • Run second round of MAKER with the above (SNAP, AUGUSTUS, and GeneMark) ab initio predictions plus the results from previous MAKER rounds.

3. Direct Inference evidence-based predictions (See details here):

We provide an automated pipeline for evidence-based predictions (See details here)

  • Align RNA-Seq with splice aware aligner (STAR or HiSat2 preferred, HiSat2 used here)
  • Generate BAM file for each SRA-SRR id
  • For each BAM file, use multiple transcript assemblers for genome guided transcript assembly:
    • Class2
    • StringTie
    • Cufflinks
  • Run PortCullis to remove invalid splice junctions
  • Consolidate transcripts and generate a non-redundant set of transcripts using Mikado.
  • Predict ORFs on these consolidated transcripts using TransDecoder
  • Pick best transcripts using all the above information with Miakdo Pick.

4. Combine ab initio and Direct Inference evidence-based predictions:

If you ran BRAKER in step 2, run 4.1

  1. Merge BRAKER with Direct Inference (BIND) (See details here):
  • Use Mikado to combine BRAKER-generated predictions with Direct Inference evidence-based predictions.

If you ran MAKER in step 2, run 4.2

  1. Merge MAKER with Direct Inference (MIND) (See details here):
  • Use Mikado to combine MAKER-generated predictions with Direct Inference evidence-based predictions.

5. Evaluate your predictions (See details here):

  • Run BUSCO to see how well the conserved genes are represented in your final predictions
  • Run OrthoFinder to find and annotate orthologs present in your predictions
  • Run phylostratR to find orphan genes in your predictions
  • Add functional annotation to your genes using homology and InterProScan

Prediction tools include:

Tool Purpose
SRA Tools (v. 2.9.6 ) SRA access
Hisat2 (v. 2.2.0) Alignment
STAR (v. 2.7.7a) Alignment
Kallisto (v. 0.46.2) Quantification
Samtools (v. 1.10) Tools
CLASS2 (v. 2.1.7) Transcript Assembly
Stringtie (v. 1.3.3) Transcript Assembly
Cufflinks (v. 2.2.1) Transcript Assembly
Trinity (v. 2.6.6) Transcript Assembly
Porticullis (v. 1.2.2) Tools
Transdecoder (v. 3.0.1) CDS prediction
Mikado (v. 2.0) Direct Inference prediction
Phylostratr (v. 0.2.0) Phylostratigraphy
BLAST (v. 3.11.0) Tools
Braker (v. 2.1.2) Ab initio prediction
Maker (v. 2.31.10) Ab initio prediction
GMAP-GSNAP (v. 2019-05-12) Alignment
GeneMark (v. 4.83) Ab initio Prediction

About

methods for orphan gene prediction paper optimization

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published