chimeraBuster

chimeraBuster is a tool for correcting chimeric gene annotations.

Installation and running

chimeraBuster was tested on Linux, but should also run on Windows/Mac.

Requirements

Git
python (any version)
conda (Anaconda/Miniconda, compatible with the python version)

Installation

$ git clone [email protected]:MayroseLab/chimeraBuster.git
$ cd chimeraBuster/
$ conda env create -f env.yml
$ conda activate chimeraBuster
$ python chimeraBuster.py -h
usage: python chimeraBuster.py -h (display full usage doc)

ChimeraBuster: break chimeric gene models based on transcript mapping

optional arguments:
  -h, --help            show this help message and exit
  -g GFF, --gff GFF     Input GFF
  -f GENOME_FASTA, --genome_fasta GENOME_FASTA
                        Input genome in FASTA format
  -t TRANSCRIPTS, --transcripts TRANSCRIPTS
                        Transcripts to be mapped to the genome, in FASTA format
  -m MAPPING, --mapping MAPPING
                        Transcripts mapping to the genome, in minimap2 PAF format
  -d GFF_DB, --gff_db GFF_DB
                        GFF DB from a previous run
  -o OUTPUT, --output OUTPUT
                        Output path
  -r, --do_not_refine   Skip mapping region refining step for a quick-and-dirty analysis
  -n MIN_TRANSCRIPTS, --min_transcripts MIN_TRANSCRIPTS
                        Minimum number of transcripts in mapping region
  -q MIN_QUERY_COV, --min_query_cov MIN_QUERY_COV
                        Minimum transcript query coverage (0-1)
  -c CPUS, --cpus CPUS  Number of CPUs to use
  -v, --verbose         Increase verbosity

Running a test data set

$ python chimeraBuster.py -g test/genes.gff -f test/genome.fasta -t test/transcripts.fasta -o test_result

A new directory called test_result should be created. The list of genes detected as chimeric can be found in test_result/chimeric_genes.list (should include three genes) and the gff with corrected genes is at test_result/genes.gff.corrected.gff.

How does it work?

chimeraBuster detects and corrects chimeric gene annotation based on transcript (cDNA) alignments. The main inputs are:

The genome sequence
The annotation to be corrected
Transcript sequences

The tool applies a series of steps to achieve its goal:

Map transcript sequences to the genome (using Minimap2).
Filter transcript mappings by removing ones with low query coverage.
Determine transript "mapping regions" - these are genomic regions where multiple transcripts are mapped. Overlapping regions are merged into longer regions.
Refine mapping regions - the purpose of this step is to remove "outlier" mappings. Usually these are mappings bridging over two (or more) otherwise non-overlapping mapping regions. Outliers are detected using the DBSCAN clustering algorithm, and removed. Mapping regions are re-computed.
Detect chimeric genes - a gene model spanning more than one mapping region is considered a putative chimeric gene.
Correct (break) chimeric genes - chimeras are broken according to the mapping regions, and the starts/ends of the new genes are corrected by searching for start/stop codons.

Inputs and parameters

Parameter	Description
-g GFF, --gff GFF	Input annotation, in GFF3 format. All features must have an ID, or Name, or Parent attribute.
-f GENOME_FASTA, --genome_fasta GENOME_FASTA	Input genome in FASTA format. Record names must match the ones in the GFF3 file.
-t TRANSCRIPTS, --transcripts TRANSCRIPTS	Transcripts to be mapped to the genome, in FASTA format. Prefereably high quality cDNA or transcriptome assembly.
-m MAPPING, --mapping MAPPING	Instead of providing transcript sequences, provide transcripts mapping to the genome, in minimap2 PAF format.
-d GFF_DB, --gff_db GFF_DB	Instead of providing a GFF file, use a gffutils GFF DB from a previous run
-o OUTPUT, --output OUTPUT	A path where all intermediate and result files will be written to. The directory will be created if it does not already exists. Will override results from previous runs.
-r, --do_not_refine	Skip mapping region refining step for a quick-and-dirty analysis (not recommended)
-n MIN_TRANSCRIPTS, --min_transcripts MIN_TRANSCRIPTS	Minimum number of transcripts in a mapping region (2 by default). Mapping regions with less transcripts will be ignored and no chimeras will be detected based on these regions.
-q MIN_QUERY_COV, --min_query_cov MIN_QUERY_COV	Minimum transcript query coverage (0-1, 0.95 by default). This cutoff will be used when filtering transcript mappings. Query coverage is calculated as the number of matches devided by the query length.
-c CPUS, --cpus CPUS	Number of CPUs to use for running Minimap2 and for refining mapping regions.
-v, --verbose	Increase verbosity. Mainly useful for debugging.

Outputs

The final outputs are written to the output directory:

chimeric_genes.list - a list of gene IDs detected as putative chimeras.
<input GFF base name>.corrected.gff - a GFF3 file with chimeric genes broken into multiple genes. The source of these genes is set to 'chimeraBuster' and indicative IDs are assigned.

Note that in cases where breaking chimeric genes cannot be performed reliably, genes listed as putatively chimeric will be kept unchanged.
In addition, chimeraBuster ignores overlapping genes since they cannot be reliably analyzed based on transcript data.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
test		test
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
chimeraBuster.png		chimeraBuster.png
correct_chimeric_genes.py		correct_chimeric_genes.py
detect_chimeric_genes.py		detect_chimeric_genes.py
env.yml		env.yml
util.py		util.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

chimeraBuster

Installation and running

Requirements

Installation

Running a test data set

How does it work?

Inputs and parameters

Outputs

About

Releases

Packages

Languages

License

MayroseLab/chimeraBuster

Folders and files

Latest commit

History

Repository files navigation

chimeraBuster

Installation and running

Requirements

Installation

Running a test data set

How does it work?

Inputs and parameters

Outputs

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages