Skip to content

Latest commit

 

History

History
117 lines (83 loc) · 6.25 KB

README.md

File metadata and controls

117 lines (83 loc) · 6.25 KB

NewTargets

Description

Bioinformatic sequence recovery for universal target-capture bait kits can be substantially improved by appropriate tailoring of target files to the group under study. To enable the best possible locus recovery from Angiosperms353 capture data, we have developed an expanded target file (mega353.fasta) incorporating sequences from over 550 transcriptomes from the 1KP project. To maximise computational efficiency we provide the script filter_mega353.py, which can be used to subsample the mega353.fasta file based on user-selected taxa or taxon groups. These groups can be defined using unique 1KP transcriptome codes, species, families, orders, or broader groups (e.g. Basal Eudicots, Monocots, etc). In addition, we provide the script BYO_transcriptome.py, which can be used to add sequences from any transcriptome to any protein-coding nucleotide target file. These tailored and customised target files can be used directly in target-capture pipelines such as HybPiper.

Data files

  • mega353.fasta A target file for use with target enrichment datasets captured using the Angiosperms353 bait kit.
  • filtering_options.csv A comma-separated values file listing the options available for filtering the mega353.fasta file. This reference file can also be produced by the filter_mega353.py script (see below).

Scripts

  • filter_mega353.py A script to filter the mega353.fasta target file.
  • BYO_transcriptome.py A script to add sequences from any transcriptome dataset to any target file containing protein-coding sequences.

Manuscript

Dependencies

Dependencies for filter_mega353.py

Dependencies for BYO_transcriptome.py

Please see the Wiki page Installing dependencies for further details.

Installation

Assuming all dependencies are installed, either:

  1. Download the NewTargets package directly from the repository home page and unzip it. Note that the mega353.fasta file in the unzipped package is also provided as a .zip file, and will need to be unzipped separately.
  2. Clone the repository using the command git clone https://github.com/chrisjackson-pellicle/NewTargets.git. Unzip the mega353.zip file.

Scripts

filter_mega353.py

Input:

  • mega353.fasta. The expanded Angiosperms353 target file.
  • select_file. A text file containing a list of IDs; sequences from these samples will be retained in the filtered output target file.

Output:

  • filtered_target_file. The filtered target file containing the default Angiosperms353 sequences and any additional sequences corresponding to IDs in the select_file.
  • report_file. A report file in .csv format, listing samples with sequences retained in the filtered target file (excluding default Angiosperms353 samples).

Quick usage:

python filter_mega353.py [-h] [-filtered_target_file FILTERED_TARGET_FILE]
                         [-report_filename REPORT_FILENAME]
                         [-list_filtering_options]
                         mega353_file select_file

Example command line:

python filter_mega353.py mega353.fasta select_asparagales.txt -filtered_target_file asparagales_targetfile.fasta -report_filename asparagales_report.csv

To generate the filtering_options.csv file:

python filter_mega353.py -list_filtering_options

Please see the Wiki page filter_mega353 for further details.


BYO_transcriptome.py

Input:

  • target_file. A target file containing protein-coding nucleotide sequences in fasta format.
  • transcriptomes_folder. A directory containing one or more transcriptomes in fasta format.

Output:

These are the main results and reports folders; see the Wiki page for full output details.

  • 17_mega_target_file. A folder containing the final target file BYO_target.fasta.
  • 18_reports. A folder containing the general report file summary_report.csv, and the file report_per_gene.csv containing a presence/absence matrix of transcriptome hits for each gene/transcriptome.

Quick usage:

python BYO_transcriptome.py [-h] [-num_hits_to_recover <integer>]
                            [-python_threads <integer>]
                            [-external_program_threads <integer>]
                            [-length_percentage <float>]
                            [-hmmsearch_evalue <number in scientific notation; default is 1e-50>]
                            [-trim_to_refs <taxon_name>] [-no_n]
                            [-skip_exonerate_frameshift_fix]
                            [-discard_short]
                            target_file transcriptomes_folder

Example command line:

python BYO_transcriptome.py asparagales_targetfile.fasta additional_asparagales_transcriptomes_folder -python_threads 4 -external_program_threads 4

Please see the Wiki page BYO_transcriptome for further details.