This pipeline annotates RNA genes based on Rfam alignments, tRNAscan predictions, and miRBase data. A subset of the alignments produced by the RNAFeatures_conf (doc), with strict taxonomic filtering, is used as a requisite.
A registry file with the locations of the core database server(s) and the production database (or -production_db $PROD_DB_URL
specified).
init_pipeline.pl Bio::EnsEMBL::EGPipeline::PipeConfig::RNAGenes_conf \
$($CMD details script) \
-hive_force_init 1\
-registry $REG_FILE \
-production_db "$($PROD_SERVER details url)""$PROD_DBNAME" \
-pipeline_tag "_${SPECIES_TAG}" \
-pipeline_dir $OUT_DIR/rna_genes \
-species $SPECIES \
-eg_pipelines_dir $ENS_DIR/ensembl-production-imported \
-all_new_species 1 \
-run_context vb \
${OTHER_OPTIONS} \
2> $OUT_DIR/init.stderr \
1> $OUT_DIR/init.stdout
SYNC_CMD=$(cat $OUT_DIR/init.stdout | grep -- -sync'$' | perl -pe 's/^\s*//; s/"//g')
# should get something like
# beekeeper.pl -url $url -sync
LOOP_CMD=$(cat $OUT_DIR/init.stdout | grep -- -loop | perl -pe 's/^\s*//; s/\s*#.*$//; s/"//g')
# should get something like
# beekeeper.pl -url $url -reg_file $REG_FILE -loop
$SYNC_CMD 2> $OUT_DIR/sync.stderr 1> $OUT_DIR/sync.stdout
$LOOP_CMD 2> $OUT_DIR/loop.stderr 1> $OUT_DIR/loop.stdout
option | default value | meaning |
---|---|---|
-species |
species to process, several -species options are possible |
|
-pipeline_dir |
directory to store results to | |
-registry |
registry file with the locations of the core DBs and production DB | |
-production_db |
connection URL for the production DB; provide if no production DB is in the registry | |
-old_registry |
registry file with the locations of the core DBs for the previous release; no need if annotating only new genomes | |
-all_new_species |
0 | Stop doing stable ID, all species are assumed to be new; 1 -- to disable mapping |
-use_cmscan |
1 | create genes from cmscan (Rfam mostly) alignments; 0 -- to skip |
-use_trnascan |
1 | create genes from tRNAscan alignments; 0 -- to skip |
-use_mirbase |
1 | create genes from mirBase alignments; 0 -- to skip |
-run_context |
eg |
style of the stable identifiers: eg -- ENSRNA\d{9} like; vb -- use species.stable_id_prefix from the core DB metatable |
-gene_source |
species.division from meta if defined; Ensembl -- otherwise |
name to use as a gene source |
-mirbase_source_logic_name |
mirbase |
logic_name of the source alignments (already existing in the DB) |
-mirbase_target_logic_name |
mirbase_gene |
logic_name for the genes to be created |
-trnascan_source_logic_name |
trnascan_align |
logic_name of the source alignments (already existing in the DB) |
-trnascan_target_logic_name |
trnascan_gene |
logic_name for the genes to be created |
-rfam_version |
RFAM_VERSION | set rfam_version to be included into default -cmscan_source_logic_name and -cmscan_target_logic_name values |
-cmscan_source_logic_name |
cmscan_rfam_${rfam_version}_lca |
logic_name of the source alignments (default _lca assumes strict taxonomic filtering was used) |
-cmscan_target_logic_name |
rfam_${rfam_version}_gene |
logic_name for the genes to be created |
-id_db_host |
ENSEMBL_ENA_IDENTIFIERS_HOST | connection details for the stable IDs DB |
-id_db_port |
ENSEMBL_ENA_IDENTIFIERS_PORT | connection details for the stable IDs DB |
-id_db_user |
ENSEMBL_ENA_IDENTIFIERS_USER | connection details for the stable IDs DB |
-id_db_dbname |
ENSEMBL_ENA_IDENTIFIERS_DBNAME | connection details for the stable IDs DB |
-id_db_pass |
connection details for the stable IDs DB | |
-pipeline_tag |
Tag to append to the default -pipeline_name |
|
-pipeline_name |
rna_genes_${ENS_VERSION}_<pipeline_tag> |
The hive database name will be ${USER}_${pipeline_name} |
-production_lookup |
1 | Fetch analysis display name, description and web data from the production database; 0 -- to disable |
The taxonomic filtering itself, if enabled, for the RNAFeatures_conf (doc), removes many false positives. This pipeline assumes that RNA gene creation is conservative, and expects the taxonomic filtering be based on shared ancestry (rather than at the level of divisions).
A few filters applied automatically:
- regulatory elements (everything with "misc_RNA" biotype) are filtered out;
- palindromic RNA sequences can lead to overalpping credible hits (on the same or different strand), only one such alignment is picked to be converted to a gene, based on the better E-value.
miRBase
alignments are trusted and not filtered.
cmscan
filters can be tweaked using these pipeline options (see Inferal docs for some details):
option | default value | meaning |
---|---|---|
-evalue_threshold |
1e-6 | cmscan E-value threshold |
-truncated |
0 | allow processing of truncated alignments; 1 -- partial genes are allowed |
-nonsignificant |
0 | allow processing of non-significant alignments; 1 -- to allow |
-bias_threshold |
0.3 | maximum degree of allowable GC/AT bias |
-has_structure |
1 | only use features if they have structure; 0 -- allow to use features lacking structure |
-allow_repeat_overlap |
1 | allow genes which overlap a repeat feature; 0 -- to disallow (not recommended) |
-allow_coding_overlap |
0 | allow genes which overlap a protein-coding exon; 1 -- to allow |
-maximum_per_hit_name |
{ pre_miRNA' => 100, } |
limits on usage/sharing of the same hit(model) name |
tRNAscan
filters:
- always applied:
- tRNA genes are not allowed to overlap repeat regions
- tRNA genes are not allowed to overlap protein-coding exons
Configurable:
option | default value | meaning |
---|---|---|
-score_threshold |
40 | a threshold for the COVE score |
There's an option to preserve stable IDs between releases and reruns runs using -id_db_*
database options.
A few generic from Common::RunnableDB.
A few from RNAFeatures.