The DNA Features pipeline runs three programs to annotate repeat features: RepeatMasker (repeatmask), DustMasker (dust), and TRF (trf).
A registry file with the locations of the core database server(s) and the production database (or -production_db $PROD_DB_URL
specified).
# REP_LIB_OPT= # or whatever
init_pipeline.pl Bio::EnsEMBL::EGPipeline::PipeConfig::DNAFeatures_conf \
$($CMD details script) \
-registry $REG_FILE \
-production_db "$($PROD_SERVER details url)""$PROD_DBNAME" \
-hive_force_init 1\
-pipeline_tag "_${SPECIES_TAG}" \
-pipeline_dir $OUT_DIR \
-species $SPECIES \
-redatrepeatmasker 0 \
-always_use_repbase 1 \
-repeatmasker_timer '10H' \
$REP_LIB_OPT \
-repeatmasker_repbase_species "$REPBASE_SPECIES_NAME" \
-max_seq_length 300000 \
2> $OUT_DIR/init.stderr \
1> $OUT_DIR/init.stdout
SYNC_CMD=$(cat $OUT_DIR/init.stdout | grep -- -sync'$' | perl -pe 's/^\s*//; s/"//g')
# should get something like
# beekeeper.pl -url $url -sync
LOOP_CMD=$(cat $OUT_DIR/init.stdout | grep -- -loop | perl -pe 's/^\s*//; s/\s*#.*$//; s/"//g')
# should get something like
# beekeeper.pl -url $url -reg_file $REG_FILE -loop
$SYNC_CMD 2> $OUT_DIR/sync.stderr 1> $OUT_DIR/sync.stdout
$LOOP_CMD 2> $OUT_DIR/loop.stderr 1> $OUT_DIR/loop.stdout
option | default value | meaning |
---|---|---|
-division |
division (intersection with registry) to be processed | |
-species |
species to process, several -species options are possible |
|
-pipeline_dir |
directory to store results to | |
-report_dir |
direcory to write reports to | |
-dust |
1 | run DustMasker (dust); 0 -- not to run |
-trf |
1 | run TRF(trf); 0 -- not to run |
-repeatmasker |
1 | run RepeatMasker (repeatmask); 0 -- not to run. |
-redatrepeatmasker |
0 | use plant-specific library of DNA repeats REdat; 1 -- to run |
-redat_repeatmasker_library |
REDAT_REPEATMASKER_LIBRARY_PATH | path to REdat library, to be used with -redatrepeatmasker 1 |
-repeatmasker_library |
custom RepeatMasker libraries; several are allowed, of the form -repeatmasker_library ${SPECIES_NAME_1}=/path/to/lib/file |
|
-repeatmasker_logic_name |
use custom (pre-existing!) logic name instead of repeatmask_customlib when -repeatmasker_library is used (see notes below). |
|
-always_use_repbase |
0 | 1 -- to always run default (RepBase one) repeatmasker , even if custom -repeatmasker_library specifed (see notes below) |
-repeatmasker_repbase_species |
same as -species |
use this name to run default (RepBase-bases) repeatmasker |
-parameters_json |
JSON with species specific parameters (see below) | |
-repeatmasker_sensitivity ${SPECIES}|_all_=<level> |
all=automatic |
controls RepeatMasker trade-off between sensitivity and run-time. Possible levels: automatic , very_low , low , medium , high , very_high . See below. |
-delete_existing |
1 | Delete preexisting repeatmasker analysis data; 0 to disable |
-max_seq_length |
1_000_000 | maximum length for a single scaffold, scaffolds will be splitted on exciding this value |
-max_hive_capacity |
50 | The default hive capacity (i.e. the maximum number of concurrent workers) (no enforced upper limit, but above 150 you might run into problems with locking or database connections). |
-repeatmasker_timer |
16H |
default time limit for the RepeatMasker analysis |
-pipeline_name |
dna_features_${ENS_VERSION}_${pipeline_tag} |
The hive database name will be ${USER}_${pipeline_name} |
-pipeline_tag |
Tag to append to the default -pipeline_name |
|
-repeatmasker_exe |
RepeatMasker | Path to the RepeatMasker executable file. |
-repeatmasker_parameters |
-nolow -s -gccalc |
Default is to exclude low-complexity annotations; use slower (and more sensitive) search; and calculate (rather than estimate) GC content |
-dust_exe |
dustmasker | Path to the dust executable file. |
-dust_parameters_hash |
{} |
DustMasker module takes a hash of parameters, rather than requiring explicit command line options like RepeatMasker. It is generally not necessary to override the default parameters. |
-trf_exe |
trf | Path to the TRF executable file. |
-trf_parameters_hash |
{} |
TRF module (in ensembl-analysis) takes a hash of parameters, rather than requiring explicit command line options like RepeatMasker. It is generally not necessary to override the default parameter. |
-production_lookup |
1 | Fetch analysis display name, description and web data from the production database; 0 -- to disable |
-email_repeat_report |
1 | Send an email with a summary of the repeats when the pipeline finishes; 0 -- to disable |
To add use plants specific nrTEplants library,
which can be obtained at https://github.com/Ensembl/plant-scripts/releases/tag/v0.3,
define $NRPLANTSLIB
path and add these options to the init command:
-repeatmasker_library zea_mays=$NRPLANTSLIB \
-repeatmasker_logic_name zea_mays=repeatmask_nrplants
To run both custom repeatmask and default (RepBase) one
-repeatmasker_library bombyx_mori=/path/to/lib/file \
-repeatmasker_library pediculus_humanus=/path/to/another/lib/file \
-always_use_repbase 1
To use different species name for default (RepBase-based) repeatmasker run
-species dinothrombium_tinctorium \
-repeatmasker_repbase_species Tetranychus_urticae \
-repeatmasker_library dinothrombium_tinctorium=/path/to/another/lib/file \
-always_use_repbase 1
If you need to run the pipeline for a lot of species, it might be preferable to load the species specific parameters from a file. To do so, create a json file as a dictionary with the keys being the parameter names, like the following example.
{
"species" : [
"pediculus_humanus",
]
"repeatmasker_library" : {
"pediculus_humanus" : "/path/to/lib/pediculus_humanus.lib",
},
}
.
And load it with --parameters_json <path/to/json>
when running init command.
There's a trade-off with RepeatMasker between sensitivity and run-time.
Sensitivity is controlled via the alignment engine (crossmatch or a variant of NCBI BLAST+) and some command line flags,
but it's a pain for users of this pipeline to have to think about these sorts of things;
so, left to its own defaults, the pipeline will do its best to use sensible settings.
This default behaviour can be overridden, on either a species-specific or pipeline-wide level,
via the -repeatmasker_sensitivity
parameter, which can be set to one of:
automatic
(default), very_low
, low
, medium
, high
, very_high
.
The rationale and workings for the automatic settings are detailed elsewhere, but briefly,
it takes into account the genome size, the size of each DNA chunk, and the size of the repeat library, and sets sensitivity at low
, medium
or high
.
The very_high
option is only recommended if you absopositively need to wring every ounce of sensitivity out of RepeatMasker; compared to high
, the coverage gains will be marginal (fractions of a percentage point), and the execution time will be exponentially longer. Conversely, there's only a tiny amount of time to be gained by using very_low
as opposed to low
, and there's not likely to be much difference (if any) in sensitivity, so the very_low
option is probably only useful for running against a massive number of genomes (e.g. bacteria).
If running for a custom library with the -always_use_repbase
parameter,
any specificity parameter will apply to both analyses;
the only way to avoid this is to run the pipeline twice, once for RepBase, once for the custom library.
An example set of options to be added to run one species with high sensitivity and let the other be automatically determined:
-species pediculus_humanus \
-repeatmasker_library pediculus_humanus=/path/to/another/lib/file \
-repeatmasker_sensitivity pediculus_humanus=high
To force a particular sensitivity level across all species, use 'all' instead of the species name:
-repeatmasker_sensitivity all=low
A few generic from Common::RunnableDB.
A few from DNAFeatures.