The abinitio training workflow takes an assembly (parameter:genome
) and
an evidence file from Maker (parameter:maker_evidence_gff
) to filter
gene models and create training and test data sets for abinitio evidence
driven prediction in Maker.
Run workflow using the singularity profile:
params.yml
:
subworkflow: 'abinitio_training'
genome: '/path/to/genome/assembly.fasta'
maker_evidence_gff: '/path/to/evidence/annotation.gff'
species_label: 'species_name'
codon_table: 1
aed_value:
- 0.2
- 0.3
locus_distance:
- 3000
- 4000
outdir: '/path/to/save/results'
To add the result folder to the augustus folder (to be used in maker for instance), add to the yml file :
maker_species_publishdir : '/PATH/augustus/config/species/'
Command line:
nextflow run NBISweden/pipelines-nextflow \
-profile singularity \
-params-file params.yml
- General:
maker_evidence_gff
: Path to the GFF annotation.genome
: Path to the genome assembly.outdir
: Path to the results folder.species_label
: A species label for the training data.maker_species_publishdir
: A shared directory where a copy of the augustusspecies_label
profile is saved.codon_table
: The number of the codon table to use for translation (default: 1).aed_value
: A list of model selection values to explore (smaller values mean higher stringency).locus_distance
: A list of locus distances (average distance between genes) to explore.flank_region_size
: The size of the flank region to include (default: 1000).
In these workflows, the Nextflow process directive ext.args
is used to inject command line tool parameters directly to the shell script.
These command line tool parameters can be changed by overriding the ext.args
variable for the respective process in a configuration file.
nextflow.config
:
process {
withName: 'MODEL_SELECTION_BY_AED' {
ext.args = '--value 0.3 -a _AED -t ">"'
}
}
See Abinitio training modules config for the default tool configuration.
- Separate maker evidence by record type.
- Select model by AED.
- Keep the longest isoform.
- Remove incomplete gene models.
- Filter by locus distance.
- Extract the protein sequence.
- Blast sequences against themselves.
- Filter sequences.
- Create a training and test dataset.
- Train augustus.
- Train snap.