- Sequence motifs
- 5' UTR
- 3' UTR
- Upstream gene's 3' UTR
- Downstream genes 5' UTR
- Upstream intergenic region
- Downstream intergenic region
- CDS
- Sequence composition
- 5' UTR GC/CT composition
- 3' UTR GC/CT composition
- CDS GC/CT composition
- Polypyrimidine tract GC/CT composition
- Kmer counts
- Sequence lengths
- 5' UTR length
- 3' UTR length
- Polypyrimidine tract length
- Interenic region / inter-CDS length
- Other
- CDS codon adaptation index (CAI)
The Trypanosomatid Regulatory Elements prediction pipeline makes use a number of different R and Python packages, as well as several standalone tools.
Below is a list of all of the requirements needed to run this pipeline.
- Bioconductor (3.3+)
- Biostrings (2.40.0+)
- caret (6.0-73+)
- tibble (1.2+)
- seqLogo (1.38+)
conda create -n reg-predict --file requirements.txt \
--channel bioconda \
--channel conda-forge \
--channel pytorch
Note: In order to avoid running out of memory during execution, the
hierarchical clustering portion of the EXTREME script
run_consensus_clusering_using_wm.pl
may need to be edited to increase the
value Xmx
, e.g.: -Xmx10000m
.
TODO: describe software for predicting UTR boundaries, etc.
snakemake --configfile settings/config.yml combine_motifs