The Triplexer is the computational pipeline that builds the backend database of
the TriplexRNA: a database of cooperative microRNAs
and their mutual targets.
The pipeline is based on the work of Lai et al.
and Schmitz et al., and extended to cover
multiple organisms and prediction algorithms.
The only requirement is Docker, which can be installed in different ways depending on the underlying operative system:
- Unix users should follow the Docker installation for Linux, and install both Docker and Docker compose
- MacOS 10.13+ users should follow the Docker installation for Mac
- Windows 10+ users, should follow the Docker installation for Windows
- For legacy systems, users can rely on the Docker Toolbox.
The Triplexer defines three operations: read, filtrate, and annotate;
each of which is referred to a namespace, i.e. a resource (file, database,
etc.) that describes the RNA duplexes of a specific organism.
Namespaces are used to capture the provenance of a predicted RNA duplex, and
subsequently keep the identification of putative RNA triplexes consistent
across different organisms and genome releases.
This operation parses a file (or queries a database) containing the attributes
of a set of organism-specific RNA duplexes, and stores their attributes in the
underlying Redis cache as a set of hashes.
Since each namespace defines its own data structures, identifiers, and
granularity of data, this operation is likely to be redefined by each
namespace. However, output data structures share a common schema regardless of
their namespace of origin. For instance, each RNA duplex is identified by the
unique string:
<namespace label>:<dataset release>:<organism>:<genome build>:target:<target id>
#For more information about a namespace-specific read implementation, please #refer to the IMPLEMENTATIONS.md.
Experimental findings suggest that RNA triplexes form when two cooperating
miRNAs bind a common target gene with a seed site distance between 13 and 35
nucleotides (Saetrom et al. 2007).
This means that duplex pairs that share a common target must be tested for
complying with the aforementioned seed site distance.
constraint.
Filtrate relies on the read operation (see above). It compares all the cached
duplexes that share a common target gene, and keeps those pairs that comply
with the seed site distance constraint. This operation is namespace agnostic.
Its behavior can be summarized by the following pseudo-code:
for each target in the set of targets:
for each duplex in the set of the target's duplexes:
if duplex pair has miRNA alignment within binding range constraint:
cache the target
cache the duplex pair
The in-silico testing of a putative RNA triplex's structural stability can only be performed when the nucleotide sequences of both target gene's transcript and miRNA pair are given. However, not all dataset provide this information. For this reason, the annotate operation retrieves the genomic sequence of a duplex's target gene from the UCSC, and caches the transcript sequence for later stability testing.
To run the Triplexer pipeline, you need to run the Triplexer docker container and all containers it relies on. This is done via docker compose. Type:
docker-compose run triplexer
You can now launch the Triplexer pipeline. Try it with no arguments to overview its command line options:
$ triplexer
usage: triplexer [-h] [-v] [-c CONF] [-e EXE] [-d DB] [-r] [-f] [-a] [-n NS]
Predict and simulate putative RNA triplexes.
optional arguments:
-h, --help show this help message and exit
-v, --version print the version and exit
-c CONF, --conf CONF set CONF as configuration file
-e EXE, --exe EXE set EXE as number of parallely executing processes
-d DB, --db DB set DB as intermediate results database
operations (require -n):
-r, --read read the provided dataset in memory
-f, --filtrate filter entries not forming putative triplexes
-a, --annotate annotate transcripts with their sequences
namespace:
-n NS, --ns NS set NS as model organism namespace
supported NS (default "test"):
+-------+----------------------------------+
| NS | database:version:organism:genome |
+-------+----------------------------------+
| test | microrna.org:aug.2010:hsa:hg19 |
| 1 | microrna.org:aug.2010:hsa:hg19 |
| 2 | microrna.org:aug.2010:mmu:mm9 |
| 3 | microrna.org:aug.2010:rno:rn4 |
| 4 | microrna.org:aug.2010:dme:dm3 |
+-------+----------------------------------+
Read, filtrate and annotate duplexes rely on one another. It is therefore good practice to run them in this order, or at least make sure that the underlying cache can be of use when running one operation in isolation.
Here are some examples on how to fill the underlying cache with duplexes from the microrna.org namespace.
- Read microrna.org's Human hg19 target site predictions:
triplexer -n 1 -r
- Filtrate all microrna.org's Human hg19 duplexes by keeping those whose miRNA pairs bind a common target gene within the allowed distance range. Do so using 4 parallel processes:
triplexer -e 4 -n 1 -f
- Annotate all microrna.org's Human hg19 duplexes with the transcript sequence of their target genes. Do so using 2 parallel processes:
triplexer -n 1 -a
- Perform all aforementioned operations in one run. Do so using 4 parallel processes:
triplexer -e 4 -n 1 -r -f -a