The overall goal of this project is to provide a benchmark data set for 2.5D RNA graphs. We further demonstrate the usage of this dataset with some baseline results on predicting RNA binding-site nodes. Furthermore we use motif fingerprints to compare motif sets generated by automated motif finding programs. We will rely on interpretable prediction models at first (ie. decisions trees) to determine motif importance. The overall result will be cleaned and curated data sources; tuned parameters for VeRNAl motif extraction; trained classification models for RNA-{protein/rna/small molecule/function} prediction and novel functional insights on conserved RNA structural patterns.
The training data for VernaML consists of networkx graphs which are sliced into portions containing RNA interfaces and their respective complement counterparts. Graphs are '2.5D' whereby the tertiary structure is maintained through retaining a discrete set of edge types according to different possible base-pairing geometries. Here is an example of one overlayed on a PDB structure. Backbones are in white, canonical Watson-Crick bonds are in green and non-canonical bonds are in red.
To generate this data:
- Retrieve a representative set of RCSB PDB structures.
- Find all interfaces within structures.
- Slice native RNA graphs into interface and complement parts.
The prepare_data
package contains all the scripts to do these tasks. The process can take some time so alternatively the following pre-built datasets can be downloaded from MEGA:
Dataset | Graphs | Edges | Nodes | Avg. Nodes | Avg. Edges | Links |
---|---|---|---|---|---|---|
ALL | 2679 | 447225 | 641968 | 166.9 | 239.6 | link |
ALL complement | 9034 | 195395 | 228261 | 21.6 | 25.3 | |
RNA-Protein | 2750 | 411487 | 587961 | 149.6 | 213.8 | link |
RNA-Protein complement | 8265 | 241611 | 322324 | 29.2 | 39.0 | |
RNA-RNA | 2737 | 59333 | 79116 | 21.7 | 28.9 | link |
RNA-RNA complement | 2483 | 55001 | 70551 | 22.2 | 28.4 | |
RNA-Small_Mol. | 166 | 981 | 1004 | 5.9 | 6.0 | link |
RNA-Small_Mol. complement | 140 | 973 | 1038 | 7.0 | 7.4 | |
RNA-Ion | 572 | 3490 | 3764 | 6.1 | 6.6 | link |
RNA-Ion complement | 493 | 3691 | 3993 | 7.5 | 8.1 |
To avoid redundancies in the training data the BGSU representative set of RNAs are used. They can be downloaded from here [1]
Make a directory to store the structures
mkdir data/structures
Then run the following command to retrieve the PDB structures from the RCSB database
python prepare_data/retrieve_structures.py <BGSU file> data/structures
Make a directory for the native graphs and the interface graphs
mkdir data/graphs
mkdir data/graphs/interfaces
mkdir data/graphs/native
Download the set of native RNA graphs from here and extract the compressed files into the native
directory.
Now run prepare_data/main.py
to find all the interfaces and slice the graphs. This process will take a few hours.
python prepare_data/main.py data/graphs/interfaces
- The an optional parameter
-t
can be added to specify the RNA interaction type. The default is all but can be any ofrna protein ion ligand
. Use a string in quotations seperated by spaces for multple interaction types. - Once the PDB interfaces are found, if you would like to run the script again use
-interface_list_input interface_residues_list.csv
option to use the interfaces computed from previous call and speed up execution.
The code base to prepare the DSSR data is stored on another repository called RNAGlib which has not been published yet. For now the most recent version of the data can be downloaded here
- Leontis, N. B., & Zirbel, C. L. (2012). Nonredundant 3D Structure Datasets for RNA Knowledge Extraction and Benchmarking. In RNA 3D Structure Analysis and Prediction N. Leontis & E. Westhof (Eds.), (Vol. 27, pp. 281–298). Springer Berlin Heidelberg. doi:10.1007/978-3-642-25740-7_13
- Lu, X. J. & Olson, W. K. 3DNA: A versatile, integrated software sys-tem for the analysis, rebuilding and visualization of three-dimensionalnucleic-acid structures.Nature Protocols3,1213–1227.issn: 17542189.http://3dna.rutgers.edu/.(July 2008).