Skip to content

Latest commit

 

History

History
74 lines (48 loc) · 5.23 KB

README.md

File metadata and controls

74 lines (48 loc) · 5.23 KB

VeRNAml

The overall goal of this project is to provide a benchmark data set for 2.5D RNA graphs. We further demonstrate the usage of this dataset with some baseline results on predicting RNA binding-site nodes. Furthermore we use motif fingerprints to compare motif sets generated by automated motif finding programs. We will rely on interpretable prediction models at first (ie. decisions trees) to determine motif importance. The overall result will be cleaned and curated data sources; tuned parameters for VeRNAl motif extraction; trained classification models for RNA-{protein/rna/small molecule/function} prediction and novel functional insights on conserved RNA structural patterns.

Associated Repositories:

VeRNAl RNAMigos

The training data for VernaML consists of networkx graphs which are sliced into portions containing RNA interfaces and their respective complement counterparts. Graphs are '2.5D' whereby the tertiary structure is maintained through retaining a discrete set of edge types according to different possible base-pairing geometries. Here is an example of one overlayed on a PDB structure. Backbones are in white, canonical Watson-Crick bonds are in green and non-canonical bonds are in red.

RNA motif binding to CMC ligand

1. FR3D Data

To generate this data:

  1. Retrieve a representative set of RCSB PDB structures.
  2. Find all interfaces within structures.
  3. Slice native RNA graphs into interface and complement parts.

The prepare_data package contains all the scripts to do these tasks. The process can take some time so alternatively the following pre-built datasets can be downloaded from MEGA:

Dataset Graphs Edges Nodes Avg. Nodes Avg. Edges Links
ALL 2679 447225 641968 166.9 239.6 link
ALL complement 9034 195395 228261 21.6 25.3
RNA-Protein 2750 411487 587961 149.6 213.8 link
RNA-Protein complement 8265 241611 322324 29.2 39.0
RNA-RNA 2737 59333 79116 21.7 28.9 link
RNA-RNA complement 2483 55001 70551 22.2 28.4
RNA-Small_Mol. 166 981 1004 5.9 6.0 link
RNA-Small_Mol. complement 140 973 1038 7.0 7.4
RNA-Ion 572 3490 3764 6.1 6.6 link
RNA-Ion complement 493 3691 3993 7.5 8.1

1.1 Retrieve a Representative Set of PDB Structures

To avoid redundancies in the training data the BGSU representative set of RNAs are used. They can be downloaded from here [1]

Make a directory to store the structures

mkdir data/structures

Then run the following command to retrieve the PDB structures from the RCSB database

python prepare_data/retrieve_structures.py <BGSU file> data/structures

1.2 Find Interfaces in the PDB structures and Slice their RNA graphs

Make a directory for the native graphs and the interface graphs

mkdir data/graphs

mkdir data/graphs/interfaces

mkdir data/graphs/native

Download the set of native RNA graphs from here and extract the compressed files into the native directory.

Now run prepare_data/main.py to find all the interfaces and slice the graphs. This process will take a few hours.

python prepare_data/main.py data/graphs/interfaces

Note

  • The an optional parameter -t can be added to specify the RNA interaction type. The default is all but can be any of rna protein ion ligand. Use a string in quotations seperated by spaces for multple interaction types.
  • Once the PDB interfaces are found, if you would like to run the script again use -interface_list_input interface_residues_list.csv option to use the interfaces computed from previous call and speed up execution.

2. DSSR Data

The code base to prepare the DSSR data is stored on another repository called RNAGlib which has not been published yet. For now the most recent version of the data can be downloaded here

References

  1. Leontis, N. B., & Zirbel, C. L. (2012). Nonredundant 3D Structure Datasets for RNA Knowledge Extraction and Benchmarking. In RNA 3D Structure Analysis and Prediction N. Leontis & E. Westhof (Eds.), (Vol. 27, pp. 281–298). Springer Berlin Heidelberg. doi:10.1007/978-3-642-25740-7_13
  2. Lu, X. J. & Olson, W. K. 3DNA: A versatile, integrated software sys-tem for the analysis, rebuilding and visualization of three-dimensionalnucleic-acid structures.Nature Protocols3,1213–1227.issn: 17542189.http://3dna.rutgers.edu/.(July 2008).