GraphProt is a tool for modelling binding preferences of RNA-binding proteins from high-throughput experiments such as CLIP-seq and RNAcompete.
When using GraphProt please cite
- Maticzka, D., Lange, S. J., Costa, F. & Backofen, R. GraphProt: modeling binding preferences of RNA-binding proteins. Genome Biol. 15, R17 (2014).
GraphProt is available via bioconda. GraphProt including all dependencies can be installed via:
conda install graphprot
GraphProt contains a precompiled version of "EDeN", the SVM package used for feature creation and classification. This binary should run on most Linux-based systems. In case it does not run on your system, please call "bash ./recompile_EDeN.sh" from the GraphProt main directory.
GraphProt uses various opensource software packages. Please make sure that the follwing programs are installed and accessible via the PATH environment variable (i.e. you should be able to call the programs by just issuing the command).
- RNAshapes is used for GraphProt secondary structure predictions (required version: 2.1.6 available for download at https://github.com/bgruening/download_store/raw/master/RNAshapes/RNAshapes-2.1.6.tar.gz or via bioconda https://bioconda.github.io)
- libsvm is used for support vector regressions (http://www.csie.ntu.edu.tw/~cjlin/libsvm/)
- GNU make is used as the pipeline backend (http://www.gnu.org/software/make/)
- R is used to process nucleotide-wise margins for motif creation (www.r-project.org/)
- The R plyr package is required for calculating motifs (http://plyr.had.co.nz/) and can be installed from within R by issuing the command "install.packages('plyr')".
- The R stats package is required for calculating motifs (http://plyr.had.co.nz/) and can be installed from within R by issuing the command "install.packages('stats')".
- GraphProt uses WebLogo3 to plot sequence and structure motifs. GraphProt was tested using version WebLogo 3.2 http://code.google.com/p/weblogo/downloads.
GraphProt will scan for these programs and notify you if something seems amiss. GraphProt contains a copy of fastapl.
GraphProt analyses are started by calling "GrapProt.pl". If no options are given, GraphProt.pl will display a help message summarizing all available options. The default mode is to run analyses in classification setting, switch to regression setting using the parameter -mode regression. In general, GraphProt analyses are run by issuing different actions, e.g.
GraphProt.pl --action train -fasta train_positives.fa -negfasta train_negatives.fa
GraphProt supports input sequences in fasta format. The viewpoint mechanism sets viewpoints to all nucleotides in uppercase letters, nucleotides in lowercase letters are only used for RNA structure predictions.
GraphProt parameters abstraction, R, D, bitsize, c, epsilon, epochs and lambda are set to default values. For best results, optimized parameters should be obtained with the ls parameter optimization setting.
Input files in classification setting are specified with parameters "-fasta" (binding sites) and "-negfasta" (unbound sites). For regressions, input sequences are specified with "-fasta" and sequence scores with "-affinities". For each sequence, the affinity file should contain one value per line.
Output filenames can be specified via a prefix (-prefix); if no prefix is given, the default is "GraphProt".
Determines optimized parameters. Parameters are printed to screen and written to file "GraphProt.param".
Runs a 10-fold crossvalidation. Measures of classification performance are listed in "GraphProt.cv_results". In classification setting, crossvalidation results are written to file "GraphProt.cv_predictions", this file contains three columns:
- sequence id of training instance
- class of training instance
- predicted margin
Trains a GraphProt model. The model is written to file "GraphProt.model".
Predict binding of whole sequences, e.g. CLIP sites. Predictions are written to file "GraphProt.predictions", this file contains three columns:
- sequence id from the fasta file
- predicted class
- prediction margin
Predict binding profiles (nucleotide-wise margins) for sequences. Nucleotide-wise margins are written to file "GraphProt.profile", this file contains three columns:
- number of sequence
- number of nucleotide
- prediction for this nucleotide
Predict high-affinity target sites as showcased in the GraphProt paper. Selects all regions with average scores within 12nt above a given percentile (parameter -percentile, defaults to 99). Average nucleotide-wise margins of high-affinity sites are written to file GraphProt.has. This file contains three columns:
- number of sequence
- number of nucleotide
- average prediction this nucleotide
Create RNA sequence and structure motifs as described in the "GraphProt" paper. Motifs are written to files "GraphProt.sequence_motif.png" and "GraphProt.structure_motif.png".
To create motifs as done in the paper, this should be run using a trained model
and the bound training sequences from the CLIP experiment the model was trained on.
E.g. GraphProt.pl --action motif --model CLIP.model --fasta CLIP_bound.fa
.
In addition to the integrated usage via GraphProt.pl, individual tasks such as creation of RNA structure graphs or calculation of features can be accomplished using the following tools:
- fasta2shrep_gspan.pl: graph creation
- EDeN/EDeN: NSPD kernel and SGD support vector machine
Usage information for these tools can be obtained by specifying the "-h" option.
RNA sequence and structure graphs are created using fasta2shrep_gspan.pl. Structure graphs are created using the following parameters. The user has to chose an appropriate RNAshapes ABSTRACTION_LEVEL.
fasta2shrep_gspan.pl --seq-graph-t --seq-graph-alph -abstr -stdout -M 3 -wins '150,' -shift '25' -fasta PTBv1.train.fa -t __ABSTRACTION_LEVEL__ | gzip > PTBv1.train.gspan.gz
RNA sequence graphs are created using the following parameters:
fasta2shrep_gspan.pl --seq-graph-t -nostr -stdout -fasta PTBv1.train.fa | gzip > PTBv1.train.gspan.gz
For example, 10-fold crossvalidation using EDeN is done via:
EDeN/EDeN -a CROSS_VALIDATION -c 10 -i PTBv1.train.gspan.gz -t PTBv1.train.class -g DIRECTED -b __BIT_SIZE__ -r __RADIUS__ -d __DISTANCE__ -e __EPOCHS__ -l __LAMBDA__
and setting the appropriate parameters for BIT_SIZE, RADIUS, DISTANCE, EPOCHS
and LAMBDA
.