-
Notifications
You must be signed in to change notification settings - Fork 9
Tutorial: Plot results of netDx predictor
Following a complete predictor run, you will want to evaluate the performance of the predictor and examine the nature of feature-selected variables.
Need more detail? Look at sections below.
# requires r2cytoscape, JSONIO, Cytoscape, httr, EasyCyRest installed
suppressWarnings(suppressMessages(require(netDx)))
suppressWarnings(suppressMessages(require(netDx.examples)))
phenoFile <- sprintf("%s/extdata/KIRC_pheno.rda",path.package("netDx.examples"))
lnames <- load(phenoFile)
# group genes into pathways
pathFile <- sprintf("%s/extdata/Human_160124_AllPathways.gmt",
path.package("netDx.examples"))
pathwayList <- readPathways(pathFile)
xpr_genes <- sprintf("%s/extdata/EMap_input/genenames.txt",
path.package("netDx.examples"))
xpr_genes <- read.delim(xpr_genes,h=FALSE,as.is=TRUE)[,1]
pathwayList <- lapply(pathwayList, function(x) x[which(x %in% xpr_genes)])
# plot results
inDir <- sprintf("%s/extdata/KIRC_output",
path.package("netDx.examples"))
out <- plotAllResults(pheno, inDir,outDir=sprintf("%s/plots",getwd()),
fsCutoff=10,fsPctPass=0.7,pathwaySet=pathwayList)
# examine results
dir(sprintf("%s/plots",getwd()),recursive=TRUE)
For this example to work, you will require the following components to be installed. You may have already installed these when you first setup netDx but if not, make sure these are set up now:
- Cytoscape must be installed and running.
- Cytoscape apps EnrichmentMap (v3.0.0+) and AutoAnnotate (v1.2+) must be installed.
- R packages:
httr
,RJSONIO
- R packages
r2cytoscape
andEasycyRest
.
Note: If Cytoscape is not running, this example will not work! Let us see if the required dependencies can be installed and/or loaded:
#httr
tryCatch(expr = { library(httr)},
error = function(e) { install.packages("httr")}, finally = library(httr))
#RJSONIO
tryCatch(expr = { library(RJSONIO)},
error = function(e) { install.packages("RJSONIO")}, finally = library(RJSONIO))
#r2cytoscape
tryCatch(expr = { library(r2cytoscape)},
error = function(e) { devtools::install_github('cytoscape/r2cytoscape')}, finally = library(r2cytoscape))
## Loading required package: XML
# EasycyRest
tryCatch(expr = { library(EasycyRest); detach(package:EascyRest,unload=TRUE)},
error = function(e) { devtools::install_github('BaderLab/Easycyrest/[email protected]')}, finally = {})
##
## Attaching package: 'EasycyRest'
## The following objects are masked from 'package:r2cytoscape':
##
## createNetwork, createStyle
## Skipping install of 'EasycyRest' from a github remote, the SHA1 (3f474b6e) has not changed since last install.
## Use `force = TRUE` to force installation
Now let us test if Cytoscape is open:
# Basic settings
version.url <-"http://localhost:1234/v1/version"
cytoscape.open <- TRUE
tryCatch(expr = { GET(version.url)},
error = function(e) { return (cytoscape.open = FALSE)}, finally =function(r){ return(cytoscape.open = TRUE)})
## Response [http://localhost:1234/v1/version]
## Date: 2017-09-13 15:53
## Status: 200
## Content-Type: application/json
## Size: 46 B
if(!cytoscape.open){
#try and launch cytoscape
stop("Cytoscape is not open. Please launch cytoscape.")
} else{
cytoscape.version <- GET(version.url)
cy.version <- fromJSON(rawToChar(cytoscape.version$content))
print(cy.version)
}
## apiVersion cytoscapeVersion
## "v1" "3.5.1"
For this example, the predictor has been run using a nested cross-validation design. In nested cross-validation, cross-validation is performed repeatedly over multiple random splits of the data into train and blind test partitions. Feature selected networks are those that consistently score highly across the multiple splits. Conceptually, this is what the higher-level logic looks like for a nested cross-validation design with 10-fold CV in the inner loop, and 100 splits in the outer loop: (Note: these aren’t real function calls; this block just serves to illustrate the concept of the nested CV design for our purposes)
outerLoop <- 100 # num times to split data into train/blind test samples
innerLoop <- 10 # num folds for cross-validation, also max score for a network
netScores <- list() # collect <outerLoop> set of netScores
perf <- list() # collect <outerLoop> set of test evaluations
for k in 1:outerLoop
[train, test] <- splitData(80:20) # split data using RNG seed
netScores[[k]] <- runCV(train)
perf[[k]] <- collectPerformance(netScores[[k]], test)
end
netDx expects a nested directory structure with the predictor results. The top level should contain one directory for each train/test split. Within each of these directories are the predictor results for the corresponding cross-validation. Here is the directory structure for a dataset with rootDirectory dataset_yymmdd
.
dataset_yymmdd/
+ rng1/
+ tmp/ # directory created by netDx, containing input data for GeneMANIA database
+ networks/ # PSN created by calls to makePSN_NamedMatrix()
+-- Class1
+ tmp/
+ networks/ # networks for test classification for this split
+ GM_results/ # results of inner loop (10-fold CV)
+ Class1_pathway_CV_score.txt # network scores for inner CV fold
+--- CV_1.query # query for CV fold
+--- CV_1.query-results.report.txt.NRANK # network weights for CV fold
...
+--- CV_10.query
+--- CV_10.query-results.report.txt.NRANK
+-- Class2
+ predictionResults.txt # test predictions for this train/test split
+ rng2/
+ rng3/
+ rng4/
...
+ rng100/
suppressMessages(require(netDx))
suppressMessages(require(netDx.examples))
In this example, we use data from The Cancer Genome Atlas (http://cancergenome.nih.gov/), downloaded from the PanCancer Survival project (https://www.synapse.org/#!Synapse:syn1710282). We use gene expression profiles from renal clear cell carcinoma tumours to predict poor and good survival after Yuan et al. (2014) (Refs 1-2). The data consists of 150 tumours. Here we work only with the gene expression profiles generated.
phenoFile <- sprintf("%s/extdata/KIRC_pheno.rda",path.package("netDx.examples"))
lnames <- load(phenoFile)
head(pheno)
## ID age grade stage STATUS_INT STATUS
## 1 TCGA-AK-3428 62 G2 Stage III 1 SURVIVEYES
## 2 TCGA-AK-3434 72 G2 Stage I 1 SURVIVEYES
## 3 TCGA-B0-4688 46 G4 Stage IV 0 SURVIVENO
## 4 TCGA-B0-4690 65 G3 Stage IV 0 SURVIVENO
## 5 TCGA-B0-4691 55 G3 Stage IV 0 SURVIVENO
## 6 TCGA-B0-4693 72 G4 Stage III 0 SURVIVENO
Create a directory to store output in:
outDir <- paste(getwd(),"plots",sep="/")
if (!file.exists(outDir)) dir.create(outDir)
setwd(outDir)
Now compile the paths to output files. First get the list of all directories corresponding to the outer loops:
inDir <- sprintf("%s/extdata/KIRC_output",
path.package("netDx.examples"))
all_rngs <- list.dirs(inDir, recursive = FALSE)
print(head(basename(all_rngs)))
## [1] "rng1" "rng10" "rng100" "rng11" "rng12" "rng13"
Each rngX/
directory contains the results of a particular train/test split. The two classes in question are SURVIVEYES
(good survival) and SURVIVENO
(poor survival).
dir(all_rngs[1])
## [1] "predictionResults.txt" "SURVIVENO" "SURVIVEYES"
First we evaluate the average performance of the predictor over the 100 train/blind test splits. Performance is defined as the Area under the Receiver Operator Characteristic curve (AUROC) or Area under the Precision-Recall curve (AUPR).
First compile prediction files for all the 100 splits in the example data. These paths are stored in predFiles
. Then call the performance-plotting function, which is plotPerf()
.
predClasses <- c("SURVIVEYES","SURVIVENO")
predFiles <- unlist(lapply(all_rngs, function(x)
paste(x, "predictionResults.txt", sep = "/")))
predPerf <- plotPerf(inDir, predClasses=predClasses)
## Single directory provided, retrieving prediction files
Feature-level scores are one of the outputs of feature-selection. These indicate the predictive strength of a feature, so that networks with higher scores indicate those with greater predictive power for the class in question. Note that in netDx, feature selection occurs once per patient class (i.e. here, once for SURVIVEYES
and once for SURVIVENO
), so each feature gets a predictive score for each class.
This nested design has an outer loop of 100 splits with 10-fold cross-validation in the inner loop. Each network can therefore score between 0 and 10 (max number of cross-validation folds) for a given split. For the 100 splits, feature scores can be viewed as a matrix X which is N-by-100, where N is the number of networks. X[i,j] is the score for network i for split j.
**Note: **In practice, we do not include networks that never score >0, so the number of rows is < N.
featScores <- getFeatureScores(inDir,predClasses=c("SURVIVEYES","SURVIVENO"))
## SURVIVEYES
## Single directory provided, retrieving CV score files
## Got 100 iterations
## * Computing consensus
## SURVIVENO
## Single directory provided, retrieving CV score files
## Got 100 iterations
## * Computing consensus
The size of featScores
should correspond to number of networks that score >0 at least once (rows) and the number of train/test splits (here, 100):
dim(featScores[[1]])
## [1] 25 101
The first column shows feature names:
head(featScores[[1]][,1:10])
## PATHWAY_NAME rng1 rng10 rng100
## 30 ACTIVATION_OF_THE_PRE-REPLICATIVE_COMPLEX.profile 4 10 10
## 64 ANDROGEN_BIOSYNTHESIS.profile 9 9 4
## 237 BIOCARTA_STEM_PATHWAY.profile 10 10 8
## 264 CALNEXIN_CALRETICULIN_CYCLE.profile 10 10 6
## 355 DEFECTS_IN_COBALAMIN__B12__METABOLISM.profile 10 9 10
## 356 DEFECTS_IN_VITAMIN_AND_COFACTOR_METABOLISM.profile 9 9 10
## rng11 rng12 rng13 rng14 rng15 rng16
## 30 8 9 6 6 10 8
## 64 8 5 6 4 9 9
## 237 10 9 7 10 9 4
## 264 7 7 10 9 8 7
## 355 9 10 8 10 8 10
## 356 9 10 9 10 8 9
Let us define feature-selected networks as those that score 10 out of 10 in at least 70% of the train/test splits.
featSelNet <- lapply(featScores, function(x) {
callFeatSel(x, fsCutoff=10, fsPctPass=0.7)
})
Let’s take a look at the top networks for each class:
tmp <- lapply(featSelNet,print)
## [1] "REACTIONS_SPECIFIC_TO_THE_COMPLEX_N-GLYCAN_SYNTHESIS_PATHWAY.profile"
## [2] "THYROXINE_BIOSYNTHESIS.profile"
## [1] "ABACAVIR_TRANSPORT_AND_METABOLISM.profile"
## [2] "BILE_SALT_AND_ORGANIC_ANION_SLC_TRANSPORTERS.profile"
## [3] "METABOLISM_OF_WATER-SOLUBLE_VITAMINS_AND_COFACTORS.profile"
## [4] "PLATELET_ADHESION_TO_EXPOSED_COLLAGEN.profile"
## [5] "REGULATION_OF_IFNA_SIGNALING.profile"
## [6] "THYROXINE_BIOSYNTHESIS.profile"
An EnrichmentMap is a network-based visualization for a group of overlapping sets (Ref 3). Here, we use this visualization to examine which pathways and functional themes are feature-selected for each patient class. So, this is a network where nodes are features (here, pathways), and edges indicate shared members between the features (here, genes common to the pathway).
By clustering and annotating the EnrichmentMap using the AutoAnnotate app (Ref 4), we can identify functional themes that are common among high-scoring features.
Get the set of valid gene sets:
pathFile <- sprintf("%s/extdata/Human_160124_AllPathways.gmt",
path.package("netDx.examples"))
pathwayList <- readPathways(pathFile)
## ---------------------------------------
## File: Human_160124_AllPathways.gmt
##
## Read 2760 pathways in total, internal list has 2712 entries
## FILTER: sets with num genes in [10, 500]
## => 911 pathways excluded
## => 1801 left
Filter for the genes measured in this dataset:
xpr_genes <- sprintf("%s/extdata/EMap_input/genenames.txt",
path.package("netDx.examples"))
xpr_genes <- read.delim(xpr_genes,h=FALSE,as.is=TRUE)[,1]
head(xpr_genes)
## [1] "ZNF121" "OR2J3" "HMOX1" "SYT4" "GPR137C" "AKAP12"
Filter:
pathwayList <- lapply(pathwayList, function(x) x[which(x %in% xpr_genes)])
To create an enrichment map in netDx, you need two files:
- A .gmt file: A file with top-scoring genesets in the GMT format, similar to the example pathway file.
- Node attribute table: A table containing the names of features and the maximum score each achieves across cross-validation. Here we generate the above two files for each patient class. We have two sets of files because each class has its own set of predictive features and therefore, its own enrichment map (EM).
In addition to objects we have seen before, this step requires a table indicating what type of data each network represents. This may be used to assign visual features (e.g. node shapes) to distinguish different data sources in the EM.
netInfoFile <- sprintf("%s/extdata/KIRC_output/inputNets.txt",
path.package("netDx.examples"))
netInfo <- read.delim(netInfoFile,sep="\t",h=FALSE,as.is=TRUE)
head(netInfo)
## V1
## 1 rna
## 2 rna
## 3 rna
## 4 rna
## 5 rna
## 6 rna
## V2
## 1 GUANOSINE_NUCLEOTIDES__I_DE_NOVO__I__BIOSYNTHESIS
## 2 RETINOL_BIOSYNTHESIS
## 3 MUCIN_CORE_1_AND_CORE_2__I_O__I_-GLYCOSYLATION
## 4 SUPERPATHWAY_OF_D-_I_MYO__I_-INOSITOL__1,4,5_-TRISPHOSPHATE_METABOLISM
## 5 D-_I_MYO__I_-INOSITOL__1,4,5_-TRISPHOSPHATE_DEGRADATION
## 6 MRNA_CAPPING
Create the EnrichmentMap input:
EMap_input <- writeEMapInput_many(featScores,pathwayList,
netInfo,outDir=outDir)
Finally, plot the EnrichmentMap:
pngFiles <- list()
for (curGroup in names(EMap_input)[1:2]) {
pngFiles[[curGroup]] <- plotEmap(gmtFile=EMap_input[[curGroup]][1],
nodeAttrFile=EMap_input[[curGroup]][2],
netName=curGroup,outDir=outDir)
}
## apiVersion cytoscapeVersion
## "v1" "3.5.1"
## * Applying AutoAnnotate
## * Importing node attributes
## * Creating or applying style
## * Final cleanup
## [1] "http://localhost:1234/v1/commands/view/export?OutputFile=/Users/ahmadshah/netDx/examples/plots/EnrichmentMap_SURVIVEYES.png"
## apiVersion cytoscapeVersion
## "v1" "3.5.1"
## * Applying AutoAnnotate
## * Importing node attributes
## * Creating or applying style
## * Final cleanup
## [1] "http://localhost:1234/v1/commands/view/export?OutputFile=/Users/ahmadshah/netDx/examples/plots/EnrichmentMap_SURVIVENO.png"
The enrichment map showing the top-scoring networks for SURVIVEYES
should now be active in Cytoscape. It should look like this:
richmentMap for the SURVIVEYES class
Nodes are coloured by the highest consistent score the particular feature achieved; consistency is defined as having achieved the score for >70% of the splits.
Note: Node labels for individual features have been turned off to remove visual clutter. These can be enabled interactively in Cytoscape.
There should also be a second EnrichmentMap for the SURVIVENO
class:
An integrated patient network is constructed by aggregating feature-selected networks for all classes. It represents the overall view of patient similarity that results from identifying predictive networks through the netDx predictor.
For the visualization we actually prefer to magnify dissimilarity so that similar patients are closer (i.e. have smaller edge weight) and dissimilar patients are farther apart (i.e. have larger edge weight). So the network generated in this step is really a dissimilarity network. The steps for computing this network are:
- Concatenate all feature-selected networks
- For all patient pairs, collapse edges by taking the mean similarity.
- Convert to dissimilarity (
1-mean_similarity
). - For the Cytoscape visualization, keep a certain fraction of the top edges (e.g. top 20% disimilarities).
The function
plotIntegratedPSN()
runs these computations. It therefore needs access to the folder that contains all the feature-selected networks. Here we providerng1/
for that purpose using thebaseDir
argument.
netInfo <- plotIntegratedPSN(pheno=pheno,baseDir=sprintf("%s/rng1",inDir),
netNames=featSelNet,outDir=outDir)
## 2 classes: { SURVIVEYES,SURVIVENO }
## * Style exists, not creating
## Group SURVIVEYES
## Group SURVIVENO
## * Computing aggregate net
##
## Writing aggregate PSN
## Loading required package: reshape2
##
## 12234 pairs have no edges (counts directed edges)
## Sparsity = 10266/11175 (92 %)
## * Creating network in Cytoscape
## * Create network URL
## Network ID is : 65311
## * Applying layout
## * Applying style
## * Exporting to PNG
The Cytoscape session should now show a network that looks like this:
The returned netInfo
object contains several pieces of information related to the integrated network shown above, including the path to the full similarity network (aggPSN_FULL
), the pruned dissimilarity network (aggPDN_pruned
), and information about the resulting view in Cytoscape (network_suid
and netView
).
summary(netInfo)
## Length Class Mode
## aggPSN_FULL 1 -none- character
## aggPDN_pruned 1 -none- character
## incNets 7 -none- character
## network_suid 1 -none- numeric
## netView 1 -none- character
sessionInfo()
## R version 3.3.3 (2017-03-06)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Sierra 10.12.5
##
## locale:
## [1] C/C/C/C/C/en_CA.UTF-8
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] reshape2_1.4.2 netDx.examples_0.1 netDx_0.94
## [4] RColorBrewer_1.1-2 pracma_2.0.7 ROCR_1.0-7
## [7] gplots_3.0.1 GenomicRanges_1.26.4 GenomeInfoDb_1.10.3
## [10] IRanges_2.8.2 S4Vectors_0.12.2 BiocGenerics_0.20.0
## [13] combinat_0.0-8 doParallel_1.0.10 iterators_1.0.8
## [16] foreach_1.4.3 bigmemory_4.5.19 bigmemory.sri_0.1.3
## [19] EasycyRest_0.1 r2cytoscape_0.0.3 XML_3.98-1.9
## [22] RJSONIO_1.3-0 httr_1.3.1
##
## loaded via a namespace (and not attached):
## [1] gtools_3.5.0 colorspace_1.3-2 htmltools_0.3.6
## [4] yaml_2.1.14 rlang_0.1.1 withr_1.0.2
## [7] plyr_1.8.4 stringr_1.2.0 zlibbioc_1.20.0
## [10] munsell_0.4.3 gtable_0.2.0 devtools_1.13.2
## [13] caTools_1.17.1 codetools_0.2-15 memoise_1.1.0
## [16] evaluate_0.10 knitr_1.16 curl_2.8.1
## [19] Rcpp_0.12.11 KernSmooth_2.23-15 backports_1.1.0
## [22] scales_0.4.1 gdata_2.18.0 XVector_0.14.1
## [25] ggplot2_2.2.1 digest_0.6.12 stringi_1.1.5
## [28] rprojroot_1.2 grid_3.3.3 quadprog_1.5-5
## [31] tools_3.3.3 bitops_1.0-6 magrittr_1.5
## [34] RCurl_1.95-4.8 lazyeval_0.2.0 tibble_1.3.3
## [37] pkgconfig_2.0.1 rmarkdown_1.6 R6_2.2.2
## [40] igraph_1.1.2 git2r_0.18.0
- Yuan, Y. et al. (2014) Assessing the clinical utility of cancer genomic and proteomic data across tumor types. Nat Biotechnol 32, 644-52.
- The Cancer Genome Atlas Research Network (2013). Comprehensive molecular characterization of clear cell renal cell carcinoma. Nature 499, 43-9.
- Merico, D., Isserlin, R. & Bader, G.D. (2011). Visualizing gene-set enrichment results using the Cytoscape plug-in enrichment map. Methods Mol Biol 781, 257-77.
- Kucera, M., Isserlin, R., Arkhangorodsky, A. & Bader, G.D. (2016). AutoAnnotate: A Cytoscape app for summarizing networks with semantic annotations. F1000Res 5, 1717.