craddockGigascience2014_newtext_v3.tex

\section{Introduction}

With its new emphasis on collecting larger datasets, data sharing, deep phenotyping, and multimodal integration, neuroimaging has become a data intensive science. This is particularly true for connectomics where grass-root initiatives (e.g.the 1000 Functional Connectomes Project (FCP) \cite{Biswal2010}, the International Neuroimaging Data-sharing Initiative (INDI) \cite{Mennes2013}) and large-scale international projects (the Human Connectome Project (HCP) \cite{RosenHCP2010,VanEssen2012}, the Brainnetome \cite{Jiang2013}, the Human Brain Project in EU known as CONNECT \cite{Assaf2013}, Pediatric Imaging, Neurocognition and Genetics (PING) Study \cite{JerniganPING}, Philadelphia Neurodevelopmental Cohort \cite{Satterthwaite2014},  Brain Genomics Superstruct (GSP) \cite{BucknerGSP2014}, National Database for Autisim Research (NDAR) \cite{NDAR}) are collecting and openly sharing thousands of brain imaging scans, each of which consist of hundreds of observations of thousands of variables. Although this deluge of complex data promises to enable the investigation of neuroscientific questions that were previously inaccessible, it is quickly overwhelming the capacity of existing tools and algorithms to extract meaningful information from the data. This combined with a new focus on discovery science is creating a plethora of opportunities for data scientists from a wide range of disciplines such as computer science, engineering, mathematics, statistics, etc., to make substantial contributions to neuroscience. The goal of this review is to describe the state-of-the-art in connectomics research and enumerate opportunities for data scientists to contribute to the field.

The human connectome is a comprehensive map of the brain's circuitry, which consists of brain areas, their structural connections and their functional interactions. The connectome can be measured with a variety of different imaging techniques, but magnetic resonance imaging (MRI) is the most common primarily due to its near-ubiquity, non-invasiveness, and high spatial resolution \cite{Craddock2013}. As measured by MRI: brain areas are patches of cortex (approximately 1\si{\centi\meter\cubed} volume) containing 1,000s of neurons \cite{Varela2001}, structural connections are long range fiber tracts that are inferred from the motion of water particles measured by diffusion weighted MRI (dMRI), and functional interactions are inferred from synchronized brain activity measured by functional MRI (fMRI) \cite{Behrens2012}. Addressing the current state-of-the-art for both functional and structural connectivity is well beyond the scope of a single review. Instead, this review will focus on functional connectivity, which is particularly fast growing and offers many exciting opportunities for data scientists.

The advent of functional connectivity analyses has popularized the application of discovery science to brain function, which marks a shift in emphasis from hypothesis testing, to supervised and unsupervised methods for learning statistical relationships from the data \cite{Biswal2010}. Since functional connectivity is inferred from statistical dependencies between physiological measures of brain activity (i.e. correlations between the dependent variables), it can be measured without an experimental manipulation. Thus, functional connectivity is most commonly measured from ``resting state'' fMRI scans, during which the study participant is lying quietly and not performing an experimenter specified task- when measured in this way, it is referred to as intrinsic functional connectivity (iFC) \cite{Biswal1995}. Once iFC is measured, data mining techniques can be applied to identify iFC patterns that covary with phenotypes, such as, indices of cognitive abilities, personality traits, or disease state, severity, and prognosis, to name a few \cite{Varoquaux2013}. In a time dominated by skepticism about the ecological validity of psychiatric diagnoses \cite{Kapur2012}, iFC analyses have become particularly important for identifying subgroups within patient populations by similarity in brain architecture, rather than similarity in symptom profiles. This new emphasis in discovery necessitates a new breed of data analysis tools that are equipped to deal with the issues inherent to functional neuroimaging data.

\section{The Connectome Analysis Paradigm}

In 2005 Sporn and Hagmann \cite{Sporns2005,Hagmann2005} independently and in parallel coined the term \textit{the human connectome}, which embodies the notion that the set of all connections within the human brain can be represented and understood as graphs. In the context of iFC, graphs provide a mathematical representation of the functional interactions between brain areas -  nodes in the graph represent brain areas and edges indicate their functional connectivity. While general graphs can have multiple edges between two nodes, brain graphs tend to be simple graphs with a single undirected edge between pairs of nodes (i.e. the direction of influence between nodes is unknown). Additionally edges in graphs of brain function tend to be weighted - annotated with a value that indicates the similarity between nodes. Analyzing functional connectivity involves 1) preprocessing the data to remove confounding variation and to make it comparable across datasets, 2) specification of brain areas to be used as nodes, 3) identification of edges from the iFC between nodes, and 4) analysis of the graph (i.e. the structure and edges) to identify relationships with inter- or intra- individual variability. All of these steps have been well covered in the literature by other reviews \cite{Craddock2013,Kelly2012,Varoquaux2013} and repeating that information provides little value. Instead we will focus on exciting areas in the functional connectomics literature that we believe provide the greatest opportunities for data scientists in this quickly advancing field.

\subsection{Modeling functional interactions within the connectome}

Defining the nodes to use for a connectivity graph is a well described problem that has become the focus of a lot of research \cite{Thirion2014}. From a neuroscientific perspective there is meaningful spatial variation in brain function that exists at resolutions much finer than what can be measured using modern non-invasive neuroimaging techniques. But, connectivity graphs generated at the spatial resolution of these techniques are too large to be wieldy and there is insufficient fine-grained information about brain function to interpret connectivity results at that level. For that reason, the number of nodes in the connectome are commonly reduced by combining voxels into larger brain areas for analysis. This is accomplished using either boundaries derived from anatomical landmarks \cite{Desikan2006,AAL2002}, regions containing homogeneous cyto-architecture as determined by post-mortem studies \cite{Eickhoff2008}, or from clusters determined by applying unsupervised learning methods to functional data \cite{Bellec2006,Craddock2012}. The latter approach tends to be preferred since it is not clear that brain function respects anatomical subdivisions, and similar cells may support very different brain functions \cite{Craddock2012}. Quite a few clustering approaches have been applied to the problem of parcellating brain data into functionally homogenous brain areas, each varying in terms of the constraints that they impose on the clustering solution  \cite{Craddock2012,Blumensath2013,Bellec2006,Thirion2006,Zalesky2010,Flandin2002,Thirion2014}. There is a growing agreement in the literature that hierarchical clustering based methods perform best \cite{Blumensath2013,Thirion2014}, but no single clustering level has emerged as optimal. Instead, it appears as though there is a range of suitable clustering solutions from which to choose  \cite{Craddock2012,Thirion2014}.  


Once the nodes of a connectivity graph have been chosen, the functional connectivity between them is estimated from statistical dependencies between their time courses of brain activity. Although a variety of bivariate and multivariate methods have been proposed for this purpose \cite{Smith2011,Varoquaux2013}, there is a lot of room for new techniques that provide better estimates of the dependencies, or provide more information about the nature of these dependencies. iFC is most commonly inferred using bivariate tests for statistical dependence, typically Pearson's correlation coefficient \cite{Biswal1995}. Since these methods only consider two brain areas at the time, they cannot differentiate between direct and indirect relationships. Indirect relationships can be excluded from the graph using partial correlation, or inverse covariance matrix estimation, but regularization estimators must be employed for large number of brain areas \cite{Ryali2012,Varoquaux2013}. 

Tests of statistical dependencies between brain regions are sufficient for determining whether or not two nodes are connected, but it should be possible to construct a more precise mathematical description of the relationship between brain areas \cite{Friston1994}. Several different modeling techniques have been proposed to this end. Model confirmatory approaches such as structural equation modeling (SEM) \cite{Buchel1997} and dynamic causal modeling (DCM) \cite{Friston2003} can offer fairly detailed descriptions of node relationships . But, they rely on the pre-specification of a model and are limited in the size of a network that can be modeled. Cross-validation methods have been proposed to systematically search for the best model \cite{Zhuang2005,Penny2010,James2009}, but simulations have shown that those methods do not necessarily converge to the correct model \cite{Lohmann2012}. Granger causality is another exploratory, data-driven modeling technique that has been particularly popular due to its promise of identifying causal relationships between nodes based on temporal lags between them \cite{Deshpande2011}. But the assumptions underlying granger causality do not quite fit with fMRI data\cite{Smith2011}, where delays in the time-courses between regions may be more reflective of some physiological phenomena, such as a perfusion deficit \cite{Lv2013}, rather than causal relationships between brain areas. Alternatively, brain connectivity can be inferred from a multivariate regression  that is solved using either dimensionality reduction \cite{Friston1994} or regularization \cite{Craddock2013b}. These more precise mathematical models of connectivity have shown great promise for testing hypotheses of brain organization \cite{Craddock2013b}, predicting response to rehabilitation after stroke data \cite{James2009b}, and as biomarkers of disease \cite{Brodersen2011}.

Functional interactions within the connectome are commonly considered to be static over the course of an imaging experiment, but a growing body of research have demonstrated that connectivity between brain regions changes dynamically over time\cite{Hutchison2013}. Although, most studies have measured connectivity within a short window of the fMRI time course that is moved forward along time \cite{Keilholz2013,Chang2010,Yang2014,Allen2014} other methods have been employed with similar results  \cite{Majeed2011,Smith2012}. Several problems must be overcome in order to reliably measure changing functional connectivity patterns from the inherently slow and poorly sampled fMRI signal. First, the variance of correlation estimates increases with decreasing window size, meaning that unless proper statistical controls are utilized, the observed dynamics may arise solely from the increased variance \cite{Handwerker2012}. This issue may be mitigated using the new higher speed imaging methods, which has already shown promise for extracting dynamic network modes using temporal ICA, although really large number of observations are still necessary \cite{Smith2012}. Node definition is another issue, as it is unclear whether brain areas defined from static iFC are appropriate for dynamic iFC, although initial work has shown that parcellations of at least some brain regions from dynamic iFC are consistent with what is found with static \cite{Yang2014}.

\subsection{Comparing brain graphs} 

The ultimate goals of connectomics is to map the brain's functional architecture and to annotate that architecture with the cognitive or behavioral functions that they subtend. This latter pursuit is achieved by a group level analysis in which variations in the connectome are mapped to inter-individual variations in phenotype  \cite{Kelly2012},clinical diagnosis \cite{Castellanos2013}, or intra-individual responses to experimental perturbations \cite{Shirer2012}. Several different analyses have been proposed for accomplishing these goals, and they all require some mechanism for comparing brain graphs \cite{Varoquaux2013}. 
 
Approaches to comparing brain graphs can be differentiated based on how they treat the statistical relationships between edges. One such approach, referred to as \emph{bag of edges}, is to treat each edge in the brain graph as a sample from some random variable. Thus, a set of $N$ brain graphs each with $M$ edges will have $N$ observations for each of the $M$ random variables. In this case, the adjacency (or similarity) matrix that describes the brain graphs can be flattened into a vector representation and any of the well explored similarity or dissimilarity metrics can be applied to the data \cite{Craddock2013}. One of the benefits of this representation is the ability to treat each edge as independent of all other edges and to compare graphs using mass univariate analysis, in which, a separate univariate statistical test (e.g. t-test, anova, or ancova) is performed at each edge. This will result in a very large number of comparisons and an appropriate correction for multiple comparisons, such as Network Based Statistic \cite{Zalesky2012}, Spatial Pairwise Clustering \cite{Zalesky2012}, Statistical Parametric Networks \cite{Ginestat2011}, or group-wise false discovery rate  \cite{Benjamini2001}, must be employed to control the number of false positives. Alternatively the interdependencies between edges can be modeled at the node level using multivariate distance multiple regression (MDMR) \cite{Shehzad2014}, or across all edges using machine learning methods \cite{Craddock2009, Dosenbach2010, Richiardi2011}.

Despite the successful application of this technique, a drawback of representing a brain graph as a bag of edges is that it throws away all information about the structure of the graph. Alternative methods such as Frequent Subgraph Mining (FSM) rely on graph structure to discover features that better discriminate between different groups of graphs \cite{Thoma2010}. For instance, \cite{Bogdanov2014} were able to identify functional connectivity subgraphs that had a high predictive power for high versus low learners of motor tasks. A recent comprehensive review \cite{Richiardi2013} outlines other approaches that take the graph structure into account e.g. the graph edit distance and a number of different graph kernels. All these methods are under active development and have not been widely adapted by the connectomics community.

Another approach for graph similarity using all the vertices involves computing a set of \emph{graph-invariants} such as node centrality, modality, global efficiency, among others and using the values of these measures to represent the graph \cite{Rubinov2010,Bullmore2011}. Depending on the invariant used, this approach may permit the direct comparison of graphs that are not aligned. Another advantage is that invariants substantially reduce the dimensionality of the graph comparison problem. On the other hand, representing the graph using its computed invariants throws away information about that graph's vertex labels \cite{Vogelstein2013}. Moreover, after computing these invariants it is often unclear how they can be interpreted biologically. It is important that the invariant used matches the relationships represented by the graph. Since edges in functional brain graphs represent statistical dependencies between nodes and not anatomical connections, many of the path based invariants do not make sense, as indirect relationships are not interpretable \cite{Rubinov2010}. For example, the relationships $A \leftrightarrow B$ and $B \leftrightarrow C$ do not imply that there is a path between nodes $A$ and $C$, if a statistical relationship between $A$ and $C$ were to exist they would be connected directly.   

\subsubsection{Prediction}

Resting state fMRI and iFC analyses are commonly applied to studying clinical disorders and to this end, the ultimate goal is the identification of biomarkers of disease state, severity, and prognosis \cite{Castellanos2013}. Prediction modelling has become a popular analysis method because it most directly addresses the question of biomarker efficacy \cite{Craddock2009, Dosenbach2010, Richiardi2013}. Additionally, the prediction framework provides a principled means for validating multivariate models that more accurately deal with the statistical dependencies between edges than mass univariate techniques, all while obviating the need to correct for multiple comparisons. 

The general predictive framework involves learning a relationship between a \emph{training} set of brain graphs and a corresponding categorical or continuous variable. The brain graphs can be represented by any of the previously discussed features. The learnt model is then applied to an independent \emph{testing} set of brain graphs to decode or \emph{predict} their corresponding value of the variable. These values are compared to their "true" values to estimate \emph{prediction accuracy} - a measure of how well the model generalizes to know data. Several different strategies can be employed to split the data into training and testing datasets, although leave-one-out cross-validation has high variance and should be avoided \cite{james2014introduction}. 

A variety of different machine learning algorithms have been applied to analyzing brain graphs in this manner, but by far the most commonly employed has been support vector machines \cite{Castellanos2013}. Although these methods  offer excellent prediction accuracy, they are often black boxes, for which the information that is used to make the predictions is not easily discernible. The extraction of neuroscientifically meaningful information from the learnt model can be achieved by employing sparse methods \cite{Ryali2010} and feature selection methods \cite{Craddock2009} to reduce the input variables to only those that are essential for prediction \cite{Varoquaux2013}.  There is still considerable work to be performed in improving the extraction of information from these models, for developing techniques that permit multiple labels to be considered jointly, and developing kernels for measuring distances between graphs.

There are a few common analytical and experimental details that limit the utility of putative biomarkers learned through predictive modeling analyses.  Generalization ability is most commonly used to measure the quality of predictive models, but since this measure doesn't consider the prevalance of the disorder in the population, it doesn't provide an accurate picture of how well a clinical diagnostic based on the model would perform. Instead it is important to estimate positive and negative predictive values \cite{Grimes2002,Altman1994} using disease prevalence information from resources such as Centers for Disease Control and Prevention Mortality and Morbidity Weekly Reports \cite{CDCMMWR}. Also, the majority of neuroimaging studies are designed to differentiate between an ultra-healthy cohort and a single severely-ill population, which further waters down estimates of specificity. Instead it is also important to validate a biomarker's ability to differentiate between several different disease populations - an understudied area of connectomes research \cite{Kapur2012}. 

Most predictive modeling based explorations of connectomes have utilized classification methods that are sensitive to noisy labels. This is particularly problematic given the growing uncertainty about the biological validity of classical categorizations of mental health disorders \cite{Kapur2012}. This necessitates the use of methods that are robust to noisy labels \cite{Lugosi1992,Scott2013}. Many such techniques require quantifying the uncertainty of each training example's label, which can be very difficult to estimate for clinical classifications. Another approach that is being embraced by the psychiatric community is to abandon classification approaches and to instead focus on dimensional measures of symptoms \cite{Insel2010}. In the context of predictive modeling this translates into change in focus toward regression models, which to date have been underutilized for analyzing connectomes \cite{Castellanos2013}. 

The aforementioned dissatisfaction with extant clinical categories opens up opportunities to redefine clinical populations based on their biology rather than symptomatology. This can be accomplished using unsupervised learning techniques to identify subpopulations of individuals based on indices of brain function and then identifying their associated phenotypes \cite{Gates2014}. Similar to predictive modeling, a major challenge of this approach is finding the features that are most important for defining groups. Another problem is regularizing the clustering solution to make sure that it is relevant to the phenotypes under evaluation. These issues can be resolved using semi-supervised techniques or "multi-way" methods that incorporate phenotypic information to guide clustering \cite{Morup2011}. Along these lines, joint- or linked- ICA methods have been used to fuse different imaging modalities \cite{Franco2008, Groves2011} as well as genetics and EEG data with imaging data \cite{Calhoun2009}. 

\subsection{Model Selection}

Analyzing functional connectivity data requires choosing the preprocessing strategy for removing noise, the parcellation method and scale for defining graph nodes, the measure for defining connectivity, and the features and methods for comparing connectivity across participants, among other parameters. Several different possibilities have been proposed for each of these steps and choosing the best analysis strategy is a critical problem for connectome researchers. The complexity of this problem is highlighted by observations that both uncorrected noise sources \cite{Birn2012, Power2012, VanDijk2012, yan2013comprehensive, satterthwaite2012impact} and denoising strategies \cite{Murphy2009, Saad2012} can introduce artifactual findings. Ideally the choices for each of these parameters would be determined by maximizing the ability of the analysis to replicate some ground truth, but, as with most biomedical research, the ground truth is unknown. Simulations provide useful means for comparing the performance of different algorithms and parameter settings, but are limited by the same lack of knowledge that necessitates their use. Instead researchers are forced to rely on criteria such as prediction accuracy, reliability, reproducibility, and others for model selection \cite{strother2006}. Although, most published evaluations of different connectivity analysis strategies focus on single optimization criterion in isolation, doing so may result in a sub-optimal choice. For example, head motion has high test-retest reliability, as do the artifacts that are induced by head motion \cite{yan2013comprehensive}. As such, focusing solely on test-retest reliability may lead to the conclusion that motion correction should not be employed. Likewise, when learning a classifier for a hyperkinetic population, head motion induced artifacts will improve prediction accuracy\cite{satterthwaite2012improved}. Instead, several, ideally orthogonal, metrics should be combined for model selection. In the case of motion correction, it might be useful to also include an estimate of residual head motion effects in the data \cite{Power2012, VanDijk2012, yan2013comprehensive, satterthwaite2012impact}. But without including measures of prediction accuracy and reproducibility in the optimization, might result in a strategy that is too aggressive and removes biological signal  \cite{laconte2003evaluation, strother2002quantitative}. Going forward, the development of new frameworks and metrics for determining the best algorithms for connectivity analysis will continue to be a crucial area of research. 

\section{MAKING DATA SHARING ACCESSIBLE TO DATA SCIENTISTS}

Significant barriets exists for ``big data'' scientists who wish to engage in connectomics research. The aforementioned imaging repositories have made significant progress in assembling and openly sharing large datasets comprised of high-quality data from well-characterized populations. Before the data can be analyzed it must be preprocessed to remove nuisance variation and to make it comparable across individuals \cite{strother2006}. Additionally, the quality of the data must be assessed to determine if it is of suitable for analysis. Both of these are daunting chores, and although several open source toolsets are available for performing these tasks, they require a significant amount of domain-specific knowledge an man-power to accomplish. The Preprocessed Connectomes Project (PCP) \cite{CraddockPCP}, the Human Connectome Project (HCP) \cite{RosenHCP2010,VanEssen2012}, and others are directly addressing this challenge by sharing data in its preprocessed form. The biggest challenge faced by these preprocessing initiatives is determining the preprocessing pipeline that they will implement. The HCP takes advantage of the uniformity of their data collection to choose a single optimized preprocessing \cite{Glasser2013}. Favoring plurality, the PCP approaches this problem by preprocessing the data using a variety of different processing tools and strategies. After an analysis is complete, the results can be compared to previous results from other anlayses to assess their validity and to assist in their interpretation. Several hand-curated, and automatically generated, databases of neuroimaging results exist to aide in this effort \cite{Fox2002, Yarkoni2011, Neurovault, Brainspell}.


\section {Conclusion}
Functional connectomics is a ``big data'' science. As highlighed in this review, the challenge of learning statistical relationships between very high dimensional feature spaces and noisy or underspecified labels is rapidly emerging as rate-limiting steps for this burgeoning field and its promises to transform clinical knowledge. Accelerating the pace of discovery in functional connectivity research will require attracting data science researchers to develop new tools and techniques to address these challenges. It is our hope that recent augmentation of open science data-sharing initiatives with preprocessing efforts will catalyze the involvement of these researchers by reducing the common barriers of entry.