craddockGigascience2014_oldertext.tex

\section{Introduction}

With its new emphasis on collecting larger datasets, data sharing, deep phenotyping, and multimodal integration, nueroimaging has become a data intensive science. This is particularly true for connectomics where grass-root initiatives (e.g.the 1000 Functional Connectomes Project (FCP) \cite{Biswal2010}, the International Neuroimaging Data-sharing Initiative (INDI) \cite{Mennes2013}) and large-scale interational projects (the Human Connectome Project\cite{Sotiropoulus2013,VanEssen2012}, the Brain Connectome project from China\cite{Jiang2013}, the human brain project in EU known as CONNECT\cite{Assaf2013}, PING, NDAR \todo{i am not sure that all of these share data}) are collecting and openly sharing thousands of brain imaging scans, each of which consist of hundreds of observations of thousands of variables. Although this deluge of complex data promises to enable the investigation of nueroscientific questions that were previously inaccessible, it is quickly overwhelming the capacity of existing tools and algorithms to extract meaningful information from the data. This combined with a new focus on discovery science is creating a plethora of opportunities for data scientists from a wide range of disciplines such as computer science, engineering, mathematics, statistics, etc., to make substantial contributions to neuroscience. The goal of this review is to describe the state-of-the-art in connectomics research and enumerate opportunities for data scientists to contribute to the field.

The human connectome is a comprehensive map of the brain's circuitry, which consists of brain areas, their structural connections and their functional interactions. The connectome can be measured with a variety of different imaging techniques, but magnetic resonance imaging (MRI) is the most common primarily due to its near-ubiquity, non-invasiveness, and high spatial resolution. As measured by MRI: brain areas are patches of cortex (approximately 1\si{\centi\meter\cubed} volume) containing 1,000s of neurons, structural connections are long range fiber tracts that are inferred from the motion of water particles measured by diffusion weighted MRI (dMRI), and functional interactions are inferred from synchronized brain activity measured by functional MRI (fMRI). Addressing the current state-of-the-art for both functional and structural connectivity is well beyond the scope of a single review. Instead, this review will focus on functional connectivity, which is particularly fast growing and offers many exciting opportunities for data scientists.

The advent of functional connectivity analyses has popularized the application of discovery science to brain function, which marks a shift in emphasis from hypothesis testing, to supervised and unsupervised methods for learning statistical relationships from the data. Since functional connectivity is inferred from statistical dependencies between physiological measures of brain activity (i.e. correlations between the dependent variables), it can be measured without an experimental manipulation. Thus, functional connectivity is most commonly measured from "resting state" fMRI scans, during which the study participant is lying quietly and not performing an experimenter specified task - when measured in this way, it is referred to as intrinsic functional connectivity (iFC). Once iFC is measured, data mining techniques can be applied to identify iFC patterns that covary with phenotypes, such as, indices of cognitive abilities, personality traits, or disease state, severity, and prognosis, to name a few. In a time dominated by skepticism about the ecological validity of psychiatric diagnoses, iFC analyses have become particularly important for identifying subgroups within patient populations by similarity in brain architecture, rather than similarity in symptom profiles. This new emphasis in discovery necessitates a new breed of data analysis tools that are equipped to deal with the issues inherent to functional nueroimaging data.

No matter what type of method is being applied, iFC analyses are plagued by the curse of dimensionality and an inability to validate their results with a gold standard. Functional MRI datasets typically involve hundreds or thousands of scans, each of which consist of a time series of hundreds of 3D brain volumes, that each contain measures from hundreds of thousands of brain locations (variables). Whether performing an analysis within a scan, or across scans, the number of variables (N) is much larger than the number of observations (P). Different analyses have approached the dimensionality problems in different ways, but there is no consensus on the best algorithm for brain data. Indeed there isn't a consensus about the best methods to use for any step of iFC analyses, and this lack of agreement is due to a lack of a ground truth or gold standard for comparing different techniques. Although several simulations have been proposed, there is always some skepticism about their accuracy and comprehensiveness. In their place, cross validation techniques have been utilized to compare methods based on some measure of reproducibility and/or generalizability.

This review is a primer on intrinsic functional connectivity analyses for interdisciplinary data scientists. It begins in section 2 by providing a description of fMRI data and the technical issues encountered when working with the data. Section 3 describes graph theoretic approaches for analyzing iFC data. Section 4 highlights several cutting edge iFC analytical paradigms that are being employed, and open issues that will benefit from an increased engagement of data science practitioners. The review concludes in section 5 with a list of several valuable open science resources that make connectomics research accessible to the larger scientific community.

\section{Intrinsic funcitonal connectivity}

Functional MRI, in its most conventional form, is a brain imaging modality in which approximately 40 3-\si{\milli\meter}, $64 \times 64$ voxel (3D pixel, $3\times3$\si{\milli\meter\squared} in-plane resolution), brain slices are sequentially acquired every 2 - 3 seconds. The fMRI signal relies on the blood oxygenation level dependent (BOLD) contrast to derive a relative measure of brain function. The BOLD signal originates in the magnetic properties of hemoglobin, the protein in blood that binds to oxygen. When hemoglobin is not bound to oxygen (deoxygenated), its four atoms of iron generate a local magnetic field gradient that dephases a region of the fMRI signal. When bound to oxygen (oxygenated) the influence of the iron on the magnetic field is blocked, and the fMRI signal is preserved. The magnitude of the fMRI signal in an image voxel is proportional to the ratio between oxygenated and deoxygenated hemoglobin in or near that voxel. Perhaps quixotically, when brain activity increases, so does the oxy-deoxy ratio, resulting in a increase in the magnitude of the fMRI signal \cite{fmribook}. 

The changes in BOLD due to brain activity (hemodynamic response), as measured using task-fMRI, are fairly slow, peaking 1-2 seconds after neuronal activity begins, and persisting 4-6 seconds after it ends \cite{fmribook}. Resting state fluctuations that underpin iFC are consistently localized to the between $0.01-0.08$ \si{\hertz}, with a peak somewhere around $0.035$ \si{\hertz}\cite{biswal,cordes}. Since the BOLD signal originates from magnetic fields generated by blood, it tends to be localized to the larger vascular and smears out beyond an active brain region. As a consequence of these two properties of BOLD, there is temporal and spatial auto-correlation in fMRI data that must be accounted for in the analysis (e.g. when estimating statistical significance).

Although fMRI data contains several sources of noise, such as linear drifts and Rician distributed noise from the MRI hardware\cite{Gudbjartsson1995}, it is dominated by physiological noise. The scanning participant's heart beat generates a pulsatile motion in the brain and consequently fluctuations in the fMRI signal\cite{birn2012}. Changes in abdominal volume due to breathing causes variations in the magnetic field profile, which in turn, result in global fluctuations in image intensity\cite{birn2012}. Also, changes in rate and depth of respiration will lead to changes in the brain's oxygenation level and longer-term modulations of the fMRI signal\cite{birn2012}. Perhaps worse of all is participant head motion, which not only results in the misalignment of brain regions between brain volumes, but also modulates fMRI signal intensity due to partial volumning and spin-history effects. Although head motion can be mitigated during the acquisition process, other sources of noise can only be removed using post-hoc corrections to the fMRI signal.

\subsection{Data preprocessing}

Several preprocessing procedures have been developed to precondition fMRI data to remove noise and other sources of systematic variation that might confound analysis results. Since iFC analyses rely on correlations between the fMRI data, all of which are confounded by the same noise sources, these systematic variations my lead to false positive or negatives in the data. There is no concensus on the best algorithms to use for preprocessing or the order in which they should be applied to the data and this situation persists due to a lack of a gold standard for comparing different strategies. Several of the preprocessing steps have been evaluated in isolation \cite{Powers, VanDijk, Yan, Yan, Satterthwaite, Satterthwaite, Murphy, Birn, Chang, Saad}, but there is likely to exist interdependencies between these steps that will not be fully accounted for unless the full pipeline is evaluated simultaneously (Strother EMBS). The nonparametric prediction, activation, influence, and reproducibility
resampling (NPAIRS) framework \cite{Strother2002, Laconte2003} that has recently been adapted to iFC analyses \cite{Chu2012, Craddock2013} offers a principled method for performing these comparisons, but alot of work remains in comparing different preprocessing strategies, and developing new evaluation frameworks.

fMRI data is acquired sequentially one slice at a time, resulting in phase delays that vary between slices, these delays can be removed by interpolating all of the data to the same time grid. Imaging volumes can be spatially co-registered to correct their misalignment due to head motion. Although low frequency scanner drift can be removed from the data by high-pass filtering ($f<0.01$  \si{\hertz}), the slow sampling rates of typical fMRI sequences ($0.33 \leq f_s \leq 0.5$ \si{\hertz}) make it impossible to filter out the higher frequency ($1 \leq f \leq 2$ \si{\hertz}) contributions of physiological noise. Instead heart rate, respiration, and head motion effects are modeled using a regression framework, and the result is subtracted from the fMRI data \cite{lund_nvr}. For heart beat and respiration this can be accomplished using physiological recordings of these signals \cite{Hu1999,Glover2002}, but due to the difficulties of making these recordings in the scanning environment, surrogate signals from cerebrospinal fluid and white matter\cite{fox2005}, which contain fluctuations due to physiological noise but no neuronal signal, are commonly used to model their effects. Several different methods have been proposed to account for head-motion induced fluctuations in the fMRI signal, and most include either the 6 motion parameters ($\delta_{x}[t]$, $\delta_{y}[t]$, $\delta_{z}[t]$, pitch, roll, yaw) calculated during coregistration\cite{fox2005}, or a 24 parameter model derived from this parameters (the six parameters, their squares, the 1-lagged parameters, and the square of the 1-lag parameters)\cite{friston1996}, with some evidence supporting the 24-parameter model as best\cite{Yan2013, Satterthwaite2013}. More aggressive methods censor offending time points by either deleting them all-together\cite{Power2012} or in a regression model (spike regression)\cite{gabrieli_whitefield}. Some researchers have advocated regressing out the global-signal as a non-specific measure of noise\cite{fox_response}, but this method is widely criticized since it mathematically centers the distribution of iFC values, which jeopardizes the interpretation of negative correlations\cite{Murphy, Saad}.  

Preprocessing fMRI data is reliant on the analysis of high resolution structural MRI data that is usually acquired during the same scanning session. It is from the structural data that we can segment the images into cerebral spinal fluid, white matter, and gray matter areas of the brain. Additionally the structural data is used in to calculate a spatial transformation to a brain template to account for volumetric and morphometric differences between participants. In general, the quality of iFC analyses depends on the quality of the structural data and its processing.

\subsection{IFC and Regional measures}

A variety of different strategies have been developed for analyzing iFC that are outside what might be considered \emph{connectomes}-style analyses, but nevertheless are invaluable methods for investing the brain's functional architecture. iFC began with seed-based correlation analyses, which identify an iFC-map for a particular region of interest (seed) from the Pearson's correlation of that regions mean time course with the time courses of every other voxel in the brain \cite{biswal1995}. This hypothesis-oriented approach has been complemented by unsupervised approaches such as clustering and independent components analysis for defining \emph{intrinsic connectivity networks} (ICN). Using ICA researchers have identified 6-10 ICNs which are reproducible across participants\cite{beckmann,damasoix}, time\cite{zuo}, and across cognitive tasks\cite{smith2009}. Several regional measures have also been proposed that provide sensitivity to different aspects of the data's structure. The fractional amplitude of of low frequency fluctuations is a normalized measure of each voxel's power in the frequency band commonly associated with iFC (0.01-0.08 Hz) and provides a relative measure of brain activity\cite{zang}. Regional homogeneity (ReHo) measures the degree to which a voxel's 6-, 18-, or 22- voxel neighborhood are synchronized\cite{}. Local functional connectivity density (LFCD) is a complementary method that meaures the size of a voxels neighborhood, which is defined as the connected region of voxels who exhibit a super-threshold correlation with the target \cite{Tomasi}. Voxel-mirrored homotopic connectivity is measured from the correlation between a voxel's time course and the time course for its corresponding voxel in the contralateral hemisphere \cite{VMHC}. Another approach, which is measured from the time-lag that maximizes correlation between a voxel's time course and a reference time course such as the global signal, has been shown to be sensitive to blood flow deficits, particularly in stroke \cite{lv}. The result of each of these methods is a voxel-wise map of statistics that can be compared across participants using either voxel-by-voxel univariate statistical tests such as t-tests or anovas, or can be analyzed using multivariate supervised learning techniques \cite{ReHoPrediction}.

\section{The Connectome Analysis Paradigm}

In 2005 Sporn and Hagmann \cite{Sporns2005,Hagmann2005} independently and in parallel coined the term \textit{the human connectome}, which embodies the notion that the set of all connections within the human brain can be represented and understood as graphs. In the context of iFC, graphs provide a mathematical representation of the functional interactions between brain areas. In this representation \emph{nodes} are brain areas that are connected by \emph{edges} that indicate their functional connectivity. Any graph analysis problem follows the following procedural steps: (1) identification of nodes, (2) identification of edges i.e. the associations between nodes, and (3) analysis of the graph (i.e. the structure and edges) to identify relationship with inter or intra-individual variability. Analysis of the human connectome adheres to these steps as outlined in this section. 

\subsection{Defining the nodes of the connnectome}

The first step in constructing brain graphs from fMRI data is the specification of nodes. Single voxels might be used as nodes, but that will result in computationally expensive graphs that can contain billions of edges. Due to the spatial correlation inherent in the data, neighboring voxels are sufficiently similar to be combined together into larger brain areas. It is important for nodes to be functional homogenous, as any mixing of time courses between different functional areas will result in an inaccurate reproduction of brain connectivity \cite{smith2010}. Nodes can be circumscribed using anatomical atlases, meta-analyses of functional studies, or data-driven parcellation approaches. Anatomical atlases subdivide the brain based on anatomical landmarks\cite{AAL, HO}, such as sulcal patterns\cite{varoquaux}, or into areas containing similar cell types, as determined by microscopic visualization of post-mortem brains\cite{TT,Eickhoffzilles}. But, since the definitions of brain areas in these atlases did not incorporate functional information, they are not likely to be functionally homogenous. Another approach is to define regions from the overlap of activations found in other functional nueroimaging studies\cite{Dosenbach}, which works well as long as the regions are relatively similar in size and are non-overlapping\cite{VaroquauxCraddock2013}. In the last approach, a unsupervised or semi-supervised learning algorithm can be employed to subdivide functional data into homogenous units. Several different algorithms have been applied to data-driven node specification that vary in definition of clustering cost, incorporation of explicit constraints, and whether they are performed at the individual or group-level\cite{bellec, flandin, craddock, blumensath, kiviniemi}. This prompts the inevitable question of which parcellation is the best representation of the human connectome. Overall there is no consensus on which is the best node definition to use, although there is strong evidence that data-driven approaches outperform anatomical atlases (figure from cameron paper).

A considerable drawback of using data-driven approaches for defining brain regions, is that the number of regions in the clustering solution, or conversely their size, must be specified. The choice of node size has been shown to have a considerable effect on the resulting graph topology \cite{Zalesky} and can also affect the outcome of an analysis\cite{Cecci_2009}. There are a variety of mechanisms for optimizing the number of brain areas in a parcellation, many of which using cross-validation methods, but in application they do not tend to converge to a single optimum, and instead provide a range of parcellations that are suitable\cite{craddock2012}. The choice of parcellation used will strictly depend on the problem to be solved by that analysis and the amount of error that one can allow when interpreting the results. 

\subsection{Define the edges}

After deciding the grain size of each region, i.e. whether it be an individual voxel or a set of spatially contiguous voxels or even a set of overlapping regions (one voxel can belong to more than one node of the graph e.g. ICA approach), one needs to construct the edges that describe relationships between nodes. While general graphs can have multiple edges between two nodes, brain graphs tend to be simple graphs with a single undirected edge between pairs of nodes (i.e. the direction of influence between nodes is unknown). Additionally edges in graphs of brain function tend to be weighted - annotated with a value that indicates the similarity between nodes. 

The first step in defining edges is summarizing the functional information present in the node. A representative time course for each node may be created by either averaging the time courses of all the voxels in a node, or by taking the first eigenvariate from a single value decomposition of all of the time courses\cite{friston_functional_localizers}. The former is preferred because it has a tendency to represent all of the time courses in a node equally, rather than favoring those that most commonly occur \cite{craddock2012}. These summary time courses can either be compared directly between nodes, or can be used in a seed-based correlation analysis to construct node-specific whole brain iFC maps, which are then compared spatially between nodes. Although, these two features are seemingly similar, they can lead to very different similarities between nodes \cite{craddock2012}. 

Quite a few different measures have been used in the iFC literature for measuring the similarity between time courses, many of which have been reviewed in an extensive simulation study\cite{smith2010}. By far the most commonly employed measure is Pearson's correlation coefficient, which performs well in simulation\cite{smith2010}. A benefit of Pearson's correlation is that is scaled so that it is insensitive to differences in scales and shifts between time courses. This is particularly useful for fMRI, which is a relative measure of brain function that can vary widely across the brain for reasons that are non-neuronal in source, i.e. proximity to a vein\cite{smithbook}. A draw back is that Pearson's correlation is a bivariate measure that does not adequately address the simultaneous influence that multiple nodes have on one another. The resulting loss of specificity might result in "phantom" edges between regions that is a by-product of shared variance from a mutual relationship with another region (or several other regions)\cite{VaroquauxCraddock2013}. Partial correlation methods simultaneously take into account all of the nodes in their calculation and as such have higher specificity. But, since there are often more nodes in a graph than independent observations (particularly when accounting for temporal autocorrelation) partial correlations cannot be directly measured from the data. Instead, regularized regression techniques can be employed\cite{VaroquauxCraddock2013}. Approaches for measuring the spatial similarity between iFC maps has been much less studied in the literature. Commonly employed measures include $\eta^2$\cite{} and the concordance correlation coefficient\cite{}, both of which measure how identical the two maps are, included differences in shifts and scales. We have yet to find a report where partial correlation was used to calculate similarity between iFC maps.

Once the similarity between nodes has been calculated, a threshold must be chosen to use to indicate the presence of an edge. Negative relationships are possible given the commonly employed similarity measures, but these are typically removed from the graph since it is unclear how to interpret negative similarity and because many graph theoretical techniques require nonnegative graph weights. It is common to apply a significance threshold to the graph to exclude false positive edges. These thresholds can be determined by parametric tests such as Pearson's correlation to p-value calculators, or non-parametric tests such as wave-strapping\cite{breakspear}, circular block bootstrap\cite{bellec}, or phase randomization\cite{zhen}. It is import to correctly account for auto-correlation in the data when employing these techniques. Alternatively, a sparsity threshold may be employed to choose a percentage of the edges with the strongest weights\cite{}. This is useful when calculating graph statistics that are highly sensitive to the number of edges in the graph, and may serve to ameliorate batch-effects that may exist between datasets acquired on different scanners or at different sites\cite{Chaogan}. Once the below-threshold edges have been excluded from the brain graphs, the are ready for group-level analysis. 

Despite the successful application of the technique mentioned above, a drawback of representing a brain graph as a bag of edges is that this representation throws away all information about the structure of the graph. Being able to retain these graph structures within an analysis commonly known as Frequent Subgraph Mining (FSM) has facilitated the discovery of features that better discriminated between different groups of graphs \cite{Harrison2013}. For instance, \cite{Bogdanov2014} were able to identify discriminative subgraphs from functional connectivity graphs that had a high predictive power for high versus low learners given specific motor tasks. \cite{Richiardi2013} outlines other approaches that take the graph structure into account e.g. the graph edit distance and a number of different graph kernels. All these methods are under active development and have not been widely adapted by the connectomics

\subsection{Comparing brain graphs} 

The ultimate goals of connectomics is to map the brain's functional architecture and to annotate that architecture with the cognitive or behavioral functions that they subtend. This latter pursuit is achieved by a group level analysis in which variations in the connectome are mapped to inter-individual variations in phenotype \cite{Kelly2011}, or intra-individual responses to experimental perturbations \cite{Shirer}. Several different analyses have been proposed for accomplishing these goals, and they all require some mechanism for comparing brain graphs. 
 
Approaches to comparing brain graphs can be differentiated based on how they treat the statistical relationships between edges. One such approach, referred to as \emph{bag of edges}, is to treat each edge in the brain graph as a sample from some random variable. Thus, a set of $N$ brain graphs each with $M$ edges will have $N$ observations for each of the $M$ random variables. In this case, the adjacency (or similarity) matrix that describes the brain graphs can be flattened into a vector representation and any of the well explored similarity or dissimilarity metrics can be applied to the data \cite{Ravindran}. One of the benefits of this representation is the ability to treat each edge as independent of all other edges and to compare graphs using mass univariate analysis, in which, a separate univariate statistical test (e.g. t-test, anova, or ancova) is performed at each edge. This will result in a very large number of comparisons and an appropriate correction for multiple comparisons, such as Network Based Statistic \cite{Zalesky2011}, Spatial Pairwise Clustering \cite{Zalesky2012}, Statistical Parametric Networks \cite{Ginestet2013}, or group-wise false discovery rate \cite{}, must be employed to control the number of false positives. Alternatively the interdependencies between edges can be modeled at the node level using multivariate distance multiple regression (MDMR) \cite{Shehzad2014}, or across all edges using machine learning methods \cite{Craddock2009, Dosenbach2010, Richiardi2011}. Despite the successful application of this technique, a drawback of representing a brain graph as a bag of edges is that this representation throws away all information about the structure of the graph.  In an effort to overcome these limitations, work is being done to look at sub-graphs \cite{} \todo{how about directly comparing graphs using a graph similarity metric}

Another approach for graph similarity using all the vertexes involves computing a set of \emph{graph-invariants} such as node centrality, modality, global efficiency, etc. and using the values of these measures to represent the graph \cite{rubinov}\cite{bullmoreReview}. Depending on the invariant used, this approach may permit the direct comparison of graphs that are not aligned. Another advantage is that invariants substantially reduce the dimensionality of the graph comparison problem. On the other hand, representing the graph using its computed invariants throws away information about that graph's vertex labels \cite{Vogelstein2012}. Moreover, after computing these invariants it is often unclear how they can be interpreted biologically. It is important that the invariant used matches the relationships represented by the graph. Since, edges in functional brain graphs represent statistical dependencies between nodes and not anatomical connections, many of the path based invariants do not make sense, as indirect relationships are not interpretable \cite{}. For example, the relationships $A \leftrightarrow B$ and $B \leftrightarrow C$ do not imply that there is a path between nodes $A$ and $C$, if a statistical relationship between $A$ and $C$ were to exist they would be connected directly.   

\section{Emerging Techniques}

Goal for each of the below section is 200 work max description of emerging technique and challenges.

\subsection{High speed image acquisition}

The emergence of high-speed functional imaging techniques is perhaps the biggest advance in fMRI since the discovery of BOLD. Although many of these techniques involve 3D imaging \cite{glover, MRNguy} or echo shifting techniques \cite{PRESTO} that come at a considerable cost of resolution, or image distortion, the recently introduced multi-band imaging can obtain images at the same resolution of classical methods with acceleration factors as high as $\times8$ \cite{Feinberg}. One hope is that the increase in temporal resolution afforded by these techniques will improve the ability to model brain connectivity, particularly temporal dynamics \cite{smithTICA}, but it is not clear that the additonal observations translate to increases in degrees of freedom \cite{MRNguy}. One thing for certain, is that the faster acquisitions improve the ability to model noise and they are approaching sampling rates that will enable physiological noise to be removed by digitial filtering \cite{beckmann}, but not without a cost, these speedups come at a cost of increased spatial smoothing and reduced image constrast \cite{needfind}. The growing popularity of high speed image acquisition techniques will require the development of new analysis methods to take full advantage of the increases in datasize and informationtion afforded by these techniques.

\subsection{Dynamic Connectivity}

Standard seed- and ICA- methods for mapping iFC assume that it is stationary, and derive connectivity patters from the entirity of the available fMRI time course. Recent studies however, have demonstrated that connectivity between brain regions change dynamically over time \cite{Chang, Keilholz, Hutchinson2013, Fu2013}. A variety of investigations have dynamic iFC have already been performed, most of which measure connectivity withen small a window of the fMRI time course that is gradually moved forward along time \cite{}. Several problems must be overcome in order to reliably measure changing functional connectivity patterns, from the inherently slow and poorly sampled fMRI signal. First, the variance of correlation estimates increases with decreasing window size, meaning that unless proper statistical controls are utilized, the observed dynamics may arise solely from the increased variance \cite{}. This issue may be mitigated using the new higher speed imaging methods, which has already shown promise for extracting dynamic network modes using temporal ICA, although really large number of observations are still necessary \cite{Smith2012}.\todo{verify this} Node definition is another issue, as it is unclear whether brain areas defined from static iFC are appropriate for dynamic iFC, although initial work has shown that parcellations of at least some brain regions from dynamic iFC are consistent with what is found with static \cite{Yang2013}. As the connectomes field moves toward dynamic connectivity, there will be a large need for the development of new analysis paradigms and tools for their identification and iterpretation. 

\subsection{Prediction}

Resting state fMRI and iFC analyses are most commonly applied to studying clinical disorders and to this end, the ultimate goal is the identification of biomarkers of disease state, severity, and prognosis\cite{DiMartino}. To this end, prediction modelling has become a popular analysis method because it most directly addresses the question of biomarker efficacy\cite{craddock,Dosenbach,review}. Additionally, the prediction framework provides a principled means for validating multivariate models that more accurately deal with the statistical dependencies between edges than mass univariate techniques, all while obviating the need to correct for multiple comparisons. The general framework involves learning a relationship between a \emph{training} set of brain graphs and a corresponding categorical or continuous variable. The features for the brain graphs can be (1) a set of topological properties from each brain graph \cite{Cecci2009, Bassett2012}, (2) a vector embedding of the brain graphs \cite{Richiadi2013,Luo2003}, or (3) the result of passing the brain graphs through a graph kernel \cite{}. The learnt model is then applied to an independent \emph{testing} set of brain graphs to decode or \emph{predict} their corresponding value of the variable. These values are compared to their "true" values to estimate \emph{prediction accuracy} - a measure of how well the model generalizes to know data. Several different strategies can be employed to split the data into training and testing datasets, although leave-one-out cross-validation has high variance and should be avoided \cite{}. Although the advanced machine learning methods commonly employed in this framework offer excellent prediction accuracy, they are often black boxes, for which the information that is used to make the predictions is not easily discernable. To this end, sparse methods and feature selection can be employed to reduce the input variables to only those that are essential for prediction, thereby aiding the extraction of nueroscientifically meaningful information from the learnt model. A variety of different machine learning algorithms have been applied to analyzing brain graphs in this manner, but by far the most commonly employed has been support vector machines\cite{DiMartino}. Their is still considerable work to be performed in improving the extraction of information from these models, for developing techniques that permit multiple labels to be considered jointly, and developing kernels for measuring distances between graphs.

\subsection{Subdividing populations}

An essential 
Clustering individuals based on neurophenotype a la Damian.

\subsection{Multimodal integration}

Combining information from different modes to learn more

\subsection{Causality}

This will be rewritten to emphasize roles of integration with brain stimulation.

To date, a lot of work has been done to identify nodes within a brain graph that share some degree of mutual information (known as functional connectivity). Another interesting question to answer with regards to brain graphs is to identify what nodes have direct influence on the signal obtained in other nodes (known as effective connectivity). In other words, how can we identify the directional causal influence of one brain region over another using brain graphs? This is a hard problem to tackle using rsfMRI. Simple correlation analysis does not suffice because a "high correlation between remote sampling sites might imply some direct connection, or it might imply some third site driving their joint activation, or it might imply them jointly driving some third site. And even if they are connected, it's difficult to tell the direction of the connection, even if there is a "direction" to it." \footnote{http://mindhive.mit.edu/node/58}. 

Current approaches include the use of dynamic causal modeling (DCM) \cite{}, structural equation modeling (SEM) \cite{}, and Granger Causality \cite{}. While these methods have shed some light into the wiring of the brain, there are some unresolved problems. DCM analysis requires some external stimuli that will perturb a given brain networks. Such a stimuli is not possible for rsfMRI data because it is acquired when the subject is at rest. On the other hand, using the GC approach one attempts to identify a specific node that influences the characteristic of another node in a given brain network. The drawback of this approach is that the hemodynamic response in fMRI signal naturally lags by a few seconds. Thus, it is non-trivial to draw conclusions that any observed lag in signal is specifically due to one node exerting an influence over another.   

\subsection{Batch effects}

With the growing prevalence of openly shared data, the desire to compare data acquired on different scanners, with different scanning protocols, and at different sites is becoming common. Although early studies have found that the amount of biological variation between individuals outweighs between site variation\cite{biswal}, there is still a need to standardize datasets against batch effects\cite{YanStand}. Ideally, this will entail the acquisition of calibration data, but until an accepted standard on what this calibration data will include exists and its collection becomes commonplace, post-hoc statistical corrections will be essential. Understanding the nature of batch effects and developing adequate correction strategies is in its infancy and there is substantial need for contributions in this area.  

\section{Open Science Resources}

As with other big data problems, the analysis of the type of brain data described above will benefit significantly from open science and open data. By open science we mean the process of making intermediate and end data products available to researchers outside the lab that initiated the investigation and by open data we mean data sharing, data reuse, allowing groups other than the ones that acquired the data to further analyze freely available data \cite{Milham2012}. Both open science and open data have the benefit of facilitating the creation of new questions to answer from a data set that the people who acquired the data didn't conceive. A great example of this is the bioinformatics community, which came into existence due to the availability of large genomics datasets that were publicly available and could be used by anybody who possessed the skills and knowledge to mine them\cite{VanHorn2013}. This brought scientists and researchers from disparate fields (ranging from biology, statistics, to computer science and engineering) together. The same can be aspired for the brain imaging community by making data acquired in different labs publicly available. At the moment there are a few public databases such as the fMRI data center \footnote{fmri datacenter webaddress}, LONI IDA \footnote{LONI webaddress}, 1000 Functional Connectomes Project (FCP)\footnote{\url{http://fcon_1000.projects.nitrc.org}}, the International Neuroimaging Data-sharing Initiative (INDI)\footnote{INDI webaddress}, ABIDE preprocessed\footnote{\url{http://preprocessed-connectomes-project.github.io/abide/}}, Human Connectome Project \footnote{HCP webaddress}, ADHD 200 preprocessed \footnote{\url{http://neurobureau.projects.nitrc.org/ADHD200/Introduction.html}},CORR \footnote{\url{http://fcon_1000.projects.nitrc.org/indi/CoRR/html/index.html}}. Among these, there are a few such as the data from the Human Connectome Project and the Cameron data set, which contain clean and tidy data. The latter will speed up knowledge discovery and new method development by alleviating the need for newcomers to acquire the knowledge and skills required to properly clean raw MRI data.