BrainImageR provides a suite of tools to compare gene expression data to post-mortem human brain data from the Allen Brain Atlas. BrainImageR has two main functionalities, the first which is presented in Section 1 is to characterize spatial gene set enrichment (SGSE) in the human brain using either the developing brain or the adult brain as the reference material. The second, presented in Section 2, predicts the approximate point in developmental time that the sample relates to.
Comparing your dataset versus the human post-mortem reference:
- quantify SGSE
- qualitatively plot SGSE maps
- predict developmental time
library(brainImageR)
library(ggplot2)
brainImageR:::loadworkspace()
Quick Run! These instructions are for those that just want to dive straight in. Otherwise, if you would like step-by-step instructions, see "Section 1" below.
#Spatial
data(vth)
composite <- SpatialEnrichment(genes = vth, reps = 2, refset = "developing")
res <- testEnrich(composite, method = "fisher")
composite <- CreateBrain(composite, res, slice = 5, pcut =1)
PlotBrain(composite)
#Temporal
data(dat)
Time <- predict_time(dat, minage = 0 , maxage = 40)
head(Time$pred_age)
The Spatial Gene Set Enrichment (SGSE) tools provide the user with a quantitative measure of gene set enrichment within the postmortem brain as well as additional tools to dig in deeper into the genes that overlap across regions, and an ability to plot the SGSE over reference drawings of the human brain. These set of tools work both for the developing and adult human brain.
One point to note is that a major advantage of brainImageR is the ability to cross-reference two disparate datasets from the Allen Brain Atlas:
- microdissected tissue: Transcription datasets from microdissected tissue. This set is used to identify which genes are present within a given tissue.
- general brain area: Regions noted in the Allen Brain Atlas reference map. This set is used to plot the final SGSE over a map of the human brain.
These two datasets do not share a 1:1 relationship. As a general rule more than one microdissected tissue will be combined to generate the profile plotted over the general brain area reference atlas. We will refer to these two datsets as either microdissected tissue or general brain area throughout the vignette to clearly distinguish the two datasets being queried.
The SGSE analysis works directly off of gene lists such as those that are
returned from a differential expression analysis. We've provided two datasets
to examine SGSE in either the developing ventral thalamus vth
or the
adult hippocampus hip
.
data(vth)
length(vth)
head(vth)
data(hipp)
length(hipp)
head(hipp)
brainImageR
works from gene names in the Human Gene Symbol format.
If your genes are not in this format they must first be converted.
There are many utilities online for this purpose such as GeneIDConversion
from DAVID Bioinformatics. https://david.ncifcrf.gov/conversion.jsp
The first step in SGSE is to compare your query gene list to the genes known
to be expressed in post-mortem human brain microdissected tissue using
SpatialEnrichment
. There are three important arguments
that go into SpatialEnrichment
.
SpatialEnrichment
Arguments:
genes
: The query gene list of common Human Gene Symbols.refset
: The user will pick whethergenes
is compared to the developing human or the adult human by assigning therefset
to "developing" or "adult".reps
: The final significance estimates are determined by eirther a fisher's exact test or by bootstrapping the raw data. If you choose to bootstrap, increasing the number of iterations or "reps" will increase the sensitivity of this analysis. For the initial exploratory analysis it is recommended to set this value low (ex: 20). This will allow you to assess different conditions and gene sets. We suggest that for the final analysis thatreps
be increased to a number that can successfully discern comparisons that are significant after Bonferroni correction (developing = 6540 reps, adult = 3680 rep).
composite <- SpatialEnrichment(genes = vth, reps = 2, refset = "developing")
Additionally you can input a user-defined background list for more nuanced analyses.
background
: A gene list of common Human Gene Symbols to perform a user-defined background correction.
The second step in SGSE is to calculate significance of the observed
fold-change compared to random chance within the microdissected tissue
with testEnrich
.
testEnrich
only requires the output of SpatialEnrichment
to run.
If using method = "fisher" a standard fisher's exact test is called.
If using method = "bootstrap" the p-value will be estimated from
the raw data.
res <- testEnrich(composite, method = "fisher")
It is good practice to save your data at this point.
save(list = c("composite", "res"), file = "/mydir/myfile.rda")
The table produced from testEnrich
reports the enrichment score for all
tested microdissected tissue as well as the raw and
adjusted p-values. These values can be used to identify the exact
microdissected tissues that are over- or under-represented by the
gene list.
Output from testEnrich
:
- count.sample: overlap of query gene list and microdissected tissue gene list
- count.random: average overlap of random gene list and microdissected tissue gene list
- FC: fold-change between sample and random
- p-value: raw p-value
- p-adj: FDR corrected p-value
- abrev: Abreviation of the tissue name provided by Allen Brain Atlas
reference. This abreviation can be used as input to the Allen Brain Atlas
reference map to identify where exactly the tissue is located within the
general brain area. It can also be used as input into
GetGenes
to determine exactly which of your query gene list overlaps with the specific tissue. - structure: Full tissue name.
res <- res[order(res$FC, decreasing=TRUE),]
head(res)
The results from testEnrich
can be easily plotted with PlotEnrich
PlotEnrich(res)
One of the top enriched microdissected tissues is "LHAa" or the lateral hypothalemic area, anterior part. Let's see what genes are overlapping between the vth set and the LHAa tissue.
vth_lha_overlap <- GetGenes(vth, composite, tissue_abrev = "LHAa")
length(vth_lha_overlap)
Then using GO term enrichment packages such as clusterProfiler
we can
identify what GO terms are enriched given these overlapped genes.
library(clusterProfiler)
library(org.Hs.eg.db)
vth_go <- enrichGO(gene = vth_lha_overlap,
OrgDb = org.Hs.eg.db,
keytype = 'SYMBOL',
pvalueCutoff = 0.05,
qvalueCutoff = 0.05)
dotplot(vth_go, showCategory=30)
Next we would like to see if neighboring brain areas share SGSE estimates.
To do this we will use CreateBrain
to merge all of the
microdissected tissues returned from testEnrich
so that each
general brain area present within the Allen Brain Atlas reference map
is supported by the respective set of underlying tissues.
As mentioned above, note that the general brain areas within the Allen Brain Atlas reference map and the microdissected tissues present within the Allen Brain Atlas transcription datasets do not have a 1:1 relationship. Therefore some areas are not supported by transcriptional information in the reference map while other areas are supported by more than one tissue.
CreateBrain
takes the results from SpatialEnrichment
and testEnrich
as input.
There are two additional arguments to consider.
CreateBrain
arguments:
- slice: Image from within either the developing or adult brain set to plot. The choices for both the developing and adult brain are 1-10.
- pcut: p-value threshold for considering a microdissected tissue as being enriched or depleted
Since we initially found the LHAa interesting, let's see what slice of the developing brain contains general brain areas composed of the LHAa.
tis_in_region(composite, "LHAa")
Slices 6 and 7 both have relevant information for LHAa in the thalamus (THM). Let's use slice 6 for downstream analysis.
composite <- CreateBrain(composite, res, slice = 5, pcut = 0.05)
Once the microdissected tissues have been merged into the corresponding general brain areas, the image can be plotted.
PlotBrain(composite)
From this image it looks like the thalamus is indeed enriched for the vth gene set.Interestingly surrounding areas are also enriched for this set. By referencing the Allen Brain Atlas we can see that those are general brain areas like the Putamen (Pu) and Caudate Nucleus (Ca).
Note that you can either reference the Allen Brain Atlas directly or use
available_areanames
to identify the list of available general brain area
names to query.
available_areanames(composite, slice = 6)
Let's look more closely at the general brain area of the Putamen (Pu).
Using tis_set
we can work backwards and identify which
microdissected tissues support the final color code in a given
area of interest.
tis_set(composite, area.name = "Pu", slice = 6)
And now we can go back and observe the gene overlap in a microdissected tissue within the Putamen. Here we focus in on the medioventral part of putamen (Pmv)
vth_pld_overlap <- GetGenes(vth, composite, tissue_abrev = "Pmv")
length(vth_pld_overlap)
##vth_go <- enrichGO(gene = vth_pld_overlap,
## OrgDb = org.Hs.eg.db,
## keytype = 'SYMBOL',
## pvalueCutoff = 0.05,
## qvalueCutoff = 0.05)
##dotplot(vth_go, showCategory=30)
The GO term "hormone activity" has showed up in both the comparison
with LHAa and Pmv. Perhaps there is a shared pathway across these tissues.
Let's use whichtissues
to see if the genes with hormone activity in the
Pmv are also present in the LHAa.
#grab the genes associated with hormone activity
#vth_go2 <- data.frame(vth_go)
#vth_match <- vth_go2$Description == "hormone activity"
#vth_pmv_hormone <- vth_go2[vth_match, "geneID"]
#vth_pmv_hormone <- unlist(strsplit(vth_pmv_hormone,"/"))
#identify the presence of these genes across all tissues
#vth_pmv_hormone_tis <- whichtissues(vth_pmv_hormone, refset = "developing")
#vth_pmv_hormone_tis[,1:10]
#which genes are present (1) in the LHAa
#all <- vth_pmv_hormone_tis[,"LHAa"]
#vth_pmv_hormone[ all == 1]
Indeed all of the genes involved in hormone activity in the Pmv are also expressed in the LHAa.
Using the above mentioned functions, brainImageR has provided us with a easy way to query SGSE with respect to the postmortem human brain. We calculated the significance of the enrichment of gene overlaps and visualized these enrichments with images of the human brain. Now let's move on to predicting developmental time from a dataset.
An additional functionality of brainImageR is predicting the developmental
timepoint of the sample with reference to the postmortem human brain.
This analysis takes a normalized expression matrix as input.
The columns of this matrix should be samples, and the rows should be gene
names in the Gene Symbol format. An example is provided here in dat
.
data(dat)
dim(dat)
head(rownames(dat))
head(colnames(dat))
state <- do.call("rbind", strsplit(colnames(dat), ".", fixed = TRUE))[,1]
In the simplest case, one can predict developmental time with respect
to the full Allen Brain Atlas dataset by providing predict_time
. It
has been shown across multiple studies that neurons derived in-vitro
primarily recapitulate prenatal development. Therefore we've provided
additional flexibility in predict_time
to obtain higher resolution
within this prenatal window. This is done by restricting the analysis
to samples within the Allen Brain Atlas that come from a given class
of data. You can see which datasets that we've precomputed using
availabledatasets
.
#availabledatasets()
Let's start with the default settings which will use all of the available samples to perform the temoral prediction.
time <- predict_time(dat, minage = 8, maxage = 40)
PlotPred
will show the predictions overlaid on top of predictions
against the Allen Brain Atlas reference set (aba) to show how the
model performs on the starting dataset. This plot enables the user
to have an intuitive understanding of how the model is performing
PlotPred(time)
As we can see, using all samples, the prediction is linear across all timepoints. As with most in vitro studies of neurons, our samples are clustering with the prenatal samples. Although our model is good at predicting time over all ages, there is the issue that the prenatal samples are predicted within a small window of each other. To provide higher resolution within this time period lets restrict the analysis to use only prenatal samples in the prediction.
If we select "prenatal" this will restrict the predictions to only those samples that are < 40 weeks post-conception.
#time <- predict_time(dat,dataset = "prenatal")
#PlotPred(time)
Now we can see that the model has higher resolution within the prenatal timepoints. Note that this model should not be used for postnatal samples, as (conversely to the default model) it has a reduced ability to resolve differences between postnatal timepoints.
Now that we have a model that is useful for examining samples with respect to prenatal development, lets compare the NPCs to the Neurons.
#time2 <- data.frame(pred_age = time@pred_age, state)
#time2$state <- factor(time2$state, c("NPC","Neurons"))
#ggplot(time2, aes(state,pred_age))+
# geom_boxplot()+
# ylab("Predicted time, weeks post-conception")+
# labs(title = "predicted developmental time,\ndataset = prenatal")
As expected the neural progenitor cells are predicted to be younger
in developmental time than the neurons. In the manuscript associated
with this package there are additional examples of predicting developmental
time on post-mortem and in vitro-derived neurons showing the utility of
predict_time
across datasets and types.
There are additional options in predict_time
that allow the user
to further refine their analysis.
predict_time arguments:
- dat: A normalized expression matrix where columns are samples, and rows are genes defined by Gene Symbol.
- dataset: If desired, the user can work from a pre-computed model.
Use
availabledatasets
to see what options are available. default = "all". - genelist: If you would like to restrict the predictions to a
certain subset of genes you can retain your full dataset in the argument
dat
and input the restricted gene list here. - minage: Minimum weeks post-conception for running the prediction.
- maxage: Maximum weeks post-conception for running the prediction.
- tissue: Tissue of interest to restrict the dataset to.
- minrsq: (0-1).
minrsq
is correlated with the predictive strength of the model. Reduce the value to allow weaker models to be calculated.
Using these additional options we have more flexibility in our predictions.
For example, let's say that we have generated neurons that we expect to
be functionally similar to the adult amygdala and therefore would prefer
a model that most closely represents that profile. We can set minage
to 40 and tissue
to AMY and recalculate our predictions.
#time <- predict_time(dat,minage = 40, tissue = "AMY")