-
Notifications
You must be signed in to change notification settings - Fork 6
pydpiper: Introduction to twolevel_model_building.py (old)
So you want to run 2 level model building version of pydpiper?
Why: You have MRI scans and want to know if there are volume differences between groups (e.g. stimulated animals vs not, or mutant vs not), AND you have a longitudinal design with multiple scans per subject over time.
Note: This volumetric assay is deformation based (it stretches and fits your brain images to an average and asks HOW MUCH stretching or shrinking was needed to make a single scan overlap with said average). It gives you an estimate of this for every voxel (a jacobian) which you can then test statistically at every voxel. Then you can map these statistics back onto the average to see where groups are different and in what way (bigger or smaller). If you want raw volumes, MAGeT brain is probably a better idea.
Make a directory for your run.
Make a csv file (e.g.: "mnc-list.csv") in this directory with a list of all the ABSOLUTE PATHS (the realpath
command can be helpful here) of scans you want to include. E.g. /data/chamal/projects/chloe/chloe_preprocessed/structural_lsq6/mch_rca111_001_20171121_102212_131073_mri_lsq6.mnc.
The first column contains paths to each scan (all timepoints included, the order does not matter), and the second column a unique identifier for each subject. Subjects must have at least two timepoints (i.e. appear twice in the second column) or the pipeline will fail.
Make a script for the run in the directory. Here is an example:
module load anaconda/5.1.0-python3 minc-toolkit/1.9.17 minc-stuffs/0.1.24^minc-toolkit-1.9.17 qbatch pydpiper/2.0.13
twolevel_model_building.py --verbose --pipeline-name=whateveryouwant \
--csv-file mnc-list.csv --resolution 0.1
twolevel_model_building.py
is the program that runs the pipeline. Type twolevel_model_building.py --help
to get options for the command. "Verbose" means the terminal will talk to you as the cluster crunches through the steps of the pipeline for instance.
Name your pipeline. This one is called "whateveryouwant". You must specify the number of executors (processing units) to us (e.g.: 150).
The lsq6 stage involves the initial major alignment by rotation. This script used the simple option, which means the scans can be assumed to be in the same general orientation and native space as your initial model or atlas. This is generally NOT a safe assumption unless you've done some fancy preprocessing to your images and atlas (as done by our preprocessing script. Most users should use --lsq6-large-rotations
which tests a large amount of orientations before settling on a best fit.
The --init-model
is where you tell the script which atlas to use. Do so.
Two pipeline has the option to run MAGeT in order to segment label volumes, however we choose not to as have found problems with the template library selection. If you are interested in running MAGeT you can do so separately, and refer to our wiki.
Finally, you must point the script at the csv (e.g.: "mnc-list", kept in the same folder as the scans) you previously made with all the absolute paths to scans listed.
Run your script and wait.
So your terminal seems to be accepting commands again, so your pipeline run finished... or did it? How can you tell? Did it do a good job? QC'ing is a must before moving on to look at your results. Your results are meaningless if the process failed at some point.
Inside the folder where you conducted your 2-level run, there should now be a few more folders. You'll notice a first level folder (where each subject's different timepoints were registered together) and a second level folder (where each subject's average was then itself registered to other subjects in one final big average). There are also a ton of logs, notably a pipeline.log which you can open in a text editor if you'd like.
Within the first level folder, you will find a "processed" directory for each subject. In each of these there will be various directories--two of these are the subject's lsq6 and lsq12 directory. In each of these directories should be lsq[6 12]_montage.jpg
files. Use display (lowercase) to open the png. Three slices at different orientations for each scan appear. So if subject 1 has 3 timepoints, three scans (with 3 orientations each) show up plus, for the lsq12 you'll also see the final average (also 3 orientations). The average is the one on the bottom right. Check to make sure the orientations and sizes are similar and in the same place. If all scans look aligned in the same way you're good, but if one or multiple scans look to be oriented differently, registration failed.
You can use the display tool (part of "minc-toolkit" library) to visualize these images.
Now it's time to QC the second level registrations. In the second level _lsq12 folder, there will be an "LSQ12_montage.jpg" that you can use to see if your LSQ12 stage worked for you second level.
#Mask for nlin-3 One more thing for you to check: how is the final mask that your two-level produced? This will be important when you start analyses.
In your run folder, there should be a directory called "[pipeline-name]_second_level-nlin". Go there. Inside you should find multiple mncs that are the averages of the images ([pipeline-name]-nlin-[1 2 3].mnc) and the masks ([pipeline-name]-nlin-[1 2 3]_mask.mnc). They will be differentiated by numbers. We will focus on the last one (with the highest number). To check how good your mask is, type this in your terminal: Display [pipeline-name]-nlin-3.mnc -label [pipeline-name]-nlin-3_mask.mnc (remember to module load minc-toolkit)
In the Display:Menu, click "d" on your keyboard followed by another "d". This will grayscale your background average.
You should see a semi-transparent red mask displayed on a gray background average. Here is mine:
This mask is over segmented, meaning it's including a lot of non-brain voxels as brain. This is especially apparent at the ventral regions; check out those olfactory bulbs! If your mask lines up well, you're good to move on to the next section, but if it looks more like mine, don't worry! There's a fix!
The problem is that our mask is using a very generous function for considering voxels part of the mask called max (voxel-wise maximum). Any voxel that is labelled as a mask in individual scans is considered part of the brain in the final average mask as well. A more conservative approach is voxel-voting, where a voxel must be part of multiple masks for it to end up in the final average.
So, to fix your over-segmented mask, follow these steps:
Load the same modules you used to run pydpiper:
module load minc-toolkit/1.9.16 anaconda/5.1.0-python3 minc-stuffs/0.1.24^minc-toolkit-1.9.16 qbatch pydpiper/2.0.12
If you run 'minchistory [your final mask]', you will see the command used to generate the original nlin-3 mask.
You should see the following: mincmath -clobber -2 -max \ <a long list of files> \ nlin-3_mask.mnc
.
To generate a better mask, we're going to use the voxel_vote
command instead of mincmath
and apply it to the files listed by the minchistory
command. This command works as follows: voxel_vote <list of inputs.mnc> output_mask.mnc
. To do so, move up one directory to your derivatives directory, and run voxel_vote <paste long list of filenames> statsmask.mnc
.
Now you can display your new mask over your old average. Looks much better, eh? See an example below.
Go ahead and use your improved mask in your analysis!
Assuming your run has passed QC, the next step should be to look at your actual results and make fancy brain maps to impress the boss.
As always, make a folder to run your statistical analysis in.
Make a csv with your subject numbers, whatever information you might want to include in your model (timepoint will be present because 2 level model building is longitudinal, then sex, genotype etc), and whatever jacobians you're interested in.
For R to do statistics on your csv, it's easiest to put all these categories into their own labelled columns. Try to avoid special characters in your column headers. Underscores are fine however. Have subject number in one column, sex in another, etc.
When it comes to scan timepoint however, you'll have multiples (due to longitudinality). R likes to have one row per scan, so the csv shown above simply runs through all the subjects for timepoint 1, then repeats all subjects (and all their other info) for timepoint 2 and then again for timepoint 3. R doesn't mind redundancy and doesn't care if subject 1a is listed as being female 3 separate times or just once. Things are organized this way so that every scan can have an associated jacobian (deformation value) file listed with it.
The csv should contain columns with the jacobians you're interested in, listed as absolute paths. So where do you find them? The twolevel_model_building.py will output two .csv files: resampled_determinants.csv and overall_determinants.csv, which have absolute paths to the Jacobian determinants created from your analysis, as well as the corresponding group ID (the same one you specific in your input csv). For your statistics, you'll want to use the Jacobians from the resampled_determinants.csv, as these are resampled into the study average space.
For both of these .csv, you'll have columns with paths to both the Absolute Jacobians (log_full_det) and the Relative Jacobinas (log_nlin_det).
Relative Jacobian determinants explicitly model only the non-linear part of the deformations and remove residual global linear transformations (attributable to differences in total brain size). Absolute Jacobians include the overall linear transformations in addition to the nonlinear ones, so they give you information on overall brain differences. Choosing which ones to look at depends on your population of interest. For a more developmental cohort, you may choose the absolute Jacobians, for a study in which you don't expect large differences in brain volume, Relative Jacobians may be more appropriate. As a rule of thumb, it's always good to look at both.
The next thing to look at is blurring factor, which shows up near the end of the file (0,0.2,). This allows some amount of deformation "bleeding" from one voxel to the nearby voxels. Some amount of influence on neighbouring voxels is necessary. It's unlikely that only a single voxel would grow or shrink in isolation, it must "tug" on surrounding tissue. It smooths out the deformation field which you can imagine as a 3D grid embedded in your scans. Generally, 0.2 is what you want to use.
In the terminal, navigate to the folder you made in step 1 for the statistical analysis. Load up the following modules in this order:
Start rstudio
module load R/3.4.0
module load rstudio
module load R-extras/3.4.0
module load minc-toolkit/1.9.16
module load RMINC/1.5.1.0^minc-toolkit-1.9.16^R-3.4.0
module load rstudio
rstudio
Set the working directory for rstudio to the folder that you made for this analysis.
Write and run a script to model your Jacobians.
The linear mixed effect model (LMER) formula will be different depending on how you want to build your model and what you want to test. "squeak" is the data frame being calling from. Note that you can input a mask here which greatly speeds up the program because it tells R where NOT to compute the model. Parallel processing can speed things up, you can use the snowfall to dedicate more cores.
Degrees of freedom is so messy for LMERs that an entire program is called here to estimate how many there are for the purposes of statistical testing.
FDR or "false discovery rate" corrects for multiple comparisons and the chance of finding significant results when doing the same test over and over and over again. You expect some things to pass by chance. You can see if any voxels pass significance after correction (have different slopes, aka getting bigger or smaller over timepoints on average by group), and at what t-values are necessary to pass at what thresholds you set.
Example below with commentary:
> library(ggplot2)
> library(RMINC)
> library(lme4)
#load necessary libraries
#in a separate R file in the folder where you run your analyses, call it "sge_resources.R", with the following content:
#cluster.functions <- makeClusterFunctionsSGE(system.file("parallel/sge_script.tmpl", package = "RMINC"))
#default.resources <-
# list(nodes = 1,
# cores = 2, ##if the model run fails due to "error in qMincRegistry", then increase the number of cores, save "sge_resources.R", then close and rerun your analysis R file, until it works.
# walltime = "01:00:00")
> options(RMINC_BATCH_CONF = "sge_resources.R")
> squeak<-read.csv(file="jacobian.csv",head=TRUE,sep=",")
#make thing called squeak which summons your csv with jacobians
> mask = "/path/tomask.mnc"
# create an object that is the absolute path of the mask you created running your two-level
# the mask can be found in the nlin directory, which includes nlin3 average and a corresponding mask
> squeak$sex = relevel(squeak$sex, ref = "f")
#set female as default for comparison
> squeak$treat_group = relevel(squeak$treat_group, ref = "sham")
#set sham as default for comparisons
> filtered.squeak = subset(squeak, ! squeak$treat_group == "none")
#subset data to exclude skullhole controls "!" means do not include
> cheese<-mincLmer(reljacob02 ~ treat_group + timepoint + (1|mouse_batch), squeak, mask = mask, parallel = c("sge", 200))
#actual model. Things after + are covariates (sex) and | expressions are random effects to be taken out, reljacob02 is the column header that contained the absolute paths for jacobian files
> cheese = mincLmerEstimateDF(cheese)
#degrees of freedom estimator
> FDR = mincFDR(cheese, mask= mask)
#do FDR correction on cheese
> FDR
#displays FDR corrected stats for your run
This will generate output that looks like this:
Degrees of Freedom: 188.002164322792 188.002164322789 188.002164322792 188.002164322795 188.002164322789
FDR Thresholds:
tvalue-(Intercept) tvalue-genotype2 tvalue-timepoint2 tvalue-timepoint3
0.01 2.801642 NA 2.929002 2.807175
0.05 2.148982 NA 2.259712 2.157496
0.1 1.816631 NA 1.917197 1.825371
0.15 1.599764 NA 1.693466 1.608211
0.2 1.432213 NA 1.520069 1.440741
tvalue-timepoint4
0.01 2.789115
0.05 2.141490
0.1 1.810780
0.15 1.594782
0.2 1.428045
Here are displayed the various components of the formula you constructed with LMER and any t-values that passed FDR at the thresholds listed here. Reference groups set that condition as 0 or baseline. Timepoint 1 was reference in this case, which means the model tested if timepoint 2, 3, and 4 were different. All timepoints tested contained voxels which were significantly different at each threshold, from the less stringent .2 (20% FDR) to the more stringent 0.01 (1% FDR). For instance, at timepoint 3, there were voxels that passed 0.01 FDR. The t-values of these significant voxels will be greater than 2.929002. Positive t-score indicates greater than reference group, negative indicates smaller than reference group. In this case, timepoint 3 scans were significantly larger than timepoint 1 in our case, simply because these younger mice aged 1.5 months in between. NAs indicate no voxels were significantly different at those thresholds.
You can then map the T-values in one of these columns by writing them into a minc file. To to write timepoint 3 to mnc format, we use the mincWriteVolume command.
In this case, the data frame now containing our FDR and t-stat info is now called "cheese". It is followed by a chosen name of the new t-stat mnc file, and what column of t-stat data you'd like to write. The new file is saved in whatever your current working directory is.
> mincWriteVolume(cheese, "myt-statsmap.mnc", column = 'tvalue-timepoint3')
#makes a minc file that puts your t values onto voxels, choose column of tstats from the FDR output
Once you've written your t-values into a minc, you can pull it up in display, mapped on top of the final, average brain.
Hunt for the final average. Go back to your 2-level run folder and look in the "*_secondlevel" folder this time. Then choose the second_level_nlin folder, then second_level-nlin-3.mnc file. Copy it into your R-stats analysis folder where your t-stats map minc file is living.
Navigate terminal back to the R-stats folder and open both files in Display. For example:
> Display second_level-nlin-3.mnc myt-statsmap.mnc
This should conjure your t-stats map on top of your final average.
Keep the FDR output summary handy. Using the t-value cut-off for whatever threshold you choose, you can slide the max and min for the displayed t-values around on the left hand scale of Display. If you want to display t-values at 0.05 FDR cutoff for instance, you'd choose to put the minimum at 2.259712 and max as high as the values actually go. For a more complete guide on changing Display's options to make the map visualizations more visually appealing, see
https://en.wikibooks.org/wiki/File:Display_MINC_toolkit_overlay_images_input.png