-
Notifications
You must be signed in to change notification settings - Fork 6
Running PLS in Python: pyls
Partial least squares (PLS) analysis is a technique that allows you to relate two sets of variables to each other. If you have data about B brain variables (ex: vertex-wise surface area of the striatum, often referred to as the set X) and J behavioural variables (ex: results from your subjects performing cognitive tasks, often referred to as the set Y) in S subjects, your PLS will allow you to figure out which linear combinations of behavioural variable measures have the strongest impact on which linear combinations of brain variables. Mathematically, you are performing a singular value decomposition of the covariance matrix between two sets of features (X and Y). The statistical significance of these latent variables (LV) is assessed through permutation testing: the rows in the brain matrix are shuffled and the singular value of the latent variable is calculated for each permutation. This creates a null distribution and allows for performance of a null-hypothesis test. Similarly, the significance of the contribution of each brain or behaviour variable to a specific LV is calculated using bootstrap resampling. This time, samples are taken with replacement from the available set of participants and the contribution of each variable to the LV is reassessed. The idea behind this is that if the contribution of a brain or behaviour variable to a LV is highly dependent on which participants are included in the sample, the contribution of the variable is unreliable. This reliability is measured as a bootstrap ratio (BSR), which is calculated by multiplying the weight of the variable in the LV by the singular value of that LV and dividing by the standard error. See references below for a more in-depth understanding of the theoretical framework of a PLS analysis.
Ross Markello built a package that allows you to run a partial least squares correlation (PLSC) or regression (PLSR) in Python (instead of Matlab): pyls (https://github.com/rmarkello/pyls). In the following document, we will run you through (1) the creation of a virtual environment in which to run your analysis; (2) tips on how to preprocess your data; (3) how to run the PLS analysis using Ross Markello’s package; (4) how to understand the package outputs; (5) a few ideas on what to do about sex and age; and (6) some ideas on how to visualize your data. You can read about this package’s Installation, Setup, Purpose and Usage on his repository: https://github.com/rmarkello/pyls. It also has a very complete User Guide: https://pyls.readthedocs.io/en/latest/usage.html.
When preparing your analysis, it is critical to figure out whether or not you should be running a PLSC or a PLSR. One important distinction is that, as its name implies, a PLSR assumes that there is a directionality to your variables: indeed, as in any regression analysis, you are trying to predict linear combinations of your dependent variables (from the set Y) from linear combinations of your independent variables (from the set X). From a neuroscience perspective, this would entail making a design choice as to whether or not your brain variables cause changes in your behavioural variables, or vice-versa. A PLSC assumes no such directionality. Mathematically, PLSC performs one single SVD on your covariance matrix between X and Y. PLSR, on the other hand, performs as many recursive SVDs on your recursively residualized covariance matrix as you have components (Krishnan et al. 2011).
Note on PLSR: According to Ross Markello, it uses the SIMPLS algorithm, as opposed to the NIPALS algorithm. The SIMPLS algorithm is the one used by Matlab so the results obtained here should be identical (or nearly identical) to the ones obtained by running the same algorithm in Matlab.
When you remotely connect to a server and load Python, you have access to whatever software packages are available to the default version of your server’s Python. A virtual environment allows you to set up something that you interact with as though it were another computer, hosted on your original server. A virtual environment is a directory that you populate with all of the Python software packages that you need in order to be able to run a specific analysis. When you “activate” your virtual environment, you are making all of the software packages in it available to you (but you do not have access to the default packages of your host server). When you “deactivate” your virtual environment, you only have access to the default Python packages of your host server. This allows you to preserve the correct versions of the packages that you used when writing your data analysis scripts, and it also ensures that you have access to all of these packages’ dependencies. Why run this remotely on Niagara when you can just use your laptop? Depending on the size of your dataset and the number of features under examination, runtimes can become increasingly long.
1. Load python (intel speeds up matrix operations):
module load NiaEnv/2019b intelpython3
2. Make a directory in which you will be storing all of your virtual environments (you can name it anything; we are naming it virtualenvs in this example):
mkdir ~/.virtualenvs
3. Go into that directory:
cd ~/.virtualenvs
4. Initialize your new virtual environment (you can name it anything; we are naming it PLS_env in this example); note that the --system-site-packages option allows you to get all of the optimized Python packages that are already on Niagara and allows you to avoid most of the pip installations that can fill up your $HOME:
virtualenv --system-site-packages ~/.virtualenvs/PLS_venv
5. Screen is a very useful tool: it allows you to open a new terminal session, start running something, and “detach” the screen such that your script keeps running even if you decide to log off the remote server. Open a new screen:
screen
# to detach your screen, press: ctrl+A+D; to exit screen, type exit + enter
# to enter a previously opened screen, type “screen -list”, then “screen -r screen_of_interest_ID”
6. Activate your virtual environment (note that to deactivate your venv, you just need to type “deactivate” into your command line):
source ~/.virtualenvs/PLS_venv/bin/activate
7. Install the packages that you need (the instructions on how to install the Ross Markello’s pyls
package come from here: https://github.com/rmarkello/pyls)
cd PLS_venv/lib/python3.6/site-packages/
git clone https://github.com/rmarkello/pyls.git
cd pyls
python setup.py install
NOTE: if you are using Python 3.6.8 (provided by the intelpython3/2019u4
module), you will additionally need to install the joblib package to take advantage of pyls
' parallel processing. This will speed up permutation testing and any resampling operations, and is especially important if you plan to run split-half resampling.
cd ~/.virtualenvs/PLS_venv/lib/python3.6/site-packages/
pip install joblib
8. Once you have preprocessed your data and prepared a python script that runs a PLS (see step 2 below), prepare a list of jobs to submit to Niagara (joblist.txt). For this example, we will assume that your Python script does not take any command-line inputs and that your joblist.txt has the following format:
python PLS_script.py
9. your commands to run the PLS analysis
screen # if you currently do not have a screen open already
source ~/.virtualenvs/PLS_venv/bin/activate
module load gnu-parallel qbatch
qbatch -c1 --walltime=4:00:00 joblist.txt # the walltime can be changed; the shorter it is, the quicker Niagara will submit your job as they boost short jobs in the cues; max walltime is 24:00:00
1. The steps are pretty similar, but you use the conda instead of pip (CIC does not have virtual environment package; conda environments has many advantages over virtual environments anyway)
module load anaconda/miniconda3
conda info --envs # take a look at what virtual environments you already have, and where they are located
conda create --name PLS_venv python=3.8.3 # once again, PLS_venv is just an arbitrary name choice
source activate PLS_venv
2. One disadvantage is that you have to install all of the packages you will need from scratch. Thus, any packages that you are importing in your python script should be installed via conda:
conda install numpy
conda install pandas
conda install scipy
conda install matplotlib
conda install statsmodels
3. Install the PLS packages that you need (the instructions on how to install the Ross Markello’s pyls
package come from here: https://github.com/rmarkello/pyls):
conda info --envs # obtain the /path/to/PLS_venv
cd /path/to/PLS_venv/lib/python3.8/site-packages
git clone https://github.com/rmarkello/pyls.git
cd pyls
python setup.py install
Additionally install the joblib
package to enable parallelization within pyls
: (you can check first whether joblib
is already installed by running import joblib
within a python instance)
cd ..
pip install joblib
4. Running PLS: once you have prepared your PLS script (see step 2 below), run the following:
python PLS_script.py
It may be useful for you to learn how to modify your Python script to take command-line inputs. For instance, if ever you want to run through a few iterations of the same PLS script (PLS_script.py), with one parameter modified for each iteration (ex: num_boot; see below), you can adapt your Python script to take command line inputs:
import sys
num_boot = str(sys.argv[1])
This would allow you to create a joblist (joblist.txt, see above) to submit to Niagara where each line can take command-line inputs. For instance, if your joblist.txt went as follows:
python PLS_script.py 5
python PLS_script.py 1000
python PLS_script.py 5000
python PLS_script.py 10000
This would mean that your first PLS run would bootstrap your data 5 times; your 2nd PLS run would bootstrap your data 1000 times; your 3rd PLS run would bootstrap your data 5000 times; and your last PLS run would bootstrap your data 10000 times. You would have to make sure to adjust the name of your output_dir (see below) to also use num_boot in its name, not to create the same directory 4 times, or your script would crash:
output_dir = f"/path/to/directory/where/you/save/PLS/outputs_{num_boot}
# NOT: output_dir = "/path/to/directory/where/you/save/PLS/outputs”)
It is important to be aware that this portion of your analysis will be specific to your data. We are just including several tricks that helped us.
The inputs to pyls consist of two matrices:
- Matrix X: Rows correspond to participants, columns correspond to morphometric measures at specific vertices;
- Matrix Y: Rows correspond to participants, columns correspond to behavioural variables of interest.
Note that although the ordering of your X and Y variables does not matter in the context of a PLSC, it is highly recommended to make sure that matrix X is your set that contains the most variables (usually, the set of brain variables) due to the way that the package computes the bootstrapped ratios and bootstrapped confidence intervals.
Neither the column or the row names in either matrix need to be labeled.
If your brain data is vertex-wise, you will need to exclude the medial wall vertices from the X matrix so that they are not considered in PLS correlations. This wiki includes instructions for doing this, as well as for adding back medial wall indices following statistical analysis so that the final stats data are in the correct dimensions to be visualized on a cortical surface (e.g., using create_civet_image.sh).
Every value in the matrix should be numerical: INF and NaN values cannot be handled.
The recommended way to handle NaN values is via imputation by using singular value decomposition (SVD), described here. One issue that arises here is that the SVDmiss function is written in R. If you are looking for a challenge, you can take a look at the source code here and translate it to Python. If you are short on time, you can do something less elegant but just as effective, on the CIC computers:
Y_zscore = (Y-Y.mean())/Y.std() # z-score your dataframe
Y_zscore.to_csv("Y_zScored_with_nas.csv") # save to .csv
# load the right packages by interacting directly with your shell:
! module load minc-toolkit/1.9.17
! module load anaconda/5.1.0-python3
! module load minc-stuffs/0.1.24^minc-toolkit-1.9.17
! module load R
! Rscript SVD_imputation.r # what code to include in SVD_imputation.r can be found in the link above
An alternative would be to drop your columns with two many NaN values (up to you to determine a “threshold” above which you discard a whole column), in dataframe Y:
remove_cols=[]
for i in range(len(Y.columns)):
if (Y.iloc[:,i].isnull().sum() > threshold):
remove_cols.append(Y.columns[i])
Y = Y.drop(columns=remove_cols)
The laziest method would just be to replace your NaN values with the median value of the column they are:
for col in Y.columns:
Y[col].fillna(Y[col].median(), inplace = True)
Here is an example of how you can replace all of your positive and negative inf values with NaN, before handling the NaN values as shown above:
Y = Y.replace([np.inf, -np.inf], np.nan)
Read this to learn more on how to handle missing values.
Be careful with strings representing numbers, they must be converted either into an int or a float. Here are a couple examples of how to do that in the Y data frame:
Y = Y.replace('0',0)
Y = Y.replace(True,1)
Here is how to make sure all of your data is numerical:
for col in Y.columns:
Y[col] = Y[col].astype(float)
Note that for pyls to run, there needs to be a minimum amount of variance within the values of a variable. You can also remove your columns with 0 or very little standard deviation (up to you to set a threshold):
Y = Y.loc[:, Y.std() > threshold]
Make sure that your final X and Y matrices that you will be using for your PLS are numpy arrays (as opposed to pandas dataframes). Although the pyls
should be able to accept both numpy arrays and panda data frames, numpy arrays are a lot less computationally demanding and this may save you a lot of processing time.
The default setting for both is 5000 (i.e. your data gets bootstrapped 5000 times to compute the BSR, and it gets permuted 5000 times to compute a p-value). When testing your script to see whether or not it works, it is advised to set n_boot=1 and n_perm=1 so that it runs faster and you make sure nothing is crashing. Then, we recommend either using n_boot=5000 and n_perm=5000 or n_boot=10000 and n_perm=10000.
In order to run PLS, the following script should be executed from within the pyls directory. Note that we wrote our script such that it saved all of the pyls
object outputs. However, we later found out that the package already does this. This means that you could either use our script below or read the following documentation:
- Saving the object: https://pyls.readthedocs.io/en/latest/generated/pyls.save_results.html#pyls.save_results
- Loading the object: https://pyls.readthedocs.io/en/latest/generated/pyls.load_results.html#pyls.load_results
from pyls import behavioral_pls
import numpy as np
import os
# Import the two matrices
X = numpy_array_with_matrix_X
Y = numpy_array_with_matrix_Y
# Create a directory in which to save the outputs
output_dir = “/path/to/directory/where/you/save/PLS/outputs”
os.makedirs(output_dir, exist_ok = True)
# Run plsc
bpls = behavioral_pls(X,Y, n_perm = desired_number_of_permutations, n_boot = desired_number_of_bootstrap_resamplings, seed = 123, n_proc = number_of_desired_processors)
# n_proc can be set to ‘max’ on niagara (make sure that joblib is installed)
# setting the seed is optional, but highly recommended to ensure your results are reproducible (since the permutation testing and bootstrap resampling are random, your p-values and CIs may be slightly different if you need to rerun the analysis)
np.savetxt(f'{output_dir}/x_weights.csv', bpls['x_weights'], delimiter=',') # p x m
np.savetxt(f'{output_dir}/y_weights.csv', bpls['y_weights'], delimiter=',') # m x m
np.savetxt(f'{output_dir}/x_scores.csv', bpls['x_scores'], delimiter=',') # n x m
np.savetxt(f'{output_dir}/y_scores.csv', bpls['y_scores'], delimiter=',') # n x m
np.savetxt(f'{output_dir}/y_loadings.csv', bpls['y_loadings'], delimiter=',') # m x m
np.savetxt(f'{output_dir}/sinvals.csv', bpls['singvals'], delimiter=',') # (m,)
np.savetxt(f'{output_dir}/varexp.csv', bpls['varexp'], delimiter=',') # (m,)
np.savetxt(f'{output_dir}/permres_pvals.csv', bpls['permres']['pvals'], delimiter=',') # shape: (m,)
np.savetxt(f'{output_dir}/permres_permsamples.csv', bpls['permres']['permsamples'], delimiter=',')
np.savetxt(f'{output_dir}/bootres_x_weights_normed.csv', bpls['bootres']['x_weights_normed'], delimiter=',')
np.savetxt(f'{output_dir}/bootres_x_weights_stderr.csv', bpls['bootres']['x_weights_stderr'], delimiter=',')
np.savetxt(f'{output_dir}/bootres_bootsamples.csv', bpls['bootres']['bootsamples'], delimiter=',')
np.savetxt(f'{output_dir}/bootres_y_loadings.csv', bpls['bootres']['y_loadings'], delimiter=',')
# to save your bootstrapped samples (for the very, very cautious)
os.mkdir(f'{output_dir}/y_loadings_boot')
for i in range(bpls['bootres']['y_loadings_boot'].shape[0]):
np.savetxt(f'{output_dir}/y_loadings_boot/bootres_y_loadings_boot_{i}.csv', bpls['bootres']['y_loadings_boot'][i], delimiter=',')
# saving your confidence intervals
os.mkdir(f'{output_dir}/y_loadings_ci')
for i in range(bpls['bootres']['y_loadings_ci'].shape[0]):
np.savetxt(f'{output_dir}/y_loadings_ci/bootres_y_loadings_ci_behaviour_{i}.csv', bpls['bootres']['y_loadings_ci'][i], delimiter=',')
np.savetxt(f'{output_dir}/cvres_pearson_r.csv', bpls['cvres']['pearson_r'], delimiter=',')
np.savetxt(f'{output_dir}/cvres_r_squared.csv', bpls['cvres']['r_squared'], delimiter=',')
# Saving your input information (makes the code longer, but quite valuable when trouble shooting, or if you've done several runs and want to keep track of the parameters of each):
np.savetxt(f'{output_dir}/inputs_X.csv', bpls['inputs']['X'], delimiter=',')
np.savetxt(f'{output_dir}/inputs_Y.csv', bpls['inputs']['Y'], delimiter=',')
f=open(f'{output_dir}/input_info.txt','w')
f.write(f"Groups: {bpls['inputs']['groups']}\n") # [n]
f.write(f"Num of conditions: {bpls['inputs']['n_cond']}\n") # 1
f.write(f"Num of permutations: {bpls['inputs']['n_perm']}\n") # perm
f.write(f"Bootstrapping: {bpls['inputs']['n_boot']}\n") # boot
f.write(f"Bootstrapping: {bpls['inputs']['test_split']}\n") # default=100
f.write(f"Bootstrapping: {bpls['inputs']['test_size']}\n") # default=0.25
f.write(f"Bootstrapping: {bpls['inputs']['covariance']}\n") # default=False
f.write(f"Rotations: {bpls['inputs']['rotate']}\n") # True
f.write(f"Confidence Intervals: {bpls['inputs']['ci']}\n") # default=95
f.write(f"Verbose: {bpls['inputs']['verbose']}\n") # True
f.close()
You can optionally implement split-half resampling during the permutation testing to assess the reliability of each observed latent variable (or, more specifically, the reliability of the associations captured by each LV). This is useful if you are interested in answering the question: would the same behavioural pattern still link with a similar brain pattern if a different [sub]set of subjects were chosen? Below is a diagram of how the split-half resampling works, from Kovacevic et al., 2013 (see this paper for more information on the method).
To turn on split-half resampling, add the parameter n_split = 200
(or other desired number of resamples) to your behavioral_pls()
command. To save the results, add the following lines to your PLS script:
os.makedirs(f'{out_dir}/splitres/', exist_ok=True)
np.savetxt(f'{out_dir}/splitres/splitres_u-vcorr.csv', np.column_stack((bpls['splitres']['ucorr'], bpls['splitres']['vcorr'])), delimiter=',')
np.savetxt(f'{out_dir}/splitres/splitres_u-vcorr_pvals.csv', np.column_stack((bpls['splitres']['ucorr_pvals'], bpls['splitres']['vcorr_pvals'])), delimiter=',')
np.savetxt(f'{out_dir}/splitres/splitres_ucorr_lo-uplim.csv', np.column_stack((bpls['splitres']['ucorr_lolim'], bpls['splitres']['ucorr_uplim'])),delimiter=',')
np.savetxt(f'{out_dir}/splitres/splitres_vcorr_lo-uplim.csv', np.column_stack((bpls['splitres']['vcorr_lolim'], bpls['splitres']['vcorr_uplim'])),delimiter=',')
Running split-half resampling will significantly increase the computation time. e.g., For a standard PLS run where the brain matrix contains vertex-wise measurements of a single metric (e.g., cortical thickness) for 250 subjects, and you have specified 200 split-halves, you may need to request -w 15:00:00
when submitting the PLS job to Niagara using qbatch
, vs. -w 00:15:00
when split-half resampling is turned off (assuming the joblib
Python package is available in your environment).
More details about each split-half resampling result in the bpls['splitres']
dictionary can be found in the corresponding pyls documentation page.
To determine if an LV is reliable, you can use the following output criteria:
-
splitres_u-vcorr_pvals.csv
(L latent variables x 2): the p values of the split-half correlations of both the left (pUCorr, first col) and right (pVcorr, second col) singular vectors are < 0.05
from pyls import pls_regression
import numpy as np
import os
# Import the two matrices
X = numpy_array_with_matrix_X
Y = numpy_array_with_matrix_Y
# Create a directory in which to save the outputs
output_dir = “/path/to/directory/where/you/save/PLS/outputs”
os.makedirs(output_dir, exist_ok = True)
# Run plsr
plsr = pls_regression(X,Y, n_perm = desired_number_of_permutations, n_boot = desired_number_of_bootstrap_resamplings, n_proc = number_of_desired_processors)
# n_proc can be set to ‘max’ on niagara
# Save all outputs
np.savetxt(f“{output_dir}/x_weights.csv", plsr['x_weights'], delimiter=',')
np.savetxt(f“{output_dir}/x_scores.csv", plsr['x_scores'], delimiter=',')
np.savetxt(f“{output_dir}/y_scores.csv", plsr['y_scores'], delimiter=',')
np.savetxt(f“{output_dir}/y_loadings.csv", plsr['y_loadings'], delimiter=',')
np.savetxt(f“{output_dir}/varexp.csv", plsr['varexp'], delimiter=',')
np.savetxt(f“{output_dir}/permres_pvals.csv", plsr['permres']['pvals'], delimiter=',')
np.savetxt(f“{output_dir}/permres_permsamples.csv", plsr['permres']['permsamples'], delimiter=',')
np.savetxt(f“{output_dir}/bootres_x_weights_normed.csv", plsr['bootres']['x_weights_normed'], delimiter=',')
np.savetxt(f“{output_dir}/bootres_x_weights_stderr.csv", plsr['bootres']['x_weights_stderr'], delimiter=',')
np.savetxt(f“{output_dir}/bootres_bootsamples.csv", plsr['bootres']['bootsamples'], delimiter=',')
np.savetxt(f“{output_dir}/bootres_y_loadings.csv", plsr['bootres']['y_loadings'], delimiter=',')
np.savetxt(f“{output_dir}/cvres_pearson_r.csv', plsr['cvres']['pearson_r'], delimiter=',')
np.savetxt(f“{output_dir}/cvres_r_squared.csv', plsr['cvres']['r_squared'], delimiter=',')
os.mkdir(f“{output_dir}/y_loadings_ci/")
for i in range(plsr['bootres']['y_loadings_ci'].shape[0]):
np.savetxt(f'{output_dir}/y_loadings_ci/bootres_y_loadings_ci_{i}.csv', plsr['bootres']['y_loadings_ci'][i], delimiter=',')
# to save your bootstrapped samples (for the very, very cautious)
os.mkdir(f“{output_dir}/y_loadings_boot")
for i in range(plsr['bootres']['y_loadings_boot'].shape[0]):
np.savetxt(f“{output_dir}/y_loadings_boot/bootres_y_loadings_boot_{i}.csv", plsr['bootres']['y_loadings_boot'][i], delimiter=',')
# Saving your input information:
np.savetxt(f'{output_dir}/inputs_X.csv', bpls['inputs']['X'], delimiter=',')
np.savetxt(f'{output_dir}/inputs_Y.csv', bpls['inputs']['Y'], delimiter=',')
f=open(f'{output_dir}/input_info.txt','w')
f.write(f"Groups: {bpls['inputs']['groups']}\n") # [n]
f.write(f"Num of conditions: {bpls['inputs']['n_cond']}\n") # 1
f.write(f"Num of permutations: {bpls['inputs']['n_perm']}\n") # perm
f.write(f"Bootstrapping: {bpls['inputs']['n_boot']}\n") # boot
f.write(f"Bootstrapping: {bpls['inputs']['test_split']}\n") # default=100
f.write(f"Bootstrapping: {bpls['inputs']['test_size']}\n") # default=0.25
f.write(f"Bootstrapping: {bpls['inputs']['covariance']}\n") # default=False
f.write(f"Rotations: {bpls['inputs']['rotate']}\n") # True
f.write(f"Confidence Intervals: {bpls['inputs']['ci']}\n") # default=95
f.write(f"Verbose: {bpls['inputs']['verbose']}\n") # True
f.close()
- a x b = a rows, b columns
- S: number of subjects
- L: number of components
- J: number of behavioural features (T?)
- B: number of brain features
- boot: number of bootstraps
- perm: number of permutations
-
bootres_bootsamples.csv (S x boot): how to reorder the rows of inputs_X.csv and inputs_Y.csv, in order to conduct bootstrap resampling
- Rows = subject (each subject is assigned a “number”), columns = new ordering of the subjects for the “permuted” inputs_X.csv and inputs_Y.csv matrices
-
bootres_x_weights_normed.csv (B x L): Bootstrap ratio. PLSR: bootres_x_weights_normed= x_weights/bootres_x_weights_stderr. PLSC: bootres_x_weights_normed= (x_weights * singvals)/bootres_x_weights_stderr
- Rows = vertices, columns = LVs
- This is the file that should be used in create_civet_image.sh or create_morpho_image.sh
- These values will have to be thresholded according to the unit normal distribution. A BSR threshold of 3.29 corresponds to p < 0.001, 2.57 to p < 0.01 and 1.95 to p < 0.05.
-
bootres_x_weights_stderr.csv (B x L): Allows for computation of the BSR
- Rows = vertices, Columns = LVs
-
plsr[bootres][y_loadings_boot][ci] (J x L x boot):
- Each .csv file is a behavioural variable
- In each .csv file, each row = LV, col 1 = lower bound of CI, col 2 = upper bound of CI
- Use this to add the error bars to the plot with y_loadings
-
plsr[bootres][y_loadings_boot] (J x L x boot):
- Each .csv file is a behavioural variable
- In each .csv file, each row = LV, each col = boot (ie value of the component found at that instance of the bootstrap resampling)
-
cvres_pearson_r.csv (J/L x 100):Results of cross-validation testing
- Cross-validation procedure: assuming (1) S subjects, (2) J behaviours, (3) a S x J Y-matrix, and (4) that you keep the default values (ie test_size = 0.25, test_split = 100), you rerun your PLS on 75% of the data, then use the obtained weightings to predict the behavioural variables [Y matrix] of the test data points (the 25% of your data that was left out). You then look at the Pearson correlation and r_squared values between each column (i.e. behaviour) of Y_predicted and Y_observed. You repeat this 100 times.
-
cvres_r_squared.csv (J/L x 100): Results of cross-validation testing
- See above
-
inputs_X.csv (S x B): Copy of the input matrix X
- Rows = subjects, Columns = vertices
-
inputs_Y.csv (S x J): Copy of the input matrix X
- Rows = subjects, Columns = behavioural variables
-
permres_permsamples.csv (S x perm): how to reorder the rows of inputs_X.csv, in order to conduct the permutation test that will yield a p-value for each LV
- Rows = subject (each subject is assigned a “number”), columns = new ordering of the subjects for the “permuted” inputs_X.csv matrix, leaving Y unchanged (see Krishan et al. 2011)
-
permres_pvals.csv (i x 1): p-value for each latent variable
- Rows = LVs
-
singvals.csv (L x 1): Singular values used for to calculate the BSR in PLSC and for significance testing, not available in PLSR
- Rows = LVs
-
varexp.csv (L x 1): variance explained by each latent variable
- Rows = LVs
-
x_scores.csv (S x L): Projection of brain pattern onto individual data (“brain scores”)
- Rows = subject, columns = LVs
-
y_scores.csv (S x L): Projection of behaviour pattern onto individual data (“behaviour scores”)
- Rows = subject, columns = LVs
-
x_weights.csv (B x L): Used to compute the BSR
- Rows = vertices, columns = LVs
-
y_loadings.csv (J x L): Contribution of each behavioural variable to each latent variable (sometimes referred to as “cross-loadings”), what you are going to be looking at when visualizing the contribution of each behavioural variable to each LV.
- Rows = behavioural variable, Columns = LVs
-
y_weights.csv (J x L): actual values of the right singular vectors from the singular value decomposition, not available in PLSR
- Rows = behavioural variable, Columns = LVs
In a PLS analysis, one needs to carefully think about how to handle sex and age variables; if you simply include them in your set of behavioural variables, sex and age may become the main variables driving your components. As a result, your analysis does not give you much more information about how the rest of your behavioural variables relate to your brain variables.
Should you then residualize all of your brain (X) and behaviour (Y) variables for sex and age, before running your PLS analysis? Although linear models are pretty robust with regard to treating sex and age variables, you may want to touch your data as little as possible before running it through a PLS.
One clever solution suggested by Dr. Bratislav Misic was to hold sex and age out of your analysis, then see how sex and age correlate with your subject-specific brain scores and behaviour scores (see sections 6.5 and 6.6 below).
There are many ways to visualize PLS outputs; the scripts below just include a few suggestions. Feel free to share figures of your own visualizations in section 6.7.
Comment last two lines if not using split-half resampling
import pandas as pd
import numpy as np
input_dir=/path/to/PLS/outputs # directory in which your PLS outputs have been saved
output_dir=/path/to/visualization # directory in which you will be saving your PLS visualizations
df = pd.read_csv(/path/to/behavioural_data.csv) # data frame where each row is a subject, each column is a behavioural variable
inputs_X = pd.read_csv(f"{input_dir}/inputs_X.csv", header=None)
inputs_Y = pd.read_csv(f"{input_dir}/inputs_Y.csv", header=None)
print("inputs_X shape: ",inputs_X.shape)
print("inputs_Y shape: ",inputs_Y.shape)
x_scores = pd.read_csv(f"{input_dir}/x_scores.csv", header=None)
y_scores = pd.read_csv(f"{input_dir}/y_scores.csv", header=None)
print("x_scores shape: ",x_scores.shape)
print("y_scores shape: ",y_scores.shape)
bootres_bootsamples = pd.read_csv(f"{input_dir}/bootres_bootsamples.csv", header=None)
print("bootres_bootsamples shape: ",bootres_bootsamples.shape)
bootres_x_weights_normed = pd.read_csv(f"{input_dir}/bootres_x_weights_normed.csv", header=None)
print("bootres_x_weights_normed shape: ",bootres_x_weights_normed.shape)
bootres_x_weights_stderr = pd.read_csv(f"{input_dir}/bootres_x_weights_stderr.csv", header=None)
print("bootres_x_weights_stderr shape: ",bootres_x_weights_stderr.shape)
permres_permsamples = pd.read_csv(f"{input_dir}/permres_permsamples.csv", header=None)
print("permres_permsamples shape: ",permres_permsamples.shape)
permres_pvals= pd.read_csv(f"{input_dir}/permres_pvals.csv", header=None)
print("permres_pvals shape: ",permres_pvals.shape)
varexp = pd.read_csv(f"{input_dir}/varexp.csv", header=None)
print("varexp shape: ",varexp.shape)
x_weights= pd.read_csv(f"{input_dir}/x_weights.csv", header=None)
print("x_weights shape: ",x_weights.shape)
y_loadings= pd.read_csv(f"{input_dir}/y_loadings.csv", header=None)
print("y_loadings shape: ",y_loadings.shape)
splitres_u_vcorr_pvals = pd.read_csv(f"{input_dir}/splitres/splitres_u-vcorr_pvals.csv", header=None)
print("splitres_u_vcorr_pvals shape: ",splitres_u_vcorr_pvals.shape)
Select the p-value threshold above which you do not keep a LV:
pval=0.05
Create the directory in which you will store all of the analysis outputs:
import os
os.mkdir(f'{output_dir}')
This will make the downstream visualization of the BSR on 3D brain surfaces a lot easier.
About the example below: This example is built to allow you to understand how to interpret your BSR outputs when you have included several brain structures in one single analysis. You will need to adapt this part of the script to whether or not you used more than one brain structure and to the number of brain variables you have. Here, we are using the vertex-wise surface area values of the striatum, thalamus and globs pallidus.
- We know that the striatum surface has 13000 vertices, the thalamus surface has 6500 and the globus pallidus surface has 3000 vertices.
- We know that the inputs_X matrix rows were ordered in the following way: left striatum, right striatum, left thalamus, right thalamus, left globus pallidus and right globus pallidus.
- Therefore, rows 0 to 12 999 (inclusively) of the inputs_X martix are vertices of the left striatum, rows 13000 to 25 999 of inputs_X matrix are vertices from the right striatum, etc.
- Since for bootres_x_weights_normed, each row is a vertex and each column is a LV, we use the same principle here:
- rows 0 to 12 999 (inclusively) of the bootres_x_weights_normed matrix are part of the left striatum;
- rows 13000 to 25999 (inclusively) of the bootres_x_weights_normed matrix are part of the right striatum;
- rows 26000 to 32499 (inclusively) of the bootres_x_weights_normed matrix are part of the left thalamus;
- rows 32500 to 38999 (inclusively) of the bootres_x_weights_normed matrix are part of the right thalamus;
- rows 39000 to 41999 (inclusively) of the bootres_x_weights_normed matrix are part of the left globus pallidus;
- rows 42000 to 44999 (inclusively) of the bootres_x_weights_normed matrix are part of the right globus pallidus.
os.mkdir(f'{output_dir}/brain_loadings/')
for i in range(permres_pvals.shape[0]): # i ranges from 0 to (num_LVs – 1)
if (permres_pvals.loc[i][0] <= pval):
os.mkdir(f'{output_dir}/brain_loadings/LV_{i+1}')
f=open(f'{output_dir}/info_LV_{i+1}.txt','w')
# num rows: inclusive
str_left = bootres_x_weights_normed.loc[0:12999,i]
str_right = bootres_x_weights_normed.loc[13000:25999,i]
thal_left = bootres_x_weights_normed.loc[26000:32499,i]
thal_right = bootres_x_weights_normed.loc[32500:38999,i]
gp_left = bootres_x_weights_normed.loc[39000:41999,i]
gp_right = bootres_x_weights_normed.loc[42000:44999,i]
str_left.to_csv(f'{output_dir}/brain_loadings/LV_{i+1}/str_left.csv',header=['BSR_ratio'])
str_right.to_csv(f'{output_dir}/brain_loadings/LV_{i+1}/str_right.csv',header=['BSR_ratio'])
thal_left.to_csv(f'{output_dir}/brain_loadings/LV_{i+1}/thal_left.csv',header=['BSR_ratio'])
thal_right.to_csv(f'{output_dir}/brain_loadings/LV_{i+1}/thal_right.csv',header=['BSR_ratio'])
gp_left.to_csv(f'{output_dir}/brain_loadings/LV_{i+1}/gp_left.csv',header=['BSR_ratio'])
gp_right.to_csv(f'{output_dir}/brain_loadings/LV_{i+1}/gp_right.csv',header=['BSR_ratio'])
f.write(f'LV {i+1} varexp = {varexp.loc[i,0]}\n')
f.write(f'LV {i+1} pval = {permres_pvals.loc[i,0]}\n')
f.close()
import math
import matplotlib.pyplot as plt
os.mkdir(f"{output_dir}/behavioural_loadings/")
behav = ordered_list_of_behavioural_variable_names
for i in range(permres_pvals.shape.lod[0]): # i ranges from 0 to (num_LVs – 1)
if (permres_pvals[i][0] <= pval):
y_load = y_loadings[:,i]
x_error = []
for j in range(len(behav)):
y_loadings_ci = np.loadtxt(f"{input_dir}/y_loadings_ci/bootres_y_loadings_ci_{j}.csv",delimiter=",")
pair = y_loadings_ci[i]
entry = [math.fabs(float((y_load[j]-pair[0]))),math.fabs(float((y_load[j]-pair[1])))]
x_error.append(entry)
# SORTING LOADINGS IN ASCENDING ORDER BY:
x_error_new = [x_error[i] for i in new_indices]
colors_new = [colors[i] for i in new_indices]
behav_new = [behav[i] for i in new_indices]
y_load.sort()
x_error_new = np.array(x_error_new)
x_error_new = x_error_new.T
# COLOURING IT SUCH THAT ONLY THE BEHAVIOURS WITH LOWER & UPPER CIs IN THE SAME DIRECTION APPEAR IN RED:
colors=[]
for k in range(x_error_new.shape[1]):
a=y_load[k]-x_error_new[0,k]
b=y_load[k]+x_error_new[1,k]
c=a*b
if (c >= 0):
colors.append("red")
elif (c < 0):
colors.append("blue")
x_error = np.array(x_error)
x_error = x_error.T
fig, ax = plt.subplots(figsize=(15, 10),dpi=300)
ax.barh(y_pos, y_load,xerr=x_error_new, align='center',colors=colors)
ax.set_yticks(y_pos)
ax.set_yticklabels(behav_new)
ax.invert_yaxis()
ax.set_xlabel('Loadings')
ax.set_title(f"LV{i+1} Behavioural Loadings")
plt.tight_layout()
plt.savefig(f"{output_dir}/behavioural_loadings/LV_{i+1}",dpi=300)
plt.show()
p-value by % covariance explained
fig, ax1 = plt.subplots()
LV = list(range(1,df.shape[1]+1))
color = 'tab:red'
ax1.set_xlabel('latent variable')
ax1.set_ylabel('% covariance', color=color)
ax1.scatter(LV, varexp*100, color=color)
ax1.tick_params(axis='y', labelcolor=color)
ax1.tick_params()
ax2 = ax1.twinx() # instantiate a second axes that shares the same x-axis
color = 'tab:blue'
ax2.set_ylabel('p-value', color=color) # we already handled the x-label with ax1
ax2.scatter(LV, permres_pvals, color=color)
ax2.tick_params(axis='y', labelcolor=color)
fig.tight_layout() # otherwise the right y-label is slightly clipped
ax1.set_title("p-value and covariance explained by each latent variable")
fig.savefig(f"{output_dir}/pval_by_covexp_per_LV.png",bbox_inches = "tight", dpi=1000)
split-half resampling p-values
import math
import matplotlib.pyplot as plt
# P-values and covariance explained by each latent variable
fig, ax1 = plt.subplots()
LV = list(range(1,df.shape[1]+1))
LV_sp1=list()
LV_sp2=list()
LV_sp1[:] = [i-0.1 for i in LV]
LV_sp2[:] = [i+0.1 for i in LV]
color = 'tab:red'
ax1.set_xlabel('latent variable')
ax1.set_ylabel('P Ucorr', color=color)
ax1.scatter(LV_sp1, np.minimum(splitres_u_vcorr_pvals[0], 0.199), color=color)
ax1.tick_params(axis='y', labelcolor=color)
ax1.tick_params()
ax1.set_ylim([0, 0.20])
ax2 = ax1.twinx() # instantiate a second axes that shares the same x-axis
color = 'tab:blue'
ax2.set_ylabel('P Vcorr', color=color) # we already handled the x-label with ax1
ax2.scatter(LV_sp2, np.minimum(splitres_u_vcorr_pvals[1], 0.199), color=color)
ax2.tick_params(axis='y', labelcolor=color)
ax2.set_ylim([0, 0.20])
plt.axhline(y=0.05, color="tab:blue")
fig.tight_layout() # otherwise the right y-label is slightly clipped
ax1.set_title("Split-half resampling p-values")
fig.savefig(f"{output_dir}/splitres_u_vcorr_pvals.png",bbox_inches = "tight", dpi=1000)
import statsmodels.api as sm # import statsmodels
import pandas as pd
X = df["Sex"] # or however else the sex variable is called in your df
X = sm.add_constant(X)
os.mkdir(f"{output_dir}/lm_X_sex")
Brain scores:
f=open(f'{output_dir}/lm_X_sex/pearson_r_brain.txt','w')
for i in range(permres_pvals.shape[0]): # i ranges from 0 to (num_LVs – 1)
if (permres_pvals.loc[i][0] <= pval):
behavioural[f"LV_brain_{i+1}"] = x_scores[i]
y = behavioural[f"LV_brain_{i+1}"]
model = sm.OLS(y, X).fit() ## sm.OLS(output, input) #(y, X)
a = model.summary()
b = a.as_csv()
text_file = open(f"{output_dir}/lm_X_sex/lm_ybrain_Xsex_LV{i+1}.csv", "wt")
n = text_file.write(b)
text_file.close()
cc = behavioural[["Gender",f"LV_brain_{i+1}"]].corr()
f.write(f"Pearson R between sex and LV {i+1} brain score: {cc.iloc[0,1]}\n")
f.close()
Behaviour scores:
f=open(f'{output_dir}/lm_X_sex/pearson_r_behav.txt','w')
for i in range(permres_pvals.shape[0]): # i ranges from 0 to (num_LVs – 1)
if (permres_pvals.loc[i][0] <= pval):
behavioural[f"LV_behav_{i+1}"] = y_scores[i]
y = behavioural[f"LV_behav_{i+1}"]
model = sm.OLS(y, X).fit() ## sm.OLS(output, input) #(y, X)
a = model.summary()
b = a.as_csv()
text_file = open(f"{output_dir}/lm_X_sex/lm_ybehav_Xsex_LV{i+1}.csv", "wt")
n = text_file.write(b)
text_file.close()
cc = behavioural[["Gender",f"LV_behav_{i+1}"]].corr()
f.write(f"Pearson R between sex and LV {i+1} behav score: {cc.iloc[0,1]}\n")
f.close()
X = df["age"] # or however else the age variable is called in your df
os.mkdir(f"{output_dir}/lm_X_age")
Brain scores:
f=open(f'{output_dir}/lm_X_age/pearson_r_brain.txt','w')
for i in range(permres_pvals.shape[0]): # i ranges from 0 to (num_LVs – 1)
if (permres_pvals.loc[i][0] <= pval):
behavioural[f"LV_brain_{i+1}"] = x_scores[i]
y = behavioural[f"LV_brain_{i+1}"]
model = sm.OLS(y, X).fit() ## sm.OLS(output, input) #(y, X)
a = model.summary()
b = a.as_csv()
text_file = open(f"{output_dir}/lm_X_age/lm_ybrain_Xage_LV{i+1}.csv", "wt")
n = text_file.write(b)
text_file.close()
cc = behavioural[["Age_in_Yrs",f"LV_brain_{i+1}"]].corr()
f.write(f"Pearson R between age and LV {i+1} brain score: {cc.iloc[0,1]}\n")
f.close()
Behaviour scores:
f=open(f'{output_dir}/lm_X_age/pearson_r_behav.txt','w')
for i in range(permres_pvals.shape[0]): # i ranges from 0 to (num_LVs – 1)
if (permres_pvals.loc[i][0] <= pval):
behavioural[f"LV_behav_{i+1}"] = y_scores[i]
y = behavioural[f"LV_behav_{i+1}"]
model = sm.OLS(y, X).fit() ## sm.OLS(output, input) #(y, X)
a = model.summary()
b = a.as_csv()
text_file = open(f"{output_dir}/lm_X_age/lm_ybehav_Xage_LV{i+1}.csv", "wt")
n = text_file.write(b)
text_file.close()
cc = behavioural[["Age_in_Yrs",f"LV_behav_{i+1}"]].corr()
f.write(f"Pearson R between age and LV {i+1} behav score: {cc.iloc[0,1]}\n")
f.close()
The following figure illustrates the first significant LV of a brain-behaviour PLS correlation analysis of chronically stress and control male mice. A) Behaviour weight for each behavioural variable included in the analysis. Size of the bar is estimated through singular value decomposition, confidence intervals are calculated by bootstrapping. Variables whose lines that cross the zero line should not be considered to significantly weight onto the LV. B) Covariance explained (left Y axis) and permutation p-values (right Y axis) for all 14 LVs in the analysis (For LV1: 45.8% covariance explained, p=.002). C) Individual mouse brain and behaviour score, color coded by group with a trend line per group. D) Brain loaded bootstrapping ratios of deformation patterns overlaid on population average, warm colors (yellow-orange) indicate larger volume whereas cool colors (blue) indicate smaller volumes.
Hansen, J. Y., Markello, R. D., Vogel, J. W., Seidlitz, J., Bzdok, D., & Misic, B. (2020). Molecular signatures of cognition and affect. In bioRxiv (p. 2020.07.16.203026). https://doi.org/10.1101/2020.07.16.203026
Krishnan, A., Williams, L. J., McIntosh, A. R., & Abdi, H. (2011). Partial Least Squares (PLS) methods for neuroimaging: a tutorial and review. NeuroImage, 56(2), 455–475.
McIntosh, A. R., & Lobaugh, N. J. (2004). Partial least squares analysis of neuroimaging data: applications and advances. NeuroImage, 23 Suppl 1, S250–S263.
Zeighami, Y., Fereshtehnejad, S.-M., Dadar, M., Collins, D. L., Postuma, R. B., Mišić, B., & Dagher, A. (2019). A clinical-anatomical signature of Parkinson’s disease identified with partial least squares and magnetic resonance imaging. NeuroImage, 190, 69–78.