Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

smartpca: columns of SNP weight matrix are not orthogonal with ldregress on #42

Open
kristjanmoore opened this issue Feb 2, 2020 · 6 comments

Comments

@kristjanmoore
Copy link

While trying to integrate smartpca output with the R package bigsnpr, I've found that when ldregress mode is on, the snpweightoutname matrix columns are pretty far from being orthogonal -- the values off the diagonal of the cross product are much farther from zero than I'd expect from rounding and floating point errors. (I'm not a real mathsy guy, so I hope I'm using the terminology correctly and not missing anything too obvious.)

From a run calculating 10 PCs on 521616 SNPs x 12782 samples and ldregress: 200, loading the snpweightoutname file into R:

library(dplyr)

snpweight <- read.table(snpweight_file, stringsAsFactors = F)

snpweight_rescaled <- snpweight %>%
    select(-(1:3)) %>%
    sapply(function(x) x / sqrt(sum(x ^ 2))

crossprod(snpweight_rescaled)

#                V4            V5            V6            V7            V8 
# V4   1.0000000000  0.0629886216 -0.0551280009  0.0342755900 -3.184831e-02 ...
# V5   0.0629886216  1.0000000000 -0.0754935467  0.0663489089 -3.037471e-02 ...
# V6  -0.0551280009 -0.0754935467  1.0000000000 -0.0308700444  1.562871e-02 ...
# V7   0.0342755900  0.0663489089 -0.0308700444  1.0000000000  1.967471e-03 ...
# V8  -0.0318483056 -0.0303747146  0.0156287070  0.0019674706  1.000000e+00 ...
# V9   0.0015981099 -0.0198431477 -0.0225956002 -0.0077647879 -1.527105e-02 ...
# V10  0.0029581338  0.0164261311  0.0014516901 -0.0007959342 -3.126125e-05 ...
# V11  0.0006888677 -0.0002119814 -0.0070838543  0.0003065091  1.261866e-03 ...
# V12  0.0008047758 -0.0009930216 -0.0016761682  0.0006491613 -2.979100e-03 ...
# V13 -0.0005483224  0.0015044037 -0.0008509219  0.0047795698  4.093559e-04 ...

But on the exact same run as above except with ldregress off ("vanilla" mode), the cross product is basically identity, off by about as much as you'd expect given that the SNP weights are rounded to 3 dp:

vanilla_snpweight <- read.table(vanilla_snpweight_file, stringsAsFactors = F)

vanilla_snpweight_rescaled <- vanilla_snpweight %>%
    select(-(1:3)) %>%
    sapply(function(x) x / sqrt(sum(x ^ 2))

crossprod(vanilla_snpweight_rescaled)

#                V4            V5            V6            V7            V8
# V4   1.000000e+00 -4.875293e-07  6.548305e-08 -4.504985e-07  4.295304e-07 ...
# V5  -4.875293e-07  1.000000e+00 -1.206785e-06 -2.307409e-07 -1.165748e-06 ...
# V6   6.548305e-08 -1.206785e-06  1.000000e+00 -4.188756e-07 -6.649852e-07 ...
# V7  -4.504985e-07 -2.307409e-07 -4.188756e-07  1.000000e+00 -4.656665e-07 ...
# V8   4.295304e-07 -1.165748e-06 -6.649852e-07 -4.656665e-07  1.000000e+00 ...
# V9  -1.534980e-07 -2.369042e-07 -1.517975e-07  2.652259e-07  2.653215e-07 ...
# V10 -3.775631e-07  1.127952e-06 -1.817513e-06  2.046219e-07  1.711181e-07 ...
# V11  1.740131e-07  9.267356e-08 -2.228536e-07 -1.769993e-06 -4.598688e-07 ...
# V12 -3.790950e-07 -1.005894e-07  1.198180e-07  2.694341e-07  3.500083e-07 ...
# V13  3.474165e-07  2.574883e-08  4.890360e-07 -4.279933e-07 -3.077147e-07 ...

I also observe this in a different, smaller dataset, and also when I just set ldregress: 1.

Is this a bug or (more likely) am I just being dumb? Would there be any way of transforming the SNP weights matrix to one with orthogonal columns?

@bumblenick
Copy link

bumblenick commented Feb 2, 2020 via email

@kristjanmoore
Copy link
Author

Thanks for the extremely prompt response! So the functionality I'm trying to use in bigsnpr (OADP projection) is, according to its creator, dependent on having a SNP weights matrix that has orthogonal (strictly, orthonormal) columns. It isn't the absolute end of the world for me if I can't use this, but smartpca's ldregress and bigsnpr's OADP shrinkage correction both seem to work really well independently and I was looking forward to combining them. I guess I'm just going to have to make allowances and either use plain old LD pruning or a simpler projection method.

@bumblenick
Copy link

bumblenick commented Feb 2, 2020 via email

@kristjanmoore
Copy link
Author

I'll give it a go! Thanks.

@kristjanmoore
Copy link
Author

Think I've cracked it -- orthogonalisation wasn't actually necessary, I'd just calculated the singular values for bigsnpr incorrectly from smartpca's eigenvalues.

Using the orthogonalised SNP weight matrix does noticeably shear the projected coords though; just from eyeballing the plots it seems like the non-orthogonalised SNP weights gives more accurate results, but I might test this more rigorously. Really appreciate the help! Thanks.

@kristjanmoore
Copy link
Author

Just generated some results which illustrate the problem more viscerally.

I performed two PCAs on the same set of ~10k samples using smartpca. In the first, I LD pruned the data and ran it through smartpca without ldregress. In the second, I didn't LD prune and used ldregress instead. Obviously this means there were different numbers of variants in each run so the results aren't entirely equivalent.

Below are "pairs plots" of the first 10 PCs for each of the above. Below the diagonal are hex-bin plots of point density; on diagonal are density plots for that PC; and above the diagonal are computed correlations (I think Pearson's).

No ldregress ("vanilla"):
image

ldregress:
image

Notably, the "vanilla" plots show that all PC dimensions have 0 correlation to 3 d.p., while the ldregress plots show that PC dimensions nearly always show non-zero correlations, which in some cases pass pretty strict significance thresholds, as indicated by the appended asterisks and periods.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants