smartpca: columns of SNP weight matrix are not orthogonal with ldregress on #42

kristjanmoore · 2020-02-02T15:04:01Z

While trying to integrate smartpca output with the R package bigsnpr, I've found that when ldregress mode is on, the snpweightoutname matrix columns are pretty far from being orthogonal -- the values off the diagonal of the cross product are much farther from zero than I'd expect from rounding and floating point errors. (I'm not a real mathsy guy, so I hope I'm using the terminology correctly and not missing anything too obvious.)

From a run calculating 10 PCs on 521616 SNPs x 12782 samples and ldregress: 200, loading the snpweightoutname file into R:

library(dplyr)

snpweight <- read.table(snpweight_file, stringsAsFactors = F)

snpweight_rescaled <- snpweight %>%
    select(-(1:3)) %>%
    sapply(function(x) x / sqrt(sum(x ^ 2))

crossprod(snpweight_rescaled)

#                V4            V5            V6            V7            V8 
# V4   1.0000000000  0.0629886216 -0.0551280009  0.0342755900 -3.184831e-02 ...
# V5   0.0629886216  1.0000000000 -0.0754935467  0.0663489089 -3.037471e-02 ...
# V6  -0.0551280009 -0.0754935467  1.0000000000 -0.0308700444  1.562871e-02 ...
# V7   0.0342755900  0.0663489089 -0.0308700444  1.0000000000  1.967471e-03 ...
# V8  -0.0318483056 -0.0303747146  0.0156287070  0.0019674706  1.000000e+00 ...
# V9   0.0015981099 -0.0198431477 -0.0225956002 -0.0077647879 -1.527105e-02 ...
# V10  0.0029581338  0.0164261311  0.0014516901 -0.0007959342 -3.126125e-05 ...
# V11  0.0006888677 -0.0002119814 -0.0070838543  0.0003065091  1.261866e-03 ...
# V12  0.0008047758 -0.0009930216 -0.0016761682  0.0006491613 -2.979100e-03 ...
# V13 -0.0005483224  0.0015044037 -0.0008509219  0.0047795698  4.093559e-04 ...

But on the exact same run as above except with ldregress off ("vanilla" mode), the cross product is basically identity, off by about as much as you'd expect given that the SNP weights are rounded to 3 dp:

vanilla_snpweight <- read.table(vanilla_snpweight_file, stringsAsFactors = F)

vanilla_snpweight_rescaled <- vanilla_snpweight %>%
    select(-(1:3)) %>%
    sapply(function(x) x / sqrt(sum(x ^ 2))

crossprod(vanilla_snpweight_rescaled)

#                V4            V5            V6            V7            V8
# V4   1.000000e+00 -4.875293e-07  6.548305e-08 -4.504985e-07  4.295304e-07 ...
# V5  -4.875293e-07  1.000000e+00 -1.206785e-06 -2.307409e-07 -1.165748e-06 ...
# V6   6.548305e-08 -1.206785e-06  1.000000e+00 -4.188756e-07 -6.649852e-07 ...
# V7  -4.504985e-07 -2.307409e-07 -4.188756e-07  1.000000e+00 -4.656665e-07 ...
# V8   4.295304e-07 -1.165748e-06 -6.649852e-07 -4.656665e-07  1.000000e+00 ...
# V9  -1.534980e-07 -2.369042e-07 -1.517975e-07  2.652259e-07  2.653215e-07 ...
# V10 -3.775631e-07  1.127952e-06 -1.817513e-06  2.046219e-07  1.711181e-07 ...
# V11  1.740131e-07  9.267356e-08 -2.228536e-07 -1.769993e-06 -4.598688e-07 ...
# V12 -3.790950e-07 -1.005894e-07  1.198180e-07  2.694341e-07  3.500083e-07 ...
# V13  3.474165e-07  2.574883e-08  4.890360e-07 -4.279933e-07 -3.077147e-07 ...

I also observe this in a different, smaller dataset, and also when I just set ldregress: 1.

Is this a bug or (more likely) am I just being dumb? Would there be any way of transforming the SNP weights matrix to one with orthogonal columns?

The text was updated successfully, but these errors were encountered:

bumblenick · 2020-02-02T15:07:20Z

I don't use ldregress much, and have to think if there is a bug lurking here -- but is this behavior an obstacle for you? Nick

…

On Sun, Feb 2, 2020 at 10:04 AM Kristján Moore (Kris) < ***@***.***> wrote: While trying to integrate smartpca output with the R package bigsnpr <privefl/bigsnpr#64>, I've found that when ldregress mode is on, the snpweightoutname matrix columns are pretty far from being orthogonal -- the values off the diagonal of the cross product are much farther from zero than I'd expect from rounding and floating point errors. (I'm not a real mathsy guy, so I hope I'm using the terminology correctly and not missing anything too obvious.) From a run calculating 10 PCs on 521616 SNPs x 12782 samples and ldregress: 200, loading the snpweightoutname file into R: library(dplyr) snpweight <- read.table(snpweight_file, stringsAsFactors = F) snpweight_rescaled <- snpweight %>% select(-(1:3)) %>% sapply(function(x) x / sqrt(sum(x ^ 2)) crossprod(snpweight_rescaled) # V4 V5 V6 V7 V8 # V4 1.0000000000 0.0629886216 -0.0551280009 0.0342755900 -3.184831e-02 ...# V5 0.0629886216 1.0000000000 -0.0754935467 0.0663489089 -3.037471e-02 ...# V6 -0.0551280009 -0.0754935467 1.0000000000 -0.0308700444 1.562871e-02 ...# V7 0.0342755900 0.0663489089 -0.0308700444 1.0000000000 1.967471e-03 ...# V8 -0.0318483056 -0.0303747146 0.0156287070 0.0019674706 1.000000e+00 ...# V9 0.0015981099 -0.0198431477 -0.0225956002 -0.0077647879 -1.527105e-02 ...# V10 0.0029581338 0.0164261311 0.0014516901 -0.0007959342 -3.126125e-05 ...# V11 0.0006888677 -0.0002119814 -0.0070838543 0.0003065091 1.261866e-03 ...# V12 0.0008047758 -0.0009930216 -0.0016761682 0.0006491613 -2.979100e-03 ...# V13 -0.0005483224 0.0015044037 -0.0008509219 0.0047795698 4.093559e-04 ... But on the exact same run as above except with ldregress off ("vanilla" mode), the cross product is basically identity, off by about as much as you'd expect given that the SNP weights are rounded to 3 dp: vanilla_snpweight <- read.table(vanilla_snpweight_file, stringsAsFactors = F) vanilla_snpweight_rescaled <- vanilla_snpweight %>% select(-(1:3)) %>% sapply(function(x) x / sqrt(sum(x ^ 2)) crossprod(vanilla_snpweight_rescaled) # V4 V5 V6 V7 V8# V4 1.000000e+00 -4.875293e-07 6.548305e-08 -4.504985e-07 4.295304e-07 ...# V5 -4.875293e-07 1.000000e+00 -1.206785e-06 -2.307409e-07 -1.165748e-06 ...# V6 6.548305e-08 -1.206785e-06 1.000000e+00 -4.188756e-07 -6.649852e-07 ...# V7 -4.504985e-07 -2.307409e-07 -4.188756e-07 1.000000e+00 -4.656665e-07 ...# V8 4.295304e-07 -1.165748e-06 -6.649852e-07 -4.656665e-07 1.000000e+00 ...# V9 -1.534980e-07 -2.369042e-07 -1.517975e-07 2.652259e-07 2.653215e-07 ...# V10 -3.775631e-07 1.127952e-06 -1.817513e-06 2.046219e-07 1.711181e-07 ...# V11 1.740131e-07 9.267356e-08 -2.228536e-07 -1.769993e-06 -4.598688e-07 ...# V12 -3.790950e-07 -1.005894e-07 1.198180e-07 2.694341e-07 3.500083e-07 ...# V13 3.474165e-07 2.574883e-08 4.890360e-07 -4.279933e-07 -3.077147e-07 ... I also observe this in a different, smaller dataset, and also when I just set ldregress: 1. Is this a bug or (more likely) am I just being dumb? Would there be any way of transforming the SNP weights matrix to one with orthogonal columns? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#42?email_source=notifications&email_token=AEE77B44TLCW4572IEFPM43RA3OGHA5CNFSM4KO2FDUKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IKNHDUA>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEE77BZWQTDYLX6GENDSV23RA3OGHANCNFSM4KO2FDUA> .

kristjanmoore · 2020-02-02T15:13:31Z

Thanks for the extremely prompt response! So the functionality I'm trying to use in bigsnpr (OADP projection) is, according to its creator, dependent on having a SNP weights matrix that has orthogonal (strictly, orthonormal) columns. It isn't the absolute end of the world for me if I can't use this, but smartpca's ldregress and bigsnpr's OADP shrinkage correction both seem to work really well independently and I was looking forward to combining them. I guess I'm just going to have to make allowances and either use plain old LD pruning or a simpler projection method.

bumblenick · 2020-02-02T15:20:13Z

One thing you could try is a Gram-Schmidt orthogonalization on the weight vectors returned by smartpca. I'm sure that's easy if you are using R. N

…

On Sun, Feb 2, 2020 at 10:13 AM Kristján Moore (Kris) < ***@***.***> wrote: Thanks for the extremely prompt response! So the functionality I'm trying to use in bigsnpr (OADP projection) is, according to its creator, dependent on having a SNP weights matrix that has orthogonal (strictly, orthonormal) columns. It isn't the absolute end of the world for me if I can't use this, but smartpca's ldregress and bigsnpr's OADP shrinkage correction both seem to work really well independently and I was looking forward to combining them. I guess I'm just going to have to make allowances and either use plain old LD pruning or a simpler projection method. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#42?email_source=notifications&email_token=AEE77B3LNDPCDCINC67G3ELRA3PJXA5CNFSM4KO2FDUKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKRZC3Y#issuecomment-581144943>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEE77BYZWTQAV7ZXVS4TRATRA3PJXANCNFSM4KO2FDUA> .

kristjanmoore · 2020-02-02T15:21:34Z

I'll give it a go! Thanks.

kristjanmoore · 2020-02-02T16:06:46Z

Think I've cracked it -- orthogonalisation wasn't actually necessary, I'd just calculated the singular values for bigsnpr incorrectly from smartpca's eigenvalues.

Using the orthogonalised SNP weight matrix does noticeably shear the projected coords though; just from eyeballing the plots it seems like the non-orthogonalised SNP weights gives more accurate results, but I might test this more rigorously. Really appreciate the help! Thanks.

kristjanmoore · 2020-10-12T15:04:33Z

Just generated some results which illustrate the problem more viscerally.

I performed two PCAs on the same set of ~10k samples using smartpca. In the first, I LD pruned the data and ran it through smartpca without ldregress. In the second, I didn't LD prune and used ldregress instead. Obviously this means there were different numbers of variants in each run so the results aren't entirely equivalent.

Below are "pairs plots" of the first 10 PCs for each of the above. Below the diagonal are hex-bin plots of point density; on diagonal are density plots for that PC; and above the diagonal are computed correlations (I think Pearson's).

No ldregress ("vanilla"):

ldregress:

Notably, the "vanilla" plots show that all PC dimensions have 0 correlation to 3 d.p., while the ldregress plots show that PC dimensions nearly always show non-zero correlations, which in some cases pass pretty strict significance thresholds, as indicated by the appended asterisks and periods.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

smartpca: columns of SNP weight matrix are not orthogonal with ldregress on #42

smartpca: columns of SNP weight matrix are not orthogonal with ldregress on #42

kristjanmoore commented Feb 2, 2020

bumblenick commented Feb 2, 2020 via email

kristjanmoore commented Feb 2, 2020

bumblenick commented Feb 2, 2020 via email

kristjanmoore commented Feb 2, 2020

kristjanmoore commented Feb 2, 2020

kristjanmoore commented Oct 12, 2020

smartpca: columns of SNP weight matrix are not orthogonal with ldregress on #42

smartpca: columns of SNP weight matrix are not orthogonal with ldregress on #42

Comments

kristjanmoore commented Feb 2, 2020

bumblenick commented Feb 2, 2020 via email

kristjanmoore commented Feb 2, 2020

bumblenick commented Feb 2, 2020 via email

kristjanmoore commented Feb 2, 2020

kristjanmoore commented Feb 2, 2020

kristjanmoore commented Oct 12, 2020