-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
smartpca: columns of SNP weight matrix are not orthogonal with ldregress on #42
Comments
I don't use ldregress much, and have to think if there is a bug lurking
here -- but is this behavior
an obstacle for you?
Nick
…On Sun, Feb 2, 2020 at 10:04 AM Kristján Moore (Kris) < ***@***.***> wrote:
While trying to integrate smartpca output with the R package bigsnpr
<privefl/bigsnpr#64>, I've found that when
ldregress mode is on, the snpweightoutname matrix columns are pretty far
from being orthogonal -- the values off the diagonal of the cross product
are much farther from zero than I'd expect from rounding and floating point
errors. (I'm not a real mathsy guy, so I hope I'm using the terminology
correctly and not missing anything too obvious.)
From a run calculating 10 PCs on 521616 SNPs x 12782 samples and ldregress:
200, loading the snpweightoutname file into R:
library(dplyr)
snpweight <- read.table(snpweight_file, stringsAsFactors = F)
snpweight_rescaled <- snpweight %>%
select(-(1:3)) %>%
sapply(function(x) x / sqrt(sum(x ^ 2))
crossprod(snpweight_rescaled)
# V4 V5 V6 V7 V8 # V4 1.0000000000 0.0629886216 -0.0551280009 0.0342755900 -3.184831e-02 ...# V5 0.0629886216 1.0000000000 -0.0754935467 0.0663489089 -3.037471e-02 ...# V6 -0.0551280009 -0.0754935467 1.0000000000 -0.0308700444 1.562871e-02 ...# V7 0.0342755900 0.0663489089 -0.0308700444 1.0000000000 1.967471e-03 ...# V8 -0.0318483056 -0.0303747146 0.0156287070 0.0019674706 1.000000e+00 ...# V9 0.0015981099 -0.0198431477 -0.0225956002 -0.0077647879 -1.527105e-02 ...# V10 0.0029581338 0.0164261311 0.0014516901 -0.0007959342 -3.126125e-05 ...# V11 0.0006888677 -0.0002119814 -0.0070838543 0.0003065091 1.261866e-03 ...# V12 0.0008047758 -0.0009930216 -0.0016761682 0.0006491613 -2.979100e-03 ...# V13 -0.0005483224 0.0015044037 -0.0008509219 0.0047795698 4.093559e-04 ...
But on the exact same run as above except with ldregress off ("vanilla"
mode), the cross product is basically identity, off by about as much as
you'd expect given that the SNP weights are rounded to 3 dp:
vanilla_snpweight <- read.table(vanilla_snpweight_file, stringsAsFactors = F)
vanilla_snpweight_rescaled <- vanilla_snpweight %>%
select(-(1:3)) %>%
sapply(function(x) x / sqrt(sum(x ^ 2))
crossprod(vanilla_snpweight_rescaled)
# V4 V5 V6 V7 V8# V4 1.000000e+00 -4.875293e-07 6.548305e-08 -4.504985e-07 4.295304e-07 ...# V5 -4.875293e-07 1.000000e+00 -1.206785e-06 -2.307409e-07 -1.165748e-06 ...# V6 6.548305e-08 -1.206785e-06 1.000000e+00 -4.188756e-07 -6.649852e-07 ...# V7 -4.504985e-07 -2.307409e-07 -4.188756e-07 1.000000e+00 -4.656665e-07 ...# V8 4.295304e-07 -1.165748e-06 -6.649852e-07 -4.656665e-07 1.000000e+00 ...# V9 -1.534980e-07 -2.369042e-07 -1.517975e-07 2.652259e-07 2.653215e-07 ...# V10 -3.775631e-07 1.127952e-06 -1.817513e-06 2.046219e-07 1.711181e-07 ...# V11 1.740131e-07 9.267356e-08 -2.228536e-07 -1.769993e-06 -4.598688e-07 ...# V12 -3.790950e-07 -1.005894e-07 1.198180e-07 2.694341e-07 3.500083e-07 ...# V13 3.474165e-07 2.574883e-08 4.890360e-07 -4.279933e-07 -3.077147e-07 ...
I also observe this in a different, smaller dataset, and also when I just
set ldregress: 1.
Is this a bug or (more likely) am I just being dumb? Would there be any
way of transforming the SNP weights matrix to one with orthogonal columns?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#42?email_source=notifications&email_token=AEE77B44TLCW4572IEFPM43RA3OGHA5CNFSM4KO2FDUKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IKNHDUA>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEE77BZWQTDYLX6GENDSV23RA3OGHANCNFSM4KO2FDUA>
.
|
Thanks for the extremely prompt response! So the functionality I'm trying to use in bigsnpr (OADP projection) is, according to its creator, dependent on having a SNP weights matrix that has orthogonal (strictly, orthonormal) columns. It isn't the absolute end of the world for me if I can't use this, but smartpca's |
One thing you could try is a Gram-Schmidt orthogonalization on the weight
vectors returned by smartpca.
I'm sure that's easy if you are using R.
N
…On Sun, Feb 2, 2020 at 10:13 AM Kristján Moore (Kris) < ***@***.***> wrote:
Thanks for the extremely prompt response! So the functionality I'm trying
to use in bigsnpr (OADP projection) is, according to its creator, dependent
on having a SNP weights matrix that has orthogonal (strictly, orthonormal)
columns. It isn't the absolute end of the world for me if I can't use this,
but smartpca's ldregress and bigsnpr's OADP shrinkage correction both
seem to work really well independently and I was looking forward to
combining them. I guess I'm just going to have to make allowances and
either use plain old LD pruning or a simpler projection method.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#42?email_source=notifications&email_token=AEE77B3LNDPCDCINC67G3ELRA3PJXA5CNFSM4KO2FDUKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKRZC3Y#issuecomment-581144943>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEE77BYZWTQAV7ZXVS4TRATRA3PJXANCNFSM4KO2FDUA>
.
|
I'll give it a go! Thanks. |
Think I've cracked it -- orthogonalisation wasn't actually necessary, I'd just calculated the singular values for bigsnpr incorrectly from smartpca's eigenvalues. Using the orthogonalised SNP weight matrix does noticeably shear the projected coords though; just from eyeballing the plots it seems like the non-orthogonalised SNP weights gives more accurate results, but I might test this more rigorously. Really appreciate the help! Thanks. |
Just generated some results which illustrate the problem more viscerally. I performed two PCAs on the same set of ~10k samples using Below are "pairs plots" of the first 10 PCs for each of the above. Below the diagonal are hex-bin plots of point density; on diagonal are density plots for that PC; and above the diagonal are computed correlations (I think Pearson's). Notably, the "vanilla" plots show that all PC dimensions have 0 correlation to 3 d.p., while the |
While trying to integrate smartpca output with the R package bigsnpr, I've found that when
ldregress
mode is on, thesnpweightoutname
matrix columns are pretty far from being orthogonal -- the values off the diagonal of the cross product are much farther from zero than I'd expect from rounding and floating point errors. (I'm not a real mathsy guy, so I hope I'm using the terminology correctly and not missing anything too obvious.)From a run calculating 10 PCs on 521616 SNPs x 12782 samples and
ldregress: 200
, loading thesnpweightoutname
file into R:But on the exact same run as above except with
ldregress
off ("vanilla" mode), the cross product is basically identity, off by about as much as you'd expect given that the SNP weights are rounded to 3 dp:I also observe this in a different, smaller dataset, and also when I just set
ldregress: 1
.Is this a bug or (more likely) am I just being dumb? Would there be any way of transforming the SNP weights matrix to one with orthogonal columns?
The text was updated successfully, but these errors were encountered: