Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

strange behaviour on a test example #11

Open
mcol opened this issue May 28, 2015 · 11 comments
Open

strange behaviour on a test example #11

mcol opened this issue May 28, 2015 · 11 comments

Comments

@mcol
Copy link
Collaborator

mcol commented May 28, 2015

In bd46c4e I've committed a small test example to understand why we always seem to get equiprobable posteriors (0.333 0.333 0.333). Varying the flanking_snps option to 6 or more (maximum allowed is 10 here), produces those posteriors. For 5 or less, we start seeing different values.

To complicate matters, the same example produces different results according to the machine I'm using. Building with the debug_posterior set to true, on one machine I get this output:

CPU_POSTERIOR: 0 0 0.444444 0.444444 0.111111
CPU_POSTERIOR: 1 0 0.694444 0.277778 0.0277778

and on another I get

CPU_POSTERIOR: 0 0 0.333333 0.333333 0.333333
CPU_POSTERIOR: 1 0 0.333333 0.333333 0.333333

Could you have a try to see what's going on?

@mcol
Copy link
Collaborator Author

mcol commented May 28, 2015

Looking at the code, it seems that a glf should only contain the genotype likelihoods without the first 6 columns of snp information. Is that correct?

Removing those, now both machines get the same results, with posteriors set to (0.333 0.333 0.333) for all snps (independently of the number of flanking snps), which is what we see also on real data. We would expect that, since the genotype likelihoods are all zero, then the posteriors should reflect the prior (haplotype frequencies in the reference panel), and those are not equiprobable. It's almost as if the reference panel is actually not considered, or something else is going on.

@mcol
Copy link
Collaborator Author

mcol commented May 29, 2015

Ok, I've now fixed the glf file, but as I said, the result still doesn't look correct. This should be considered as a small test case for what the results we are getting on real data, so until we understand this, we are stuck.

@mcol
Copy link
Collaborator Author

mcol commented May 29, 2015

The problem is related to the fact that I put 0s in the glf file: the reason for this is that the genotype likelihoods are missing. What is the correct way of specifying missingness?

@gchen98
Copy link
Owner

gchen98 commented May 29, 2015

I apologize. The variable naming was a mess. It really should have been called g_log_penetrance, not g_snp_penetrance. Hence, for the GLF input, penetrance for a missing genotype in probability space is .33 .33 .33 and the log penetrance is -1.09 -1.09 -1.09. Probabilities always add to one. So you can either edit the code to read in things in probability space or log space. Right now it assumes input is in log space.

@gchen98
Copy link
Owner

gchen98 commented May 29, 2015

Here is in indication of what I am getting at:

[garyc@ssb202q-1 src]$ find . -name '.cpp'|xargs grep g_snp_penetrance
./init/denovo_glf.cpp: io_manager->read_input(g_snp_penetrance,this->g_people,this->g_snps);
./init/guide_glf.cpp: io_manager->read_input(ref_haplotype,g_snp_penetrance,this->informative_snp,this->g_people,this->g_snps, this->ref_haplotypes);
./init/guide_glf_opencl.cpp: writeToBuffer(buffer_snp_penetrance,g_people_geno_dim_g_snps,g_snp_penetrance,"buffer_snp_penetrance");
./init/denovo_glf_opencl.cpp: writeToBuffer(buffer_snp_penetrance,g_people_geno_dim_g_snps,g_snp_penetrance,"buffer_snp_penetrance");
./penetrance/denovo_glf.cpp: logpenetrance_new[parent] = g_snp_penetrance[i
(geno_dim*
./penetrance/denovo_glf.cpp: logpenetrance_old[parent] = g_snp_penetrance[i_(geno_dim_
./penetrance/denovo_glf.cpp: float logpenetrance_new = g_snp_penetrance[i_(geno_dim_
./penetrance/denovo_glf.cpp: logpenetrance_old = g_snp_penetrance[i_(geno_dim_
./penetrance/guide_glf.cpp: logpenetrance[parent]+=g_snp_penetrance[i*
./penetrance/guide_glf.cpp: g_snp_penetrance[i*
./penetrance/guide_glf.cpp: logpenetrance+=g_snp_penetrance[i_(geno_dim_g_snps)+geno_dim*(g_left_marker+extended_snp_mapping[l]) + n];
[garyc@ssb202q-1 src]$

@gchen98
Copy link
Owner

gchen98 commented May 29, 2015

Be sure the input matrix of the GLF file is of dimension TOTAL_SNPS rows and TOTAL_PERSONS * 3 columns. Thanks.

@mcol
Copy link
Collaborator Author

mcol commented Jun 1, 2015

Even if I replace all the 0s in the glf file with -1.09, I still get that all posteriors are set to .333. If instead I replace them with a positive value (say 1, or 10), I get more reasonable looking values. So something's not quite right with the handling of missingness.

@gchen98
Copy link
Owner

gchen98 commented Jun 1, 2015

I just went through the code. I forgot that it was probabilities after all, and not log-prob. That makes it easier at least for the user. Also, the method will need to have some information that resembles the template haplotypes. i.e. some of the SNPs should at least match the true template haplotype pair's genotypes. With .3 for everything, I see that I get equal probability for any given template haplotype pair. Can you provide an updated common.30.glf with more informative genotype likelihoods?

@mcol
Copy link
Collaborator Author

mcol commented Jun 1, 2015

Ok, but if the SNPs are missing, shouldn't the posterior reflect the haplotype frequencies? Those are not equiprobable, so I think that the solution you are getting is still incorrect.

@gchen98
Copy link
Owner

gchen98 commented Jun 1, 2015

That's strange. I just tested with a new GLF file and these are excerpts
from my input and output:

[garyc@ssb202q-1 tests]$ head uninf.30.glf POSTERIORS

==> uninf.30.glf <==
.33 .33 .33
.33 .33 .33
.33 .33 .33
.33 .33 .33
.33 .33 .33
.33 .33 .33
.33 .33 .33
.33 .33 .33
.33 .33 .33
.33 .33 .33

==> POSTERIORS <==
0 0.444444 0.444444 0.111111
1 0.694444 0.277778 0.0277778
2 0.183673 0.489796 0.326531
3 0.734694 0.244898 0.0204082
4 0.734694 0.244898 0.0204082
5 0.734694 0.244898 0.0204082
6 0.25 0.5 0.25
7 1 1e-10 1e-10
8 1 1e-10 1e-10
9 0.25 0.5 0.25
[garyc@ssb202q-1 tests]$

On 06/01/2015 02:39 PM, Marco Colombo wrote:

Ok, but if the SNPs are missing, shouldn't the posterior reflect the
haplotype frequencies? Those are not equiprobable, so I think that the
solution you are getting is still incorrect.


Reply to this email directly or view it on GitHub
#11 (comment).

@mcol
Copy link
Collaborator Author

mcol commented Jun 2, 2015

Yes, I get the same values. I've fixed the example to have .333 everywhere instead of 0s (2e6ac8a).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants