forked from harpak-lab/drosophila-longevity
-
Notifications
You must be signed in to change notification settings - Fork 0
/
week4_dros.Rmd
553 lines (416 loc) · 20.1 KB
/
week4_dros.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
---
title: "Week 4 Work"
author: "Eric Weine"
date: "3/24/2022"
output: pdf_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r libraries, include=FALSE}
set.seed(1919)
'%>%' <- dplyr::'%>%'
library(ggplot2)
library(rstan)
```
## Revisiting MASH
After finding the output of MASH very strange, I wanted to revisit the code used to get the mash output. I noticed a bug in the original code where I was not feeding in the data matrices correctly. The fixed code is shown below:
```{r, eval=FALSE}
#' Create 2x2 covariance matrices with optional levels of amplification
#'
#' The function assumes that we have two groups, with the standard deviation of
#' the effect size in one group being 1, and the standard deviation of the effect
#' size in the other group being \code{1 * amp_coef} if \code{amp} is set to
#' \code{TRUE}
#'
#' @param desired_corr Desired level of correlation. Must be in [-1, 1].
#' @param amp_coef Coefficient of amplification, as described above.
#' @param amp Boolean indicating if any amplificaiton should take place
#' @param amp_hs Boolean indicating if amplification should take place in the hs
#' group or in the c group. Only used if \code{amp} is set to \code{TRUE}.
#'
#' @return 2x2 covariance matrix
#' @export
#'
#' @examples
make_amp_cov_mat <- function(
desired_corr, amp_coef = 1, amp = TRUE, amp_hs = TRUE
) {
if (amp_hs && amp) {
ctrl_sd <- 1
hs_sd <- ctrl_sd * amp_coef
} else if(!amp_hs && amp) {
hs_sd <- 1
ctrl_sd <- hs_sd * amp_coef
} else {
hs_sd <- 1
ctrl_sd <- 1
}
# derive covariance from correlation and sds
cov_hs_ctrl <- desired_corr * hs_sd * ctrl_sd
cov_mat <- matrix(
data = c(ctrl_sd ^ 2, cov_hs_ctrl, cov_hs_ctrl, hs_sd ^ 2),
nrow = 2,
byrow = TRUE,
dimnames = list(
rows = c("ctrl", "hs"), cols = c("ctrl", "hs")
)
)
return(cov_mat)
}
# read in 50% of data and select relevant columns
summary_table <- read.delim('data/SummaryTable_allsites_12Nov20.txt')
summary_table_samp <- summary_table %>%
dplyr::sample_frac(.5) %>%
dplyr::select(c(site, pval_CTRL, pval_HS, coef_CTRL, coef_HS, sig_cat))
# replace 0 p-values with small numbers
summary_table_samp <- summary_table_samp %>%
dplyr::mutate(
pval_CTRL = pmax(.00000000001, pval_CTRL),
pval_HS = pmax(.00000000001, pval_HS)
)
# construct std error estimates from coefficients and p-values
summary_table_samp <- summary_table_samp %>%
dplyr::mutate(
std_error_ctrl = abs(coef_CTRL) / qnorm((2 - pval_CTRL) / 2),
std_error_hs = abs(coef_HS) / qnorm((2 - pval_HS) / 2)
)
reg_fx_mat <- t(matrix(
data = c(summary_table_samp$coef_CTRL, summary_table_samp$coef_HS),
nrow = 2,
byrow = TRUE
))
colnames(reg_fx_mat) <- c("ctrl", "hs")
reg_se_mat <- t(matrix(
data = c(summary_table_samp$std_error_ctrl, summary_table_samp$std_error_hs),
nrow = 2,
byrow = TRUE
))
colnames(reg_se_mat) <- c("ctrl", "hs")
mash_data <- mashr::mash_set_data(reg_fx_mat, reg_se_mat)
# Now, want to construct covariance matrices to feed into mash
cov_mat_list <- list()
cov_mat_list[['no_effect']] <- matrix(
data = rep(0, 4), nrow = 2, dimnames = list(
rows = c("ctrl", "hs"), cols = c("ctrl", "hs")
)
)
cov_mat_list[['hs_spec']] <- matrix(
data = c(0, 0, 0, 1), nrow = 2, byrow = TRUE, dimnames = list(
rows = c("ctrl", "hs"), cols = c("ctrl", "hs")
)
)
cov_mat_list[['ctrl_spec']] <- matrix(
data = c(1, 0, 0, 0), nrow = 2, byrow = TRUE, dimnames = list(
rows = c("ctrl", "hs"), cols = c("ctrl", "hs")
)
)
desired_corrs <- seq(from = -1, to = 1, by = .25)
desired_amp <- c(3, 2, 1.5)
for(corr in desired_corrs) {
cov_mat_list[[glue::glue('equal_corr_{corr}')]] <- make_amp_cov_mat(
desired_corr = corr, amp = FALSE
)
for(cond in c("hs", "ctrl")) {
for(amp in desired_amp) {
cov_mat_list[[glue::glue('{cond}_amp_{amp}_corr_{corr}')]] <- make_amp_cov_mat(
desired_corr = corr, amp_hs = (cond == "hs"), amp_coef = amp
)
}
}
}
mash_out <- mashr::mash(
data = mash_data,
Ulist = cov_mat_list,
algorithm.version = "Rcpp",
outputlevel = 1
)
cov_mat_ests <- mashr::get_estimated_pi(mash_out)
```
```{r, include=FALSE}
cov_mat_ests <- readr::read_rds(
"~/Documents/academic/drosophila_longevity/drosophila-longevity/rds_data/cov_mat_ests_half_v2.rds"
)
```
The fitted weights on the covariance matrices are shown below:
```{r}
print(cov_mat_ests[cov_mat_ests > 1e-4])
```
While then weight on the null covariance matrix is still surprisingly low, the other estimates make much more sense. Most signals are equal between the two environments, but around 22% of signals are suggested to have a higher magnitude effect in the high sugar environment than the control environment. This matches what I saw in week 1 when I looked at the density plots of the regression coefficients.
Still unsatisfied with the weight on the null matrix, I wanted to try using mash's "nullbiased" prior option. This option penalizes the likelihood for non-zero effects. I imagine this is a penalty on the L1 norm of the fitted effects, though I cannot find any specific documentation on this, so it may be worth checking into.
```{r, eval=FALSE}
mash_out_nb <- mashr::mash(
data = mash_data,
Ulist = cov_mat_list,
algorithm.version = "Rcpp",
outputlevel = 1,
prior = "nullbiased"
)
cov_mat_ests_nb <- mashr::get_estimated_pi(mash_out_nb)
```
```{r, include=FALSE}
cov_mat_ests_nb <- readr::read_rds(
"~/Documents/academic/drosophila_longevity/drosophila-longevity/rds_data/cov_mat_ests_half_nullb_v2.rds"
)
```
The fitted weights on the covariance matrices are shown below:
```{r}
print(cov_mat_ests_nb[cov_mat_ests_nb > 1e-4])
```
Strangely it seems that the nullbiased results are exactly the same as the standard results. I'm guessing that because there is so much data any penalty is completely overwhelmed, but I'm not completely sure.
Another aspect of MASH that I wanted to look at was the lfsr estimates of the fitted effects. If the lfsr is high for many of the fitted effects, then this would indicate that even though the weight on the null matrix is very low, there are many SNPs that mash beleives may have a sign of the opposite effect (which in reality could indicate that they are null or very close to null). To prevent running out of RAM, I had to fit the model with only $\frac{1}{4}$th of the data.
```{r, eval=FALSE}
summary_table_samp <- summary_table_samp %>%
dplyr::sample_frac(.5)
reg_fx_mat <- t(matrix(
data = c(summary_table_samp$coef_CTRL, summary_table_samp$coef_HS),
nrow = 2,
byrow = TRUE
))
colnames(reg_fx_mat) <- c("ctrl", "hs")
reg_se_mat <- t(matrix(
data = c(summary_table_samp$std_error_ctrl, summary_table_samp$std_error_hs),
nrow = 2,
byrow = TRUE
))
colnames(reg_se_mat) <- c("ctrl", "hs")
mash_data <- mashr::mash_set_data(reg_fx_mat, reg_se_mat)
mash_out <- mashr::mash(
data = mash_data,
Ulist = cov_mat_list,
algorithm.version = "Rcpp",
outputlevel = 2,
prior = "nullbiased"
)
lfsr_ests <- ashr::get_lfsr(mash_out)
```
```{r, include=FALSE}
lfsr_ests <- readr::read_rds(
"~/Documents/academic/drosophila_longevity/drosophila-longevity/rds_data/lfsr_ests_quarter_nullb_v2.rds"
)
```
The density of the lfsr estimates for the CTRL and HS groups are shown below. While these two plots look exactly the same, the lfsr estimates in mash between the two groups are not exactly equivalent.
```{r}
plot(
density(lfsr_ests[,'ctrl'], from = 0, to = 0.5),
main = "Density of CTRL lfsr",
ylim = c(0, 4.5)
)
plot(
density(lfsr_ests[,'hs'], from = 0, to = 0.5),
main = "Density of HS lfsr",
ylim = c(0, 4.5)
)
```
While the mode of these distributions is relatively close to 0, there are quite a few effects with very high lfsrs. It seems that for many of the signals we would expect to be null, mash is fitting them as non-null effects with high lfsrs.
### Sampling LD Blocks and Assessing Uncertainty in MASH
One area of concern with fitting genetic data with MASH is that the level of correlation between SNPs is very high. One potential solution to this is to create separate LD blocks and sample one SNP from each of these LD blocks and then fit mash. Originally, Arbel suggested dividing the data into mega base pair blocks. The code to do this is shown below:
```{r, eval=FALSE}
summary_table_ld <- read.delim('data/SummaryTable_allsites_12Nov20.txt') %>%
dplyr::select(c(site))
sites_df <- data.frame(stringr::str_split_fixed(summary_table_ld$site, ":", 2))
colnames(sites_df) <- c("chromosome", "site_id")
sites_df <- sites_df %>%
dplyr::mutate(site_id = as.numeric(site_id))
# split into blocks of a certain length
split_into_LD_blocks <- function(df, block_length) {
block_range <- seq(from = min(df$site_id), to = max(df$site_id), by = block_length)
df %>%
dplyr::mutate(block_id = plyr::laply(site_id, function(x) sum(x > block_range)))
}
# group by chromosome and then split into blocks
sites_df <- sites_df %>%
dplyr::group_by(chromosome) %>%
dplyr::group_modify(~ split_into_LD_blocks(.x, 1e6))
total_blocks <- sites_df %>% dplyr::distinct(chromosome, block_id) %>% nrow()
```
Unfortunately, based on the number of SNPs available in this dataset, this method only creates 144 LD blocks, which does not seem like enough to fit complex models. While the authors do seem to do some LD control by eliminating some SNPs before analysis, it seems that this control could be insufficient.
Finally, I was thinking about potential bootstrapping methods for assessing uncertainty in mash weights on covariance matrices. One potential method would be to sample one SNP from each LD block repeatedly, fit the mash model, and then look at the distribution of fitted weights. I believe that this method will likely work for most datasets. However, if the correlation between effects within each LD block is extremely high, this might underestimate uncertainty in the fitted weights. Instead, it might be better to sample some set of LD blocks, then sample one SNP from each block, and then fit the model. This would effectively simulate sampling data from a population even if the effects in each LD block were perfectly correlated in the dataset.
## Revising the STAN Model and Fitting it to the Drosophila Dataset
Recall from last week the following model:
Assume that we have SNPs labelled $1, ..., n$, where SNP $i$ has true effect $\theta_{i}$. Then, assume each value of $\theta$ is drawn independently as
\begin{equation*}
\theta_{1}, ..., \theta_{n} \sim N(\mu_{\theta}, \sigma_{\theta}^{2}),
\end{equation*}
where we put flat, uninformative priors on $\mu_{\theta}$ and $\sigma_{\theta}^{2}$.
Then, assume we have observations $h_{1}, ..., h_{n}$ in the high sugar group with standard errors $s_{1h}, ..., s_{nh}$, and observations $c_{1}, ..., c_{n}$ with standard errors $s_{1c}, ..., s_{nc}$. We also assume there exists an amplification coefficient $\alpha$ for the effect in the high sugar group, which we give a flat, uninformative prior. Then, we have the following sampling statements:
\begin{align*}
c_{i} &\sim N(\theta_{i}, s_{ic}^{2})\\
h_{i} &\sim \pi_{0}N(\theta_{i}, s_{ih}^{2}) + (1-\pi_{0})N((1 + \alpha)\theta_{i}, s_{ih}^{2})
\end{align*}
Where $\pi_{0}$ is a mixture component (with an uninformative prior) indicating the probability of drawing an $h_{i}$ without amplification.
One issue with this model is that the number of $\theta$ parameters is equal to the size of the dataset $n$, which makes sampling slow and less likely to converge. However, if we are not interested in estimating the posterior distribution of the $\theta$ parameters, then we can marginalize them out, as we are simply adding two normal random variables together. Thus, the model above is equivalent to
\begin{align*}
c_{i} &\sim N(\mu_{\theta}, s_{ic}^{2} + \sigma_{\theta}^{2})\\
h_{i} &\sim \pi_{0}N(\mu_{\theta}, s_{ih}^{2} + \sigma_{\theta}^{2}) + (1-\pi_{0})N((1 + \alpha)\mu_{\theta}, s_{ih}^{2} + (1 + \alpha)^{2}\sigma_{\theta}^{2})
\end{align*}
This model is written in STAN as follows:
```{stan output.var="mixt_amp_mod",}
data {
int < lower = 1 > N; // Sample size
vector[N] h; // high sugar measured effects
vector[N] c; // control measured effects
vector<lower = 0>[N] s_h; // high sugar se
vector<lower = 0>[N] s_c; // control se
}
parameters {
real<lower = 0, upper = 1> pi_0; // Mixture model proportion
//vector[N] theta; // vector of mean parameters
real mu_theta; // mean of theta parameters
real<lower = 0> sigma_theta; // sd of the theta parameters
real alpha; // amplification coefficient
}
model {
pi_0 ~ beta(1, 2);
mu_theta ~ normal(0, sqrt(10));
sigma_theta ~ normal(0, sqrt(5));
alpha ~ normal(0, sqrt(5));
c ~ normal(mu_theta, sqrt(square(s_c) + square(sigma_theta)));
for(i in 1:N) {
target += log_sum_exp(
log(pi_0) +
normal_lpdf(h[i] |
mu_theta,
sqrt(square(s_h[i]) + square(sigma_theta))),
log(1 - pi_0) +
normal_lpdf(h[i] |
(1 + alpha) * mu_theta,
sqrt(square(s_h[i]) + square(1 + alpha) * square(sigma_theta))));
}
}
```
When testing this model on simulated data (see week 3 code for how to simulate the data), I got much better results, where the chains consistently converged to the true parameter values. I did have to tighten the priors in this model slightly, but I still believe that they are reasonable and relatively flat in the feasible area of parameters.
I then fit this model to the Drosophila dataset. It would be computationally infeasible to do this with the entire dataset, so I decided only to sample SNPs that the author's analysis labelled as significant in at least one environment.
```{r, eval=FALSE}
summary_table_stan <- summary_table %>%
dplyr::filter(sig_cat != 'NS') %>%
dplyr::select(c(site, pval_CTRL, pval_HS, coef_CTRL, coef_HS, sig_cat))
# replace 0 p-values with small numbers
summary_table_stan <- summary_table_stan %>%
dplyr::mutate(
pval_CTRL = pmax(.000000000001, pval_CTRL),
pval_HS = pmax(.000000000001, pval_HS)
)
# construct std error estimates from coefficients and p-values
summary_table_stan <- summary_table_stan %>%
dplyr::mutate(
std_error_ctrl = abs(coef_CTRL) / qnorm((2 - pval_CTRL) / 2),
std_error_hs = abs(coef_HS) / qnorm((2 - pval_HS) / 2)
)
dros_stan_data <- list(
N = nrow(summary_table_stan),
h = summary_table_stan$coef_HS,
c = summary_table_stan$coef_CTRL,
s_h = summary_table_stan$std_error_hs,
s_c = summary_table_stan$std_error_ctrl
)
fitted_model <- rstan::sampling(
mixt_amp_mod,
data = dros_stan_data,
warmup = 6000,
iter = 12000,
cores = 4
)
posterior_dist <- rstan::extract(fitted_model)
```
```{r, include=FALSE}
posterior_dists <- readr::read_rds(
"~/Documents/academic/drosophila_longevity/drosophila-longevity/rds_data/stan_dros_post.rds"
)
```
However, this model seems to have trouble converging, which seems to indicate that the model may not fit the data particularly well. I'm guessing that this is driven by the mixture component, as the posterior distribution appears to be bimodal. Over the next day / week I will continue to explore more informative prior distributions that may assist in convergence. However, at a certain point the priors will become so informative that the impartiality of the analysis has to come into question.
```{r}
plot(density(posterior_dists$pi_0, from = 0, to = 1), main = "Posterior Dist of pi0")
```
## Evaluation of Correlation Between Interaction Term and Time Coefficient
Finally, I wanted to look into the correlation between the fitted interaction term and time coefficient in the interaction model. I first loaded in and merged the data from Rebecca.
```{r read_data, cache=TRUE, results=FALSE, message=FALSE, warning=FALSE}
int_model_coefs <- readr::read_delim(
file = paste0("~/Documents/academic/drosophila_longevity/drosophila-longevity/",
"data/ALL_int_model_coefs.txt"),
skip = 1,
col_names = FALSE
)
colnames(int_model_coefs) <- c(
"variant_id", "coef_intercept", "coef_time_TN", "coef_time_TN_int_HS",
"coef_seq_batch", "coef_meta_cage", "coef_sex"
)
int_model_coefs <- int_model_coefs %>%
dplyr::select(c(variant_id, coef_time_TN, coef_time_TN_int_HS))
int_model_pvals <- readr::read_delim(
file = paste0("~/Documents/academic/drosophila_longevity/drosophila-longevity/",
"data/ALL_int_model_pvals.txt"),
skip = 1,
col_names = FALSE
)
colnames(int_model_pvals) <- c(
"variant_id", "pval_intercept", "pval_time_TN", "pval_time_TN_int_HS",
"pval_seq_batch", "pval_meta_cage", "pval_sex"
)
int_model_pvals <- int_model_pvals %>%
dplyr::select(c(variant_id, pval_time_TN, pval_time_TN_int_HS))
int_model_se <- readr::read_delim(
file = paste0("~/Documents/academic/drosophila_longevity/drosophila-longevity/",
"data/ALL_int_model_se.txt"),
skip = 1,
col_names = FALSE
)
colnames(int_model_se) <- c(
"variant_id", "se_intercept", "se_time_TN", "se_time_TN_int_HS",
"se_seq_batch", "se_meta_cage", "se_sex"
)
int_model_se <- int_model_se %>%
dplyr::select(c(variant_id, se_time_TN, se_time_TN_int_HS))
int_model_df <- int_model_coefs %>%
dplyr::inner_join(int_model_pvals, by = "variant_id") %>%
dplyr::inner_join(int_model_se, by = "variant_id")
```
First, I plotted the coefficients across all datapoints.
```{r, cache=TRUE, include=FALSE}
# first, look at the correlation without subsetting data based on p-vals
ggplot(int_model_df, mapping = aes(x = coef_time_TN, y = coef_time_TN_int_HS)) +
geom_point(size = .5) +
geom_smooth(method = "lm", formula = y ~ x) +
ggtitle("cor = 3.4%")
```
Clearly, there is a slight correlation between the coefficient on time and the interaction term. This would provide evidence for some sort of an amplification effect.
However, I'm not sure if it makes sense to view all SNPs together or if it would make sense to subset them based on the p-values output from the model. Specifically, it would also be interesting to look at the correlation of coefficients for some subset of "significant" SNPs, where at least one (or both) of the coefficients passes some liberal definition of "significant."
```{r}
int_model_signif_df <- int_model_df %>%
dplyr::filter(pval_time_TN < .05 & pval_time_TN_int_HS < .05)
```
Now, we'll re-examine the plot.
```{r, cache=TRUE}
ggplot(int_model_signif_df, mapping = aes(x = coef_time_TN, y = coef_time_TN_int_HS)) +
geom_point(size = .5) +
geom_smooth(method = "lm") +
ggtitle("cor = -5.2%")
```
I'm not sure I really believe that this negative correlation coefficient is reliable, especially because requiring both effects to be "significant" imposes a very particular structure on the data.
It would perhaps be interesting to also look at this plot in the case that at least one of the coefficients is "significant".
```{r}
int_model_signif_df3 <- int_model_df %>%
dplyr::filter(pval_time_TN < .05 | pval_time_TN_int_HS < .05)
```
Now, we have a very high correlation, which is expected.
```{r}
ggplot(int_model_signif_df3, mapping = aes(x = coef_time_TN, y = coef_time_TN_int_HS)) +
geom_point(size = .5) +
geom_smooth(method = "lm") +
ggtitle("cor = 17.2%")
```
This coheres much more closely with the amplification hypothesis. However, excluding the subset of data around 0 still makes me a bit uncomfortable (though it's probably fine).
One other possible method of analysis would involve taking a weighted correlation, where we look at the entire dataset but weight each observation by the inverse of the sum of the variances on the two relevant coefficients.
```{r}
int_model_df <- int_model_df %>%
dplyr::mutate(weight = 1 / (se_time_TN ^ 2 + se_time_TN_int_HS ^ 2))
```
```{r}
weights::wtd.cors(
x = int_model_df$coef_time_TN_int_HS,
y = int_model_df$coef_time_TN,
weight = int_model_df$weight
)
```
The correlation seems about the same here as in the overall dataset.
So, overall it seems that the interaction model does lend some amount of credence to the amplification hypothesis. However, I'm not really sure the best way to subset the data to test this hypothesis. Once we determine this precisely, I think that it would also be good to construct a permutation test where we shuffle the interaction term to get a null distribution for the correlation coefficient. This would allow us to more precisely assess the magnitude of the correlation between the coefficients.