forked from harpak-lab/drosophila-longevity
-
Notifications
You must be signed in to change notification settings - Fork 0
/
week5_dros.Rmd
319 lines (233 loc) · 14.7 KB
/
week5_dros.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
---
title: "Week 5 Drosophila Work"
author: "Eric Weine"
date: "3/30/2022"
output: pdf_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r, include=FALSE}
summary_table <- read.delim('data/SummaryTable_allsites_12Nov20.txt')
```
## Subsetting Data by LD Blocks
This week, all data that I'll be working with will be subset by kilobase pairs. The code to do this is shown below, with the only addition from last week that I am now sampling from each LD block as opposed to merely creating them.
```{r, eval=FALSE}
sites_df <- data.frame(stringr::str_split_fixed(summary_table$site, ":", 2))
colnames(sites_df) <- c("chromosome", "site_id")
sites_df <- sites_df %>%
dplyr::mutate(site_id = as.numeric(site_id))
# split into blocks of a certain length
split_into_LD_blocks <- function(df, block_length) {
block_range <- seq(from = min(df$site_id), to = max(df$site_id), by = block_length)
df %>%
dplyr::mutate(block_id = plyr::laply(site_id, function(x) sum(x > block_range)))
}
# group by chromosome and then split into blocks
sites_df <- sites_df %>%
dplyr::group_by(chromosome) %>%
dplyr::group_modify(~ split_into_LD_blocks(.x, 1e4))
# Now, want to sample one SNP from each group
sites_sample_df <- sites_df %>%
dplyr::ungroup() %>%
dplyr::sample_frac() %>% #randomly shuffle df
dplyr::distinct(chromosome, block_id, .keep_all = TRUE) %>%
dplyr::select(chromosome, site_id)
# Reconstruct site names
selected_sites <- purrr::pmap_chr(
list(sites_sample_df$chromosome, sites_sample_df$site_id),
function(x, y) glue::glue("{x}:{y}")
)
summary_table_samp <- summary_table %>%
dplyr::filter(site %in% selected_sites) %>%
dplyr::select(c(site, pval_CTRL, pval_HS, coef_CTRL, coef_HS))
# replace 0 p-values with small numbers
summary_table_samp <- summary_table_samp %>%
dplyr::mutate(
pval_CTRL = pmax(.00000000001, pval_CTRL),
pval_HS = pmax(.00000000001, pval_HS)
)
# construct std error estimates from coefficients and p-values
summary_table_samp <- summary_table_samp %>%
dplyr::mutate(
std_error_ctrl = abs(coef_CTRL) / qnorm((2 - pval_CTRL) / 2),
std_error_hs = abs(coef_HS) / qnorm((2 - pval_HS) / 2)
)
```
```{r, include=FALSE}
summary_table_samp <- readr::read_rds(
"rds_data/stan_ld_data.rds"
)
```
In total, there are about 12K variants sampled in the reduced dataset.
## Fitting MASH to subset dataset
Now, we fit MASH to the subset dataset, first without additional added penalty in the prior. The code is the same as last week, except that I realized that MASH automatically adds a null matrix for you, and so I have removed the "no_effect" matrix that I was feeding mash, which always had 0 posterior weight and was clearly redundant.
```{r, eval=FALSE}
reg_fx_mat <- t(matrix(
data = c(summary_table_samp$coef_CTRL, summary_table_samp$coef_HS),
nrow = 2,
byrow = TRUE
))
colnames(reg_fx_mat) <- c("ctrl", "hs")
reg_se_mat <- t(matrix(
data = c(summary_table_samp$std_error_ctrl, summary_table_samp$std_error_hs),
nrow = 2,
byrow = TRUE
))
colnames(reg_se_mat) <- c("ctrl", "hs")
mash_data <- mashr::mash_set_data(reg_fx_mat, reg_se_mat)
# Now, want to construct covariance matrices to feed into mash
cov_mat_list <- list()
cov_mat_list[['hs_spec']] <- matrix(
data = c(0, 0, 0, 1), nrow = 2, byrow = TRUE, dimnames = list(
rows = c("ctrl", "hs"), cols = c("ctrl", "hs")
)
)
cov_mat_list[['ctrl_spec']] <- matrix(
data = c(1, 0, 0, 0), nrow = 2, byrow = TRUE, dimnames = list(
rows = c("ctrl", "hs"), cols = c("ctrl", "hs")
)
)
desired_corrs <- seq(from = -1, to = 1, by = .25)
desired_amp <- c(3, 2, 1.5)
for(corr in desired_corrs) {
cov_mat_list[[glue::glue('equal_corr_{corr}')]] <- make_amp_cov_mat(
desired_corr = corr, amp = FALSE
)
for(cond in c("hs", "ctrl")) {
for(amp in desired_amp) {
cov_mat_list[[glue::glue('{cond}_amp_{amp}_corr_{corr}')]] <- make_amp_cov_mat(
desired_corr = corr, amp_hs = (cond == "hs"), amp_coef = amp
)
}
}
}
mash_out <- mashr::mash(
data = mash_data,
Ulist = cov_mat_list,
algorithm.version = "Rcpp",
outputlevel = 1
)
cov_mat_ests <- mashr::get_estimated_pi(mash_out)
```
The updated posterior estimates on the mixing parameters (above 1e-3) of the mash model are shown below:
```{r, echo=FALSE}
cov_mat_ests <- readr::read_rds(
"~/Documents/academic/drosophila_longevity/drosophila-longevity/rds_data/cov_mat_LD_blocks.rds"
)
print(cov_mat_ests[cov_mat_ests >= 1e-3])
```
The estimate on the null covariance matrix has increased from about 1% to about 7%, so clearly subsetting the data is making some amount of a difference. However, the weight on the null matrix is still substantially lower than what we saw in GxSex.
## Experimenting with null penalization in MASH
Another avenue of exploration in MASH is attempting to bias the fitting procedure to fit more null effects. The function `mash` has a `prior` parameter, that can be set to either `'nullbiased'` or `'uniform'` (the default). In every dataset I have examined, including simulated datasets used in the mash tutorial, I have only noticed negligible differences between the posterior weights on the mixture parameters when changing between these two settings. So, I wanted to better understand the "nullbiased" setting and see if it was possible to make the bias towards the null stronger than whatever was being implemented currently by MASH.
Based on the mash source code, it seems that mash calls the ash code in order to estimate the mixture components. This makes sense, because once the likelihoods are evaluated for the data, estimating the mixture components involves the same procedure for ash and mash. To encourage sparsity, ash uses a penalization term in the likelihood. This term is of the form
\begin{equation*}
\prod_{k=0}^{K}\pi_{k}^{\lambda_{k} - 1},
\end{equation*}
where $\pi_{0}$ is the weight on the null matrix, $\pi_{1},...\pi_{K}$ is the weight on the $K$ non-null mixture components, and $\lambda_{i}$ is the penalty term on the $i^{th}$ mixture component. In ash, $\lambda_0$ can be set via the `nullweight` parameter in the `ash` function (assuming one sets `prior = "nullbiased"`), and all other $\lambda_{i}$ are set to 1. In `mash`, there is no `nullweight` parameter (as of the writing of this), and one simply has the ability to set `prior = "nullbiased"` which sets $\lambda_{0}$ to $10$ (and keeps all other values $\lambda_{i}$ at 1). When testing this feature, it seems that in mash, setting $\lambda_{0}$ to $10$ is insufficient to induce sparsity in most cases. When going through the mash tutorial as well as testing that value on our data, the weight on $\pi_{0}$ essentially did not change.
I submitted a PR to add this parameter to mash, but in the meantime I was able to build mash locally and run it with a higher penalty on non-null matrices. I set $\lambda_{0}$ to $100$, with the results shown below:
```{r, echo=FALSE}
cov_mat_ests_nb100 <- readr::read_rds(
"rds_data/cov_mat_LD_blocks_nb100.rds"
)
print(cov_mat_ests_nb100[cov_mat_ests_nb100 >= 1e-3])
```
Interestingly, it seems most of the weight that transferred to the null mixture component came from the amplification covariance matrix. This could be because many of the signals that were increasing the weight on the amplification matrix had very small signals for the high sugar group, and thus with a higher penalty were "classified" as null. In any event, the weight on the null matrix is still much lower than GxSex, even with a high penalty on non-null weights.
## Troubleshooting STAN model
This week, I also wanted to continue to improve the stan model. Since the dataset is of size 12k, my local would run out of RAM when running the model, but I was able to spin up an AWS instance with much more RAM that could get the job done. The key insight this weak was abandoning the normality assumption for the distribution of effect sizes. Following Arbel's advice, I first attempted to fit two models. One where I set the mixture component $\pi_{0}$ to $0$ (corresponding to amplification in every signal) and one where I set the mixture component $\pi_{0}$ to $1$, corresponding to no amplification. I was still having some difficulty fitting these models, and I required quite a few samples until the chains mixed (on the order of 10K, which is always a bad sign). What I noticed was that the posterior distribution seemed to be bimodal in the mean and variance parameters, which was driving poor chain mixing. Below is the plot of the mean and variance parameters from setting $\pi_{0}$ to $1$.
```{r}
post_group_diff <- readr::read_rds("rds_data/stan_dros_post_group_diff.rds")
stan_ld_data <- readr::read_rds("rds_data/stan_ld_data.rds")
plot(density(post_group_diff$mu_theta), main = "Posterior Density of Effect sd")
plot(density(post_group_diff$sigma_theta), main = "Posterior Density of Effect Mean")
plot(
post_group_diff$mu_theta,
post_group_diff$sigma_theta,
xlab = "Posterior Effect Mean",
ylab = "Posterior Effect sd"
)
```
To me, this indicated that perhaps the variance of the data was too large for a normal distribution, as it seemed that some outlying observations were pulling the likelihood into two modes.
To confirm this, I plotted a qq-plot of the estimated beta-binomial regression effects for both groups against the normal distribution (with parameters estimated by the EXCELLENT package qqconf).
```{r}
qqconf::qq_conf_plot(
obs = stan_ld_data$c,
points_params = list(cex = .5),
main = "Estimated Control Effects from Regression Model"
)
qqconf::qq_conf_plot(
obs = stan_ld_data$h,
points_params = list(cex = .5),
main = "Estimated HS Effects from Regression Model"
)
```
The deviation from non-normality is not extreme, but the tails do seem to be large enough such that it could make a difference. To remedy this, I decided to use a t-likelihood (with one degree of freedom). The mixture model with the t likelihood is shown below:
```{stan output.var="amp_mixt_t", eval = FALSE}
data {
int < lower = 1 > N; // Sample size
vector[N] h; // high sugar measured effects
vector[N] c; // control measured effects
vector<lower = 0>[N] s_h; // high sugar se
vector<lower = 0>[N] s_c; // control se
}
parameters {
real<lower = 0, upper = 1> pi_0; // Mixture model proportion
real mu_theta; // mean of theta parameters
real<lower = 0> sigma_theta; // sd of the theta parameters
real alpha; // amplification coefficient
}
model {
// uninformative prior on pi_0
mu_theta ~ normal(0, sqrt(3));
sigma_theta ~ normal(0, sqrt(3));
alpha ~ normal(0, sqrt(3));
c ~ normal(mu_theta, sqrt(square(s_c) + square(sigma_theta)));
for(i in 1:N) {
target += log_sum_exp(
log(pi_0) +
student_t_lpdf(h[i] | 1,
mu_theta,
sqrt(square(s_h[i]) + square(sigma_theta))),
log(1 - pi_0) +
student_t_lpdf(h[i] | 1,
(1 + alpha) * mu_theta,
sqrt(square(s_h[i]) + square(1 + alpha) * square(sigma_theta))));
}
}
```
The inclusion of the t-likelihood seemed to make a very big difference, and convergence occurred quickly.
```{r, include=FALSE}
# posterior from mixture
amp_mixt_t_post <- readr::read_rds("rds_data/stan_dros_post_mixture.rds")
```
Below are the posterior densities of the parameters of interest.
```{r}
plot(density(amp_mixt_t_post$pi_0), main = "Posterior Density of pi0")
plot(density(amp_mixt_t_post$mu_theta), main = "Posterior Density of mu_theta")
plot(density(amp_mixt_t_post$alpha), main = "Posterior Density of alpha")
```
This model is much less flexible than the MASH model (most notably there's no null effect here). However, the two models seem to "agree" with each other, relatively speaking. The posterior mean of $\pi_{0}$ is about $.65$, which is pretty close to the estimates that we've gotten on matrix of equal effects with a correlation of 1. The posterior density of $\alpha$ is about $1.25$ (which indicates that the effect of the control group is multiplied by $2.25$ in the high sugar group). The MASH model seemed to suggest that the amplification was closer to $1.5$, which seems more plausible based on the density plots of the regression coefficients I looked at in week 1.
Adding a null effect mixture component could be the next logical step in this model. However, before spending more money on AWS instances, I'd like to discuss where we want to go with this model so that I can reduce cost as much as possible.
## MASH Bootstrap
I was mistaken about what I said last week. I believe that the current method is valid for creating bootstrap samples. I had misremembered the procedure for generating bootstrap samples, which I thought only generated sample sizes of some fraction of the whole dataset (e.g. $.66\cdot n$), but in fact if the dataset is of size $n$ then for each bootstrap sample you sample with replacement $n$ datapoints. Obviously any datapoint has a correlation of 1 with itself, so sampling 1 SNP per block will have less than a correlation of 1 between samples, and thus is valid for this case. It is necessary that the blocks are sampled with replacement though, in order to follow the standard bootstrapping procedure.
## Fitting MASH for GxExTrait
One other idea I was thinking about this week is extending MASH to the case in which the effect matrices estimated are 3D instead of 2D. In the GxSex project, we had SNP effects for a number of traits in each sex. In this case, the posterior effects were estimated completely separately. However, it seems possible to imagine a matrix normal generalization of the mash procedure as follows.
Assume that we have $n$ SNPs of interest, where we have estimated effect sizes and standard errors for $c$ conditions and $t$ traits. Let the true effect matrix for the $i^{th}$ SNP be represented as the $cxt$ matrix $X_{i}$. Then, we assume that
\begin{equation*}
X_{1}, ..., X{n} \sim MN_{cxt}(0, U, V)
\end{equation*}
where $MN_{cxt}$ is the $cxt$ matrix normal distribution and $U$ is a $cxc$ matrix representing the correlation between conditions and $V$ is a $txt$ matrix representing the correlation between traits. Now, to actually fit this model with MASH, we can take advantage of the fact that the matrix normal distribution reduces to the multivariate normal distribution in the following way. If
\begin{equation*}
X \sim MN_{nxp}(M, U, V),
\end{equation*}
Then
\begin{equation*}
vec(X) \sim N_{np}(vec(M), U \otimes V)
\end{equation*}
where the $vec$ operator converts a matrix to a column vector by stacking the columns atop each other. Thus, to actually fit this model with mash, we would first define a set of potential covariance matrices for $U$ and $V$
\begin{align*}
W &= \{U_{1}, ..., U_{z}\}\\
P &= \{V_{1}, ..., V_{q}\}\\
B &= PxW
\end{align*}
then, we apply the kronecker product to each element of $B$, and we have our set of potential covariance matrices. Then, we simply convert our observations $\hat{X}_{1}, ..., \hat{X}_{n}$ to $vec(\hat{X}_{1}), ..., vec(\hat{X}_{n})$ and proceed as usual.