-
Notifications
You must be signed in to change notification settings - Fork 1
/
20190115_02_sample_estimation.Rmd
211 lines (181 loc) · 6.76 KB
/
20190115_02_sample_estimation.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
---
title: "20190116: Sample estimation for TNSeq of Spyogenes (2017 including Andrew)."
author: "atb"
date: "`r Sys.Date()`"
output:
html_document:
code_download: true
code_folding: show
fig_caption: true
fig_height: 7
fig_width: 7
highlight: tango
keep_md: false
mode: selfcontained
number_sections: true
self_contained: true
theme: readable
toc: true
toc_float:
collapsed: false
smooth_scroll: false
rmdformats::readthedown:
code_download: true
code_folding: show
df_print: paged
fig_caption: true
fig_height: 7
fig_width: 7
highlight: tango
width: 300
keep_md: false
mode: selfcontained
toc_float: true
BiocStyle::html_document:
code_download: true
code_folding: show
fig_caption: true
fig_height: 7
fig_width: 7
highlight: tango
keep_md: false
mode: selfcontained
toc_float: true
---
<style type="text/css">
body, td {
font-size: 16px;
}
code.r{
font-size: 16px;
}
pre {
font-size: 16px
}
</style>
```{r options, include=FALSE}
library("hpgltools")
tt <- devtools::load_all("~/hpgltools")
knitr::opts_knit$set(width=120,
progress=TRUE,
verbose=TRUE,
echo=TRUE)
knitr::opts_chunk$set(error=TRUE,
dpi=96)
old_options <- options(digits=4,
max.print=120,
stringsAsFactors=FALSE,
knitr.duplicate.label="allow")
ggplot2::theme_set(ggplot2::theme_bw(base_size=10))
rundate <- format(Sys.Date(), format="%Y%m%d")
previous_file <- "20190115_01_annotation.Rmd"
ver <- "20190115"
tmp <- sm(loadme(filename=paste0(gsub(pattern="\\.Rmd", replace="", x=previous_file), "-v", ver, ".rda.xz")))
rmd_file <- "20190115_02_sample_estimation.Rmd"
```
# `r rundate` Sample Estimation, Spyogenes: `r ver`
This document is concerned with analyzing TNSeq from S.pyogenes, primarily from
the perspective of changing Zn concentration in the media. This was done with
samples where Cu concentrations were changed; those are not included in the
final analyses, but are important for the variance calculations and so are kept
here.
## Extract the relevant samples
There were some other samples in this series of runs which were not relevant; so
lets subset the data to include only the relevant samples.
```{r extract_samples}
rpmi_expt <- subset_expt(expt=sp_expt, subset="experiment=='metal homeostasis'")
```
## Plot samples of the raw data
```{r initial_estimation, fig.show="hide"}
rpmi_metrics <- sm(graph_metrics(expt=rpmi_expt))
```
I am doing this fresh after a long time; I don't really remember what these
plots look like, but I am betting there is a huge batch-like effect which
depends on the library precursor. But before that, let us look at their
saturations and other metrics.
```{r plot_initial_estimates}
rpmi_metrics$legend
## Don't forget to print the legend!
rpmi_metrics$libsize
## So the saddest library has 3,922,047 reads, that is not terrible I think.
## A few samples might be a problem: hpgl0898, hpgl0879; but I am guessing a factor
## of <4 between the highest and lowest samples should not be too big of a problem.
rpmi_metrics$nonzero
## Lower y-axis samples are more likely to be troublesome (looking at you hpgl0866, 898, 878, 869).
rpmi_metrics$corheat
## There is a pretty harsh batch-like effect
rpmi_metrics$cvplot
## But at least the coefficients of variance are similar. wait, we only have 2 thy samples?
rpmi_metrics$density
## Nice, consistent sample densities
```
## Normalize and replot
I just copied/pasted the following from the previous worksheet, but I am seeing
that I did not actually use these plots. I think I will leave them here for
now.
```{r norm_replot, fig.show="hide"}
rpmi_filt <- sm(normalize_expt(rpmi_expt, filter=TRUE))
## This will likely be our data for DE analyses.
rpmi_norm <- sm(normalize_expt(rpmi_expt, filter=TRUE, convert="cpm", norm="quant",
transform="log2"))
## This, and a sva version of it, will be used for visualizing.
rpmi_batch <- sm(normalize_expt(rpmi_filt, filter=TRUE, convert="cpm",
batch="svaseq", transform="log2"))
rpmi_norm_metrics <- sm(graph_metrics(rpmi_norm))
rpmi_batch_metrics <- sm(graph_metrics(rpmi_batch))
```
## Now some plots from after normalization
After log2 and such, PCA becomes informative. Let us therefore look first at
the clustering before any surrogate estimation.
```{r mouse_show_images_post}
norm_pca <- plot_pca(rpmi_norm, cis=FALSE)
norm_pca$plot
## This clustering is kind of terrible.
rpmi_norm_metrics$disheat
## This is going to be a problem.
```
So I think once again, the primary surrogate variable is not batch per se, but
stems from how the libraries diverge after their selection but before
collection/library prepration.
## Try a couple of surrogate variables
Given the wretched clustering observed, I figure I should try a couple tools
from ruv/sva and see if they help. It looks like I prefered fsva in my earlier
iteration of this visualization. I have since come to the feeling that fsva is
a bit weird; and so I think I will use svaseq instead.
```{r batch_testing}
rpmi_batch1 <- sm(normalize_expt(rpmi_expt, transform="log2", convert="cpm",
filter=TRUE, batch="fsva"))
plot_pca(rpmi_batch1)$plot
plot_corheat(rpmi_batch1)$plot
## This looks a bit more encouraging.
rpmi_batch2 <- sm(normalize_expt(rpmi_expt, transform="log2", convert="cpm",
filter=TRUE, batch="svaseq"))
plot_pca(rpmi_batch2)$plot
## as does this.
rpmi_batch_written <- sm(
write_expt(expt=rpmi_expt, transform="log2", convert="cpm",
filter=TRUE, batch="svaseq", violin=TRUE,
excel=paste0("excel/", rundate, "_rpmi_svaseq-v", ver, ".xlsx")))
```
```{r surrogate_testing}
varpart_test <- varpart(expt=rpmi_filt, predictor=NULL,
factors=c("coverage", "replicate", "time", "cuzn", "medium"))
varpart_test$partition_plot
varpart_test$percent_plot
surrogate_test <- compare_surrogate_estimates(rpmi_filt)
surrogate_test$sva_unsupervised_adjust$svs_sample
rpmi_metrics$libsize
## Given the contribution of coverage in the variancePartition results above, one might
## assume that the library sizes will correspond to the surrogates detected by sva and friends.
## This appears to not be the case.
surrogate_test$plot
## it looks like the various surrogate estimators mostly agree on this data --
## especially the sva-based ones.
```
```{r saveme}
message(paste0("This is hpgltools commit: ", get_git_commit()))
pander::pander(sessionInfo())
this_save <- paste0(gsub(pattern="\\.Rmd", replace="", x=rmd_file), "-v", ver, ".rda.xz")
message(paste0("Saving to ", this_save))
tmp <- sm(saveme(filename=this_save))
```