-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.Rmd
409 lines (291 loc) · 13.6 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
---
output: github_document
editor_options:
chunk_output_type: console
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%",
dev = "ragg_png",
dpi = 600,
fig.width = 12,
fig.height = 8,
message = FALSE,
warning = FALSE
)
```
# ScanCentricPeakCharacterization
The goal of the ScanCentricPeakCharacterization (SCPC) package is to facilitate scan-centric, frequency based, peak characterization of profile level, multi-scan, direct-injection Fourier-transform mass spectrometry data.
You can read more about the merits of this scan-centric method in:
RM Flight, JM Mitchell & HNB Moseley, "Scan-Centric, Frequency-Based Method for Characterizing Peaks from Direct Injection Fourier transform Mass Spectrometry Experiments", Metabolites 2022, 12(6), 515; https://doi.org/10.3390/metabo12060515
## License
This package is licensed with a [BSD-like license](LICENSE.md) with a 4th clause: **No commercial use.**
Academics who want to use it at their institution, please try it.
If you are at a business / for-profit and want to use it, please contact the authors (Robert Flight, rflight79 at gmail; Hunter Moseley, hunter dot moseley at gmail) about licensing.
Please contact us even if you aren't sure what would be required for licensing, we do want people to use it.
## Installation
You can install SCPC from [GitHub](https://github.com/) with:
``` r
# install.packages("remotes")
remotes::install_github("MoseleyBioinformaticsLab/ScanCentricPeakCharacterization")
```
## Documentation Site
You can browse the documentation online [here](https://moseleybioinformaticslab.github.io/ScanCentricPeakCharacterization).
## Setup
```{r setup}
library(ScanCentricPeakCharacterization)
library(dplyr)
library(patchwork)
library(ggplot2)
#theme_set(cowplot::theme_cowplot())
```
## Theory
### Converting m/z to Frequency
Outside of the scan-centric nature of this peak-characterization, the second most important feature is the conversion from m/z to frequency.
This is done to make evenly spaced data.
If you acquire Orbitrap / ICR type mass spectrometry data over any decent range, there is an increasing spacing between individual m/z points.
We will load up an example direct-injection lipidomics sample acquired on a Thermo-Fisher Fusion instrument to demonstrate.
```{r example_lipidomics}
mzml_lipid = SCMzml$new(system.file("extdata/lipid_example.mzML", package = "ScanCentricPeakCharacterization"))
mzml_lipid$extract_mzml_data()
mzml_lipid$predict_frequency()
```
```{r show_spacing}
mzml_lipid$mzml_df_data[[1]] %>%
dplyr::filter(convertable) %>%
ggplot(aes(x = mz, y = mean_offset)) +
geom_point()
```
We can see here that the difference or offset of m/z points is increasing with m/z.
In contrast, frequency is defined as the difference over m/z, and therefore is constant.
$$mz_{mean} = mean(mz_{p1}, mz_{p2})$$
$$mz_{diff} = mz_{p2} - mz_{p1}$$
$$frequency = \frac{mz_{mean}}{mz_{diff}}$$
```{r show_frequency}
mzml_lipid$mzml_df_data[[1]] %>%
dplyr::filter(convertable) %>%
ggplot(aes(x = mean_frequency, y = mean_freq_diff)) +
geom_point()
```
However, we can more generally define the conversion of m/z to frequency using a linear model of the form:
$$frequency = a + \frac{x}{mz} + \frac{y}{\sqrt{mz}} + \frac{z}{\sqrt[3]{mz}}$$
And we can verify that with a plot of the m/z vs frequency and their predicted values, in a couple of ways, as well as a plot of the residuals.
```{r show_predictions}
mzml_lipid$check_frequency_model()
```
See the example of `SCMzml` below to see how we can change the model being used.
## Basic Objects and Classes
A note about the assumptions baked into SCPC:
* Data is acquired in multiple, direct-injection scans (i.e. there is no chromatography component).
* The scans are not aggregated together.
* The data is in **profile mode**, not centroided.
* The M/Z range for each scan is identical.
* The data is in an open format supported by `MSnbase` and `mzR`.
### SCMzml
We assume `mzML` is the most likely mass-spectrometer format, and `SCMzml` is an R6 container around the `mzML` file (although it should work with **any** of the formats supported by `MSnbase`).
It has basic functionality for loading the `mzML` data using `MSnbase`, filtering out scans that are not required, and converting the M/Z to frequency space for subsequent peak detection.
It is good to check a few samples using `SCMzml` before peak characterization.
This is especially important if you need to define custom scan filtering and choosing a single frequency model before attempting full peak characterization.
Let's go through some of it using the example file included with SCPC.
This sample is a lipidomics sample acquired on a Thermo-Fisher tribrid Orbitrap Fusion over 15 minutes, with a combination of MS1 and MS2 scans, acquired in profile mode.
It has been filtered to a subset of M/Z and total scans to save space.
The first 7.5 minutes are the MS1 primary scans, and the remaining 7.5 minutes are combination of MS1 and MS2 scans of the highest intensity M/Z in the MS1 primary scans.
#### Load the Data
```{r}
#| label: load-data
lipid_sample = system.file("extdata", "lipid_example.mzML", package = "ScanCentricPeakCharacterization")
sc_mzml = SCMzml$new(lipid_sample)
```
After instantiating the object, we extract the data into a list of data.frames that make it easier to work with.
This may change over time to use more efficient data structures.
```{r}
#| label: extract-mzml
sc_mzml$extract_mzml_data()
```
#### Examine the Scan Level Information
We can see what is reported at the scan level:
```{r}
#| label: show-scans
head(sc_mzml$scan_info) |>
knitr::kable()
```
And in the data.frame created from the mzML data (here is a single scan):
```{r}
#| label: show-data
head(sc_mzml$mzml_df_data[[1]]) |>
knitr::kable()
```
#### Frequency and M/Z Models
As discussed above (see [Theory](#theory)), SCPC is based on the idea that we should work in a pseudo-frequency space, which is achieved by calculating an initial frequency based on the M/Z spacing, and then fit that to a model based on the geometry of the Orbitrap or ICR.
For Thermo Orbitrap (Fusion and Fusion Lumos), that our lab primarily uses, this model looks like:
$$frequency = a + \frac{x}{mz} + \frac{y}{\sqrt{mz}} + \frac{z}{\sqrt[3]{mz}}$$
In contrast, for Bruker SolariX, the model has one fewer term:
$$frequency = a + \frac{x}{mz} + \frac{y}{\sqrt{mz}}$$
Note how these models are encoded, the (-) indicates a fraction, and the actual fraction indicates a root (either square or cube in these cases):
```{r}
#| label: example-models
# thermo fusion model:
thermo = c("a.freq" = 0, "x.freq" = -1, "y.freq" = -1/2, "z.freq" = -1/3)
# bruker
bruker = c("a.freq" = 0, "x.freq" = -1, "y.freq" = -1/2)
```
```{r}
#| label: predict-frequency
sc_mzml$predict_frequency()
```
Here is the scan information after prediction, it now has model terms related to the above equations:
```{r}
sc_mzml$scan_info |>
head() |>
knitr::kable()
```
And the mz data.frame now has frequency related data as well:
```{r}
sc_mzml$mzml_df_data[[1]] |>
head() |>
knitr::kable()
```
We can see a bunch more information added to the scan level summary data and the M/Z data.frame after we predict frequency in each scan.
We can check the frequency fit using `check_frequency`.
```{r}
#| label: check-frequency
sc_mzml$check_frequency_model()
```
What we've plotted are for a single scan:
* The M/Z and the calculated frequency (black), and the fit line according to the model above;
* The calculated frequency from M/Z points, and the predicted frequency after fitting a model, as well as the 1 - 1 line (red);
* Calculated - predicted frequency as a function of M/Z
For all scans, we have the median and maximum absolute deviation (MAD) of the calculated - predicted residuals.
#### Changing the Frequency Model
We can change the frequency model and recheck it easily.
Let's use a model where we change the square root to a cube root.
```{r}
#| label: change-model
correct_model = c("a.freq" = 0, "x.freq" = -1, "y.freq" = -1/2, "z.freq" = -1/3)
off_model = c("a.freq" = 0, "x.freq" = -1, "y.freq" = -1/3)
sc_mzml$frequency_fit_description = off_model
sc_mzml$predict_frequency()
sc_mzml$check_frequency_model()
```
From this diagram, we are pretty sure that the model is incorrect for our instrument.
Before proceeding, let's revert it.
```{r}
#| label: revert-model
sc_mzml$frequency_fit_description = correct_model
sc_mzml$predict_frequency()
```
#### Filter Scans
Next we set up our filter scans function.
Let's plot the scans first.
```{r}
#| label: scan-info-plot
sc_mzml$scan_info |>
ggplot(aes(x = rtime, xend = rtime,
y = 0, yend = tic,
color = y.freq)) +
geom_segment()
```
We've plotted the scans by their *retention time* (rtime), and their height is the *total intensity chromatogram* (tic), and colored by the value of the *y frequency term* (y.freq).
For **this** data, we only want those scans below an rtime of 450, and with a y.freq >= 2.9e7.
After using these criteria, the default function also uses `boxplot.stats` on the y.freq term to check for possible outlier scans.
```{r}
#| label: create-filter-scans
sc_mzml$generate_filter_scan_function(rtime = c(NA, 450),
y.freq = c(2.9e7, NA))
sc_mzml$filter_scans()
```
Now we can see which scans are being excluded with this filter.
```{r}
#| label: filter-scans-table
sc_mzml$scan_info |>
head() |>
knitr::kable()
```
```{r}
#| label: filter-scans-plot
sc_mzml$scan_info |>
ggplot(aes(x = rtime, xend = rtime,
y = 0, yend = tic,
color = keep)) +
geom_segment()
```
Changing the filters here using variations of `rtime` and `y.freq` should be simple.
If you have more involved needs, you can write your own filtering function.
`filter_scans_builtin` is another example of a function you can use as a template.
If you want to use a custom function (named *my_custom_filter*), you can add it like this:
```{r}
#| label: filter-scans-custom
#| eval: false
sc_mzml$generate_filter_scan_function(f_function = my_custom_filter)
```
#### Choose Frequency Model
After filtering scans, we need to pick a single frequency model.
Our [publication](https://doi.org/10.3390/metabo12060515) showed using each scan's own model is a bad idea, and we don't advise using some conglomeration of models either.
Instead, the default is to take the model with the y-term closest to the median of all the y-terms.
```{r}
#| label: choose-model
sc_mzml$generate_choose_frequency_model_function()
sc_mzml$choose_frequency_model()
sc_mzml$frequency_coefficients
```
#### Convert to Frequency
Finally, we can convert our M/Z data to frequency for use in peak characterization.
```{r}
#| label: convert-frequency
sc_mzml$convert_to_frequency()
```
### SCCharacterizePeaks
`SCCharacterizePeaks` controls the overall interplay between:
* the `SCZip` container that will hold the original and final data;
* the `SCMzml` object that loads mzml data, transforms it to frequency space, and filters out scans that don't seem to belong;
* the `SCPeakRegion` and `SCPeakRegionFinder` that actually do all of the peak characterization.
Let's give an example using an example lipid file.
This chunk is **not** evaluated due to peak characterization taking so long.
```{r example_char}
lipid_sample = system.file("extdata", "lipid_example.mzML", package = "ScanCentricPeakCharacterization")
sc_char = SCCharacterizePeaks$new(lipid_sample, out_file = here::here("lipid_sample.zip"))
sc_char$load_file()
```
```{r}
#| label: run-all
#| eval: false
sc_char$generate_filter_scan_function(rtime = c(NA, 450),
y.freq = c(2.9e7, NA))
sc_char$generate_choose_frequency_model_function()
sc_char$prepare_mzml_data()
sc_char$find_peaks()
sc_char$summarize()
sc_char$save_peaks()
sc_char$write_zip()
```
#### Run Everything
If you have a large number of samples to run, and you know you will be using the same functions for filtering scans and choosing the frequency model, you can use the `run_all` instead.
It takes two arguments, the `filter_scan_function`, and the `choose_frequency_model_function`.
```{r run_everything, eval = FALSE}
# not run
sc_char = SCCharacterizePeaks$new("file.mzML", out_file = "file.zip")
sc_char$run_all(filter_scan_function = custom_filter,
choose_frequency_model_function = custom_frequency_model)
```
You should see the vignette on custom functions if you want to go this route.
### SCPeakRegionFinder
`SCPeakRegionFinder` is similar to `SCCharacterizePeaks` in that it is more of a controlling workflow object.
It serves to coordinate all the steps that need to happen for peak characterization outside of the conversion to frequency, which is the purview of the `SCMzml` object.
The `SCPeakRegionFinder` acts on the `SCPeakRegions` object, which has all of the data and methods.
### SCPeakRegions
`SCPeakRegions` holds the frequency data and the methods.
It is controlled by `SCPeakRegionFinder`.
### SCZip
We wanted a fairly generic way to store the original mzML file, any metadata generated about it, the binary output of `SCPeakRegionFinder`, and a JSONized peak list that can be used for assignment.
What we decided on was a simple zip file that keeps those objects together.
When we create a new `SCZip`, we actually create a temp directory, and move all the data there, and unzip it so that it is easily accessible and pieces can be modified easily.
For example, we can see where the temp data lives for our previously created `sc_char` object.
```{r}
#| label: show-temp
sc_char$temp_loc
dir(sc_char$temp_loc)
```