forked from sneumann/xcms
-
Notifications
You must be signed in to change notification settings - Fork 0
/
xcms.Rmd
1182 lines (944 loc) · 53.8 KB
/
xcms.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "LCMS data preprocessing and analysis with xcms"
package: xcms
output:
BiocStyle::html_document:
toc_float: true
vignette: >
%\VignetteIndexEntry{LCMS data preprocessing and analysis with xcms}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
%\VignetteDepends{xcms,RColorBrewer,faahKO,pander,magrittr,BiocStyle,pheatmap}
%\VignettePackage{xcms}
%\VignetteKeywords{mass spectrometry, metabolomics}
bibliography: references.bib
csl: biomed-central.csl
references:
- id: dummy
title: no title
author:
- family: noname
given: noname
---
```{r biocstyle, echo = FALSE, results = "asis" }
BiocStyle::markdown()
```
**Package**: `r Biocpkg("xcms")`<br />
**Authors**: Johannes Rainer<br />
**Modified**: `r file.info("xcms.Rmd")$mtime`<br />
**Compiled**: `r date()`
```{r init, message = FALSE, echo = FALSE, results = "hide" }
## Silently loading all packages
library(BiocStyle)
library(xcms)
library(faahKO)
library(pander)
## Use socket based parallel processing on Windows systems
if (.Platform$OS.type == "unix") {
register(bpstart(MulticoreParam(3)))
} else {
register(bpstart(SnowParam(3)))
}
```
# Introduction
This documents describes data import, exploration, preprocessing and analysis of
LCMS experiments with `xcms` version >= 3. The examples and basic workflow was
adapted from the original *LC/MS Preprocessing and Analysis with xcms* vignette
from Colin A. Smith.
The new user interface and methods use the `XCMSnExp` object (instead of the
*old* `xcmsSet` object) as a container for the pre-processing results. To
support packages and pipelines relying on the `xcmsSet` object, it is however
possible to convert an `XCMSnExp` into a `xcmsSet` object using the `as` method
(i.e.`xset <- as(x, "xcmsSet")`, with `x` being an `XCMSnExp` object.
# Data import
`xcms` supports analysis of LC/MS data from files in (AIA/ANDI) NetCDF,
mzML/mzXML and mzData format. For the actual data import Bioconductor's
`r Biocpkg("mzR")` is used. For demonstration purpose we will analyze a
subset of the data from [@Saghatelian04] in which the metabolic consequences of
knocking out the fatty acid amide hydrolase (FAAH) gene in mice was
investigated. The raw data files (in NetCDF format) are provided with the
`faahKO` data package. The data set consists of samples from the spinal cords of
6 knock-out and 6 wild-type mice. Each file contains data in centroid mode
acquired in positive ion mode form 200-600 m/z and 2500-4500 seconds.
Below we load all required packages, locate the raw CDF files within the `faahKO`
package and build a *phenodata* data frame describing the experimental setup.
```{r load-libs-pheno, message = FALSE }
library(xcms)
library(faahKO)
library(RColorBrewer)
library(pander)
library(magrittr)
library(pheatmap)
## Get the full path to the CDF files
cdfs <- dir(system.file("cdf", package = "faahKO"), full.names = TRUE,
recursive = TRUE)
## Create a phenodata data.frame
pd <- data.frame(sample_name = sub(basename(cdfs), pattern = ".CDF",
replacement = "", fixed = TRUE),
sample_group = c(rep("KO", 6), rep("WT", 6)),
stringsAsFactors = FALSE)
```
Subsequently we load the raw data as an `OnDiskMSnExp` object using the
`readMSData` method from the `r Biocpkg("MSnbase")` package. The `MSnbase`
provides based structures and infrastructure for the processing of mass
spectrometry data. Also, `MSnbase` can be used to *centroid* profile-mode MS
data (see the corresponding vignette in the `MSnbase` package).
```{r load-with-msnbase, message = FALSE }
raw_data <- readMSData(files = cdfs, pdata = new("NAnnotatedDataFrame", pd),
mode = "onDisk")
```
```{r silent-subsetting, message = FALSE, echo = FALSE}
raw_data <- filterRt(raw_data, c(2500, 4000))
```
The resulting `OnDiskMSnExp` object contains general information about the
number of spectra, retention times, the measured total ion current etc, but does
not contain the full raw data (i.e. the m/z and intensity values from each
measured spectrum). Its memory footprint is thus rather small making it an ideal
object to represent large metabolomics experiments while allowing to perform
simple quality controls, data inspection and exploration as well as data
sub-setting operations. The m/z and intensity values are imported from the raw
data files on demand, hence the location of the raw data files should not be
changed after initial data import.
# Initial data inspection
The `OnDiskMSnExp` organizes the MS data by spectrum and provides the methods
`intensity`, `mz` and `rtime` to access the raw data from the files (the measured
intensity values, the corresponding m/z and retention time values). In addition,
the `spectra` method could be used to return all data encapsulated in `Spectrum`
objects. Below we extract the retention time values from the object.
```{r data-inspection-rtime, message = FALSE }
head(rtime(raw_data))
```
All data is returned as one-dimensional vectors (a numeric vector for `rtime`
and a `list` of numeric vectors for `mz` and `intensity`, each containing the
values from one spectrum), even if the experiment consists of multiple
files/samples. The `fromFile` function returns an integer vector providing
the mapping of the values to the originating file. Below we use the `fromFile`
indices to organize the `mz` values by file.
```{r data-inspection-mz, message = FALSE }
mzs <- mz(raw_data)
## Split the list by file
mzs_by_file <- split(mzs, f = fromFile(raw_data))
length(mzs_by_file)
```
As a first evaluation of the data we plot below the base peak chromatogram (BPC)
for each file in our experiment. We use the `chromatogram` method and set the
`aggregationFun` to `"max"` to return for each spectrum the maximal intensity
and hence create the BPC from the raw data. To create a total ion chromatogram
we could set `aggregationFun` to `sum`.
```{r data-inspection-bpc, message = FALSE, fig.align = "center", fig.width = 12, fig.height = 6 }
## Get the base peak chromatograms. This reads data from the files.
bpis <- chromatogram(raw_data, aggregationFun = "max")
## Define colors for the two groups
group_colors <- paste0(brewer.pal(3, "Set1")[1:2], "60")
names(group_colors) <- c("KO", "WT")
## Plot all chromatograms.
plot(bpis, col = group_colors[raw_data$sample_group])
```
The `chromatogram` method returned a `Chromatograms` object that organizes
individual `Chromatogram` objects (which in fact contain the chromatographic
data) in a two-dimensional array: columns represent samples and rows
(optionally) m/z and/or retention time ranges. Below we extract the chromatogram
of the first sample and access its retention time and intensity values.
```{r data-inspection-chromatogram, message = FALSE }
bpi_1 <- bpis[1, 1]
head(rtime(bpi_1))
head(intensity(bpi_1))
```
The `chromatogram` method supports also extraction of chromatographic data from
a m/z-rt slice of the MS data. In the next section we will use this method to
create an extracted ion chromatogram (EIC) for a selected peak.
Note that `chromatogram` reads the raw data from each file to calculate the
chromatogram. The `bpi` and `tic` methods on the other hand do not read any data
from the raw files but use the respective information that was provided in the
header definition of the input files (which might be different from the actual
data).
Below we create boxplots representing the distribution of total ion currents per
file. Such plots can be very useful to spot problematic or failing MS runs.
```{r data-inspection-tic-boxplot, message = FALSE, fig.align = "center", fig.width = 8, fig.height = 4, fig.cap = "Distribution of total ion currents per file." }
## Get the total ion current by file
tc <- split(tic(raw_data), f = fromFile(raw_data))
boxplot(tc, col = group_colors[raw_data$sample_group],
ylab = "intensity", main = "Total ion current")
```
Also, we can cluster the samples based on similarity of their base peak
chromatogram. This can also be helpful to spot potentially problematic samples
in an experiment or generally get an initial overview of the sample grouping in
the experiment. Since the retention times between samples are not exactly
identical, we use the `bin` function to group intensities in fixed time ranges
(bins) along the retention time axis. In the present example we use a bin size
of 1 second, the default is 0.5 seconds. The clustering is performed using
complete linkage hierarchical clustering on the pairwise correlations of the
binned base peak chromatograms.
```{r data-inspection-bpc-heatmap, message = FALSE, fig.align = "center", fig.width = 7, fig.height = 6, fig.cap = "Grouping of samples based on similarity of their base peak chromatogram."}
## Bin the BPC
bpis_bin <- bin(bpis, binSize = 2)
## Calculate correlation on the log2 transformed base peak intensities
cormat <- cor(log2(do.call(cbind, lapply(bpis_bin, intensity))))
colnames(cormat) <- rownames(cormat) <- raw_data$sample_name
## Define which phenodata columns should be highlighted in the plot
ann <- data.frame(group = raw_data$sample_group)
rownames(ann) <- raw_data$sample_name
## Perform the cluster analysis
pheatmap(cormat, annotation = ann,
annotation_color = list(group = group_colors))
```
The samples cluster in a pairwise manner, the KO and WT samples for the sample
index having the most similar BPC. In addition, 3 larger clusters are present
grouping KO and WT samples 21 and 22 together as well as 18 and 19. KO and WT
samples 15 and 16 separate from these two clusters.
# Chromatographic peak detection
Next we perform the chromatographic peak detection using the *centWave*
algorithm [@Tautenhahn:2008fx]. Before running the peak detection it is however
strongly suggested to visually inspect e.g. the extracted ion chromatogram of
internal standards or known compounds to evaluate and adapt the peak detection
settings since the default settings will not be appropriate for most LCMS
experiments. The two most critical parameters for *centWave* are the `peakwidth`
(expected range of chromatographic peak widths) and `ppm` (maximum expected
deviation of m/z values of centroids corresponding to one chromatographic peak;
this is usually much larger than the ppm specified by the manufacturer)
parameters. To evaluate the typical chromatographic peak width we plot the EIC
for one peak.
```{r peak-detection-plot-eic, message = FALSE, fig.align = "center", fig.width = 8, fig.height = 5, fig.cap = "Extracted ion chromatogram for one peak." }
## Define the rt and m/z range of the peak area
rtr <- c(2700, 2900)
mzr <- c(334.9, 335.1)
## extract the chromatogram
chr_raw <- chromatogram(raw_data, mz = mzr, rt = rtr)
plot(chr_raw, col = group_colors[chr_raw$sample_group])
```
Note that `Chromatogram` objects extracted by the `chromatogram` method contain
an `NA` value if in a certain scan (i.e. for a specific retention time) no
signal was measured in the respective mz range. This is reflected by the lines
not being drawn as continuous lines in the plot above.
The peak above has a width of about 50 seconds. The `peakwidth` parameter should
be set to accommodate the expected widths of peak in the data set. We set it to
`20,80` for the present example data set.
For the `ppm` parameter we extract the full MS data (intensity, retention time
and m/z values) corresponding to the above peak. To this end we first filter the
raw object by retention time, then by m/z and finally plot the object with `type
= "XIC"` to produce the plot below. We use the *pipe* (`%>%`) command better
illustrate the corresponding workflow. Note also that in this type of plot
identified chromatographic peaks would be indicated by default if present.
```{r peak-detection-plot-ms-data, message = FALSE, warning = FALSE, fig.aligh = "center", fig.width = 14, fig.height = 14, fig.cap = "Visualization of the raw MS data for one peak. For each plot: upper panel: chromatogram plotting the intensity values against the retention time, lower panel m/z against retention time plot. The individual data points are colored according to the intensity." }
raw_data %>%
filterRt(rt = rtr) %>%
filterMz(mz = mzr) %>%
plot(type = "XIC")
```
In the present data there is actually no variation in the m/z values. Usually
one would see the m/z values (lower panel) scatter around the *real* m/z value
of the compound. The first step of the *centWave* algorithm defines so called
regions of interest (ROI) based on the difference of m/z values from consecutive
scans. In detail, m/z values from consecutive scans are included into a ROI if
the difference between the m/z and the mean m/z of the ROI is smaller than the
user defined `ppm` parameter. A reasonable choice for the `ppm` could thus be
the maximal m/z difference of data points from neighboring scans/spectra that
are part of the chromatographic peak. It is suggested to inspect the ranges of
m/z values for many compounds (either internal standards or compounds known to
be present in the sample) and define the `ppm` parameter for *centWave*
according to these.
Note that we can also perform the peak detection on the extracted ion
chromatogram. This can help to evaluate different peak detection settings. Only
be aware that peak detection on an extracted ion chromatogram will not consider
the `ppm` parameter and that the estimation of the background signal is
different to the peak detection on the full data set; values for the `snthresh`
will hence have different consequences. Below we perform the peak detection with
the `findChromPeaks` function on the extracted ion chromatogram. The submitted
*parameter* object defines which algorithm will be used and allows to define the
settings for this algorithm. We use the *centWave* algorithm with default
settings, except for `snthresh`.
```{r peak-detection-eic, message = FALSE}
xchr <- findChromPeaks(chr_raw, param = CentWaveParam(snthresh = 2))
```
We can access the identified chromatographic peaks with the `chromPeaks`
function.
```{r peak-detection-eic-chromPeaks}
head(chromPeaks(xchr))
```
Parallel to the `chromPeaks` matrix there is also a data frame `chromPeakData`
that allows to add arbitrary annotations to each chromatographic peak. Below we
extract this data frame that by default contains only the MS level in which the
peak was identified.
```{r peak-detection-chromatogram-chromPeakData}
chromPeakData(xchr)
```
Next we plot the identified chromatographic peaks in the extracted ion chromatogram. We use the `col`
parameter to color the individual chromatogram lines. Colors can also be
specified for the identified peaks, `peakCol` for the foreground/border color,
`peakBg` for the background/fill color. One color has to be provided for each
chromatographic peak listed by `chromPeaks`. Below we define a color to indicate
the sample group from which the sample is and use the sample information in the
peaks' `"sample"` column to assign the correct color to each chromatographic
peak. More peak highlighting options are described further below.
```{r peak-detection-eic-plot, message = FALSE, fig.align = "center", fig.width = 10, fig.height = 8, fig.cap = "Signal for an example peak. Red and blue colors represent KO and wild type samples, respectively. Peak area of identified chromatographic peaks are highlighted in the sample group color."}
sample_colors <- group_colors[xchr$sample_group]
plot(xchr, col = sample_colors,
peakBg = sample_colors[chromPeaks(xchr)[, "column"]])
```
Finally we perform the chromatographic peak detection on the full data set. Note
that we set the argument `noise` to `5000` to reduce the run time of this
vignette. With this setting we consider only signals with a value larger than
5000 in the peak detection step.
```{r peak-detection-centwave, message = FALSE, results = "hide" }
cwp <- CentWaveParam(peakwidth = c(20, 80), noise = 5000)
xdata <- findChromPeaks(raw_data, param = cwp)
```
The results are returned as an `XCMSnExp` object which extends the
`OnDiskMSnExp` object by storing also LC/GC-MS preprocessing results. This means
also that all methods to sub-set and filter the data or to access the (raw) data
are inherited from the `OnDiskMSnExp` object and can thus be re-used. The
results from the chromatographic peak detection can be accessed with the
`chromPeaks` method.
```{r peak-detection-chromPeaks, message = FALSE }
head(chromPeaks(xdata))
```
The returned `matrix` provides the m/z and retention time range for each
identified chromatographic peak as well as the integrated signal intensity
("into") and the maximal peak intensitity ("maxo"). Columns "sample" contains
the index of the sample in the object/experiment in which the peak was
identified.
Annotations for each individual peak can be extracted with the `chromPeakData`
function. This data frame could also be used to add/store arbitrary annotations
for each detected peak.
```{r peak-detection-chromPeakData}
chromPeakData(xdata)
```
Below we use the data from the `chromPeaks` matrix to calculate some per-file
summaries.
```{r peak-detection-peaks-per-sample, message = FALSE, results = "asis" }
summary_fun <- function(z)
c(peak_count = nrow(z), rt = quantile(z[, "rtmax"] - z[, "rtmin"]))
T <- lapply(split.data.frame(
chromPeaks(xdata), f = chromPeaks(xdata)[, "sample"]),
FUN = summary_fun)
T <- do.call(rbind, T)
rownames(T) <- basename(fileNames(xdata))
pandoc.table(T,
caption = paste0("Summary statistics on identified chromatographic",
" peaks. Shown are number of identified peaks per",
" sample and widths/duration of chromatographic ",
"peaks."))
```
We can also plot the location of the identified chromatographic peaks in the
m/z - retention time space for one file using the `plotChromPeaks`
function. Below we plot the chromatographic peaks for the 3rd sample.
```{r peak-detection-chrom-peaks-plot, message = FALSE, fig.align = "center", fig.width = 8, fig.height = 8, fig.cap = "Identified chromatographic peaks in the m/z by retention time space for one sample." }
plotChromPeaks(xdata, file = 3)
```
To get a global overview of the peak detection we can plot the frequency of
identified peaks per file along the retention time axis. This allows to identify
time periods along the MS run in which a higher number of peaks was identified
and evaluate whether this is consistent across files.
```{r peak-detection-chrom-peak-image, message = FALSE, fig.align = "center", fig.width = 10, fig.height = 8, fig.cap = "Frequency of identified chromatographic peaks along the retention time axis. The frequency is color coded with higher frequency being represented by yellow-white. Each line shows the peak frequency for one file." }
plotChromPeakImage(xdata)
```
Next we highlight the identified chromatographic peaks for the example peak from
before. Evaluating such plots on a list of peaks corresponding to known peaks or
internal standards helps to ensure that peak detection settings were appropriate
and correctly identified the expected peaks. We extract the ion chromatogram
from the peak detection result object, which contains then also the identified
chromatographic peaks for that ion that we can extract with the `chromPeaks`
function.
```{r peak-detection-eic-example-peak, message = FALSE}
chr_ex <- chromatogram(xdata, mz = mzr, rt = rtr)
chromPeaks(chr_ex)
```
We can also plot the extracted ion chromatogram. Identified chromatographic
peaks will be automatically highlighted in the plot. Below we highlight
chromatographic peaks with a rectangle from the peak's minimal to maximal rt and
from an intensity of 0 to the maximal signal of the peak.
```{r peak-detection-highlight-chrom-peaks-plot, message = FALSE, fig.align = "center", fig.width = 10, fig.height = 8, fig.cap = "Signal for an example peak. Red and blue colors represent KO and wild type samples, respectively. The rectangles indicate the identified chromatographic peaks per sample." }
sample_colors <- group_colors[chr_ex$sample_group]
plot(chr_ex, col = sample_colors, peakType = "rectangle",
peakCol = sample_colors[chromPeaks(chr_ex)[, "sample"]],
peakBg = NA)
```
Alternatively to the rectangle visualization above, it is possible to represent
the apex position of each peak with a single point (passing argument
`type = "point"` to the function), or draw the actually identified peak by
specifying `type = "polygon"`. To completely omit highlighting the identified
peaks (e.g. to plot base peak chromatograms or similar) `type = "none"` can be
used. Below we use `type = "polygon"` to fill the peak area
for each identified chromatographic peak in each sample. Whether individual
peaks can be still identified in such a plot depends however on the number of
samples from which peaks are drawn.
```{r peak-detection-highlight-chrom-peaks-plot-polygon, message = FALSE, fig.align = "center", fig.width = 10, fig.height = 8, fig.cap = "Signal for an example peak. Red and blue colors represent KO and wild type samples, respectively. The signal area of identified chromatographic peaks are filled with a color." }
plot(chr_ex, col = group_colors[chr_raw$sample_group], lwd = 2,
peakBg = sample_colors[chromPeaks(chr_ex)[, "sample"]])
```
Note that we can also specifically extract identified chromatographic peaks for
a selected region by providing the respective m/z and retention time ranges with
the `mz` and `rt` arguments in the `chromPeaks` method.
```{r peak-detection-chrom-peak-table-selected, message = FALSE, results = "asis" }
pander(chromPeaks(xdata, mz = mzr, rt = rtr),
caption = paste("Identified chromatographic peaks in a selected ",
"m/z and retention time range."))
```
Finally we plot also the distribution of peak intensity per sample. This allows
to investigate whether systematic differences in peak signals between samples
are present.
```{r peak-detection-chrom-peak-intensity-boxplot, message = FALSE, fig.align = "center", fig.width = 10, fig.height = 8, fig.cap = "Peak intensity distribution per sample." }
## Extract a list of per-sample peak intensities (in log2 scale)
ints <- split(log2(chromPeaks(xdata)[, "into"]),
f = chromPeaks(xdata)[, "sample"])
boxplot(ints, varwidth = TRUE, col = group_colors[xdata$sample_group],
ylab = expression(log[2]~intensity), main = "Peak intensities")
grid(nx = NA, ny = NULL)
```
# Alignment
The time at which analytes elute in the chromatography can vary between samples
(and even compounds). Such a difference was already observable in the extracted
ion chromatogram plot shown as an example in the previous section. The alignment
step, also referred to as retention time correction, aims at adjusting this by
shifting signals along the retention time axis to align the signals between
different samples within an experiment.
A plethora of alignment algorithms exist (see [@Smith:2013gr]), with some of
them being implemented also in `xcms`. The method to perform the
alignment/retention time correction in `xcms` is `adjustRtime` which uses
different alignment algorithms depending on the provided parameter class.
In the example below we use the *obiwarp* method [@Prince:2006jj] to align the
samples. We use a `binSize = 0.6` which creates warping functions in mz bins of
0.6. Also here it is advisable to modify the settings for each experiment and
evaluate if retention time correction did align internal controls or known
compounds properly.
```{r alignment-obiwarp, message = FALSE, results = "hide" }
xdata <- adjustRtime(xdata, param = ObiwarpParam(binSize = 0.6))
```
`adjustRtime`, besides calculating adjusted retention times for each spectrum,
does also adjust the reported retention times of the identified chromatographic
peaks.
To extract the adjusted retention times we can use the `adjustedRtime` method,
or simply the `rtime` method that, if present, returns by default adjusted
retention times from an `XCMSnExp` object.
```{r alignment-rtime, message = FALSE }
## Extract adjusted retention times
head(adjustedRtime(xdata))
## Or simply use the rtime method
head(rtime(xdata))
```
*Raw* retention times can be extracted from an `XCMSnExp` containing
aligned data with `rtime(xdata, adjusted = FALSE)`.
To evaluate the impact of the alignment we plot the BPC on the adjusted data. In
addition we plot the differences of the adjusted- to the raw retention times per
sample using the `plotAdjustedRtime` function. Note that we **don't** want to
highlight any chromatographic peaks (because there would simply be too many) and
set thus `peakType = "none"` in the `plot` call.
```{r alignment-obiwarp-plot, message = FALSE, fig.align = "center", fig.width = 10, fig.height = 10, fig.cap = "Obiwarp aligned data. Base peak chromatogram after alignment (top) and difference between adjusted and raw retention times along the retention time axis (bottom)." }
## Get the base peak chromatograms.
bpis_adj <- chromatogram(xdata, aggregationFun = "max")
par(mfrow = c(2, 1), mar = c(4.5, 4.2, 1, 0.5))
plot(bpis_adj, col = group_colors[bpis_adj$sample_group], peakType = "none")
## Plot also the difference of adjusted to raw retention time.
plotAdjustedRtime(xdata, col = group_colors[xdata$sample_group])
```
Too large differences between adjusted and raw retention times could indicate
poorly performing samples or alignment.
Alternatively to *obiwarp* we could use the *peak groups* alignment method that
adjusts the retention time by aligning previously identified *hook peaks*
(chromatographic peaks present in most/all samples). Ideally, these hook peaks
should span most part of the retention time range. Below we first restore the
raw retention times (also of the identified peaks) using the `dropAdjustedRtime`
methods. Note that a `drop*` method is available for each preprocessing step
allowing to remove the respective results from the `XCMSnExp` object.
```{r alignment-drop, message = FALSE }
## Does the object have adjusted retention times?
hasAdjustedRtime(xdata)
## Drop the alignment results.
xdata <- dropAdjustedRtime(xdata)
## Does the object have adjusted retention times?
hasAdjustedRtime(xdata)
```
**Note**: `XCMSnExp` objects hold the raw along with the adjusted retention
times and subsetting will in most cases drop the adjusted retention
times. Sometimes it might thus be useful to **replace** the raw retention times
with the adjusted retention times. This can be done with the
`applyAdjustedRtime`.
As noted above the *peak groups* method requires peak groups (features) present
in most samples to perform the alignment. We thus have to perform a first
correspondence run to identify such peaks (details about the algorithm used are
presented in the next section). We use here again default settings, but it is
strongly advised to adapt the parameters for each data set. The definition of
the sample groups (i.e. assignment of individual samples to the sample groups in
the experiment) is mandatory for the `PeakDensityParam`. If there are no sample
groups in the experiment `sampleGroups` should be set to a single value for each
file (e.g. `rep(1, length(fileNames(xdata))`).
```{r alignment-peak-groups, message = FALSE, warning = FALSE}
## Correspondence: group peaks across samples.
pdp <- PeakDensityParam(sampleGroups = xdata$sample_group,
minFraction = 0.8)
xdata <- groupChromPeaks(xdata, param = pdp)
## Now the retention time correction.
pgp <- PeakGroupsParam(minFraction = 0.85)
## Get the peak groups that would be used for alignment.
xdata <- adjustRtime(xdata, param = pgp)
```
Note also that we could use the `adjustRtimePeakGroups` method on the object
before alignment to evaluate on which features (peak groups) the alignment would
be performed. This can be useful to test different settings for the peak groups
algorithm. Also, it is possible to manually select or define certain peak groups
(i.e. their retention times per sample) and provide this matrix to the
`PeakGroupsParam` class with the `peakGroupsMatrix` argument.
Below plot the difference between raw and adjusted retention times using the
`plotAdjustedRtime` function, which, if the *peak groups* method is used for
alignment, also highlights the peak groups used in the adjustment.
```{r alignment-peak-groups-plot, message = FALSE, fig.align = "center", fig.width = 10, fig.height = 8, fig.cap = "Peak groups aligned data." }
## Plot the difference of adjusted to raw retention time.
plotAdjustedRtime(xdata, col = group_colors[xdata$sample_group],
peakGroupsCol = "grey", peakGroupsPch = 1)
```
At last we evaluate the impact of the alignment on the test peak.
```{r alignment-peak-groups-example-peak, message = FALSE, fig.align = "center", fig.width = 10, fig.height = 10, fig.cap = "Example extracted ion chromatogram before (top) and after alignment (bottom)." }
par(mfrow = c(2, 1))
## Plot the raw data
plot(chr_raw, col = group_colors[chr_raw$sample_group])
## Extract the chromatogram from the adjusted object
chr_adj <- chromatogram(xdata, rt = rtr, mz = mzr)
plot(chr_adj, col = group_colors[chr_raw$sample_group], peakType = "none")
```
## Subset-based alignment
In some experiments it might be helpful to perform the alignment based on only a
subset of the samples, e.g. if QC samples were injected at regular intervals or
if the experiment contains blanks. Alignment method in `xcms` allow to
estimate retention time drifts on a subset of samples (either all samples
excluding blanks or QC samples injected at regular intervals during a
measurement run) and use these to adjust the full data set.
Parameters `subset` (of the `PeakGroupsParam` or `ObiwarpParam` object) can be
used to define the subset of samples on which the alignment of the full data set
will be based (e.g. `subset` being the index of QC samples), and parameter
`subsetAdjust` allows to specify the method by which the *left-out* samples will
be adjusted. There are currently two options available:
- `subsetAdjust = "previous"`: adjust the retention times of a non-subset
sample based on the alignment results of the previous subset sample (e.g. a
QC sample). If samples are e.g. in the order *A1*, *B1*, *B2*, *A2*, *B3*,
*B4* with *A* representing QC samples and *B* study samples, using
`subset = c(1, 4)` and `subsetAdjust = "previous"` would result in all *A*
samples to be aligned with each other and non-subset samples *B1* and *B2*
being adjusted based on the alignment result of subset samples *A1* and *B3*
and *B4* on those of *A2*.
- `subsetAdjust = "average"`: adjust retention times of non-subset samples based
on an interpolation of the alignment results of the previous and subsequent
subset sample. In the example above, *B1* would be adjusted based on the
average of adjusted retention times between subset (QC) samples *A1* and
*A2*. Since there is no subset sample after non-subset samples *B3* and *B4*
these will be adjusted based on the alignment results of *A2* alone. Note
that a weighted average is used to calculate the adjusted retention time
averages, which uses the inverse of the difference of the index of the
non-subset sample to the subset samples as weights. Thus, if we have a
setup like *A1*, *B1*, *B2*, *A2* the adjusted retention times of *A1*
would get a larger weight than those of *A2* in the adjustment of
non-subset sample *B1* causing it's adjusted retention times to be closer
to those of *A1* than to *A2*. See below for examples.
Both cases require a meaningful/correct ordering of the samples within the
object (e.g. ordering by injection index).
The examples below aim to illustrate the effect of these alignment options.
First we assume samples 1, 6 and 11 in the *faahKO* data set to be QC samples
(sample pools). We thus want to perform the alignment based on these samples and
subsequently adjust the retention times of the left-out samples (2, 3, 4, 5, 7,
8, 9, 10 and 12) based on interpolation of the results from the neighboring
*subset* samples. After initial peak grouping we perform below the alignment
with the *peak groups* method passing the indices of the samples on which we
want the alignment to be based with the `subset` argument and specify
`subsetAdjust = "average"` to adjust the study samples based on interpolation
of the alignment results from neighboring subset/QC samples.
Note that for any subset-alignment all parameters such as `minFraction` are
relative to the `subset`, not the full experiment!
```{r alignment-subset, message = FALSE, warning = FALSE}
xdata_subs <- dropAdjustedRtime(xdata)
## Define the experimental layout
xdata_subs$sample_type <- "study"
xdata_subs$sample_type[c(1, 6, 11)] <- "QC"
## Initial peak grouping. Use sample_type as grouping variable
pdp_subs <- PeakDensityParam(sampleGroups = xdata_subs$sample_type,
minFraction = 0.9)
xdata_subs <- groupChromPeaks(xdata_subs, param = pdp_subs)
## Define subset-alignment options and perform the alignment
pgp_subs <- PeakGroupsParam(minFraction = 0.85,
subset = which(xdata_subs$sample_type == "QC"),
subsetAdjust = "average", span = 0.4)
xdata_subs <- adjustRtime(xdata_subs, param = pgp_subs)
```
To illustrate the alignment results we plot the results only for samples 1 to 6
(1 and 6 being QC samples and 2 to 5 study samples that were adjusted based on
the QC samples). This nicely shows how the interpolation of
the `subsetAdjust = "average"` works: retention times of sample 2 are adjusted
based on those from subset sample 1 and 6, giving however more weight to the
closer subset sample 1 which results in the adjusted retention times of 2 being
more similar to those of sample 1. Sample 5 on the other hand gets adjusted
giving more weight to the second subset sample (6).
```{r alignment-subset-plot-1, message = FALSE, warning = FALSE, fig.align = "center", fig.width = 10, fig.height = 5, fig.cap = "Alignment results for samples 1 to 6."}
clrs <- rep(NA, 12)
clrs[1:6] <- paste0(brewer.pal(6, "Dark2"), 80)
plotAdjustedRtime(xdata_subs, col = clrs, lwd = 2, peakGroupsCol = NA)
legend("topleft", legend = 1:6, col = brewer.pal(6, "Dark2"), lwd = 2)
```
We also plot the full alignment results, drawing the `subset` samples as green
and study samples as grey lines.
```{r alignment-subset-plot-2, message = FALSE, warning = FALSE, fig.align = "center", fig.width = 10, fig.height = 10, fig.cap = "Subset-alignment results with option average. Difference between adjusted and raw retention times along the retention time axis. Samples on which the alignment models were estimated are shown in green, study samples in grey." }
clrs <- rep("#00000040", 12)
clrs[xdata_subs$sample_type == "QC"] <- c("#00ce0080")
par(mfrow = c(2, 1), mar = c(4, 4.5, 1, 0.5))
plot(chromatogram(xdata_subs, aggregationFun = "sum"),
col = clrs, peakType = "none")
plotAdjustedRtime(xdata_subs, col = clrs, peakGroupsPch = 1,
peakGroupsCol = "#00ce0040")
```
Option `subsetAdjust = "previous"` adjusts the retention times of a non-subset
sample based on a single subset sample (the previous), which results in most
cases in the adjusted retention times of the non-subset sample being highly
similar to those of the subset sample which was used for adjustment.
Also obiwarp alignment supports the same types of subset-alignment as shown
below.
```{r alignment-subset-obiwarp}
xdata_subs <- dropAdjustedRtime(xdata_subs)
prm <- ObiwarpParam(subset = which(xdata_subs$sample_type == "QC"),
subsetAdjust = "average")
xdata_subs <- adjustRtime(xdata_subs, param = prm)
```
We again plot the results.
```{r alignment-subset-obiwarp-plot, message = FALSE, warning = FALSE, fig.align = "center", fig.width = 10, fig.height = 10, fig.cap = "Obiwarp subset-alignment results with option average. Difference between adjusted and raw retention times along the retention time axis. Samples on which the alignment models were estimated are shown in green, study samples in grey." }
clrs <- rep("#00000040", 12)
clrs[which(xdata_subs$sample_type == "QC")] <- "#00ce0080"
par(mfrow = c(2, 1), mar = c(4, 4.5, 1, 0.5))
plot(chromatogram(xdata_subs, aggregationFun = "sum"),
col = clrs, peakType = "none")
plotAdjustedRtime(xdata_subs, col = clrs, peakGroupsPch = 1)
```
# Correspondence
The final step in the metabolomics preprocessing is the correspondence that
matches detected chromatographic peaks between samples (and depending on the
settings, also within samples if they are adjacent). The method to perform the
correspondence in `xcms` is `groupChromPeaks`. We will use the *peak density*
method [@Smith:2006ic] to group chromatographic peaks. The algorithm combines
chromatographic peaks depending on the density of peaks along the retention time
axis within small slices along the mz dimension. To illustrate this we plot
below the chromatogram for an mz slice with multiple chromatographic peaks
within each sample. We use below a value of 0.4 for the `minFraction` parameter
hence only chromatographic peaks present in at least 40% of the samples per
sample group are grouped into a feature. The sample group assignment is
specified with the `sampleGroups` argument.
```{r correspondence-example-slice, message = FALSE, fig.align = "center", fig.width = 10, fig.height = 10, fig.cap = "Example for peak density correspondence. Upper panel: chromatogram for an mz slice with multiple chromatographic peaks. Middle and lower panel: identified chromatographic peaks at their retention time (x-axis) and index within samples of the experiments (y-axis) for different values of the bw parameter. The black line represents the peak density estimate. Grouping of peaks (based on the provided settings) is indicated by grey rectangles." }
## Define the mz slice.
mzr <- c(305.05, 305.15)
## Extract and plot the chromatograms
chr_mzr <- chromatogram(xdata, mz = mzr, rt = c(2500, 4000))
par(mfrow = c(3, 1), mar = c(1, 4, 1, 0.5))
cols <- group_colors[chr_mzr$sample_group]
plot(chr_mzr, col = cols, xaxt = "n", xlab = "",
peakBg = sample_colors[chromPeaks(chr_mzr)[, "sample"]])
## Define the parameters for the peak density method
pdp <- PeakDensityParam(sampleGroups = xdata$sample_group,
minFraction = 0.4, bw = 30)
par(mar = c(4, 4, 1, 0.5))
plotChromPeakDensity(xdata, mz = mzr, col = cols, param = pdp,
pch = 16, xlim = c(2500, 4000))
## Use a different bw
pdp <- PeakDensityParam(sampleGroups = xdata$sample_group,
minFraction = 0.4, bw = 20)
plotChromPeakDensity(xdata, mz = mzr, col = cols, param = pdp,
pch = 16, xlim = c(2500, 4000))
```
The upper panel in the plot above shows the extracted ion chromatogram for each
sample with the detected peaks highlighted. The middle and lower plot shows the
retention time for each detected peak within the different samples. The black
solid line represents the density distribution of detected peaks along the
retention times. Peaks combined into *features* (peak groups) are indicated with
grey rectangles. Different values for the `bw` parameter of the
`PeakDensityParam` were used:`bw = 30` in the middle and `bw = 20` in the lower
panel. With the default value for the parameter `bw` the two neighboring
chromatographic peaks would be grouped into the same feature, while with a `bw`
of 20 they would be grouped into separate features. This grouping depends on
the parameters for the density function and other parameters passed to the
algorithm with the `PeakDensityParam`.
```{r correspondence, message = FALSE }
## Perform the correspondence
pdp <- PeakDensityParam(sampleGroups = xdata$sample_group,
minFraction = 0.4, bw = 20)
xdata <- groupChromPeaks(xdata, param = pdp)
```
The results from the correspondence can be extracted using the
`featureDefinitions` method, that returns a `DataFrame` with the definition of
the features (i.e. the mz and rt ranges and, in column `peakidx`, the index of
the chromatographic peaks in the `chromPeaks` matrix for each feature).
```{r correspondence-featureDefs, message = FALSE }
## Extract the feature definitions
featureDefinitions(xdata)
```
The `featureValues` method returns a `matrix` with rows being features and
columns samples. The content of this matrix can be defined using the `value`
argument. The default is`value = "index"` which simply returns the index of the
peak (in the `chromPeaks` matrix) assigned to a feature in a sample. Setting
`value = "into"` returns a matrix with the integrated signal of the peaks
corresponding to a feature in a sample. Any column name of the `chromPeaks`
matrix can be passed to the argument `value`. Below we extract the integrated
peak intensity per feature/sample.
```{r correspondence-feature-values, message = FALSE }
## Extract the into column for each feature.
head(featureValues(xdata, value = "into"))
```
This feature matrix contains `NA` for samples in which no chromatographic peak
was detected in the feature's m/z-rt region. While in many cases there might
indeed be no peak signal in the respective region, it might also be that there
is signal, but the peak detection algorithm failed to detect a chromatographic
peak. `xcms` provides the `fillChromPeaks` method to *fill in* intensity data
for such missing values from the original files. The *filled in* peaks are added
to the `chromPeaks` matrix and get a `TRUE` value in the `"is_filled"` column of
the `chromPeakData` data frame. Below we perform such a filling-in of missing
peaks.
```{r fill-chrom-peaks, message = FALSE }
## Filling missing peaks using default settings. Alternatively we could
## pass a FillChromPeaksParam object to the method.
xdata <- fillChromPeaks(xdata)
head(featureValues(xdata))
```
For features without detected peaks in a sample, the method extracts all
intensities in the mz-rt region of the feature, integrates the signal and adds a
*filled-in* peak to the `chromPeaks` matrix. No peak is added if no signal is
measured/available for the mz-rt region of the feature. For these, even after
filling in missing peak data, a `NA` is reported in the `featureValues` matrix.
It should be mentioned that `fillChromPeaks` uses columns `"rtmin"`, `"rtmax"`,
`"mzmin"` and `"mzmax"` from the `featureDefinitions` table to define the region
from which the signal should be integrated to generate the filled-in peak
signal. These values correspond however to the positions of the peak apex not
the peak boundaries of all chromatographic peaks assigned to a feature. It might
be advisable to increase this area in retention time dimension by a constant
value appropriate to the average peak width in the experiment. Such a value can
be specified with `fixedRt` of the `FillChromPeaksParam`. If the average peak
width in the experiment is 20 seconds, specifying`fixedRt = 10` ensures that the area
from which peaks are integrated is at least 20 seconds wide.
Below we compare the number of missing values before and after filling in
missing values. We can use the parameter `filled` of the `featureValues` method
to define whether or not filled-in peak values should be returned too.
```{r fill-chrom-peaks-compare, message = FALSE }
## Missing values before filling in peaks
apply(featureValues(xdata, filled = FALSE), MARGIN = 2,
FUN = function(z) sum(is.na(z)))
## Missing values after filling in peaks
apply(featureValues(xdata), MARGIN = 2,
FUN = function(z) sum(is.na(z)))
```
Next we use the `featureSummary` function to get a general per-feature summary
that includes the number of samples in which a peak was found or the number of
samples in which more than one peak was assigned to the feature. Specifying also
sample groups breaks down these summary statistics for each individual sample
group.
```{r featureSummary, message = FALSE }
head(featureSummary(xdata, group = xdata$sample_group))
```
The performance of peak detection, alignment and correspondence should always be
evaluated by inspecting extracted ion chromatograms e.g. of known compounds,
internal standards or identified features in general. The `featureChromatograms`
function allows to extract chromatograms for each feature present in
`featureDefinitions`. The returned `Chromatograms` object contains an ion
chromatogram for each feature (each row containing the data for one feature) and
sample (each column representing containing data for one sample). Below we
extract the chromatograms for the first 4 features.
```{r featureChromatograms, message = FALSE }
feature_chroms <- featureChromatograms(xdata, features = 1:4)
feature_chroms
```
And plot the extracted ion chromatograms. We again use the group color for each
identified peak to fill the area.
```{r feature-eic, message = FALSE, fig.align = "center", fig.width = 8, fig.height = 8, fig.cap = "Extracted ion chromatograms for features 1 to 4." }
plot(feature_chroms, col = sample_colors,
peakBg = sample_colors[chromPeaks(feature_chroms)[, "sample"]])
third <- feature_chroms[3, ]
plot(third)
chromPeaks(third)
plot(feature_chroms[3, ], col = group_colors[feature_chroms$sample_group])
plot(feature_chroms[4, ], col = group_colors[feature_chroms$sample_group])
```
At last we perform a principal component analysis to evaluate the grouping of
the samples in this experiment. Note that we did not perform any data
normalization hence the grouping might (and will) also be influenced by
technical biases.
```{r final-pca, message = FALSE, fig.align = "center", fig.width = 8, fig.height = 8, fig.cap = "PCA for the faahKO data set, un-normalized intensities." }
## Extract the features and log2 transform them
ft_ints <- log2(featureValues(xdata, value = "into"))
## Perform the PCA omitting all features with an NA in any of the
## samples. Also, the intensities are mean centered.
pc <- prcomp(t(na.omit(ft_ints)), center = TRUE)
## Plot the PCA
cols <- group_colors[xdata$sample_group]
pcSummary <- summary(pc)
plot(pc$x[, 1], pc$x[,2], pch = 21, main = "",
xlab = paste0("PC1: ", format(pcSummary$importance[2, 1] * 100,
digits = 3), " % variance"),
ylab = paste0("PC2: ", format(pcSummary$importance[2, 2] * 100,
digits = 3), " % variance"),
col = "darkgrey", bg = cols, cex = 2)
grid()
text(pc$x[, 1], pc$x[,2], labels = xdata$sample_name, col = "darkgrey",
pos = 3, cex = 2)
```
We can see the expected separation between the KO and WT samples on PC2. On PC1
samples separate based on their ID, samples with an ID <= 18 from samples with
an ID > 18. This separation might be caused by a technical bias
(e.g. measurements performed on different days/weeks) or due to biological
properties of the mice analyzed (sex, age, litter mates etc).
# Further data processing and analysis
Normalizing features' signal intensities is required, but at present not (yet)
supported in `xcms` (some methods might be added in near future). Also, for the
identification of e.g. features with significant different
intensities/abundances it is suggested to use functionality provided in other R
packages, such as Bioconductor's excellent `limma` package. To enable support
also for other packages that rely on the *old* `xcmsSet` result object, it is
possible to coerce the new `XCMSnExp` object to an `xcmsSet` object using
`xset <- as(x, "xcmsSet")`, with `x` being an `XCMSnExp` object.
# Additional details and notes
For a detailed description of the new data objects and changes/improvements
compared to the original user interface see the *new\_functionality* vignette.
## Evaluating the process history
`XCMSnExp` objects allow to capture all performed pre-processing steps along
with the used parameter class within the `@processHistory` slot. Storing also
the parameter class ensures the highest possible degree of analysis
documentation and in future might enable to *replay* analyses or parts of it.
The list of all performed preprocessings can be extracted using the
`processHistory` method.
```{r processhistory, message = FALSE }
processHistory(xdata)
```
It is also possible to extract specific processing steps by specifying its
type. Available *types* can be listed with the `processHistoryTypes`
function. Below we extract the parameter class for the alignment/retention time
adjustment step.
```{r processhistory-select, message = FALSE }
ph <- processHistory(xdata, type = "Retention time correction")
ph
```
And we can also extract the parameter class used in this preprocessing step.
```{r processhistory-param, message = FALSE }
## Access the parameter
processParam(ph[[1]])
```
## Subsetting and filtering