Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding new peak quality metrics to findChromPeaks for CentWave #685

Merged
merged 27 commits into from
Nov 28, 2023

Conversation

wkumler
Copy link
Contributor

@wkumler wkumler commented Jul 31, 2023

Hi again,

I've found myself frustrated with XCMS's large number of noise peaks and have designed a few additional metrics that can be used to identify "good-looking" peaks more easily. The core idea is to compare the raw chromatogram to an "idealized" curve with some degrees of skew, the deviation from which can then be used to estimate both how bell-shaped the peak is as well as estimating the signal-to-noise ratio using the residuals.

fig_8_peakmetrics_singlechrom

The results from this seem fairly promising. On my own data, I get pretty nice separation between the peaks I've manually labelled as "Good" and "Bad", with the Pearson's r metric performing a bit better than the new SNR metric. The new SNR metric fixes the problem with the old "sn" metric when there wasn't enough data outside the peak to properly estimate a baseline by using the signal within the peak instead though, so it seemed worth including as well.

image

image

When calculated as a summary statistic across multiple files, the performance is strengthened even further and some internal testing reveals that you can use these to enormously reduce the number of false positives (at the risk of a few additional false negatives). I trained a basic logistic regression model on these two new metrics alone and tested it using a second labeled dataset as validation and obtained the following confusion matrix:

Bad Good
Bad 1008 32
Good 28 209

in which the false discovery rate is reduced from ~80% to about 10%, although we lose out on 32 features that were incorrectly predicted to be "Bad" when I manually labeled them as "Good".

I originally considered implementing these as a separate function in XCMS or my own package but the recalculation does require going back to the raw peak data which is time consuming and quite slow. In this PR, I implemented it within the CentWave algorithm since the raw intensity and RT data is already available and additional columns can be added to the chromPeaks slot without any apparent difficulties. This was then made available in the centWaveParam with the boolean parameter "verboseBetaColumns" because the bell curve is made using a beta distribution and the name follows the "verboseColumns" syntax that already exists. The only change this really makes is that chromPeaks is now returned with two additional columns, beta_cor and beta_snr corresponding to the Pearson's r metric and the new SNR metric respectively.

mz mzmin mzmax rt rtmin rtmax into intb maxo sn beta_cor beta_snr sample
CP001 453.2 453.2 453.2 2506.073 2501.378 2527.982 1007409.0 1007380.8 38152 38151 0.7810597 5.482615 1
CP002 302.0 302.0 302.0 2617.185 2595.275 2640.659 687146.6 671297.8 30552 46 0.9852819 5.865214 1
CP003 344.0 344.0 344.0 2679.783 2646.919 2709.517 5210015.9 5135916.9 152320 68 0.9240250 6.456823 1
CP004 430.1 430.1 430.1 2681.348 2639.094 2712.647 2395840.3 2299899.6 65752 42 0.9030323 6.249482 1
CP005 366.0 366.0 366.0 2679.783 2642.224 2718.907 3365174.0 3279468.3 79928 49 0.9407829 6.310853 1
CP006 343.0 343.0 343.0 2678.218 2637.529 2712.647 24147443.2 23703761.7 672064 87 0.9491984 7.286360 1

When I tested this on the faahko dataset, here's the kind of peaks that receive low (upper row of plots) / high (lower row of plots) scores on the Pearson's r metric:

image

and the new SNR metric:

image

as tested with the below code:

devtools::install_github("https://github.com/wkumler/xcms")
library(xcms)
# Copied from the testthat.R file in XCMS
faahko_3_files <- c(system.file('cdf/KO/ko15.CDF', package = "faahKO"),
                    system.file('cdf/KO/ko16.CDF', package = "faahKO"),
                    system.file('cdf/KO/ko18.CDF', package = "faahKO"))
faahko_od <- readMSData(faahko_3_files, mode = "onDisk")

# Note the use of verboseBetaColumns = TRUE
cwp <- CentWaveParam(noise = 10000, snthresh = 40, prefilter = c(3, 10000), verboseBetaColumns = TRUE)
faahko_xod_beta <- findChromPeaks(faahko_od, param=cwp)

plot_chrom <- function(row_number, bestfirst=FALSE, axes=TRUE){
  cp <- chromPeaks(faahko_xod_beta)
  chosen_peak <- cp[order(cp[,"beta_snr"], decreasing = bestfirst),][row_number,]
  mzrange <- chosen_peak["mz"]+c(0.001, -0.001)
  rtrange <- c(chosen_peak["rtmin"], chosen_peak["rtmax"])+c(-20, 20)
  file_num <- chosen_peak["sample"]
  chrom <- chromatogram(filterFile(faahko_od, file_num), mz=mzrange, rt=rtrange)
  plot(chrom, col="black", lwd=2, xlab="", ylab="", axes=axes)
  abline(v = chosen_peak["rtmin"], col="green")
  abline(v = chosen_peak["rtmax"], col="red")
  if(axes)legend("topleft", legend = paste0("sn=", chosen_peak["sn"]))
}

par(mfrow=c(2,3), mar=c(2.1, 2.1, 2.1, 0.1))
# Bad peaks avoided
plot_chrom(1)
plot_chrom(2)
plot_chrom(3)
# Good peaks approved
plot_chrom(1, bestfirst = TRUE)
plot_chrom(2, bestfirst = TRUE)
plot_chrom(3, bestfirst = TRUE)
layout(1)

# Additional demo peaks below
# par(mfrow=c(4, 5), mar=c(0, 0, 1, 0))
# sapply(1:20, plot_chrom, axes=FALSE, bestfirst=TRUE)

Of course, the additional calculations come at the cost of a slightly longer peakpicking time (but only if enabled in the CentWaveParam). In my (limited) testing across ~40 files, XCMS took an additional 15 seconds, going from ~4 minutes to ~4.2. This is expected to increase with the number of files and lower thresholds but mine are pretty generous to begin with so I'd expect this to be about the time increase percentage that others can expect as well.

image

I've also implemented a few unit tests to ensure the objects are the same when the parameter is not passed and that the calculations are unaffected by the additional parameters, but wasn't able to test the package comprehensively as it seems to be failing too many unit tests at the moment for unrelated reasons (running devtools::test() on a freshly cloned version of the repo fails in the same way.)

Let me know if this is a PR you're interested in or if there's more I can do to help this get implemented. If you're curious about the labeling performance or interested in more details, I've got a preprint up on bioRxiv here.

@wkumler
Copy link
Contributor Author

wkumler commented Aug 1, 2023

Okay, apparently GitHub Actions is able to run the checks without problems which I'm unable to do on my end for some reason. The last attempt produced a couple errors/notes which may have been resolved with the above commits. Two problems were fairly simple - adding the verboseBetaColumns to the isotope version of CentWave and documenting the new matrix column name output. I believe I caught all the places these changes needed to be made but we'll find out when this latest set of checks finishes running.

More problematically, I wasn't able to resolve one error in the examples. The demos load an old version of an XCMSnExp object faahko_sub and then do some hacky workarounds to get it to function on other computers by overriding the path to the .cdf files in the faahKO package. Unfortunately, this means that the old object fails to pass quality checks with the new code and we get the below error in the logs.

> processParam(processHistory(faahko_sub)[[1]])
Object of class:  CentWaveParam 
Error in slot(x, name = snames[i]) : 
  no slot of name "verboseBetaColumns" for this object of class "CentWaveParam"

I don't see an easy way around this other than reconstructing the faahko_sub object with the new code. I was able to reverse-engineer the parameters for the original object with processHistory and get an object with the same chromPeaks as the original, but couldn't replicate that process for the MsExperiment equivalent faahko_sub2 so it feels dangerous to update one and not the other.

@wkumler
Copy link
Contributor Author

wkumler commented Aug 2, 2023

Update: I was able to mock up a new faahko_sub object with the new parameters, but the faahko_sub2 is still out of date. However, I didn't implement these for the new version so the testing passes and R CMD CHECK succeeds on all platforms now. I did have to do it in a little bit of a hacky way since the default dirname function has been overwritten. Code for this below:

init_dirs <- c("/usr/local/lib/R/host-site-library/faahKO/cdf/KO/ko15.CDF", 
               "/usr/local/lib/R/host-site-library/faahKO/cdf/KO/ko16.CDF", 
               "/usr/local/lib/R/host-site-library/faahKO/cdf/KO/ko18.CDF")
faahko_sub_files <- list.files(system.file("cdf/KO", package = "faahKO"), full.names = TRUE)[1:3]
sub_pdata <- as(data.frame(sampleNames=c("ko15.CDF", "ko16.CDF", "ko18.CDF")), "AnnotatedDataFrame")
faahko_sub_msn <- readMSData(faahko_sub_files, msLevel. = 1, pdata = sub_pdata, mode = "onDisk")
sub_cwp <- CentWaveParam(snthresh = 40, prefilter = c(3, 10000), noise = 10000)
faahko_sub <- findChromPeaks(faahko_sub_msn, sub_cwp)
class(faahko_sub) <- "temp"
attr(faahko_sub, "processingData")@files <- init_dirs
class(faahko_sub) <- "XCMSnExp"
save(faahko_sub, file = "data/faahko_sub.RData")

Unit testing results:

Direct clone from https://github.com/sneumann/xcms: [ FAIL 0 | WARN 1022 | SKIP 13 | PASS 3951 ]
New implementation from https://github.com/wkumler/xcms: [ FAIL 0 | WARN 1022 | SKIP 13 | PASS 3970 ]

@sneumann
Copy link
Owner

sneumann commented Aug 8, 2023

Hi @wkumler , thanks for this really well-done PR!

I think the additional runtime is well-spent.
Fixing the faahKO data after the PR should also be do-able. In addition to faahKO on the older HPLC and single-quad MS system, BioC also has the MTBLS2 package with data from a UPLC+QTOF system.

Have you also checked the fitgauss aspect in centWave ? My gut-feeling is that the gauss fitting might be slower but not better.
Finally: could you create a PCA on the sn, beta_cor, beta_snr and fitgauss values ? I am curious about the loadings plot (or biplot), which of the components contribute to the separation between your good/bad peaks.

Yours,
Steffen

@wkumler
Copy link
Contributor Author

wkumler commented Aug 8, 2023

The fitgauss parameter actually provides a lot more useful information than I'd expected given the documentation says it basically just affects the retention time of the final peak. I actually thought there was a bug because they all returned NA and didn't realize that it was because fitgauss was disabled - perhaps in the future some better documentation would be helpful as well as an option to calculate and report but not use these values within XCMS is warranted. Unfortunately, enabling fitgauss also changes the number of reported peaks in an unknown way so I wasn't able to directly use the labels I'd assigned. Instead, I had to go through and extract the mz/rt bounding boxes for each mass feature in the labeled data and then search for features within that box in the new fitgauss version.

It looks like the new beta_cor parameter does perform a bit better at separating out the good (feat_quality=1) and the bad (feat_quality = 0) peaks than any of the existing metrics, but is similar to egauss and sigma in the PCA plot. beta_snr also seems to be better aligned with the feat_quality arrow than either sn or h, likely because those two parameters are better correlated with peak height which is essentially uncorrelated with feature quality. PC1 = ~36% variance explained, PC2 = ~20%.

image

image

The PCA above is the product of log-scaling maxo, sn, h, and sigma with scale. = TRUE and center = TRUE because the non-log-scaled version showed a large skew along the maxo axis and was less informative but I'm happy to provide it if interested. Each feature has had a mean value calculated from all the individual peaks to reduce overplotting, though it looks very similar when using the median instead and the distribution is about the same when plotting every peak's values individually.

Do you know what the egauss, sigma, mu, and h metrics represent? I've struggled to find documentation on any of the verbose column metrics but I'm guessing mu and sigma are the fitted values from the best Gaussian (corresponding to $b$ and $c$ in the parametric extension). h also looks like it's returned by the fitGauss function by default but I'm unable to map it to comprehensible parameters in that expression. And egauss is some awful combination of the other three according to
sqrt((1/length(td_lm)) * sum(((d1 - gauss(td_lm, pgauss$h/md, pgauss$mu, pgauss$sigma))^2))). Looks like maybe some sort of distance/error metric?

For the MTBLS2 data I obtain the following PCA after following the same log-scaling steps above, making beta_cor align with egauss (which is very surprising, given that they're strongly negatively correlated in the prior plot but maybe I'm just misunderstanding PCA loadings per usual) as well as mu and sigma while beta_snr aligns well with h and maxo, though I don't know where the best peaks would fall in this dataset. PC1 = ~40% of the variance, PC2 = ~25% here.

image

@sneumann
Copy link
Owner

sneumann commented Oct 8, 2023

Hi @jorainer , what do you think here ? Yours, Steffen

@jorainer
Copy link
Collaborator

Really sorry, wanted to read this when I have enough time - and then always forgot about it. Will check it today - promised ;)

Copy link
Collaborator

@jorainer jorainer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks William @wkumler for this great PR!
I like your contribution, also because it does not interfere with the peak detection itself, but does evaluate the peak identified by vanilla centWave. I have some comments on the code but are generally OK adding this functionality to xcms.

Also, what would be super-nice is if there was a way to calculate that metric also on EICs. Could you maybe add a function that takes just two numeric vectors with retention time and intensity values and calculate these scores on them? We could this then use this also for generic data, independent of the centWave.


# Implement a fit of a skewed gaussian (beta distribution)
# for peak shape and within-peak signal-to-noise ratio
# See [biorxiv link]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please provide the actual biorxiv link/doi here?

beta_cors <- cor(pd, beta_vals)
best_cor <- max(beta_cors)
best_curve <- beta_vals[,which.max(beta_cors)]
scale_zero_one <- function(x)(x-min(x))/(max(x)-min(x))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please move that function outside the loop - maybe even just add as an internal .scale_zero_one function within this .R file - then this function would not need to be defined over and over again for each peak.

R/DataClasses.R Outdated
#' correlation coefficient to a bell curve with several degrees of skew as well
#' as an estimate of signal-to-noise using the residuals from the best-fitting
#' bell curve. See https://github.com/sneumann/xcms/pull/685 for more
#' information.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the doi to your biorxiv

@jorainer
Copy link
Collaborator

@sneumann I'm OK with this. As mentioned in the review above, I would like to see a base function that returns these values providing simple vectors of retention times and intensity values...

@jorainer
Copy link
Collaborator

Hi @wkumler , just a friendly reminder - do you think you could address my requested changes?

@pablovgd
Copy link
Contributor

Hi @wkumler , super nice work, also really enjoyed reading the publication in BMC bioinformatics. Nice to see other people that are as well manually evaluating their picked peaks in samples to benchmark tools instead of only relying on analytical standard mixtures. Since I'm doing some stuff with extracting targeted peaks from LC-MS data, I was also planning on including some quality metrics. My guess is that the easiest way to calculate your new metrics on new rt,int data is the function qscoreCalculator()?

Furthermore, maybe a more open question, what would we practically do with the extra info that these columns give us in the output? My first hunch was to combine different peak quality metrics in a weighted single number score, that allows the analyst to go over the picked peaks and detect the problematic ones in an easy way. That of course also has its downsides, and calculating a weighted score will be quite subjective...

Or maybe I should also train a logistic regression model on the values of these metrics for analytic standards and/or QC samples of said targeted peaks? And then let it classify the peaks in samples?

Love to hear your thoughts.

Cheers,
Pablo

@jorainer
Copy link
Collaborator

Thanks for the updates @wkumler - and I really like the discussions happening here :) - and I agree, here we just want to add tools to get some more information on individual peaks. How these metrics are then used to discriminate between good and bad peaks will be heavily data set (and instrumentation/column) dependent and should ideally go into a different package focused on exactly that.

Please re-request my review once you're OK with your commits.

@wkumler
Copy link
Contributor Author

wkumler commented Nov 21, 2023

Okay, I think I'm satisfied with the PR now. Most of the NAs being introduced were due to peak filling not due to any flaw in the algorithm itself, so I added the calculation to the peak filling step as well if the beta_cor and beta_snr columns exist in the chromPeaks spot. However, this didn't seem to improve the model's performance very much because the recalculated values were consistently lower and more variable than the ones calculated during the initial peakpicking step. I've kept the code in place here because I think it could still be valuable for other setups and because it's easy enough to determine which values were filled in using the chromPeaksData function.

@jorainer
Copy link
Collaborator

Regarding the differences of the betas from the initial and the gap filled data - I guess that has to do with .rawMat reporting only non-missing intensities. Say, for a certain m/z range and retention time you have a . Could that be an explanation? If so, maybe add a note to the .get_beta_values function clarifying this.

Maybe a quick example what I mean: assume the intensities for an EIC would be like this:

ints <- c(12, NA, 143, 2000, 45, 8)

i.e. for one spectrum no intensity is measured within the m/z range. the .rawMat function (as I understand it) would however return ints <- c(12, 143, 2000, 45, 8), i.e. drop the missing values. Correlation of that to a gaussian curve might now be lower.

Does this make sense @wkumler ? Should we make sure to use a value for each spectrum (retention time) as an input to your .get_beta_values function?

@jorainer
Copy link
Collaborator

Note: I had a look at .rawMat that is used in the gap filling for XCMSnExp objects. No way that we could "rescue" 0-intensity or missing peaks. So, I guess we need to live with the fact that gap-filled betas are worse.

However, another consideration: the getEIC C function (which is used within centWave) returns 0-intensity peaks if for a certain spectrum no peak was measured within the m/z of the ROI. This will result in intensity vectors like e.g. d <- c(0, 13, 0, 24, 50, 30, 5, 1). Again, this will have an impact in the cor function you are using. I would suggest, within your .get_beta_values function to replace 0 intensities with NA and to run the cor with use = "pairwise.complete.obs". Open for discussion.

Copy link
Collaborator

@jorainer jorainer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update, and also adding the code to the gap filling! Really great! Thanks @wkumler !

It might however be that we have some issues because of 0 intensities that we should address, I feel. Happy to discuss.

R/do_findChromPeaks-functions.R Outdated Show resolved Hide resolved
R/do_findChromPeaks-functions.R Outdated Show resolved Hide resolved
R/do_findChromPeaks-functions.R Outdated Show resolved Hide resolved
R/do_findChromPeaks-functions.R Outdated Show resolved Hide resolved
R/do_findChromPeaks-functions.R Show resolved Hide resolved
@wkumler
Copy link
Contributor Author

wkumler commented Nov 23, 2023

@jorainer Thanks for the feedback! I'm also not sure how best to handle "missing" scans and that was part of a discussion that I had with my collaborators that didn't really get resolved to my satisfaction. For something like shape-based testing I don't think replacing the values with zeroes is the right move because from what I've seen the "missing" values are much more likely to be better approximated by linear interpolation between the two nearest points. So replacing the zeroes we get from .rawMat with NAs is a reasonable thing.

When I first wrote the function for use with the re-extracted data, I actually zero-one scaled the retention times as well, then calculated the beta distribution values at the corresponding retention times. So a peak with scan RTs of c(8.0, 8.2, 8.4, 8.45, 8.5) would become c(0, 0.4, 0.8, 0.9, 1.0), which can then be passed directly to dbeta (e.g. dbeta(c(0, 0.4, 0.8, 0.9, 1.0), shape1 = 5, shape2 = 5)) to remove the assumption of equal spacing on the RT axis. This still feels like the best way to handle missing scans, but unfortunately I wasn't able to figure out which variables contained the associated RT data for each peak so I wasn't able to implement this approach in the PR. If I remember correctly, this would be easier (but still costly) to do in the original peakpicking and much more difficult during peak filling, when I didn't see a variable for the scan RTs at all. On the plus side though, this new beta calculation seems to perform slightly better than the original one in distinguishing good peaks from bad, likely because peaks with many missing scans tended to be lower quality in my dataset (and others?). I can provide that data and code if you're curious but don't have it to hand at the moment.

For the filled peaks, I'm not actually surprised that they tend to be lower quality. Presumably if they were higher quality, centWave would've picked them up the first time through, so we have a selection bias towards lower quality peaks when we're forcibly filling in all the ones missed by the original peakpicking. I think the better solution there would to use a more careful summary statistic when calculating a net feature peak shape statistic instead of the overall median. For example, if a peak only appeared in 3/30 files but was very high quality in those files, then the original picker would return high correlation values for just 3 of those peaks and return NA for the remainder (because there was no peak detected), so a median/mean correlation value across all the files would be either NA or quite high. When additional beta values are calculated during the peak filling step and we still calculate the mean/median, we're suddenly looking at a much lower average because in 27/30 files there are data points there, corresponding either to background signal or a lower-quality peak. So I'd argue that this is a downstream problem for the later feature extraction - perhaps there needs to be both a a "median original peak correlation" and a "median filled peak correlation" and large differences between the two imply high-quality peaks in only a few files, while similar values indicate that there's other good peaks there that were just missed.

@jorainer
Copy link
Collaborator

Thanks for the reply. Yes, I know, getting hold of retention times in the original centWave code is very tricky. I like your idea of scaling provided retention times and I think it would be best if your function would actually consider them (also thinking of possible future use cases of the function).

I would maybe then suggest the following change to your function:

.get_beta_values <- function(intensity, rtime = seq_along(intensity), skews=c(3, 3.5, 4, 4.5, 5), zero.rm = TRUE) {
    if (zero.rm) {
        ## remove 0 or NA intensities
        keep <- which(intensity > 0)
        intensity <- intensity[keep]
        rtime <- rtime[keep]
    }
    if(length(intensity)<5){
      best_cor <- NA
      beta_snr <- NA
    } else {
      beta_sequence <- rep(.scale_zero_one(rtime), each=5)
      beta_vals <- t(matrix(dbeta(beta_sequence, shape1 = skews, shape2 = 5), nrow = 5))
      # matplot(beta_vals)
      beta_cors <- cor(intensity, beta_vals)
      best_cor <- max(beta_cors)
      best_curve <- beta_vals[, which.max(beta_cors)]
      noise_level <- sd(diff(.scale_zero_one(best_curve)-.scale_zero_one(intensity)))
      beta_snr <- log10(max(intensity)/noise_level)
    }
    c(best_cor=best_cor, beta_snr=beta_snr)
}

So, essentially my suggested changes:

  • rename parameter eic_int to intensity - to use a name like in other functions.
  • add a parameter rtime that allows to provide retention times, and if not provided, assumes equally spaced intensity values.
  • allows to remove zero intensity (and NA) values. I don't think it makes sense to keep them for the correlation calculation.
  • return a numeric vector of length 2 instead of a list.

This function would allow to pass actual retention times (when available, such as during peak filling or re-integration of the data) to calculate formally correct betas. During peak detection it would run it providing only the intensities. Removing 0s or NA values is I think an important step that definitely should be implemented. Yes, a peak with many missing values is less reliable, but this should IMHO not influence the present calculation. That should be a different score, and should not influence the beta as it would mix two different things (gaussian peak shape and number of missing values) into the same score.

In the end, this same function could also be applied to the data from an extracted ion chromatogram (i.e. on a MChromatogram object) to evaluate peak shapes.

Would you be OK making these additional changes @wkumler ?

@wkumler
Copy link
Contributor Author

wkumler commented Nov 24, 2023

Looks good. I've implemented this function and updated the unit tests & documentation to reflect the output changing from a list to a vector. I made a few changes though. First, the function as written will introduce NAs because providing rtime as an argument means that it'll be lazily evaluated, with a value calculated the first time it's called. If we calculate keep on the full vector then subset intensity, then calculate rtime as seq_along, rtime will be the length of the new intensity vector, not the original. So I just moved the intensity subsetting after the rtime subsetting. Second, I fixed a bug I hadn't realized was present when I changed skews to an argument of indefinite length instead of hard-coding it but didn't update the size of the matrix to reflect that (nrow=5 instead of nrow=length(skews)).

Copy link
Collaborator

@jorainer jorainer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thanks a lot @wkumler !

@jorainer
Copy link
Collaborator

Thanks William! Really great contribution! @sneumann - from my side I would be OK to merge.

@sneumann
Copy link
Owner

Hi, thanks everyone for your efforts and patience. Yours, Steffen

@sneumann sneumann merged commit fe214c9 into sneumann:devel Nov 28, 2023
3 checks passed
@pablovgd
Copy link
Contributor

Hi all, I've been testing the new functionality and have (maybe an obvious) question. So after using the standard xcms workflow for peak detection using centwave and correspondence etc... I can access the peaks using chromPeaks which also includes the new beta_snr & beta_cor values for all the peaks. However, is there an easy way to find out which features are linked to which chromatographic peaks to inspect their values for beta_snr, beta_cor, etc...?
Thanks a lot!

Kind regards,
Pablo

@wkumler
Copy link
Contributor Author

wkumler commented Dec 12, 2023

Here's what I do to turn the S4 object into a flat file peak list using chromPeaks and featureDefinitions by joining on the peakidx column. Not sure if there's an easier way but the joins here aren't too bad.

library(tidyverse)
msnexp_filled <- readRDS(paste0(output_folder, "msnexp_filled.rds"))
peak_data_long <- msnexp_filled %>%
  chromPeaks() %>%
  as.data.frame() %>%
  rownames_to_column() %>%
  mutate(peakidx=row_number())
peak_data <- msnexp_filled %>%
  featureDefinitions() %>%
  as.data.frame() %>%
  select(mzmed, rtmed, npeaks, peakidx) %>%
  rownames_to_column("id") %>%
  unnest_longer(peakidx) %>%
  rename_with(~paste0("feat_", .x), .cols = -peakidx) %>%
  dplyr::rename(feature="feat_id") %>%
  left_join(peak_data_long, by = join_by(peakidx))

@pablovgd
Copy link
Contributor

Thanks @wkumler , in the meantime, I have found some other (hopefully) interesting behaviour to report on.

I'm currently testing xcms preprocessing on LC-MS data from DNA adducts for a colleague, to see if the centWave algorithm manages to find the targets that they are interested in and if I can optimize the parameters for as good as possible untargeted peak detection. After playing a bit with the parameters, I managed to retrieve all 10 targeted peaks within the untargeted data. So far so good, but because the peak quality of the targets isn't that great, I had to reduce noise threshold etc... which of course leads to more false positive peaks in the final untargeted feature table. So, a great opportunity to test out the beta_cor and beta_snr quality metrics 😉

Based on their feature chromatogram plot, all 10 targeted peaks were fairly good looking and definitely would be included by our analysts if they would encounter them in manual targeted analysis. For beta_snr, all observations had a value over 6.4, so if I were to look at more peaks, I could consider something around 5.5~6 as a cutoff for this dataset. However, for beta_cor, all peaks had a quite high value (Mean 0.7651 3rd Qu. 0.8934 Max. 0.9893), however one peak scored 0.4396. When looking at the feature chromatogram of this peak, it actually looked quite ok:

image

So I extracted to intensity and retention time and calculated the scores again using the qscoreCalculator function. Because the int vector had some NA's (5 of the 42 were NA), I changed those to zero to allow the function to work. This resulted in a big increase for both metrics, better representing the actual "good peak shape":

$SNR
[1] 13.13252

$peak_cor
[1] 0.7370644

So probably the way the scores are calculated within the findChromPeaks implementation handles NA's in a different way? Or penalizes the peak in some way for that (which in some cases might be a good thing I guess). Any idea on how we could improve this? I think, at least based on this example, it could increase the performance of your new metrics.

These were the parameters for centWave:

cwp <- CentWaveParam(ppm = 10, peakwidth = c(10,80), integrate = 2, snthresh = 0, prefilter = c(3,10e4), extendLengthMSW = T, verboseColumns = T,  verboseBetaColumns = TRUE )

xdata_S0095B <- findChromPeaks(data_S0095B, param = cwp)

Hope my explanation is clear, let me know if you need more code/files if you want to replicate this.

Cheers & enjoy the holidays,
Pablo

@wkumler
Copy link
Contributor Author

wkumler commented Dec 21, 2023

@pablovgd That's correct - qscoreCalculator does have a slightly different from the implementation now in the findChromPeaks algorithm, specifically with respect to missing values. As @jorainer and I discuss above with the zero-one scaling, this is largely due to the difficulty in extracting the retention time vector alongside the intensity vector. When both are available the exact positions of the intensity peaks can be passed to the beta distribution, but without it we're forced to assume that they're equally spaced which is generally a good but not an excellent assumption.

I'll also note that these metrics tend to perform much better when calculated as a mean/median across multiple files for a given feature instead of a single peak. Below I've attached several images showing the distribution of beta_cor and beta_snr in a couple ways for a subset of pooled QC samples from my own data:

Here's the "feature-based" mindset with a partially reviewed dataset. Manually reviewed features are shown by color, while model-predicted classes are shown as peak shape. You can see the line that the simple two-feature logistic regression algorithm draws in this space from the middle-top down to the bottom right based on how the shapes change above/below that line, as well as the few peaks that were either misclassified or were outliers in the model.

image

Here's what the peak-based metrics look like for just the "Good" and "Bad" peaks - I typically construct these plots for every dataset because the thresholds are difficult to determine a priori and I favor the trained model, but you may get some guidance from them nonetheless. Points are colored by feature number and you can kinda see how peaks tend to group pretty closely in this space with six dots (one from each mzML file) of the same color grouping close together.

image

If I were to use a beta_cor threshold alone for this dataset, I'd probably settle on a median value of 0.8 or higher for the median feature beta_cor. That help to ignore some of the single-sample outliers in the 0.5-0.75 range. Are you often working with individual samples, or do you typically have additional samples / replicates / pooled QC samples that would allow you to view things in feature space instead of peak space? I certainly agree that the algorithm could use some future refinement and hopefully @jorainer's suggestion to implement it as a single function makes tinkering easier.

@v-v1150n
Copy link

v-v1150n commented Apr 15, 2024

Sorry because I am a rookie in using xcms. I am currently looking for a way to check the peak quality. @jorainer recommended that I can refer to this issue for reference. At present, it seems that the quality of peak can be check through two indicators: beta_cor and beta_snr. But I tried to use CentWaveParam and add erboseBetaColumns=TRUE, but an error occurred.

Error in CentWaveParam(ppm = 10, peakwidth = c(10, 80), integrate = 2,  : 
  unused argument (verboseBetaColumns = TRUE)

However, the packages I used have been updated to the latest version. I don’t know if there is a way to exclude or other ways to check the quality of the peak. In addition, My experimental material is GC-MS. I am not sure whether this will affect the quality check. Thank you.

@sneumann
Copy link
Owner

Hi @v-v1150n , welcome here, let's try to help. Since R and Bioconductor have some opinions about versioning, let's double-check first that you have the correct versions. What is the output of the command sessionInfo() after loading xcms ? Yours, Steffen

@v-v1150n
Copy link

Hi @sneumann here is the version about my package

> sessionInfo()
R version 4.3.3 (2024-02-29)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.0

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Asia/Taipei
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] lubridate_1.9.3        forcats_1.0.0          stringr_1.5.1          dplyr_1.1.4           
 [5] purrr_1.0.2            readr_2.1.5            tidyr_1.3.1            tibble_3.2.1          
 [9] ggplot2_3.5.0          tidyverse_2.0.0        Matrix_1.6-5           faahKO_1.42.0         
[13] MsExperiment_1.4.0     MsBackendMsp_1.6.0     Spectra_1.12.0         IPO_1.28.0            
[17] CAMERA_1.58.0          rsm_2.10.4             MetaboAnnotation_1.6.1 xcms_4.0.2            
[21] MSnbase_2.28.1         ProtGenerics_1.34.0    S4Vectors_0.40.2       mzR_2.36.0            
[25] Rcpp_1.0.12            Biobase_2.62.0         BiocGenerics_0.48.1    BiocParallel_1.36.0   

loaded via a namespace (and not attached):
  [1] splines_4.3.3                 later_1.3.2                   bitops_1.0-7                 
  [4] filelock_1.0.3                preprocessCore_1.64.0         graph_1.80.0                 
  [7] XML_3.99-0.16.1               rpart_4.1.23                  lifecycle_1.0.4              
 [10] doParallel_1.0.17             lattice_0.22-6                MASS_7.3-60.0.1              
 [13] MultiAssayExperiment_1.28.0   backports_1.4.1               magrittr_2.0.3               
 [16] limma_3.58.1                  Hmisc_5.1-2                   rmarkdown_2.26               
 [19] yaml_2.3.8                    httpuv_1.6.15                 MsCoreUtils_1.14.1           
 [22] DBI_1.2.2                     RColorBrewer_1.1-3            abind_1.4-5                  
 [25] zlibbioc_1.48.2               GenomicRanges_1.54.1          AnnotationFilter_1.26.0      
 [28] RCurl_1.98-1.14               nnet_7.3-19                   rappdirs_0.3.3               
 [31] GenomeInfoDbData_1.2.11       IRanges_2.36.0                ncdf4_1.22                   
 [34] codetools_0.2-20              CompoundDb_1.6.0              DelayedArray_0.28.0          
 [37] DT_0.33                       xml2_1.3.6                    tidyselect_1.2.1             
 [40] matrixStats_1.3.0             BiocFileCache_2.10.2          base64enc_0.1-3              
 [43] jsonlite_1.8.8                multtest_2.58.0               Formula_1.2-5                
 [46] survival_3.5-8                iterators_1.0.14              foreach_1.5.2                
 [49] tools_4.3.3                   progress_1.2.3                glue_1.7.0                   
 [52] gridExtra_2.3                 SparseArray_1.2.4             xfun_0.43                    
 [55] MatrixGenerics_1.14.0         GenomeInfoDb_1.38.8           withr_3.0.0                  
 [58] BiocManager_1.30.22           fastmap_1.1.1                 fansi_1.0.6                  
 [61] digest_0.6.35                 timechange_0.3.0              R6_2.5.1                     
 [64] mime_0.12                     colorspace_2.1-0              rsvg_2.6.0                   
 [67] RSQLite_2.3.6                 utf8_1.2.4                    generics_0.1.3               
 [70] data.table_1.15.4             robustbase_0.99-2             prettyunits_1.2.0            
 [73] httr_1.4.7                    htmlwidgets_1.6.4             S4Arrays_1.2.1               
 [76] pkgconfig_2.0.3               gtable_0.3.4                  blob_1.2.4                   
 [79] impute_1.76.0                 MassSpecWavelet_1.68.0        XVector_0.42.0               
 [82] htmltools_0.5.8.1             RBGL_1.78.0                   MALDIquant_1.22.2            
 [85] clue_0.3-65                   scales_1.3.0                  png_0.1-8                    
 [88] knitr_1.45                    MetaboCoreUtils_1.10.0        rstudioapi_0.16.0            
 [91] tzdb_0.4.0                    rjson_0.2.21                  checkmate_2.3.1              
 [94] curl_5.2.1                    cachem_1.0.8                  BiocVersion_3.18.1           
 [97] parallel_4.3.3                foreign_0.8-86                AnnotationDbi_1.64.1         
[100] mzID_1.40.0                   vsn_3.70.0                    pillar_1.9.0                 
[103] grid_4.3.3                    vctrs_0.6.5                   MsFeatures_1.10.0            
[106] RANN_2.6.1                    pcaMethods_1.94.0             promises_1.3.0               
[109] dbplyr_2.5.0                  xtable_1.8-4                  cluster_2.1.6                
[112] htmlTable_2.4.2               evaluate_0.23                 cli_3.6.2                    
[115] compiler_4.3.3                rlang_1.1.3                   crayon_1.5.2                 
[118] QFeatures_1.12.0              ChemmineR_3.54.0              affy_1.80.0                  
[121] plyr_1.8.9                    fs_1.6.3                      stringi_1.8.3                
[124] munsell_0.5.1                 Biostrings_2.70.3             lazyeval_0.2.2               
[127] hms_1.1.3                     bit64_4.0.5                   KEGGREST_1.42.0              
[130] statmod_1.5.0                 shiny_1.8.1.1                 SummarizedExperiment_1.32.0  
[133] interactiveDisplayBase_1.40.0 AnnotationHub_3.10.1          igraph_2.0.3                 
[136] memoise_2.0.1                 affyio_1.72.0                 DEoptimR_1.1-3               
[139] bit_4.0.5                    

@pablovgd
Copy link
Contributor

Hi @v-v1150n ,

I might be wrong, but I think the quality metrics are currently only implemented in the development version of xcms: https://www.bioconductor.org/packages/devel/bioc/html/xcms.html

You'll need Bioconductor version 3.19 and R version 4.4.0, see also: https://www.bioconductor.org/install/

Hope this helps!

@wkumler
Copy link
Contributor Author

wkumler commented Apr 16, 2024

Yep, can confirm that verboseBetaColumns went live in xcms 4.1.3 according to the NEWS. Easiest way to get it is probably via devtools with devtools::install_github("https://github.com/sneumann/xcms") if you don't want to fiddle with multiple R versions, though there may be issues with dependencies if going the devtools route instead of waiting for the stable 3.19 version to drop.

Also, as some shameless self-promotion I've been working on wrapping a lot of these same metrics into a package for visualization and labeling over at https://github.com/wkumler/squallms. The package won't be on Bioconductor until the 3.20 update this fall but it's at a stable state and I'm actively collecting feedback from others if you want to give it a try!

@pablovgd
Copy link
Contributor

@wkumler nice, looks good! I'll give it a try. I implemented the beta_corr & updated SNR also in a tool I'm working on for targeted peak extraction and overall they align quite well with peak quality labeling of our analysts!

@v-v1150n
Copy link

I upgraded R and updated Bioconductor to be able to run these two parameters (beta_cor and beta_snr).
I am still looking for a more convenient way to check the quality of the peak and adjust the optimal parameters.
If there are other ways, I would also like to try it.
And I will try the method provided by @wkumler first, thank you for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants