-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding new peak quality metrics to findChromPeaks for CentWave #685
Conversation
Okay, apparently GitHub Actions is able to run the checks without problems which I'm unable to do on my end for some reason. The last attempt produced a couple errors/notes which may have been resolved with the above commits. Two problems were fairly simple - adding the More problematically, I wasn't able to resolve one error in the examples. The demos load an old version of an XCMSnExp object
I don't see an easy way around this other than reconstructing the |
Prior attempt was done without redefining the object itself because I hadn't used the dev version coded here
Update: I was able to mock up a new
Unit testing results: Direct clone from https://github.com/sneumann/xcms: [ FAIL 0 | WARN 1022 | SKIP 13 | PASS 3951 ] |
Hi @wkumler , thanks for this really well-done PR! I think the additional runtime is well-spent. Have you also checked the fitgauss aspect in centWave ? My gut-feeling is that the gauss fitting might be slower but not better. Yours, |
The It looks like the new The PCA above is the product of log-scaling Do you know what the For the MTBLS2 data I obtain the following PCA after following the same log-scaling steps above, making |
Hi @jorainer , what do you think here ? Yours, Steffen |
Really sorry, wanted to read this when I have enough time - and then always forgot about it. Will check it today - promised ;) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks William @wkumler for this great PR!
I like your contribution, also because it does not interfere with the peak detection itself, but does evaluate the peak identified by vanilla centWave. I have some comments on the code but are generally OK adding this functionality to xcms
.
Also, what would be super-nice is if there was a way to calculate that metric also on EICs. Could you maybe add a function that takes just two numeric vectors with retention time and intensity values and calculate these scores on them? We could this then use this also for generic data, independent of the centWave.
R/do_findChromPeaks-functions.R
Outdated
|
||
# Implement a fit of a skewed gaussian (beta distribution) | ||
# for peak shape and within-peak signal-to-noise ratio | ||
# See [biorxiv link] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please provide the actual biorxiv link/doi here?
R/do_findChromPeaks-functions.R
Outdated
beta_cors <- cor(pd, beta_vals) | ||
best_cor <- max(beta_cors) | ||
best_curve <- beta_vals[,which.max(beta_cors)] | ||
scale_zero_one <- function(x)(x-min(x))/(max(x)-min(x)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please move that function outside the loop - maybe even just add as an internal .scale_zero_one
function within this .R file - then this function would not need to be defined over and over again for each peak.
R/DataClasses.R
Outdated
#' correlation coefficient to a bell curve with several degrees of skew as well | ||
#' as an estimate of signal-to-noise using the residuals from the best-fitting | ||
#' bell curve. See https://github.com/sneumann/xcms/pull/685 for more | ||
#' information. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add the doi to your biorxiv
@sneumann I'm OK with this. As mentioned in the review above, I would like to see a base function that returns these values providing simple vectors of retention times and intensity values... |
Hi @wkumler , just a friendly reminder - do you think you could address my requested changes? |
Hi @wkumler , super nice work, also really enjoyed reading the publication in BMC bioinformatics. Nice to see other people that are as well manually evaluating their picked peaks in samples to benchmark tools instead of only relying on analytical standard mixtures. Since I'm doing some stuff with extracting targeted peaks from LC-MS data, I was also planning on including some quality metrics. My guess is that the easiest way to calculate your new metrics on new rt,int data is the function qscoreCalculator()? Furthermore, maybe a more open question, what would we practically do with the extra info that these columns give us in the output? My first hunch was to combine different peak quality metrics in a weighted single number score, that allows the analyst to go over the picked peaks and detect the problematic ones in an easy way. That of course also has its downsides, and calculating a weighted score will be quite subjective... Or maybe I should also train a logistic regression model on the values of these metrics for analytic standards and/or QC samples of said targeted peaks? And then let it classify the peaks in samples? Love to hear your thoughts. Cheers, |
Thanks for the updates @wkumler - and I really like the discussions happening here :) - and I agree, here we just want to add tools to get some more information on individual peaks. How these metrics are then used to discriminate between good and bad peaks will be heavily data set (and instrumentation/column) dependent and should ideally go into a different package focused on exactly that. Please re-request my review once you're OK with your commits. |
(Just for testing on GH Actions over the weekend)
Okay, I think I'm satisfied with the PR now. Most of the NAs being introduced were due to peak filling not due to any flaw in the algorithm itself, so I added the calculation to the peak filling step as well if the |
Regarding the differences of the betas from the initial and the gap filled data - I guess that has to do with Maybe a quick example what I mean: assume the intensities for an EIC would be like this: ints <- c(12, NA, 143, 2000, 45, 8) i.e. for one spectrum no intensity is measured within the m/z range. the Does this make sense @wkumler ? Should we make sure to use a value for each spectrum (retention time) as an input to your |
Note: I had a look at However, another consideration: the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update, and also adding the code to the gap filling! Really great! Thanks @wkumler !
It might however be that we have some issues because of 0 intensities that we should address, I feel. Happy to discuss.
@jorainer Thanks for the feedback! I'm also not sure how best to handle "missing" scans and that was part of a discussion that I had with my collaborators that didn't really get resolved to my satisfaction. For something like shape-based testing I don't think replacing the values with zeroes is the right move because from what I've seen the "missing" values are much more likely to be better approximated by linear interpolation between the two nearest points. So replacing the zeroes we get from .rawMat with NAs is a reasonable thing. When I first wrote the function for use with the re-extracted data, I actually zero-one scaled the retention times as well, then calculated the beta distribution values at the corresponding retention times. So a peak with scan RTs of c(8.0, 8.2, 8.4, 8.45, 8.5) would become c(0, 0.4, 0.8, 0.9, 1.0), which can then be passed directly to For the filled peaks, I'm not actually surprised that they tend to be lower quality. Presumably if they were higher quality, centWave would've picked them up the first time through, so we have a selection bias towards lower quality peaks when we're forcibly filling in all the ones missed by the original peakpicking. I think the better solution there would to use a more careful summary statistic when calculating a net feature peak shape statistic instead of the overall median. For example, if a peak only appeared in 3/30 files but was very high quality in those files, then the original picker would return high correlation values for just 3 of those peaks and return NA for the remainder (because there was no peak detected), so a median/mean correlation value across all the files would be either NA or quite high. When additional beta values are calculated during the peak filling step and we still calculate the mean/median, we're suddenly looking at a much lower average because in 27/30 files there are data points there, corresponding either to background signal or a lower-quality peak. So I'd argue that this is a downstream problem for the later feature extraction - perhaps there needs to be both a a "median original peak correlation" and a "median filled peak correlation" and large differences between the two imply high-quality peaks in only a few files, while similar values indicate that there's other good peaks there that were just missed. |
Thanks for the reply. Yes, I know, getting hold of retention times in the original centWave code is very tricky. I like your idea of scaling provided retention times and I think it would be best if your function would actually consider them (also thinking of possible future use cases of the function). I would maybe then suggest the following change to your function: .get_beta_values <- function(intensity, rtime = seq_along(intensity), skews=c(3, 3.5, 4, 4.5, 5), zero.rm = TRUE) {
if (zero.rm) {
## remove 0 or NA intensities
keep <- which(intensity > 0)
intensity <- intensity[keep]
rtime <- rtime[keep]
}
if(length(intensity)<5){
best_cor <- NA
beta_snr <- NA
} else {
beta_sequence <- rep(.scale_zero_one(rtime), each=5)
beta_vals <- t(matrix(dbeta(beta_sequence, shape1 = skews, shape2 = 5), nrow = 5))
# matplot(beta_vals)
beta_cors <- cor(intensity, beta_vals)
best_cor <- max(beta_cors)
best_curve <- beta_vals[, which.max(beta_cors)]
noise_level <- sd(diff(.scale_zero_one(best_curve)-.scale_zero_one(intensity)))
beta_snr <- log10(max(intensity)/noise_level)
}
c(best_cor=best_cor, beta_snr=beta_snr)
} So, essentially my suggested changes:
This function would allow to pass actual retention times (when available, such as during peak filling or re-integration of the data) to calculate formally correct betas. During peak detection it would run it providing only the intensities. Removing 0s or NA values is I think an important step that definitely should be implemented. Yes, a peak with many missing values is less reliable, but this should IMHO not influence the present calculation. That should be a different score, and should not influence the beta as it would mix two different things (gaussian peak shape and number of missing values) into the same score. In the end, this same function could also be applied to the data from an extracted ion chromatogram (i.e. on a Would you be OK making these additional changes @wkumler ? |
Looks good. I've implemented this function and updated the unit tests & documentation to reflect the output changing from a list to a vector. I made a few changes though. First, the function as written will introduce NAs because providing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Thanks a lot @wkumler !
Thanks William! Really great contribution! @sneumann - from my side I would be OK to merge. |
Hi, thanks everyone for your efforts and patience. Yours, Steffen |
Hi all, I've been testing the new functionality and have (maybe an obvious) question. So after using the standard xcms workflow for peak detection using centwave and correspondence etc... I can access the peaks using Kind regards, |
Here's what I do to turn the S4 object into a flat file peak list using
|
Thanks @wkumler , in the meantime, I have found some other (hopefully) interesting behaviour to report on. I'm currently testing xcms preprocessing on LC-MS data from DNA adducts for a colleague, to see if the centWave algorithm manages to find the targets that they are interested in and if I can optimize the parameters for as good as possible untargeted peak detection. After playing a bit with the parameters, I managed to retrieve all 10 targeted peaks within the untargeted data. So far so good, but because the peak quality of the targets isn't that great, I had to reduce noise threshold etc... which of course leads to more false positive peaks in the final untargeted feature table. So, a great opportunity to test out the Based on their feature chromatogram plot, all 10 targeted peaks were fairly good looking and definitely would be included by our analysts if they would encounter them in manual targeted analysis. For So I extracted to intensity and retention time and calculated the scores again using the
So probably the way the scores are calculated within the These were the parameters for centWave:
Hope my explanation is clear, let me know if you need more code/files if you want to replicate this. Cheers & enjoy the holidays, |
@pablovgd That's correct - qscoreCalculator does have a slightly different from the implementation now in the findChromPeaks algorithm, specifically with respect to missing values. As @jorainer and I discuss above with the zero-one scaling, this is largely due to the difficulty in extracting the retention time vector alongside the intensity vector. When both are available the exact positions of the intensity peaks can be passed to the beta distribution, but without it we're forced to assume that they're equally spaced which is generally a good but not an excellent assumption. I'll also note that these metrics tend to perform much better when calculated as a mean/median across multiple files for a given feature instead of a single peak. Below I've attached several images showing the distribution of beta_cor and beta_snr in a couple ways for a subset of pooled QC samples from my own data: Here's the "feature-based" mindset with a partially reviewed dataset. Manually reviewed features are shown by color, while model-predicted classes are shown as peak shape. You can see the line that the simple two-feature logistic regression algorithm draws in this space from the middle-top down to the bottom right based on how the shapes change above/below that line, as well as the few peaks that were either misclassified or were outliers in the model. Here's what the peak-based metrics look like for just the "Good" and "Bad" peaks - I typically construct these plots for every dataset because the thresholds are difficult to determine a priori and I favor the trained model, but you may get some guidance from them nonetheless. Points are colored by feature number and you can kinda see how peaks tend to group pretty closely in this space with six dots (one from each mzML file) of the same color grouping close together. If I were to use a beta_cor threshold alone for this dataset, I'd probably settle on a median value of 0.8 or higher for the median feature beta_cor. That help to ignore some of the single-sample outliers in the 0.5-0.75 range. Are you often working with individual samples, or do you typically have additional samples / replicates / pooled QC samples that would allow you to view things in feature space instead of peak space? I certainly agree that the algorithm could use some future refinement and hopefully @jorainer's suggestion to implement it as a single function makes tinkering easier. |
Sorry because I am a rookie in using
However, the packages I used have been updated to the latest version. I don’t know if there is a way to exclude or other ways to check the quality of the peak. In addition, My experimental material is GC-MS. I am not sure whether this will affect the quality check. Thank you. |
Hi @v-v1150n , welcome here, let's try to help. Since R and Bioconductor have some opinions about versioning, let's double-check first that you have the correct versions. What is the output of the command |
Hi @sneumann here is the version about my package
|
Hi @v-v1150n , I might be wrong, but I think the quality metrics are currently only implemented in the development version of xcms: https://www.bioconductor.org/packages/devel/bioc/html/xcms.html You'll need Bioconductor version 3.19 and R version 4.4.0, see also: https://www.bioconductor.org/install/ Hope this helps! |
Yep, can confirm that verboseBetaColumns went live in xcms 4.1.3 according to the NEWS. Easiest way to get it is probably via devtools with Also, as some shameless self-promotion I've been working on wrapping a lot of these same metrics into a package for visualization and labeling over at https://github.com/wkumler/squallms. The package won't be on Bioconductor until the 3.20 update this fall but it's at a stable state and I'm actively collecting feedback from others if you want to give it a try! |
@wkumler nice, looks good! I'll give it a try. I implemented the beta_corr & updated SNR also in a tool I'm working on for targeted peak extraction and overall they align quite well with peak quality labeling of our analysts! |
I upgraded R and updated Bioconductor to be able to run these two parameters ( |
Hi again,
I've found myself frustrated with XCMS's large number of noise peaks and have designed a few additional metrics that can be used to identify "good-looking" peaks more easily. The core idea is to compare the raw chromatogram to an "idealized" curve with some degrees of skew, the deviation from which can then be used to estimate both how bell-shaped the peak is as well as estimating the signal-to-noise ratio using the residuals.
The results from this seem fairly promising. On my own data, I get pretty nice separation between the peaks I've manually labelled as "Good" and "Bad", with the Pearson's r metric performing a bit better than the new SNR metric. The new SNR metric fixes the problem with the old "sn" metric when there wasn't enough data outside the peak to properly estimate a baseline by using the signal within the peak instead though, so it seemed worth including as well.
When calculated as a summary statistic across multiple files, the performance is strengthened even further and some internal testing reveals that you can use these to enormously reduce the number of false positives (at the risk of a few additional false negatives). I trained a basic logistic regression model on these two new metrics alone and tested it using a second labeled dataset as validation and obtained the following confusion matrix:
in which the false discovery rate is reduced from ~80% to about 10%, although we lose out on 32 features that were incorrectly predicted to be "Bad" when I manually labeled them as "Good".
I originally considered implementing these as a separate function in XCMS or my own package but the recalculation does require going back to the raw peak data which is time consuming and quite slow. In this PR, I implemented it within the CentWave algorithm since the raw intensity and RT data is already available and additional columns can be added to the chromPeaks slot without any apparent difficulties. This was then made available in the
centWaveParam
with the boolean parameter "verboseBetaColumns" because the bell curve is made using a beta distribution and the name follows the "verboseColumns" syntax that already exists. The only change this really makes is thatchromPeaks
is now returned with two additional columns,beta_cor
andbeta_snr
corresponding to the Pearson's r metric and the new SNR metric respectively.When I tested this on the
faahko
dataset, here's the kind of peaks that receive low (upper row of plots) / high (lower row of plots) scores on the Pearson's r metric:and the new SNR metric:
as tested with the below code:
Of course, the additional calculations come at the cost of a slightly longer peakpicking time (but only if enabled in the CentWaveParam). In my (limited) testing across ~40 files, XCMS took an additional 15 seconds, going from ~4 minutes to ~4.2. This is expected to increase with the number of files and lower thresholds but mine are pretty generous to begin with so I'd expect this to be about the time increase percentage that others can expect as well.
I've also implemented a few unit tests to ensure the objects are the same when the parameter is not passed and that the calculations are unaffected by the additional parameters, but wasn't able to test the package comprehensively as it seems to be failing too many unit tests at the moment for unrelated reasons (running
devtools::test()
on a freshly cloned version of the repo fails in the same way.)Let me know if this is a PR you're interested in or if there's more I can do to help this get implemented. If you're curious about the labeling performance or interested in more details, I've got a preprint up on bioRxiv here.