Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added expansion factors for the mz and rt range used in fillPeaks.chrom. #3

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

stanstrup
Copy link
Contributor

No description provided.

@sneumann
Copy link
Owner

Hi, I am trying a few things with the new code, and would like to add
a unitTest before pulling.

    library(faahKO)
    library(RUnit)

    xsg <- group(faahko)

    xsgf1 <- fillPeaks(xsg, method="chrom")
    xsgf0909 <- fillPeaks(xsg, method="chrom", expand.rt=0.9, expand.mz=0.9)
    xsgf1111 <- fillPeaks(xsg, method="chrom", expand.rt=1.1, expand.mz=1.1)

   # Perfect correlation: as expected:
   plot(peaks(xsgf1)[xsgf1@filled, "into"], peaks(xsgf1)[xsgf1@filled, "into"])

   # Shouldn't all peaks in xsgf1111 be >= xsgf1 ?
   plot(peaks(xsgf1)[xsgf1@filled, "into"], peaks(xsgf1111)[xsgf1111@filled, "into"])

   # Shouldn't all peaks in xsgf0909 be <= xsgf1 ?
   plot(peaks(xsgf1)[xsgf1@filled, "into"], peaks(xsgf0909)[xsgf0909@filled, "into"])

    # Actual test:
    checkTrue(all(peaks(xsgf1)[xsgf1@filled, "into"] == peaks(xsgf1)[xsgf1@filled, "into"]))
    checkTrue(all(peaks(xsgf1)[xsgf1@filled, "into"] <= peaks(xsgf1111)[xsgf1111@filled, "into"]))
    checkTrue(all(peaks(xsgf1)[xsgf1@filled, "into"] >= peaks(xsgf0909)[xsgf0909@filled, "into"]))

@stanstrup
Copy link
Contributor Author

Yes I would expect the same. It is not the case? If not I can investigate next week when I get back.

@sneumann
Copy link
Owner

screenshot from 2014-04-23 20 05 58

@stanstrup
Copy link
Contributor Author

Ok. Damn. I will investigate next week. I wonder if these are at the extremes of the retention time range and getPeaks doesn't like it or something.

@stanstrup
Copy link
Contributor Author

The added zeros were indeed because getPeaks doesn't handle if the rt range of the file is exceeded. I have added a check to limit rtmin and rtmax to the limits in each file (in my fork, pull request not submitted yet).

There is still something strange going on. getPeaks is returning slightly lower values for some peaks in some samples even with expanded peakrange. I have not figure out why yet. I guess I will have to understand exactly how getPeaks works to figure that out so no pull request yet.

Example to see the problems:

library(faahKO)
library(RUnit)
library(stringr)

xsg <- group(faahko)

xsgf1 <- fillPeaks(xsg, method="chrom")
xsgf1111 <- fillPeaks(xsg, method="chrom", expand.rt=1.1, expand.mz=1.1)

select = !(peaks(xsgf1)[xsgf1@filled, "into"] <= peaks(xsgf1111)[xsgf1111@filled, "into"])


View(peaks(xsgf1)[xsgf1@filled[select],])
View(peaks(xsgf1111)[xsgf1111@filled[select],])

Any ideas?

@stanstrup
Copy link
Contributor Author

I have tracked down the problem. The problem is that getPeaks calculates the average peakwidth:

pwid <- diff(stime[iret])/diff(iret)

With the expanded peakrange that obviously lead to inclusion of additional scans as intended.
The problem is that the extra scans might have zero intensity; and hence the only difference in the calculated peak area is the average peakwidth that is used for the calculation.

rmat[i,7] <- pwid*sum(ymax)

This average peakwidth is occasionally lower if the extra scan has a slightly shorter than "normal" "width". This in turn makes the calculated area lower. The difference is really really small in the cases I have seen.

So the only solutions I can see is to use the peakwidth for each scan for the calculation or not use this particular unit test. I would be hesitant to change such a fundamental calculation in xcms...

@stanstrup
Copy link
Contributor Author

I have added min.width.mz and min.width.rt arguments. In my test dataset setting a reasonable min.width.mz was much more important than expanding the range by a factor. It almost cut the zeros in half whereas the expansion factor only cut a few percent point.

The dataset had some features with ridiculously low mz ranges making re-integration impossible:

mz_range

…krange matrix must not be coersed to a vector when there is only one peak to expand.
@ColeWunderlich
Copy link

I think this adds some nice features that can optionally be 'turned off' if the user desires to use the original functionality. I support pulling this, and am currently using it in my own code to prevent certain cases when Pwid = Na which I believe is being caused by the M/z range being too narrow.

@trljcl
Copy link
Contributor

trljcl commented Jun 25, 2014

Sorry, I think there are a number of issues here. If this is implemented, the defaults should be FALSE or NULL. Can someone please explain why expand.rt and expand.mz are needed in the first place in fillPeaks? Is this:

  1. A workaround implemented to counter the shrunken mz and rt range for some peaks in the group, which is the default fillPeaks behaviour related to using group medians? The mz and rt ranges passed to getPeaks within fillPeaks are effectively medians of the group minima and maxima for the mz and rt values. This is a compromise, as using the medians equates to half the peaks in the peaks in the group having predicted mz and rt boundaries narrower than actual, and the other half wider. Adding expansion factors would be similar to using the minimum of the group minima and maximum of the group maxima (instead of the medians), The rt and mz boundaries would then capture all peaks in the group (and may be a better way of defining expansion factors as these would dynamically change for each group). Predictions would be closer to actual areas but ONLY if there are no overlapping peaks from adjacent groups, in which case overestimations would occur. I suppose this is why the median is used - it is a compromise between underestimating areas for peaks in the group and overestimating areas by including peaks outside the group.

  2. A workaround because getPeaks uses the profile matrix and a narrow mz boundary range may actually fall in less than one profile bin? getPeaks has a default step value of 0.1, a legacy of unit mass resolution data with peaks picked using matchedFilter. However, if findPeaks.centWave is used on high resolution data (so no profile matrix used for peak picking but a profile matrix IS generated with step = 0.1 for fillPeaks) it is perfectly possible that there may be some peak groups with mz widths < 0.01. Setting the xcmsSet object @ProfInfo$step to a low value (e.g. to 0.001) before running fillPeaks ensures a more accurate set of bins are used to estimate areas in getPeaks. It's not possible to set step as an argument in fillPeaks, it uses the value stored in the @ProfInfo slot.

  3. Maybe not directly relevant here - but there are integration errors if scan to scan times vary at all. I pointed this out in https://groups.google.com/forum/#!topic/xcms-devel/bdkvPGSWOU0 and benjic has implemented this in a fork of xcms, but only for the initial peak finding done by centWave, and not for any subsequent peak filling done by getPeaks, so in this case there will be inconsistencies (see https://github.com/benjiec/xcms/tree/5532c360b02bbc823193d994d2eae9e32f985f04). I think this fix should be implemented across all findPeaks methods and getPeaks to ensure consistent integration.

cheers
Tony

@stanstrup
Copy link
Contributor Author

This is workaround for when the scan and rt ranges found in the original profiling might not be representable, however they are defined. The reasons I could come up with for why this seem to be a pervasive problem with some datasets include:

  • The picked peaks are likely from the samples with the highest intensities. So I figured mz ranges are typically more narrow than can be expected for smaller peaks.
  • If you have peak groups based on a few samples both mz and rt ranges might not be representable of the whole dataset. That you only initially have a few well behaving samples might be because of differences in intensities but also if you have peaks with a low number of scans the peaks might not always be found.

As you can see from the plot above mz ranges go all the way down to 0 and the re-integration of peaks are therefore done on completely different terms even if you probably won't argue that the accuracy differs that much.

It is true that using the expansion factors run the risk of integration other peaks. But if a reasonable setting of min.width.rt and in particular min.width.mz creates overlap then this peak can never be trusted anyway; you have overlap that cannot be resolved by your system. I would think that the minimum settings would be much less dangerous than using the group minima and maxima.

Why must the defaults be NULL/FALSE? The current defaults (expand.mz=1,expand.rt=1,min.width.mz=0,min.width.rt=0) leaves the ranges unchanged and should be identical to the current version.

@stanstrup
Copy link
Contributor Author

It looks like my first pull request was accepted (the expand.mz and expand.rt) but not the min.width.mz and min.width.rt pull request. Was that intended?

@stanstrup
Copy link
Contributor Author

Was a decision made on this issue?

@stanstrup
Copy link
Contributor Author

I'd still very much like to get this finished...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants