From ec8578e46e049f7beaca9e57bf56d8eb3cc63e19 Mon Sep 17 00:00:00 2001 From: skschum Date: Fri, 14 Aug 2020 09:13:21 -0400 Subject: [PATCH] Update README.md --- README.md | 36 +++++++++++++++++++----------------- 1 file changed, 19 insertions(+), 17 deletions(-) diff --git a/README.md b/README.md index 4846adf..0897329 100644 --- a/README.md +++ b/README.md @@ -7,25 +7,27 @@ The MFAssignR package was designed for multi-element molecular formula (MF) assi ## Molecular Formula (MF) Assignment -The MF assignment algorithm in MFAssign was adapted from the low mass moiety CHOFIT assignment algorithm developed by Green and Perdue (2015). Briefly, the CHOFIT algorithm uses low mass moieties such as CH4O-1 and C4O-3 to move around in the O/C and H/C space to assign MF with C, H, and O without conventional loops. The MFAssignCHO() function uses the CHOFIT strategy to assign MF with C, H, and O. Additional combinatorial assignments with various heteroatoms are made using nested loops that subtract the mass of a heteroatom from the measured ion mass, creating a CHO “core” mass, which can then be assigned using the low mass moiety CHOFIT approach. The MFAssign() function uses this latter approach with several additional heteroatoms. Further information is available in Green and Perdue (2015) and Perdue and Green (2015). +The MF assignment algorithm in MFAssign was adapted from the low mass moiety CHOFIT assignment algorithm developed by Green and Perdue (2015). Briefly, the CHOFIT algorithm uses low mass moieties such as CH4O-1 and C4O-3 to move around in the O/C and H/C space to assign MF with C, H, and O without conventional loops. The MFAssignCHO function uses the CHOFIT strategy to assign MF with C, H, and O. Additional combinatorial assignments with various heteroatoms are made using nested loops that subtract the mass of a heteroatom from the measured ion mass, creating a CHO “core” mass, which can then be assigned using the low mass moiety CHOFIT approach. The MFAssign function uses this latter approach with several additional heteroatoms. Further information is available in Green and Perdue (2015) and Perdue and Green (2015). ### MFAssign() -Using the low mass moiety and combinatorial assignment approach, MFAssign() can be used to assign MF with 12C, 1H, and 16O and a variety of heteroatoms and isotopes, including 2H, 13C, 14N, 15N, 31P, 32S, 34S, 35Cl, 37Cl,and 19F. It can also assign Na+ adducts, which are common in positive ion mode. Due to the increasing number of chemically reasonable MF with the increasing number of possible elements and increasing molecular weight, the output will provide a list of ambiguous and unambiguous MF. +MFAssign can be used to assign molecular formulas to two-column or three-column dataframes where the first column is ion mass, the second column is intensity, and the third column can be anything else, but was designed for retention time, allowing better formula assignment of LC-MS data. -In MFAssignR, we use a ‘de novo’ concept for MF assignment, where ‘de novo’ means the first in series. This approach takes advantage of the naturally occurring mass spectral patterns typically observed in natural organic matter. The most frequent nominal mass difference patterns include: 2, 14, and 16 that correspond to H2, CH2, and O. Thus, these patterns are used to restrain the number of chemically reasonable MF assigned to masses above the user defined ‘de novo’ cutoff (e.g., 300). In MFAssign(), this is done using Kendrick mass defects and z* sorting. First, Kendrick mass defects (KMD) and z* values are calculated with a CH2 Kendrick base to sort the measured masses into CH2 homologous series (Stenson et al., 2003). The function then selects 1 to 3 members of each CH2 homologous series with masses below the user defined cutoff and attempts to assign MF. The ambiguous MF are then returned to the unassigned list. Then, the unambiguous MF are used as seeds for additional assignments using CH2, O, H2, H2O, and CH2O MF extensions (Kujawinski and Behn, 2006). To do the formula extensions, the KMD and z* values for each of these bases are calculated and then used to assign MF through the addition or subtraction of the series bases. +Using the low mass moiety and combinatorial assignment approach, MFAssign can be used to assign MF with 12C, 1H, and 16O and a variety of heteroatoms and isotopes, including 2H, 13C, 14N, 15N, 31P, 32S, 34S, 35Cl, 37Cl, 19F, 79Br, 81Br, and 126I. It can also assign Na+ adducts, which are common in positive ion mode. Due to the increasing number of chemically reasonable MF with the increasing number of possible elements and increasing molecular weight, the output will provide a list of ambiguous and unambiguous MF. -MFAssign tracks how many “paths” can be used to assign each MF and if a single mass has multiple MF. By default, the function will choose the MF that has the largest number of paths that intercept with it. For example, if a single mass has two possible MF and one has 20 potential “paths” to it, while the other has 4, the function will choose the MF with 20 paths. Work is ongoing to track these paths and the associated MF in the data frame output of these functions. Overall, the multi-path MF extension approach greatly reduces the number of ambiguous assignments and provides an increased level of confidence in the final MF list because the MF are related to unambiguous MF assigned below the user defined cutoff. To reduce the number of ambiguous sulfur assignments, sulfur containing MF used as seeds must be unambiguous and have a matching 34S peak. +In MFAssignR, we use a de novo concept for MF assignment, where de novo means the first in series. This approach takes advantage of the naturally occurring mass spectral patterns typically observed in natural organic matter. The most frequent mass difference patterns include: 2.01565, 14.01565, and 15.99491 that correspond to H2, CH2, and O. Thus, these patterns are used to restrain the number of chemically reasonable MF assigned to ions above the user defined ‘de novo’ cutoff (e.g., m/z 300). In MFAssign, this is done using Kendrick mass defects and z* sorting. First, Kendrick mass defects (KMD) and z* values are calculated with a CH2 Kendrick base to sort the measured masses into CH2 homologous series (Stenson et al., 2003). The function then selects 1 to 3 members of each CH2 homologous series with ions below the user defined cutoff and attempts to assign MF. The ambiguous MF are then returned to the unassigned list. Then, the unambiguous MF are used as seeds for additional assignments using CH2, O, H2, H2O, and CH2O MF extensions (Kujawinski and Behn, 2006). To do the formula extensions, the KMD and z* values for each of these bases are calculated and then used to assign MF through the addition or subtraction of the series bases. -To allow ambiguity in the formula assignments there is the "Ambig" parameter which can be turned "on" or "off". This option turns off the path frequency prioritization step for the formula assignments as described above, which allows all chemically reasonably MF assignments to be retained for each mass. Additionally, an "MSMS" parameter is available, which can be used to assign MF in a data set that is not very continuous (e.g., MS/MS data). In this case, no pre-filtering of the masses below the “DeNovo” threshold is done, meaning that all masses below the threshold will be assigned directly. This causes the function to run somewhat slower, but can improve assignment coverage in some situations. These parameters replace the MFAssignAll() and MFAssignMSMS() functions from previous versions (<= v.0.0.3). +“MFAssign” functions track how many “paths” can be used to assign each MF and if a single mass has multiple MF. By default, the functions will choose the MF that has the largest number of paths that intercept with it. For example, if a single mass has two possible MF and one has 20 potential “paths” to it, while the other has 4, the function will choose the MF with 20 paths. Work is ongoing to track these paths and the associated MF in the data frame output of these functions. Overall, the multi-path MF extension approach greatly reduces the number of ambiguous assignments and provides an increased level of confidence in the final MF list because the MF are related to unambiguous MF assigned below the user defined cutoff. To reduce the number of ambiguous sulfur assignments, sulfur containing MF used as seeds must be unambiguous and have a matching 34S peak. + +To allow ambiguity in the formula assignments there is the "Ambig" parameter which can be turned "on" or "off". This option turns off the path frequency prioritization step for the formula assignments as described above, which allows all chemically reasonably MF assignments to be retained for each mass. Additionally, an "MSMS" parameter is available, which can be used to assign MF in a data set that is not very continuous (e.g., MS/MS data). In this case, no pre-filtering of the ions below the “DeNovo” threshold is done, meaning that all ions below the threshold will be assigned directly. This causes the function to run somewhat slower, but can improve assignment coverage in some situations. These parameters replace the MFAssignAll and MFAssignMSMS functions from previous versions (<= v.0.0.3). ### MFAssignCHO() -MFAssignCHO() is a simplified version of MFAssign() used only to assign MF with CHO elements. MFAssignCHO() runs faster than MFAssign() and can be used for preliminary MF assignments prior to the selection of internal recalibration ions in conjunction with RecalList() and Recal(), which are described below. +MFAssignCHO is a simplified version of MFAssign used only to assign MF with CHO elements. MFAssignCHO runs faster than MFAssign and can be used for preliminary MF assignments prior to the selection of internal recalibration ions in conjunction with RecalList and Recal, which are described below. ## Isotope Filtering -The IsoFiltR() function can identify prospective 13C and 34S isotope masses. This is done to avoid incorrect monoisotopic MF assignments. This function operates on a two column data frame using the same structure as the MFAssign() function. +The IsoFiltR function can identify prospective 13C and 34S isotope ions. This is done to avoid incorrect monoisotopic MF assignments. This function operates on a two-column or three-column data frame using the same structure as the MFAssign function. IsoFiltR() identifies potential isotope masses using a four-step identification method. @@ -45,27 +47,27 @@ When the two data frame outputs from IsoFiltR() are put into MFAssign(), the fun ## Molecular Formula (MF) Quality Assurance -MFAssign() includes a number of quality assurance (QA) steps to ensure output of chemically reasonable MF. In general, the default settings are relatively lenient to yield a wide range of chemically reasonable MF for a broad range of experiments. However, many of the parameters are customizable, including DBE-O limits (Herzsprung et al. Anal. and Bioanal. Chem. 2014), O/C ratio limits, H/C ratio limits, and minimum number of O. The HetCut parameter can be used to select the MF with the lowest number of heteroatoms, if more than one MF is assigned to a single mass (Ohno and Ohno, 2013). The NMScut parameter identifies the CH4 vs O exchange series in each nominal mass as described in Koch et al. (2007), which can be used to limit ambiguous assignments. Additional non-adjustable QA parameters are used in all of the MFAssign functions, including the nitrogen rule, large atom rule, the maximum number of H rule, maximum DBE rule (Lobodin et al., 2012), and the Senior rules (Kind et al. 2007). +MFAssign includes a number of quality assurance (QA) steps to ensure output of chemically reasonable MF. In general, the default settings are relatively lenient to yield a wide range of chemically reasonable MF for a broad range of experiments. However, many of the parameters are customizable, including DBE-O limits (Herzsprung et al. Anal. and Bioanal. Chem. 2014), O/C ratio limits, H/C ratio limits, and minimum number of O. The HetCut parameter can be used to select the MF with the lowest number of heteroatoms, if more than one MF is assigned to a single mass (Ohno and Ohno, 2013). The NMScut parameter identifies the CH4 vs O exchange series in each nominal mass as described in Koch et al. (2007), which can be used to limit ambiguous assignments. Additional non-adjustable QA parameters are used in all of the “MFAssign” functions, including the nitrogen rule, large atom rule, the maximum number of H rule, maximum DBE rule (Lobodin et al., 2012), and the Senior rules (Kind et al. 2007). ## Noise Assessment -Noise level assessment can be accomplished using the either the HistNoise() or KMDNoise() functions in conjunction with the SNplot() functions. The HistNoise() method is based on the method developed by Zhurov et al. (2014), and KMDNoise() is a new custom method based on our observations of raw data Kendrick mass defect analysis. +Noise level assessment can be accomplished using the either the HistNoise or KMDNoise functions in conjunction with the SNplot functions. The HistNoise method is based on the method developed by Zhurov et al. (2014), and KMDNoise is a new custom method based on our observations of raw data Kendrick mass defect analysis. -The Zhurov et al. (2014) method uses a histogram distribution of the natural log intensities in the measured raw mass spectrum to determine the point where noise peaks give way to analyte signal. The HistNoise() function attempts to identify this point and reports the noise level so that the signal-to-noise threshold can be determined. The threshold is shown in the output plot with red and blue colors, where red indicates noise. If the function does not predict a reasonable noise level, the threshold can be set manually by the user. We frequently observed this function to fail to separate the distributions when the analyte signal tapers into the noise. For this reason, we developed the KMDNoise() function described below. +The Zhurov et al. (2014) method uses a histogram distribution of the natural log intensities in the measured raw mass spectrum to determine the point where noise peaks give way to analyte signal. The HistNoise function attempts to identify this point and reports the noise level so that the signal-to-noise threshold can be determined. The threshold is shown in the output plot with red and blue colors, where red indicates noise. If the function does not predict a reasonable noise level, the threshold can be set manually by the user. We frequently observed this function to fail to separate the distributions when the analyte signal tapers into the noise. For this reason, we developed the KMDNoise function described below. -The KMDNoise() method is based on the observation that the CH2 based KMD values of noise and analyte masses are naturally separated in a KMD plot, allowing the function to select a region with only noise to calculate the average intensity. We refer to this as the KMD slice method. In principle, this is similar to what was described in Reidel and Dittmar (2014), but instead of using a static range of normal mass defects (0.3-0.9), our method uses a mass dependent KMD region, which avoids potentially doubly charged masses with a mass defect of ~0.5, which would be considered as noise in the Reidel and Dittmar method. Additionally, the user can set limits on the mass range to use to estimate the noise, if that is necessary to avoid specific high intensity peaks. +The KMDNoise method is based on the observation that the CH2 based KMD values of noise and analyte masses are naturally separated in a KMD plot, allowing the function to select a region with only noise to calculate the average intensity. We refer to this as the KMD slice method. In principle, this is similar to what was described in Reidel and Dittmar (2014), but instead of using a static range of normal mass defects (0.3-0.9), our method uses a mass dependent KMD region, which avoids potentially doubly charged masses with a mass defect of ~0.5, which would be considered as noise in the Reidel and Dittmar method. Additionally, the user can set limits on the mass range to use to estimate the noise, if that is necessary to avoid specific high intensity peaks. -At least one of these noise estimation functions should be run on the mass list prior to MF assignment with MFAssign() or isotope filtering with IsoFiltR(). Setting a reasonable S/N threshold greatly increases the speed of the functions and improves the output quality. +At least one of these noise estimation functions should be run on the mass list prior to MF assignment with MFAssign or isotope filtering with IsoFiltR. Setting a reasonable S/N threshold greatly increases the speed of the functions and improves the output quality. -The SNplot() function is used to show the mass spectrum with the masses below and above the threshold shown in the output plot with red to blue colors, where red indicates noise. +The SNplot function is used to show the mass spectrum with the masses below and above the threshold shown in the output plot with red to blue colors, where red indicates noise. -##Internal Mass Recalibration +## Internal Mass Recalibration -The internal mass recalibration method in MFAssignR was adapted from Kozhinov et al. (2013) and Savory et al. (2011). It uses a polynomial central moving average to estimate the weights used to recalibrate the masses (Kozhinov et al., 2013) applied to spectral segments (Savory et al., 2011) to perform a walking recalibration. +The internal mass recalibration method in MFAssignR was adapted from Kozhinov et al. (2013) and Savory et al. (2011). It uses a polynomial central moving average to estimate the weights used to recalibrate the measured ion masses (Kozhinov et al., 2013) applied to spectral segments (Savory et al., 2011) to perform a walking recalibration. -First, the function RecalList() can be used with the output of MFAssign() or MFAssignCHO() to generate a data frame containing potential recalibrant CH2 homologous series. There are a variety of metrics included in the output of this function to aid the user in picking suitable recalibrant mass series. Some of the more useful parameters for ensuring complete coverage of the spectrum are “Number Observed”, “Mass Range”, and “Tall Peak”. The quality of the series with regard to whether the series has “holes” in it, how tall the tallest peak is relative to other peaks in the region, and how close the two tallest peaks are to each other are estimated with the “Series Score”, “Peak Distance”, and “Abundance Score” respectively. Please see the User Manual/Vignette for more information about each parameter in RecalList(). Combined, these series should cover the full mass spectral range to provide the best overall recalibration. The best series to choose are generally long and combined have a “Tall Peak” at least every 100 m/z. +First, the function RecalList can be used with the output of MFAssign or MFAssignCHO to generate a data frame containing potential recalibrant CH2 homologous series. There are a variety of metrics included in the output of this function to aid the user in picking suitable recalibrant mass series. Some of the more useful parameters for ensuring complete coverage of the spectrum are “Number Observed”, “Mass Range”, and “Tall Peak”. The quality of the series with regard to whether the series has gaps in it, how tall the tallest peak is relative to other peaks in the region, and how close the two tallest peaks are to each other are estimated with the “Series Score”, “Peak Distance”, and “Abundance Score” respectively. Please see the User Manual/Vignette for more information about each parameter in RecalList. Combined, these series should cover the full mass spectral range to provide the best overall recalibration. The best series to choose are generally long and combined have a “Tall Peak” at least every 100 m/z. -Up to ten of these series can be chosen to be used in the Recal() function, which recalibrates the spectrum. Choosing appropriate recalibrants is a critical aspect of recalibrating a mass spectrum effectively. After selecting the recalibrant series and entering them to the Recal() function, the parameter in Recal() most likely to be changed is “mzRange” which sets the recalibration segment length and has a default value of 30. If this value does not work a warning will be printed to the R console telling the user to increase the value. Formula extension via H2 and O homologous series uses the user defined recalibrant series as a base to find addtional recalibrants. It is limited to a user defined number of steps (+/- H2 or O) and generates a pool of potential recalibrants. Formula extension occurs between the assigned unambiguous molecular formulas and then each of the potential recalibrants are also check for a matching 13C peak. If there is a matching 13C then it is also added to the pool of recalibrants to be used. This pool of recalibrants are separated into each user defined segment and used to calculate a mass error correction term based on Kozhinov et al. (2013). These mass correction terms are then used to recalibrate each segment independently, removing the systematic error present in a mass spectrum. +Up to ten of these series can be chosen to be used in the Recal function, which recalibrates the spectrum. Choosing appropriate recalibrants is a critical aspect of recalibrating a mass spectrum effectively. After selecting the recalibrant series and entering them to the Recal function, the parameter in Recal most likely to be changed is “mzRange” which sets the recalibration segment length and has a default value of 30. If this value does not work a warning will be printed to the R console telling the user to increase the value. Formula extension via H2 and O homologous series uses the user defined recalibrant series as a base to find additional recalibrant ions. It is limited to a user defined number of steps (± H2 or O) and generates a pool of potential recalibrant ions. Formula extension occurs between the assigned unambiguous molecular formulas and then each of the potential recalibrant ions are checked for a matching 13C peak. If there is a matching 13C then it is also added to the pool of recalibrant ions to be used. This pool of recalibrant ions are separated into each user defined segment and used to calculate a mass error correction term based on Kozhinov et al. (2013). These mass correction terms are then used to recalibrate each segment independently, removing the systematic error present in a mass spectrum. # Function Examples