From f5c524d8dd5d516a325231cd2b9221d16b7db436 Mon Sep 17 00:00:00 2001 From: skschum Date: Mon, 13 May 2019 09:30:06 -0400 Subject: [PATCH 1/2] Update README.md --- README.md | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index 7f7b14d..cd2bec6 100644 --- a/README.md +++ b/README.md @@ -2,22 +2,21 @@ ## Package Overview and References -The MFAssignR package was designed for multi-element molecular formula (MF) assignment of ultrahigh resolution mass spectrometry measurements. A number of tools for internal mass recalibration, MF assignment, signal-to-noise evaluation, and unambiguous MF assignments are provided. This package contains MFAssign(), MFAssign_RMD(), MFAssignCHO(), MFAssignCHO_RMD(), MFAssignAll(), MFAssignAll_MSMS(), SNplot(), HistNoise(), KMDNoise(), RecalList(), Recal(), and IsoFiltR() described in the sections below. Note, the functions with “RMD” were designed to be run within an R Markdown file and are otherwise identical to the corresponding non-”RMD” versions. To learn more, please see the section titled “Semi-Automated MFAssignR Functions”. User caution with the function parameter settings and output evaluation is required; thus, several function outputs are provided to assist the user with these evaluations. +The MFAssignR package was designed for multi-element molecular formula (MF) assignment of ultrahigh resolution mass spectrometry measurements. A number of tools for internal mass recalibration, MF assignment, signal-to-noise evaluation, and unambiguous MF assignments are provided. This package contains MFAssign(), MFAssign_RMD(), MFAssignCHO(), MFAssignCHO_RMD(), SNplot(), HistNoise(), KMDNoise(), RecalList(), Recal(), Recal_2(), RecalX(), Recal_2X(), and IsoFiltR() described in the sections below. Note, the functions with “RMD” were designed to be run within an R Markdown file and are otherwise identical to the corresponding non-”RMD” versions. To learn more, please see the section titled “Semi-Automated MFAssignR Functions”. User caution with the function parameter settings and output evaluation is required; thus, several function outputs are provided to assist the user with these evaluations. ## Molecular Formula (MF) Assignment -The MF assignment algorithm in MFAssign was adapted from the low mass moiety CHOFIT assignment algorithm developed by Green and Perdue (2015). In total there are 4 versions of MF Assign, including MFAssign(), MFAssignCHO(), MFAssignAll(), and MFAssignAll_MSMS(). Where MFAssign(), MFAssignAll(), and MFAssignAll_MSMS() include external nested loops to assign additional heteroatoms, as described in Green and Perdue (2015) while MFAssignCHO() does not. Briefly, the CHOFIT algorithm uses low mass moieties such as CH4O-1 and C4O-3 to move around in the O/C and H/C space to assign MF with C, H, and O (CHO MF). These low mass moieties efficiently assign CHO MF without conventional loops. Additional combinatorial assignments with various heteroatoms are made using nested loops that subtract the mass of a heteroatom from the measured ion mass, creating a CHO “core” mass, which can then be assigned using the low mass moiety CHOFIT approach. This is further explained in Green and Perdue (2015) and Perdue and Green (2015). +The MF assignment algorithm in MFAssign was adapted from the low mass moiety CHOFIT assignment algorithm developed by Green and Perdue (2015). In total there are 2 versions of MF Assign, including MFAssign() and MFAssignCHO(). MFAssign() includes external nested loops to assign additional heteroatoms, as described in Green and Perdue (2015) while MFAssignCHO() does not. Briefly, the CHOFIT algorithm uses low mass moieties such as CH4O-1 and C4O-3 to move around in the O/C and H/C space to assign MF with C, H, and O (CHO MF). These low mass moieties efficiently assign CHO MF without conventional loops. Additional combinatorial assignments with various heteroatoms are made using nested loops that subtract the mass of a heteroatom from the measured ion mass, creating a CHO “core” mass, which can then be assigned using the low mass moiety CHOFIT approach. This is further explained in Green and Perdue (2015) and Perdue and Green (2015). ### MFAssign() Using the low mass moiety and combinatorial assignment approach, MFAssign() can be used to assign MF with 12C, 1H, and 16O and a variety of heteroatoms and isotopes, including 2H, 13C, 14N, 15N, 31P, 32S, 34S, 35Cl, 37Cl,and 19F. It can also assign Na+ adducts, which are common in positive ion mode. Due to the increasing number of chemically reasonable MF with the increasing number of possible elements and increasing molecular weight, the output will provide a list of ambiguous and unambiguous MF. Advanced Kendrick mass and z* sorting tools are used to reduce the number of ambiguous MF in MFAssign(). First, Kendrick mass defect (KMD) and z* values are calculated with a CH2 Kendrick base to sort the measured masses into CH2 homologous series (Stenson et al., 2003). The function then selects 1 to 3 members of each CH2 homologous series with masses below the user defined cutoff and attempts to assign MF. The ambiguous MF are then returned to the unassigned list. Then, the unambiguous MF are used as seeds for additional assignments using CH2, O, H2, H2O, and CH2O MF extensions (Kujawinski and Behn, 2006). To do the formula extensions the KMD and z* values for each of these bases are calculated and then used to assign MF through the addition or subtraction of the series bases. MFAssign() (and MFAssignCHO()) tracks how many different “paths” can be used to assign each MF and if a single mass has multiple MF, the function will choose the MF that has the largest number of paths that intercept with it. For example, if a single mass has two possible MF and one has 20 potential “paths” to it, while the other has 4, the function will choose the MF with 20 paths. Work is ongoing to track these paths and the removed MF in the data frame output of these functions. Overall, the multi-path MF extension approach greatly reduces the number of ambiguous assignments and provides an increased level of confidence in the final MF list because the MF are related to unambiguous MF assigned below the user defined cut point. An additional step to decrease the number of ambiguous and/or incorrect sulfur assignments was also added. This step requires that for a sulfur containing compound to act as a seed it must be unambiguous and have a matching 34S peak, when both monoisotopic and isotopic mass lists from the IsoFiltR() function are are assigned MF. This has been implemented for all versions of the MFAssign functions. +To allow for more ambiguity in the formula assignment there is the "Ambig" parameter which can be turned "on" or "off". This option turns off the path choosing step for formula assignment, described above, which allows for more assignments for each mass to be kept. Additionally, the "MSMS" parameter is present, which can help to assign molecular formulas in a data set that is not very continuous with respect to homologous series, such as MS/MS data. What it does is remove the pre-filtering of masses below the DeNovo threshold, meaning that all masses below that point will be assigned directly. This causes the function to run somewhat slower, but can help to get better assignments. These parameters replace the MFAssignAll() and MFAssignMSMS() functions from previous versions (<= v.0.0.2). + ### MFAssignCHO() MFAssignCHO() is a simplified version of MFAssign() used only to assign MF with CHO elements. MFAssignCHO() runs faster than MFAssign() and is best used as a preliminary MF assignment step prior to the selection of recalibrant ions in conjunction with MFRecalList() and MFRecalCheck(), which are described below. -### MFAssignAll() and MFAssignAll_MSMS() -MFAssignAll() uses the low mass moiety and combinatorial assignment approach with a simplified MF extension approach. However, only CH2 and H2O formula extensions are used for MF assignment. This function results in a significantly higher number of ambiguous MF and was intended to be used after MFAssign() or on short mass lists without a complex mixture. MFAssignAll_MSMS() is a further simplified version of MFAssignAll(), which runs somewhat slower, but is more effective for assigning small mass lists with very few homologous series relationships as can be observed in MS/MS data. - ## Isotope Filtering The IsoFiltR() function can identify many of the 13C and 34S isotope masses, which when removed from the mass list can lower the number of peaks assigned with an incorrect MF. This function operates on a two column data frame using the same structure as the MFAssign() function. From 0213adf7975382183665604fc2050f9e5ba52b3a Mon Sep 17 00:00:00 2001 From: skschum Date: Mon, 13 May 2019 13:29:31 -0400 Subject: [PATCH 2/2] Update README.md --- README.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index cd2bec6..b0060d3 100644 --- a/README.md +++ b/README.md @@ -49,7 +49,8 @@ At least one of these noise estimation functions should be run on the mass list The SNplot function is used to show the mass spectrum with the masses below and above the cut point denoted using the same color scheme as in the histogram plots from either HistNoise() or KMDNoise(). ## Internal Mass Recalibration -RecalList(), Recal(), and Recal_2() are functions pertaining to the internal mass recalibration method adapted from Kozhinov et al. (2013) and Savory et al. (2011) using a polynomial central moving average to estimate the weights used to recalibrate the masses (Kozhinov et al., 2013) applied to spectral segments (Savory et al., 2011). The function RecalList() can be used with the output of MFAssign() or MFAssignCHO() to generate a data frame containing potential recalibrant CH2 homologous series. There are a variety of metrics included in the output of this function to aid the user in picking suitable recalibrant series, these are described in greater detail in the example of RecalList() below. The user can select up to 10 homologous series as inputs for the mass recalibration with Recal() and Recal_2(). Recal() uses H2 and O KMD and z* series to identify additional MF that are related to the user selected recalibrants. In contrast, Recal_2() does not used those series to expand the pool of potential recalibrants, using only the peaks that correspond to the homologous series chosen as recalibrants. Other than this difference Recal() and Recal_2() work exactly the same. To avoid recalibration problems associated with too many recalibrant masses, the function uses a user-defined number of tallest peaks within a user-defined mass range “bin”. For example, if the bin width is set at 20 and the number of peaks is set at 2, the function will select the two tallest peaks within each 20 m/z window across the range of the spectrum. Additionally, when the monoisotopic peak chosen as a recalibrant has an identified 13C peak, that isotopic peak will also be added to the pool of recalibrants being used. After the recalibrants have been selected, they are split into mass windows of a user defined width (default is 50 m/z) and used to calculate the correction term according to the the adapted form of the Kozhinov et al. method. This will provide a different mass correction term for each mass window in the spectrum. Then the raw mass list(s) that are being recalibrated are split into the same mass windows, and the correction term that is associated with each window is used to correct the masses in that window, thus recalibrating the full spectrum section by section. In addition to the output of recalibrated mass lists the function also generates a plot that shows the recalibration peaks that were used in context with the overall mass spectrum, and produces an output data frame containing the mass, abundance, formula, and error for the recalibrants that were used. +RecalList(), Recal(), Recal_2(), RecalX(), and Recal_2X() are functions pertaining to the internal mass recalibration method adapted from Kozhinov et al. (2013) and Savory et al. (2011) using a polynomial central moving average to estimate the weights used to recalibrate the masses (Kozhinov et al., 2013) applied to spectral segments (Savory et al., 2011). The function RecalList() can be used with the output of MFAssign() or MFAssignCHO() to generate a data frame containing potential recalibrant CH2 homologous series. There are a variety of metrics included in the output of this function to aid the user in picking suitable recalibrant series, these are described in greater detail in the example of RecalList() below. The user can select up to 10 homologous series as inputs for the mass recalibration with Recal() and Recal_2(). Recal() uses H2 and O KMD and z* series to identify additional MF that are related to the user selected recalibrants. In contrast, Recal_2() does not used those series to expand the pool of potential recalibrants, using only the peaks that correspond to the homologous series chosen as recalibrants. Other than this difference Recal() and Recal_2() work exactly the same. To avoid recalibration problems associated with too many recalibrant masses, the function uses a user-defined number of tallest peaks within a user-defined mass range “bin”. For example, if the bin width is set at 20 and the number of peaks is set at 2, the function will select the two tallest peaks within each 20 m/z window across the range of the spectrum. Additionally, when the monoisotopic peak chosen as a recalibrant has an identified 13C peak, that isotopic peak will also be added to the pool of recalibrants being used. After the recalibrants have been selected, they are split into mass windows of a user defined width (default is 50 m/z) and used to calculate the correction term according to the the adapted form of the Kozhinov et al. method. This will provide a different mass correction term for each mass window in the spectrum. Then the raw mass list(s) that are being recalibrated are split into the same mass windows, and the correction term that is associated with each window is used to correct the masses in that window, thus recalibrating the full spectrum section by section. In addition to the output of recalibrated mass lists the function also generates a plot that shows the recalibration peaks that were used in context with the overall mass spectrum, and produces an output data frame containing the mass, abundance, formula, and error for the recalibrants that were used. +RecalX() and Recal_2X() are similar to Recal() and Recal_2(), but provide some iteration of the mass calibration and can be used more effectively with small mass windows. The homologous series are chosen in the same way as in Recal() and Recal_2(), but then they are used to do a single term recalibration for the entire spectrum instead of segments. These calibrated masses are then used to do a segmented recalibration. Within each segment the recalibrants from the previous step are used and then the tallest peaks assigned a molecular formula within each window are selected as recalibrants, with half above the central recalibrant and half below. # Function Examples ## Recommended Order of Operations @@ -64,7 +65,7 @@ The functions will be described in the order that they are most effectively used 5. Use RecalList() to generate a list of the potential recalibrant series. -6. After choosing a few recalibrant series, use Recal() (or Recal_2()) to check whether they are good recalibrants and recalibrate the mass lists using those recalibrants. +6. After choosing a few recalibrant series, use Recal() (or Recal_2(), RecalX(), Recal_2X()) to check whether they are good recalibrants and recalibrate the mass lists using those recalibrants. 7. Use MFAssign() with the recalibrated mass lists to assign MF to the data.