diff --git a/MFAssignR Vignette_User Manual.html b/MFAssignR Vignette_User Manual.html index e12e031..dcf96d9 100644 --- a/MFAssignR Vignette_User Manual.html +++ b/MFAssignR Vignette_User Manual.html @@ -1,1090 +1,1066 @@ - - - - -
- - - - - - - - - - -The MFAssignR package was designed for multi-element molecular formula (MF) assignment of ultrahigh resolution mass spectrometry measurements. A number of tools for internal mass recalibration, MF assignment, signal-to-noise evaluation, and unambiguous MF assignments are provided. This package contains MFAssign(), MFAssign_RMD(), MFAssignCHO(), MFAssignCHO_RMD(), MFAssignAll(), MFAssignAll_MSMS(), SNplot(), HistNoise(), KMDNoise(), RecalList(), Recal(), and IsoFiltR() described in the sections below. Note, the functions with “RMD” were designed to be run within an R Markdown file and are otherwise identical to the corresponding non-”RMD” versions. To learn more, please see the section titled “Semi-Automated MFAssignR Functions”. User caution with the function parameter settings and output evaluation is required; thus, several function outputs are provided to assist the user with these evaluations.
-The functions in the MFAssignR package were developed by adapting methods and algorithms from the peer reviewed literature. The following references are referred to in this document:
-Green, N. W. and Perdue, E. M.: Fast Graphically Inspired Algorithm for Assignment of Molecular Formulae in Ultrahigh Resolution Mass Spectrometry, Anal Chem, 87(10), 5086–5094, doi:10.1021/ac504166t, 2015.
-Gross, J. H.: Mass Spectrometry, , doi:10.1007/978-3-319-54398-7 , 2017.
-Herzsprung, P., Hertkorn, N., Tümpling, W. von, Harir, M., Friese, K. and Schmitt-Kopplin, P.: Understanding molecular formula assignment of Fourier transform ion cyclotron resonance mass spectrometry data of natural organic matter from a chemical point of view, Anal Bioanal Chem, 406(30), 7977–7987, doi:10.1007/s00216-014-8249-y, 2014.
-Koch, B. P., Dittmar, T., Witt, M. and Kattner, G.: Fundamentals of Molecular Formula Assignment to Ultrahigh Resolution Mass Data of Natural Organic Matter, Anal Chem, 79(4), 1758–1763, doi:10.1021/ac061949s , 2007.
-Kozhinov, A. N., Zhurov, K. O. and Tsybin, Y. O.: Iterative Method for Mass Spectra Recalibration via Empirical Estimation of the Mass Calibration Function for Fourier Transform Mass Spectrometry-Based Petroleomics, Anal Chem, 85(13), 6437–6445, doi:10.1021/ac400972y, 2013.
-Kujawinski, E. B. and Behn, M. D.: Automated Analysis of Electrospray Ionization Fourier Transform Ion Cyclotron Resonance Mass Spectra of Natural Organic Matter, Anal Chem, 78(13), 4363–4373, doi:10.1021/ac0600306 , 2006.
-Lobodin, V. V., Marshall, A. G. and Hsu, C. S.: Compositional Space Boundaries for Organic Compounds, Anal Chem, 84(7), 3410–3416, doi:10.1021/ac300244f, 2012.
-Ohno, T. and Ohno, P. E.: Influence of heteroatom pre-selection on the molecular formula assignment of soil organic matter components determined by ultrahigh resolution mass spectrometry, Anal Bioanal Chem, 405(10), 3299–3306, doi:10.1007/s00216-013-6734-3, 2013.
-Perdue, E. M. and Green, N. W.: Isobaric Molecular Formulae of C, H, and O: A View from the Negative Quadrants of van Krevelen Space, Anal Chem, 87(10), 5079–5085, doi:10.1021/ac504165k, 2015.
-Savory, J. J., Kaiser, N. K., McKenna, A. M., Xian, F., Blakney, G. T., Rodgers, R. P., Hendrickson, C. L., and Marshall, A. G.: Parts-Per-Billion Fourier Transform Ion Cyclotron Resonance Mass Measurement Accuracy with a “Walking” Calibration Equation, Anal Chem, 83, 1732-1736, doi:10.1021/ac102943z, 2011.
-Zheng, Q., Morimoto, M., Sato, H. and Fouquet, T.: Resolution-enhanced Kendrick mass defect plots for the data processing of mass spectra from wood and coal hydrothermal extracts, Fuel, 235, 944–953, doi:10.1016/j.fuel.2018.08.085, 2019.
-Zhurov, K. O., Kozhinov, A. N., Fornelli, L. and Tsybin, Y. O.: Distinguishing Analyte from Noise Components in Mass Spectra of Complex Samples: Where to Cut the Noise, Anal Chem, 86(7), 3308–3316, doi:10.1021/ac403278t, 2014.
-The MF assignment algorithm in MFAssign was adapted from the low mass moiety CHOFIT assignment algorithm developed by Green and Perdue (2015). In total there are 4 versions of MF Assign, including MFAssign(), MFAssignCHO(), MFAssignAll(), and MFAssignAll_MSMS(). Where MFAssign(), MFAssignAll(), and MFAssignAll_MSMS() include external nested loops to assign additional heteroatoms, as described in Green and Perdue (2015) while MFAssignCHO() does not. Briefly, the CHOFIT algorithm uses low mass moieties such as CH4O-1 and C4O-3 to move around in the O/C and H/C space to assign MF with C, H, and O (CHO MF). These low mass moieties efficiently assign CHO MF without conventional loops. Additional combinatorial assignments with various heteroatoms are made using nested loops that subtract the mass of a heteroatom from the measured ion mass, creating a CHO “core” mass, which can then be assigned using the low mass moiety CHOFIT approach. This is further explained in Green and Perdue (2015) and Perdue and Green (2015).
-Using the low mass moiety and combinatorial assignment approach, MFAssign() can be used to assign MF with 12C, 1H, and 16O and a variety of heteroatoms and isotopes, including 2H, 13C, 14N, 15N, 31P, 32S, 34S, 35Cl, 37Cl,and 19F. It can also assign Na+ adducts, which are common in positive ion mode. Due to the increasing number of chemically reasonable MF with the increasing number of possible elements and increasing molecular weight, the output will provide a list of ambiguous and unambiguous MF.
-Advanced Kendrick mass and z* sorting tools are used to reduce the number of ambiguous MF in MFAssign(). First, Kendrick mass defect (KMD) and z* values are calculated with a CH2 Kendrick base to sort the measured masses into CH2 homologous series (Stenson et al., 2003). The function then selects 1 to 3 members of each CH2 homologous series with masses below the user defined cutoff and attempts to assign MF. The ambiguous MF are then returned to the unassigned list. Then, the unambiguous MF are used as seeds for additional assignments using CH2, O, H2, H2O, and CH2O MF extensions (Kujawinski and Behn, 2006). To do the formula extensions the KMD and z* values for each of these bases are calculated and then used to assign MF through the addition or subtraction of the series bases. MFAssign() (and MFAssignCHO()) tracks how many different “paths” can be used to assign each MF and if a single mass has multiple MF, the function will choose the MF that has the largest number of paths that intercept with it. For example, if a single mass has two possible MF and one has 20 potential “paths” to it, while the other has 4, the function will choose the MF with 20 paths. Work is ongoing to track these paths and the removed MF in the data frame output of these functions. Overall, the multi-path MF extension approach greatly reduces the number of ambiguous assignments and provides an increased level of confidence in the final MF list because the MF are related to unambiguous MF assigned below the user defined cut point. An additional step to decrease the number of ambiguous and/or incorrect sulfur assignments was also added. This step requires that for a sulfur containing compound to act as a seed it must be unambiguous and have a matching 34S peak, when both monoisotopic and isotopic mass lists from the IsoFiltR() function are are assigned MF. This has been implemented for all versions of the MFAssign functions.
-MFAssignCHO() is a simplified version of MFAssign() used only to assign MF with CHO elements. MFAssignCHO() runs faster than MFAssign() and is best used as a preliminary MF assignment step prior to the selection of recalibrant ions in conjunction with MFRecalList() and MFRecalCheck(), which are described below.
-MFAssignAll() uses the low mass moiety and combinatorial assignment approach with a simplified MF extension approach. However, only CH2 and H2O formula extensions are used for MF assignment. This function results in a significantly higher number of ambiguous MF and was intended to be used after MFAssign() or on short mass lists without a complex mixture. MFAssignAll_MSMS() is a further simplified version of MFAssignAll(), which runs somewhat slower, but is more effective for assigning small mass lists with very few homologous series relationships as can be observed in MS/MS data.
-Below is a table highlighting the features of the various forms of MFAssign, which were described above.
--Features - | --MFAssign - | --MFAssignCHO - | --MFAssignAll - | --MFAssignAll_MSMS - | --MFAssign_RMD - | --MFAssignCHO_RMD - | -
---|---|---|---|---|---|---|
-Heteroatoms - | --X - | -- | --X - | --X - | --X - | -- | -
-Odd Electron and Sodium Adduct Assignment - | --X - | --X - | --X - | --X - | --X - | --X - | -
-CH2 KMD Pre-Filtering - | --X - | --X - | --X - | -- | --X - | --X - | -
-DeNovo cut - | --X - | --X - | -- | -- | --X - | --X - | -
-Simple Formula Extension - | -- | -- | --X - | --X - | -- | -- | -
-Advanced Formula Extension - | --X - | --X - | -- | -- | --X - | --X - | -
-Standard QA Parameters - | --X - | --X - | --X - | --X - | --X - | --X - | -
-Advanced QA Parameters - | --X - | --X - | --X - | --X - | --X - | --X - | -
-Increased Unambiguity - | --X - | --X - | -- | -- | --X - | --X - | -
-Full Ambiguity - | -- | -- | --X - | --X - | -- | -- | -
-Isotope Matching - | --X - | --X - | --X - | --X - | --X - | --X - | -
-Automatic .rs.Restart() - | --X - | --X - | --X - | --X - | -- | -- | -
-Compatible with Rmarkdown - | -- | -- | -- | -- | --X - | --X - | -
-Improved Performance for MS/MS Data - | -- | -- | -- | --X - | -- | -- | -
Simple Formula Extension using CH2 and H2O formula extensions.
Advanced Formula Extension uses the combination of CH2, H2O, H2, O, CH2O formula extensions with multiple iterations to improve assignment.
Standard QA includes both user-defined O/C, H/C, DBE, err limits in addition to the non-adjustable Senior Rules, Nitrogen Rule, Rule 13, Maximum H Rule.
Advanced QA includes a sulfur isotope check, heteroatom cut (HAcut), and the nominal mass series cut (NMScut); HAcut and NMScut can be turned on or off externally.
The IsoFiltR() function can identify many of the 13C and 34S isotope masses, which when removed from the mass list can lower the number of peaks assigned with an incorrect MF. This function operates on a two column data frame using the same structure as the MFAssign() function.
-IsoFiltR() identifies potential isotope masses using a four-step identification method.
-First the mass list is transformed to identify mass difference pairs appropriate for the element being investigated (delta mass for C (1.003355) or S (1.995797), with +/- 5 ppm mass error). Only those that meet this criteria move on to step 2.
Using the mass difference between 12C/13C (1.003355) or 32S/34S (1.995797), the KMD value can be calculated for a specific isotope. This means that the 12C (32S) monoisotopic peak will be in a KMD homologous series with its matching 13C (34S) isotopic peak, analogous to homologous series of CH2. If the KMD values are equivalent for the candidate pair, the peaks can be considered to be in a series and the pair will move on to the third step. The equations for 13C are: KM = 1/1.003355 * m/z, KMD = nominal mass - KM. Replacing 1/1.003355 with 2/1.995797 makes this work for 34S.
Isotopic pairs are separated using a “Resolution Enhanced KMD” approach adapted from Zheng et al. 2019. Resolution enhanced KMD values are calculated by dividing the mass of some homologous series base (in this case CH2) by an integer that was experimentally determined to accomplish the desired separation. This value is then used in the typical KM and KMD calculation in order to calculate the “resolution enhanced” KMD. As an example, BaseMass_adj = 14.01565 / 21 can be considered the integer divided base mass, which is then used in the KM calculation: KMr = (round(BaseMass_adj) / BaseMass_adj) * m/z, followed by KMDr = round(KMr) - KMr to calculate the resolution enhanced KMD. For 13C the integer for this calculation is 21, while for 34S it is 12. After this calculation, peaks that are 12/13 C or 32/34 S pairs will have KMDr difference values of specific values, which can be used to select the pairs that are most likely to be isotope pairs. The KMDr difference is calculated by subtracting the KMDr value of the suspected isotope mass from the KMDr of the suspected monoisotopic mass. The values are -0.291 and 0.709 for 32/34 S and -0.496 and 0.503 for 12/13 C. If the peaks meet these criteria, they can move on to step four. Using CH2 KMD values that are divided by an experimentally derived integer, the isotopic pairs are separated into two specific values. If the difference in the enhanced KMD for the candidate pair matches one of those values, it will move to the fourth tier.
The fourth step uses abundance ratios to constrain the remaining isotope pairs to ensure that the isotope peaks are not too large or too small relative to the intensity of the monoisotopic peak. The limits on this are loose due to the variation of isotope abundance with analyte signal (similar to isotope dilution) as observed in ultrahigh resolution Orbitrap and FT-ICR measurements.
The candidate pairs that make it through these four steps are put into two data frames, Mono and Iso, which contain the monoisotopic and isotopic peaks respectively. Then all peaks that were not flagged as possible mono/iso pairs are added to the Mono output data frame. In complex mixtures, some peaks can be flagged as both monoisotopic and isotopic. In these cases, the masses are included in both outputs and are classified as either monoistopic or isotopic after the MF assignment.
-When the two data frame outputs from IsoFiltR() are put into MFAssign(), the function will match the assigned monoisotopic masses to their corresponding isotopic masses. Additional work would be needed to use the isotopes to reduce ambiguous MF assignments assigned to a single mass. Thus IsoFiltR() should not be considered as definitive proof of the presence or absence of 13C or 34S in MF, but it does assign MF with these expected naturally occurring isotopes and limit the chances that they are incorrectly assigned with a monoisotopic MF.
-MFAssign() includes a number of quality assurance (QA) steps to check the assigned MF for chemically reasonability. Relatively lenient default settings are provided to avoid removing chemically reasonable ambiguous MF assignments. Many of these parameters are customizable, including DBE-O limits (Herzsprung et al. Anal. and Bioanal. Chem. 2014), O/C ratio limits, H/C ratio limits, and minimum number of O. The Hetcut parameter can be used to select the MF with the lowest number of heteroatoms, if more than one MF is assigned to a single mass (Ohno and Ohno, 2013). The NMScut parameter identifies the CH4 vs O exchange series in each nominal mass as described in Koch et al. (2007), which can be used to limit ambiguous assignments. Additional non-adjustable QA parameters are used in all of the MFAssign functions, including the nitrogen rule, large atom rule, and the maximum number of H rule, maximum DBE rule (Lobodin et al., 2012), and the Senior rules (Kind et al. 2007).
-Noise level assessment can be accomplished using the either the HistNoise() or KMDNoise() functions in conjunction with the SNplot() functions. The HistNoise() method is based on the method developed by Zhurov et al. (2014), and KMDNoise() is a new custom method based on our observations of raw data Kendrick mass defect analysis.
-The Zhurov et al. (2104) method uses a histogram distribution of the natural log intensities in the measured raw mass spectrum to determine the point where noise peaks give way to analyte signal. The HistNoise() function attempts to identify this point and reports the noise level so that the signal-to-noise cut point can be determined. The cut point is shown in an output plot red to blue colors, where red indicates noise. The cut point can also be set manually, if the function does not predict a reasonable noise level. We have observed this function to be confounded by distributions that do not match the theoretical distribution, making it difficult or impossible for the function to identify the correct noise cut point. For this reason, we developed the KMDNoise() function described below.
-The KMDNoise() method is based on the observation that the CH2 based KMD values of noise peaks and analyte peaks are naturally separated in a KMD plot, allowing the function to select a region with only noise peaks and use the average intensity of these values to estimate the noise. We refer to this as the KMD slice method. In principle, this is similar to what was briefly described in Reidel and Dittmar (2014), but instead of using a static range of normal mass defects (0.3-0.9), our method uses a mass dependent KMD region, which avoids potentially doubly charged peaks with a mass defect of ~0.5, which would be considered as noise in the Reidel and Dittmar method.
-At least one of these noise estimation functions should be run on the mass list prior to MF assignment with MFAssign() or isotope filtering with IsoFiltR(). Setting a reasonable S/N cut point greatly increases the speed of the functions and improves the output quality.
-The SNplot function is used to show the mass spectrum with the masses below and above the cut point denoted using the same color scheme as in the histogram plots from either HistNoise() or KMDNoise().
-RecalList(), Recal(), and Recal_2() are functions pertaining to the internal mass recalibration method adapted from Kozhinov et al. (2013) and Savory et al. (2011) using a polynomial central moving average to estimate the weights used to recalibrate the masses (Kozhinov et al., 2013) applied to spectral segments (Savory et al., 2011). The function RecalList() can be used with the output of MFAssign() or MFAssignCHO() to generate a data frame containing potential recalibrant CH2 homologous series. There are a variety of metrics included in the output of this function to aid the user in picking suitable recalibrant series, these are described in greater detail in the example of RecalList() below. The user can select up to 10 homologous series as inputs for the mass recalibration with Recal() and Recal_2(). Recal() uses H2 and O KMD and z* series to identify additional MF that are related to the user selected recalibrants. In contrast, Recal_2() does not used those series to expand the pool of potential recalibrants, using only the peaks that correspond to the homologous series chosen as recalibrants. Other than this difference Recal() and Recal_2() work exactly the same. To avoid recalibration problems associated with too many recalibrant masses, the function uses a user-defined number of tallest peaks within a user-defined mass range “bin”. For example, if the bin width is set at 20 and the number of peaks is set at 2, the function will select the two tallest peaks within each 20 m/z window across the range of the spectrum. Additionally, when the monoisotopic peak chosen as a recalibrant has an identified 13C peak, that isotopic peak will also be added to the pool of recalibrants being used. After the recalibrants have been selected, they are split into mass windows of a user defined width (default is 50 m/z) and used to calculate the correction term according to the the adapted form of the Kozhinov et al. method. This will provide a different mass correction term for each mass window in the spectrum. Then the raw mass list(s) that are being recalibrated are split into the same mass windows, and the correction term that is associated with each window is used to correct the masses in that window, thus recalibrating the full spectrum section by section. In addition to the output of recalibrated mass lists the function also generates a plot that shows the recalibration peaks that were used in context with the overall mass spectrum, and produces an output data frame containing the mass, abundance, formula, and error for the recalibrants that were used.
-The functions will be described in the order that they are most effectively used. The functions do not have to be run in this order, but the best results will likely be obtained in this way. A list of the functions in the recommended order is given below: 1. Run HistNoise() or KMDNoise() to determine the noise level for the data.
-Check effectiveness of S/N cut point using SNplot().
Use IsoFiltR() to identify potential 13C and 34S isotope peaks.
Using the S/N cut point, and the two data frames output from IsoFiltR(), run MFAssignCHO() to assign CHO MF to potentially be used as recalibration ions.
Use RecalList() to generate a list of the potential recalibrant series.
After choosing a few recalibrant series, use Recal() (or Recal_2()) to check whether they are good recalibrants and recalibrate the mass lists using those recalibrants.
Use MFAssign() with the recalibrated mass lists to assign MF to the data.
Check the output plots from MFAssign() to check the quality of the assignment.
The following functions are used for mass lists containing noise.
-This function is an adaptation of the method developed by Zhurov et al. (2014). It is used to estimate the noise level for raw mass spectral data from both FT-ICR and Orbitrap MS. There should, in theory, be a significant first peak that contains the measured masses due to random noise, followed by additional distinct peaks. This function finds the valley between the random noise and the analyte signal of the histogram output. The output noise level can then be used to estimate the signal to noise cut level and constrain the masses that are considered in the MFAssign() function. In some cases the data does not form the expected distribution and even when it does, sometimes the function does not identify the correct valley. In these cases, the KMD slice method generally provides a useful estimate of the noise because it is more general to samples regardless of their noise distribution.
-Data <- read.csv("YourMassList.csv")
-#You can read in an external data set. Make sure the first column is the measured ion mass and the second column is the measured ion abundance (intensity or relative abundance).
-
-HistNoise(Data, SN = 0, bin = 0.01)
-df - a two column data frame containing measured ion abundance and ion mass.
SN - a manual S/N cut point if the function does not find an acceptable value; default is 0.
bin - the binwidth for generating the histogram; default is 0.01.
The output of HistNoise() is a list containing the following components:
-“Noise” - a numeric value containing the estimated noise level.
“Hist” - a histogram of the intensity distribution of the peaks in the mass spectrum. It is color coded to highlight peaks below (red) and above (blue) the estimated noise level.
This function implements the KMD slice method of estimating the noise level for a mass spectrum, described previously in this document.
-Data <- read.csv("YourMassList.csv")
-#You can read in an external data set. Make sure the first column is the measured ion mass and the second column is the measured ion abundance (intensity or relative abundance).
-
-KMDNoise(Data, upper = 0.2, lower = 0.05)
-upper - the y-intercept for the upper boundary of the KMD slice; default is 0.2.
lower - the y-intercept for the lower boundary of the KMD slice; default is 0.05.
The output of KMDNoise() is a list containing the following components:
-“Noise” - a numeric value containing the estimated noise level.
“KMD” - a KMD plot showing the KMD values for all peaks in the spectrum, with the selected noise estimation region bounded by red lines.
This function generates a mass spectrum with color coded mass peaks to indicate if they are below (red) or above (blue) the S/N cut point. This can be used as a qualitative check of the suggested output from HistNoise() or KMDNoise(). Also, it can also be used for qualitative investigation of the S/N level in the mass spectrum independent of the two noise estimation functions.
-Data <- read.csv("YourMassList.csv")
-#You can read in an external data set. Make sure the first column is the measured ion mass and the second column is the measured ion abundance (intensity or relative abundance).
-
-SNplot(Data, cut = 500, mass = 400, window.x = 0.5, window.y = 10)
-df - a two column data frame containing measured ion abundance and ion mass.
cut - the signal-to-noise cut point.
mass - the center mass of the window.
window.x - the width of window on either side of the center mass; default is 0.5.
window.y - the y axis of the plot by multiplying the cut point by this value; default is 10.
IsoFiltR() provides a tentative filtering of masses with 13C and 34S from the overall mass list, as described previously. This decreases the likelihood of incorrect assignments. Be sure to include a noise level, which lessens the number of peaks being considered. The way isotopes are identified requires the generation of very large data frames, so if too many peaks are considered the function will take a long time to run, or will not be able to finish at all.
-Data <- read.csv("YourMassList.csv")
-#You can read in an external data set. Make sure the first column is the measured ion mass and the second column is the measured ion abundance (intensity or relative abundance).
-#Be sure to include a signal-to-noise level cut value so that the function will work properly.
-
-Mono_Iso <- IsoFiltR(Data, SN = 500, Diffrat = 0.1)
-
-Mono <- Mono_Iso[["Mono"]]
-Iso <- Mono_Iso[["Iso"]]
-Data - a two column data frame containing measured ion abundance and ion mass.
SN - a user defined signal-to-noise cut point; default is 0.
Diffrat - a user defined ratio to tighten (larger value) or loosen (lower value) the intensity thresholds for identifying a peak as an isotopic peak; default is 0.1.
MFAssignCHO() is a simpler version of MFAssign() that only assigns CHO MF. This can be helpful when trying to do a quick assignment prior to internal mass recalibration. The MF assignment algorithm is based on the same principles as the full MFAssign(). It uses the CHOFIT algorithm to do a preliminary assignment of the CH2 homologous series on a subset of the masses and Kendrick Mass Defect and z* series analysis to extend the assignments to related remaining masses. An example of its usage is shown below, along with its input parameters.
-Assign <- MFAssignCHO(peaks = Mono, isopeaks = Iso, ionMode = "pos", lowMW =100, highMW = 1000, Mx = 1, ppm_err = 3, H_Cmin = 0.3)
-#This is a typical set of parameters for positive ion data.
-
-#The output list includes the following datasets.
-Unambig <- Assign[["Unambig"]] #Unambiguous MF assignments data frame
-Ambig <- Assign[["Ambig"]] #Ambiguous MF assignments data frame
-Unassigned <- Assign[["None"]] #Unassigned values data frame
-
-#The output list includes the following plots.
-MSAssign <- Assign[["MSAssign"]] #Mass spectrum showing assigned, unassigned, and isotope peaks. Assigned peaks are in green, unassigned peaks are in red, and isotope peaks are in blue.
-
-Error <- Assign[["Error"]] #Error plot with m/z vs. absolute error (ppm) including the unambiguous MF in blue and the ambiguous MF in red.
-
-MSgroups <- Assign[["MSgroups"]] #Mass spectrum showing the assigned peaks with color to indicate the elemental group. The plot is faceted by ambiguity of the MF assignments.
-
-VK <- Assign[["VK"]] #van Krevelen plot colored by elemental group and faceted by ambiguity of the MF assignments.
-
-MSAssign #Print MSAssign
-Error #Print Error
-MSgroups #Print MSgroups
-VK #Print VK
-peaks - the input data frame with the measured ion mass in the first column followed by measured ion abundance in the second column; the column names can be anything.
isopeaks - the input isotopic masses data frame with the same structure as “peaks”; if the two data frames (peaks and isopeaks) come from the IsoFiltR function they will be formatted correctly.
ionMode - the ionization mode with either “pos” for positive mode and “neg” for negative mode; the parameters are case sensitive.
POEx - the assignment of positive odd or even electron ions. When POEx is set to 0, only positive even electron ions are permitted. When POEx is set to 1, positive odd electron ions are allowed, in addition to even electron ions. The default is 0. This option is useful when the measured ions were generated by either atmospheric pressure chemical ionization (APCI) or photoionization (APPI).
NOEx - the assignment of negative odd electron ions. When NOEx is set to 0, only negative even electron ions are permitted. When NOEx is set to 1, negative odd electron ions are allowed, in addition to even electron ions (1 = on). The default is 0. This option is useful when the measured ions were generated by either atmospheric pressure chemical ionization (APCI) or photoionization (APPI).
lowMW - the minimum ion mass to be assigned. The default is 100.
highMW - the maximum ion mass to be assigned. The default is 1000.
Ex - the maximum number of 13C to be used in the function.
Mx - the maximum number of Na+ adducts to be used in the function. Note that this is important for most positive mode ESI data. The default is 0.
NH4x - the maximum number of NH4+ adducts to be used in the function. The default is 0. Note that this will replace one N and 4 H in a CHNO MF that does not have an NH4+ adduct, so great care should be taken with the MF assignments to understand if they are correct.
--Note that the addition of more heteroatoms or adducts will decrease the speed of the function. This is especially true if more than one type of heteroatom is allowed.
-
Zx - the maximum number of charges allowed. The default is 1. Theoretically, MFAssign() can assign multiply charged compounds, but not at this time.
Ox - the maximum number of O to be assigned. It also indirectly sets the number of loops for the core CHOFIT algorithm. Each core loop increases the number of O by 3, which means that setting this value to 30 allows 10 loops of the inner algorithm, which seems to be a good balance between function speed and reasonable MF assignment.
ppm_err - the maximum allowable error for MF assignment and monoisotope/polyisotope peak matching. The value is in parts per million (ppm) and the default is 3.
SN - the signal-to-noise level for the data. This is useful if the signal-to-noise value for the data is known from an external source (such as SNcutCheck()). MFAssign() does not have the ability to independently determine the signal-to-noise level. The default is 0, and the input value must be consistent with the type of abundance in column 1 of peaks (i.e., intensity or relative abundance).
O_Cmin - the minimum allowable O/C ratio for the assigned MF. The default is 0.
O_Cmax - the maximum allowable O/C ratio for the assigned MF. The default is 2.5.
H_Cmin - the minimum allowable H/C ratio for the assigned MF. The default is 0.3.
H_Cmax - the maximum allowable H/C ratio for the assigned MF. The default is 3.
DBEOmin - the minimum allowable DBE minus O value. The default is -13, consistent with Herzsprung et al. 2014.
DBEOmax - the maximum allowable DBE minus O value. The default is 13, consistent with Herzsprung et al. 2014.
Omin - the minimum allowable number of O for a MF. The default is 0.
HetCut - a filtering step that compares ambiguous MF removes the MF with the higher number of heteroatoms (heteroatoms are defined as all elements that are not C or H). This parameter is based on Ohno and Ohno (2013).The default setting is “off” because this can lead to incorrect assignments, especially if many heteroatoms are expected for the data. The input values are “on” or “off” and are case sensitive.
NMScut - a filtering step based on nominal mass patterns as described by Koch et al. (2007). It helps to decrease the number of ambiguous MF. The default setting is “on”, to turn this option off, use “NMScut = off”.
DeNovo - a threshold where m/z values above this threshold are only assigned MF via a formula extension; values below this threshold are not restricted to a formula extension relationship. The default setting is 1000 for CHO assignment.
nLoop - the number of times that the formula extension component of MFAssignCHO() will loop to assign MF, which were not previously assigned using the CHOFIT algorithm. The default number of loops is 5.
The output of the function is a list containing 3 data frames and 4 plots. The data frames will be described here first.
-The first data frame (Unambig) contains the assigned unambiguous MF along with other useful parameters that are useful for data interpretation such as O/C, H/C, DBE, and more. The column headers and a brief description of each are given below.
-The second data frame (Ambig) contains the assigned ambiguous MF, with the same additional information as the Unambig data frame.
-The third data frame (None) contains the ion masses that were not assigned to a MF, along with their corresponding abundance.
-The following column headers are the same for both Unambig and Ambig data frames. * abundance - the measured abundance of each identified species and is identical to the input data frame values.
-exp_mass - the measured experimental ion mass from the input data frame; it is identical to the input data frame values.
formula - the assigned MF for the experimental mass.
class - the heteroatom class of the MF based on the number of heteroatoms.
group - the elemental group of the MF (CHO, CHNO, etc.).
C - the total number of assigned 12C + 13C atoms.
H - the number of assigned 1H atoms.
O - the number of assigned 16O atoms.
N - the number of assigned 14N atoms.
S - the number of assigned 32S atoms.
P - the number of assigned 31P atoms.
E - the number of assigned 13C atoms.
S34 - the number of assigned 34S atoms.
N15 - the number of assigned 15N atoms.
D - the number of assigned 2H atoms.
Cl - the number of assigned 35Cl atoms.
Cl37 - the number of assigned 37Cl atoms.
M - the number of assigned Na+ adducts.
NH4 - the number of assigned NH4+ adducts.
POE - indicates whether the assigned MF is a positive odd electron mass (1) or not (0).
NOE - indicates whether the assigned MF is a negative odd electron mass (1) or not (0).
Z - the charge on the mass.
neutral_mass - the neutral mass defined as Exp_mass plus or minus its adduct (either H+ or Na+) depending on whether the mass was collected in the negative or positive mode.
O_C - the O/C ratio for the assigned MF.
H_C - the H/C ratio for the assigned MF.
theor_mass - the theoretical neutral mass for the assigned MF using the exact masses of the atoms.
DBE - the number of double bond equivalents for the assigned MF. Note, only the lowest valence number is considered for multivalent elements. Therefore, this does not include unsaturations associated with oxidized elements such as S.
err_ppm - the error between the measured mass and the theoretical mass for the assigned MF in parts per million (ppm). The adduct mass is considered in this calculation.
AE_ppm - the absolute value of the err_ppm.
KM - the Kendrick mass using CH2 as the Kendrick base.
KMD - the Kendrick mass defect using CH2 as the Kendrick base.
max_LA - the theoretical maximum allowable number of large atoms (elements larger than 2H) for a measured mass based on the ‘Rule of 13’.
actual_LA - the actual number of large atoms (elements larger than 2H) in the assigned MF.
rule_13 - the ratio of actual_LA-to-max_LA. If the ratio is less than 1, the MF is chemically feasible based on the ‘Rule of 13’.
DBEO - the DBE value minus the number of O atoms in the MF.
max_H - the maximum possible number of H atoms for the measured mass.
H_test - the number of H atoms divided by the max_H value. If the ratio is less than 1, the MF is chemically feasible based on this parameter.
C13_mass - the measured mass of the single 13C polyisotopic mass from the “isopeaks” input that was matched to the assigned monoisotopic mass from the “peaks” input.
C13_abund - the measured abundance of the single 13C polyisotopic mass that was matched to the assigned monoisotopic mass.
C13_mass2 - the measured mass of the double 13C polyisotopic mass from the “isopeaks” input that was matched to the assigned monoisotopic mass from the “peaks” input.
C13_abund2 - the measured abundance of the double 13C polyisotopic mass that was matched to the assigned monoisotopic mass.
S34_mass - the measured mass of the single S34 polyisotopic mass from the “isopeaks” input that was matched to the assigned monoisotopic mass from the “peaks” input.
S34_abund - the measured abundance of the single S34 polyisotopic mass that was matched to the assigned monoisotopic mass.
tag - a tag identifying whether an assignment is ambiguous or unambiguous. It is denoted as “Ambiguous” or “Unambiguous”.
The third data frame (None) contains the measured masses that were not assigned with a MF. These can be further analyzed using MFAssignAll().
-There are four plot outputs in the MFAssignCHO() function.
-MSAssign - the mass spectrum of the assigned, unassigned, and isotope peaks shown in different colors (green, red, and blue, respectively).
Error - an error plot with the Exp_mass vs. absolute error for the assigned MF. Unambiguous MF are blue and ambiguous MF are red.
MSgroups - a reconstructed mass spectrum of the assigned peaks colored by their elemental composition (CHO, CHNO, etc.). CHO, CHNO, CHOS, CHNOS, CH, CHN elemental groups are considered, all other molecular groups are classified as “Other”. The plot is faceted to separate the ambiguous and unambiguous MF assignments.
VK - the van Krevelen plot of the assigned MF colored by their elemental composition, similar to the MSgroups plot. The plot is faceted to separate the ambiguous and unambiguous MF assignments.
The following two functions are used for internal mass calibration: MFRecalList() provides qualitative metrics for the selection of possible recalibrant series and MFRecalCheck() performs a mass recalibration using the approach described in Kozhinov et al. (2013).
-RecalList() is a function that takes the output of MFAssign() or MFAssignCHO() and provides metrics to rank the homologous series suitability to be used as recalibrants. The function selects CHO homologous series with at least three members. The homologous series are evaluated to determine the number of observations in each series (Number Observed), the mass range of each series (Mass Range), the mass of the tallest peak in each series (Tall Peak), and the “Abundance Score” which shows the percentage difference between the mean abundance of a homologous series and the median abundance within the mass range the “Tall Peak” of each series fall in (for example m/z 200-300). The output is a data frame with all eligible series present for the user to review. In general, a series with many members and a high abundance score is a good place to start. The goal in this recalibration method is to have recalibrants with high local abundance across the entire range of the spectrum.
-#The input for this function is the output from any of the MFAssign functions.
-RecalList <- RecalList(df = Unambig)
-The output is a data frame with nine columns for user evaluation of the possible recalibrant series.
-Series - the heteroatom class (e.g., “O6”), DBE (e.g., 3), and adduct type (“H” or “Na”) concatenated into a single term (class_Adduct_DBE). This series information can be used to identify homologous series to be used as recalibrants in RecalCheck().
Number Observed - the number of observed masses in each homologous series.
Series Index - a number indicating the length of the series relative to the other identified series, the smaller the number the longer the series.
Mass Range - the mass range from the smallest member of the homologous series to the largest.
Tall Peak - the mass of the most abundant peak in each series.
Abundance Score - the percentage difference between the mean abundance of a homologous series and the median abundance within the mass range the “Tall Peak” falls in.
Peak Score - the intensity of the tallest peak in a given series compared to the second tallest peak in the series This comparison is calculated by log10(Max Peak Intensity/ Second Peak Intensity). Values closer to 0 are preferred.
Peak Distance - the number of CH2 units between the tallest and second tallest peak in each series. Values closer to 1 are preferred.
Series Score - the number of actual observations in each series compared to the theoretical maximum number based on the CH2 homologous series. Values closer to 1 are preferred.
Recal() performs recalibration on the Mono and Iso outputs from the IsoFiltR() function and generates a mass spectrum highlighting the selected recalibrant series. The recalibration is based on the first step of the recalibration method described by Kozhinov et al. 2013, which uses a polynomial central moving average to estimate the weights used to recalibrate the masses. Additionally, the concept of a segmented “walking” recalibration from Savory et al. 2011 is used to remove systematic biases in the calibration. The recalibrated output can then be fed directly into MFAssign() for MF assignment of the recalibrated masses. Additionally, the function will output a data frame containing the recalibrants with their original mass error and the new, recalibrated mass error. To improve the mass recalibration across the studied mass range, Recal() finds additional recalibrants related by H2 or O homologous series using Kendrick mass analysis and then selects the tallest peaks within a user defined mass range. Recal_2() usens only the peaks that are part of the chosen recalibrant series, with no automatic additional peak selection. After the recalibrants are selected, the mass spectrum is split into segments of a user defined width and the recalibrants within each segment are used to recalibrate each section.
-Recalcheck <- Recal(df = Unambig, peaks = Mono, isopeaks = Iso, mode = "neg", SN = 500, mzRange = 50, series1 = "O4_Na_2", series2 = "O4_H_8", series3 = "O6_Na_8")
-
-Plot <- Recalcheck[["Plot"]]
-Plot
-Recal_Mono <- Recalcheck[["Mono"]]
-Recal_Iso <- Recalcheck[["Iso"]]
-List <- Recalcheck[["RecalList"]]
-df - the input data frame in the format of the output from MFAssign() or MFAssignCHO().
peaks - the input data frame of two columns with measured ion mass in the first column and measured ion abundance in the second column; using our recommended sequence, this is the “Mono” output from IsoFiltR().
isopeaks - the input data frame of two columns with the measured ion mass in the first column and measured ion abundance in the second column, typically the “Iso” output from IsoFiltR().
mode - a character string denoting whether the data was collected in negative (“neg”) or positive (“pos”) ion mode.
SN - a numeric value that sets the signal-to-noise threshold for the purposes of the output plot; default is 0.
mzRange - a numeric value that sets the user defined mass segment width; default is 50.
series(1-10) - a character denoting the recalibrant series (e.g., “O6_H_4”); up to 10 recalibrant series may be entered.
min - the minimum mass to be considered; default is 100.
max - the maximum mass to be considered; default is 1000.
bin - the mass window range for recalibrant selection; default is 10.
obs - the number of required recalibrant peaks within each bin; default is 2.
Plot - mass spectrum with recalibrant series highlighted in blue with the rest of the mass spectrum in gray.
Mono - a data frame of the recalibrated monoisotopic ion masses and their abundance, formatted for input to MFAssign().
Iso - a data frame of the recalibrated isotopic ion masses and their abundance, formatted for input to MFAssign().
List - data frame containing the selected recalibrant masses and their assigned MF.
MFAssign() is the function typically used for the final MF assignment with additional heteroatoms (e.g., N and S). The general parameters and method of MF assignment are the same as MFAssignCHO(), the major difference is that multiple heteroatoms and isotopes can be included. However, an increasing number of chemically reasonable MF are possible with an increasing number of possible elements and increasing molecular weight. For this reason this function uses a multi-path formula extension approach to reduce the number of ambiguous MF assignments. Thus, the final MF list contains unambiguous MF that may have been selected based on formula extensions that are expected in environmental complex mixtures and ambiguous MF. An additional consequence of this increased complexity is that the default DeNovo cut is lowered to 500 from 1000 in order to limit incorrect assignments. Some unassigned masses are also expected to remain; these could be passed through MFAssignAll().
-An example of the usage of MFAssign is shown below.
-Assign <- MFAssign(peaks = Mono, isopeaks = Iso, ionMode = "pos", lowMW =100, highMW = 1000, Nx = 3, Sx = 1, Mx = 1, ppm_err = 3, H_Cmin = 0.3)
-#The parameter settings are fairly typical for positive ion data.
-
-Unambig <- Assign[["Unambig"]] #Unambiguous MF assignments data frame
-
-Ambig <- Assign[["Ambig"]] #Ambiguous MF assignments data frame
-
-Unassigned <- Assign[["None"]] #Unassigned masses data frame
-
-MSAssign <- Assign[["MSAssign"]] #Mass spectrum showing assigned, unassigned, and isotope peaks. Assigned peaks are in green, unassigned peaks are in red, and isotope peaks are in blue.
-
-Error <- Assign[["Error"]] #Error plot with m/z vs. Absolute Error (ppm) colored to indicate unambiguous MF (blue) and ambiguous MF (red).
-
-MSgroups <- Assign[["MSgroups"]] #Reconstructed mass spectrum showing the assigned peaks colored to indicate the elemental group. The plot is faceted by the ambiguity of the MF assignments.
-
-VK <- Assign[["VK"]] #van Krevelen plot colored to indicate the elemental group and faceted by the ambiguity of the MF assignments.
-
-MSAssign #Print MSAssign
-Error #Print Error
-MSgroups #Print MSgroups
-VK #Print VK
-Many of the input parameters are common between MFAssign() and MFAssignCHO(), so only the new parameters are defined here.
-Nx - the maximum number of 14N atoms.
Sx - the maximum number of 32S atoms.
Px - the maximum number of 31P atoms.
S34x - the maximum number of 34S atoms.
N15x - the maximum number of 15N atoms.
Dx - the maximum number of 2H atoms.
Ex - the maximum number of 13C atoms.
Clx - the maximum number of 35Cl atoms.
Cl37x - the maximum number of 37Cl atoms.
Fx - the maximum number of 19F atoms.
--Note that an increased number of heteroatoms or adducts will decrease the speed of the function. This is especially true if more than one type of heteroatom is allowed.
-
The output of this function is the same as for MFAssignCHO(); see the description in MFAssignCHO() Output above.
-MFAssignAll() can be used to assign isolated or previously unassigned masses from the other MFAssign functions. This version does not have a mass cut point and only performs 2 rudimentary formula extensions (CH2 and H2O only), which causes it to run somewhat slower than the other versions of the function, especially on large mass lists. Thus, this version is recommended to be run on limited mass lists, especially when trying to assign isolated masses. The lack of advanced formula extension and mass cut point yields many more ambiguous MF assignments, so extra care is required when interpreting the results. MFAssignAll_MSMS() is even further reduced and does not take into consideration any homologous series to speed up MF assignment, it is useful for small data sets that may not have CH2 or H2O homologous series relationships between peaks. It was intended for MS/MS data. All of its parameters are the same as for MFAssignAll(), so the description of MFAssignAll() is equally applicable to MFAssignAll_MSMS().
-An example of the usage of MFAssignAll is shown below.
-Assign <- MFAssignAll(peaks = Mono, isopeaks = Iso, ionMode = "pos", lowMW =100, highMW = 1000, Nx = 3, Sx = 1, Mx = 1, ppm_err = 3, H_Cmin = 0.3)
-#The parameter settings are fairly typical for positive ion data.
-
-Unambig <- Assign[["Unambig"]] #Unambiguous MF assignments data frame
-
-Ambig <- Assign[["Ambig"]] #Ambiguous MF assignments data frame
-
-Unassigned <- Assign[["None"]] #Unassigned peaks data frame
-
-MSAssign <- Assign[["MSAssign"]] #Mass spectrum showing assigned, unassigned, and isotope peaks. Assigned peaks are in green, unassigned peaks are in red, and isotope peaks are in blue.
-
-Error <- Assign[["Error"]] #Error plot with m/z vs. Absolute Error (ppm) and colored to indicate the unambiguous MF (blue) and ambiguous MF (red).
-
-MSgroups <- Assign[["MSgroups"]] #Reconstructed mass spectrum showing the assigned peaks colored to indicate the elemental group. The plot is faceted by ambiguity of the MF assignments.
-
-VK <- Assign[["VK"]] #van Krevelen plot colored to indicate the elemental group and faceted by the ambiguity of the MF assignments.
-
-MSAssign #Print MSAssign
-Error #Print Error
-MSgroups #Print MSgroups
-VK #Print VK
-Many of the input parameters are common between MFAssignAll() and MFAssign(), the only difference is that MFAssignAll() does not have the mcut and nLoop input parameters. The input parameters are described in the MFAssign() and MFAssignCHO() function sections.
-The output of this function is the same as for MFAssignCHO() and MFAssign(); see the description in MFAssignCHO() Output above.
-Although many well-vetted methods of decreasing the number of MF assigned to each mass have been employed in the MFAssign functions, it is important to take time to assess the correctness of the assigned MF based on realistic expectations for the specific data being analyzed.
-It is important to ensure that each input parameter is set properly for the analysis being performed. There are warning messages that will be displayed if the parameters are set outside the typical range (or include a typo), but these warnings will not stop the function from running. If too many heteroatoms are included, the runtime of the function will be greatly increased, and the results may be unreasonable. On the other hand, if too few or incorrect heteroatoms are included, the results may be incorrect.
-If at any time MFAssign(), MFAssignCHO(), MFAssignAll(), or MFAssignAll_MSMS() need to be stopped before they have completed, it is recommended to run the following line of code “.rs.restartR()”. This is required to clear the memory, so that the functions can be run at a reasonable speed in future attempts. This line of code is included within the MFAssign functions and so R will automatically restart as necessary to ensure continued good performance.
-Whenever possible, limit the number of masses in the input data frame to a number as small a as possible since, the speed of the function decreases significantly with an increasing number of masses. The performance is reasonably good (~ <= 60 sec) up to 10,000 masses with a moderate number of elements (3-5). However, both a higher number of masses being evaluated and higher maximum numbers of heteroatoms, decrease the function speed and can become unreasonable depending on the computer (e.g., >30 minutes).
-Setting a reasonable signal-to-noise (SN) threshold can be very important with regard to function speed. Thus, prior to running a raw mass list, you should attempt to estimate the SN or set it to a reasonable value. This can be done with HistNoise(), KMDNoise(), and/or SNplot(). Otherwise the function will try to assign MF to every mass peak (including the noise), which slows down the function and increases the likelihood of incorrect assignments.
-The two input columns can have any name, but it is very important that the measured ion mass is in the first column and the measured abundance (or ion intensity) is in the second column. The function only attempts to assign MF to the first column. It is also important to put only ion masses into the function, using neutral masses will not work.
-As noted in the opening paragraph of this vignette, this package contains MFAssign_RMD() and MFAssignCHO_RMD() functions that were designed to be run within an R Markdown file. The functions are identical to the corresponding non-”RMD” versions (MFAssign() and MFAssignCHO(), respectively) with one exception, they do not include the command “.rs.restartR()”. The restart command line is used to clear the working memory to prevent degraded performance with sequential/repetitive analyses. Similarly, R Markdown automatically includes a restart to clear the working memory after a document is rendered.
-A recommended data analysis practice is to use R Markdown documents to serve as a record of the actual data files that serve as inputs and the parameters used in the individual functions. Thus, we have started using them with MFAssignR functions for record keeping and learned that we can semi-automatically process similar data sets using a short .R script that performs a loop over a selected set of data files.
-MFAssign_RMD() and MFAssignCHO_RMD() should only be used within an R Markdown document and should only be called once each within the document, otherwise the runtime slows down as described above. These RMD functions are particularly useful when they are used in a looped R Markdown document, which will allow the user to semi-automate the MF assignment and improve data throughput. Care must be taken when using these scripts to ensure the correct parameters are set. An R Markdown template with an entire MFAssignR pipeline designed for R Markdown is included as an additional file in this repository. The descriptions for MFAssign() and MFAssignCHO() applicable to MFAssign_RMD() and MFAssignCHO_RMD(); thus, they will not be repeated here. As noted previously, the only difference between these functions is that RMD versions do not have an “.rs.restartR()” command built into them, which allows them to run within an R Markdown document.
-Located in the GitHub repository for this package are an R script and an R Markdown document that can be used to semi-automate MF assignment over a set of data input files. The files can be edited based on the requirements of the user, but the basic template will generate a report showing the outputs of the various functions and the .csv outputs of the unambiguous, ambiguous, and unassigned mass lists, along with a list of the recalibrant ions. As noted, these scripts are best used when you have several data sets that are expected to have the same function parameters applied to them. For example, a set of sample replicates each collected using identical instrument settings could be semi-automated using the same function parameters. However, unrelated samples with very different MF compositions would be less favorable because the recalibrant ions may differ. As previously described the recalibrant ions are one of the parameters that must be manually determined before running the MF assignment scripts.
-To use the MF assignment looping scripts you must put all of the data sets into a single folder. These data sets must all have a consistent name extension, the default is “_MS”, which should be placed at the end of your file name, for example “YourData_MS.csv”. Place “reader” script “Formula Assignment Reader.R” and the markdown template “Formula Assignment Markdown.Rmd” within the folder with the data to be analyzed. Then, add to this folder an empty subfolder called “Assigned Formulas” so that the output .csv files can be saved there. Then, you should check the reader script to ensure that the working directory matches the location where your data is stored. No other changes are required to make this reader script work, apart from ensuring that the name of the R Markdown file being rendered is the same as the one in your folder (the default name is mentioned above).
-For the R Markdown template there are a few changes that will need to be considered before assigning MF. These include the function parameters for heteroatoms, ionization mode, signal-to-noise threshold, and recalibrant series. The changes for heteroatoms and ionization mode can be selected based on what type of data you are analyzing and the expected heteroatoms. In the default template the noise is estimated by the KMDNoise() function and is designed so that the only thing that needs to be changed is the multiplier value (the default is 6). This value can be changed at the top of the document and will be applied to the rest of the document. The determination of the recalibrant series requires the user to take one representative sample and experimentally determine the best recalibrant series to use. These recalibrant series can then be put in the Recal() function and will be applied to all the data sets within the folder.
-After all of these changes are made, you can run the “reader” script (all of the lines) and it will begin assigning MF to each of the datasets in your folder. This can take a significant amount of time, depending on the number of files. On a relatively new Dell XPS 15 laptop with a 7th generation i7 processor with 16 GB of RAM, one sample takes about 4 minutes to be processed, depending on the size of the mass list and the number of heteroatoms being assigned.
-The package functions have a significant reliance on dplyr and tidyr functions for some of the data manipulations. Plots are generated using the ggplot2 package, and the colorRamps package provides the color scheme for the output of KMDNoise().
-There are multiple warning messages that are reported when the functions are being run. Unless the function stops working, these error messages do not otherwise impact the functioning of the code.
-There are a variety of additional functions that are included in this package, but many of them are sub-functions that are necessary for the MFAssign functions and are not independently operational, these functions have not been described in this document. Only the independent functions have been described in this document.
-There is a large mass list built into the package called CHNOS_ML_Ex, which can be used to test whether the function is working correctly. It is a negative mode mass list with even electron ions generated by electrospray ionization. When a maximum of 3 nitrogen and 1 sulfur are allowed, MFAssign() assigns 2116 of 2121 total masses using DeNovo = 500. Additionally, there is a smaller data frame (Short_CHO_neg) of 13 observations randomly sampled from CHNOS_ML_Ex, which is more effective for checking whether or not MF are correct. There is also an 8 observation data frame (Short_CHO_pos) of positive even electron ions generated by electrospray ionization, which can be used to check the MFAssign parameters for positive ions. All functions should be able to assign all the masses in these short example data sets if DeNovo = 1000. Additionally, a raw mass list containing negative even electron ion data is included and can be used to test the Noise, SNplot, IsoFiltR, Recal and MFAssign functions. Its name is Raw_Neg_ML.
-This is the short data frame of negative ion data for testing purposes. The masses and correctly assigned unambiguous MF are shown below.
-Exp_mass | -formula | -
---|---|
531.2092 | -C24H36O13 | -
235.0251 | -C11H8O6 | -
563.1992 | -C24H36O15 | -
331.1767 | -C16H28O7 | -
391.0676 | -C18H16O10 | -
403.0524 | -C15H16O13 | -
321.0620 | -C15H14O8 | -
363.1091 | -C18H20O8 | -
683.2931 | -C33H48O15 | -
207.0301 | -C10H8O5 | -
523.1102 | -C23H24O14 | -
437.1460 | -C21H26O10 | -
487.1465 | -C21H28O13 | -
This is the short data frame of positive ion data for testing purposes. The masses and correctly assigned unambiguous MF are shown below.
-Exp_mass | -formula | -
---|---|
415.1235 | -C18H22O11 | -
325.2162 | -C22H28O2 | -
271.0812 | -C12H14O7 | -
265.0859 | -C17H12O3 | -
195.0652 | -C10H10O4 | -
303.0863 | -C16H14O6 | -
271.1176 | -C13H18O6 | -
267.1591 | -C15H22O4 | -
The MFAssignR package was designed for multi-element molecular formula (MF) assignment of ultrahigh resolution mass spectrometry measurements. A number of tools for internal mass recalibration, MF assignment, signal-to-noise evaluation, and unambiguous MF assignments are provided. This package contains MFAssign(), MFAssign_RMD(), MFAssignCHO(), MFAssignCHO_RMD(), SNplot(), HistNoise(), KMDNoise(), RecalList(), Recal(), Recal2(), RecalX(), Recal_2X(), and IsoFiltR() described in the sections below. Note, the functions with “RMD” were designed to be run within an R Markdown file and are otherwise identical to the corresponding non-”RMD” versions. To learn more, please see the section titled “Semi-Automated MFAssignR Functions”. User caution with the function parameter settings and output evaluation is required; thus, several function outputs are provided to assist the user with these evaluations.
+The functions in the MFAssignR package were developed by adapting methods and algorithms from the peer reviewed literature. The following references are referred to in this document:
+Green, N. W. and Perdue, E. M.: Fast Graphically Inspired Algorithm for Assignment of Molecular Formulae in Ultrahigh Resolution Mass Spectrometry, Anal Chem, 87(10), 5086–5094, doi:10.1021/ac504166t, 2015.
+Gross, J. H.: Mass Spectrometry, , doi:10.1007/978-3-319-54398-7 , 2017.
+Herzsprung, P., Hertkorn, N., Tümpling, W. von, Harir, M., Friese, K. and Schmitt-Kopplin, P.: Understanding molecular formula assignment of Fourier transform ion cyclotron resonance mass spectrometry data of natural organic matter from a chemical point of view, Anal Bioanal Chem, 406(30), 7977–7987, doi:10.1007/s00216-014-8249-y, 2014.
+Koch, B. P., Dittmar, T., Witt, M. and Kattner, G.: Fundamentals of Molecular Formula Assignment to Ultrahigh Resolution Mass Data of Natural Organic Matter, Anal Chem, 79(4), 1758–1763, doi:10.1021/ac061949s , 2007.
+Kozhinov, A. N., Zhurov, K. O. and Tsybin, Y. O.: Iterative Method for Mass Spectra Recalibration via Empirical Estimation of the Mass Calibration Function for Fourier Transform Mass Spectrometry-Based Petroleomics, Anal Chem, 85(13), 6437–6445, doi:10.1021/ac400972y, 2013.
+Kujawinski, E. B. and Behn, M. D.: Automated Analysis of Electrospray Ionization Fourier Transform Ion Cyclotron Resonance Mass Spectra of Natural Organic Matter, Anal Chem, 78(13), 4363–4373, doi:10.1021/ac0600306 , 2006.
+Lobodin, V. V., Marshall, A. G. and Hsu, C. S.: Compositional Space Boundaries for Organic Compounds, Anal Chem, 84(7), 3410–3416, doi:10.1021/ac300244f, 2012.
+Ohno, T. and Ohno, P. E.: Influence of heteroatom pre-selection on the molecular formula assignment of soil organic matter components determined by ultrahigh resolution mass spectrometry, Anal Bioanal Chem, 405(10), 3299–3306, doi:10.1007/s00216-013-6734-3, 2013.
+Perdue, E. M. and Green, N. W.: Isobaric Molecular Formulae of C, H, and O: A View from the Negative Quadrants of van Krevelen Space, Anal Chem, 87(10), 5079–5085, doi:10.1021/ac504165k, 2015.
+Savory, J. J., Kaiser, N. K., McKenna, A. M., Xian, F., Blakney, G. T., Rodgers, R. P., Hendrickson, C. L., and Marshall, A. G.: Parts-Per-Billion Fourier Transform Ion Cyclotron Resonance Mass Measurement Accuracy with a “Walking” Calibration Equation, Anal Chem, 83, 1732-1736, doi:10.1021/ac102943z, 2011.
+Zheng, Q., Morimoto, M., Sato, H. and Fouquet, T.: Resolution-enhanced Kendrick mass defect plots for the data processing of mass spectra from wood and coal hydrothermal extracts, Fuel, 235, 944–953, doi:10.1016/j.fuel.2018.08.085, 2019.
+Zhurov, K. O., Kozhinov, A. N., Fornelli, L. and Tsybin, Y. O.: Distinguishing Analyte from Noise Components in Mass Spectra of Complex Samples: Where to Cut the Noise, Anal Chem, 86(7), 3308–3316, doi:10.1021/ac403278t, 2014.
+The MF assignment algorithm in MFAssign was adapted from the low mass moiety CHOFIT assignment algorithm developed by Green and Perdue (2015). In total there are 4 versions of MF Assign, including MFAssign(), MFAssignCHO(), MFAssignAll(), and MFAssignAll_MSMS(). Where MFAssign()includes external nested loops to assign additional heteroatoms, as described in Green and Perdue (2015) while MFAssignCHO() does not. Briefly, the CHOFIT algorithm uses low mass moieties such as CH4O-1 and C4O-3 to move around in the O/C and H/C space to assign MF with C, H, and O (CHO MF). These low mass moieties efficiently assign CHO MF without conventional loops. Additional combinatorial assignments with various heteroatoms are made using nested loops that subtract the mass of a heteroatom from the measured ion mass, creating a CHO “core” mass, which can then be assigned using the low mass moiety CHOFIT approach. This is further explained in Green and Perdue (2015) and Perdue and Green (2015).
+Using the low mass moiety and combinatorial assignment approach, MFAssign() can be used to assign MF with 12C, 1H, and 16O and a variety of heteroatoms and isotopes, including 2H, 13C, 14N, 15N, 31P, 32S, 34S, 35Cl, 37Cl,and 19F. It can also assign Na+ adducts, which are common in positive ion mode. Due to the increasing number of chemically reasonable MF with the increasing number of possible elements and increasing molecular weight, the output will provide a list of ambiguous and unambiguous MF.
+Advanced Kendrick mass and z* sorting tools are used to reduce the number of ambiguous MF in MFAssign(). First, Kendrick mass defect (KMD) and z* values are calculated with a CH2 Kendrick base to sort the measured masses into CH2 homologous series (Stenson et al., 2003). The function then selects 1 to 3 members of each CH2 homologous series with masses below the user defined cutoff and attempts to assign MF. The ambiguous MF are then returned to the unassigned list. Then, the unambiguous MF are used as seeds for additional assignments using CH2, O, H2, H2O, and CH2O MF extensions (Kujawinski and Behn, 2006). To do the formula extensions the KMD and z* values for each of these bases are calculated and then used to assign MF through the addition or subtraction of the series bases. MFAssign() (and MFAssignCHO()) tracks how many different “paths” can be used to assign each MF and if a single mass has multiple MF, the function will choose the MF that has the largest number of paths that intercept with it. For example, if a single mass has two possible MF and one has 20 potential “paths” to it, while the other has 4, the function will choose the MF with 20 paths. Work is ongoing to track these paths and the removed MF in the data frame output of these functions. Overall, the multi-path MF extension approach greatly reduces the number of ambiguous assignments and provides an increased level of confidence in the final MF list because the MF are related to unambiguous MF assigned below the user defined cut point. An additional step to decrease the number of ambiguous and/or incorrect sulfur assignments was also added. This step requires that for a sulfur containing compound to act as a seed it must be unambiguous and have a matching 34S peak, when both monoisotopic and isotopic mass lists from the IsoFiltR() function are are assigned MF. This has been implemented for all versions of the MFAssign functions as the “SulfCheck” parameter, which can be turned “on”" or “off”.
+To allow for more ambiguity in the formula assignment there is the “Ambig” parameter which can be turned “on” or “off”. This option turns off the path choosing step for formula assignment, described above, which allows for more assignments for each mass to be kept. Additionally, the “MSMS” parameter is present, which can help to assign molecular formulas in a data set that is not very continuous with respect to homologous series, such as MS/MS data. What it does is remove the pre-filtering of masses below the DeNovo threshold, meaning that all masses below that point will be assigned directly. This causes the function to run somewhat slower, but can help to get better assignments. These parameters replace the MFAssignAll() and MFAssignMSMS() functions from previous versions (<= v.0.0.3).
+MFAssignCHO() is a simplified version of MFAssign() used only to assign MF with CHO elements. MFAssignCHO() runs faster than MFAssign() and is best used as a preliminary MF assignment step prior to the selection of recalibrant ions in conjunction with MFRecalList() and MFRecalCheck(), which are described below.
+Below is a table highlighting the features of the various forms of MFAssign, which were described above.
++Features + | ++MFAssign + | ++MFAssignCHO + | ++MFAssign_RMD + | ++MFAssignCHO_RMD + | +
---|---|---|---|---|
+Heteroatoms + | ++X + | ++ | ++X + | ++ | +
+Odd Electron and Sodium Adduct Assignment + | ++X + | ++X + | ++X + | ++X + | +
+CH2 KMD Pre-Filtering + | ++X + | ++X + | ++X + | ++X + | +
+DeNovo cut + | ++X + | ++X + | ++X + | ++X + | +
+Advanced Formula Extension + | ++X + | ++X + | ++X + | ++X + | +
+Standard QA Parameters + | ++X + | ++X + | ++X + | ++X + | +
+Advanced QA Parameters + | ++X + | ++X + | ++X + | ++X + | +
+Increased Unambiguity + | ++X + | ++X + | ++X + | ++X + | +
+Full Ambiguity + | ++X* + | ++X* + | ++X* + | ++X* + | +
+Isotope Matching + | ++X + | ++X + | ++X + | ++X + | +
+Automatic .rs.Restart() + | ++X + | ++X + | ++ | ++ | +
+Compatible with Rmarkdown + | ++ | ++ | ++X + | ++X + | +
+Improved Performance for MS/MS Data + | ++X^ + | ++X^ + | ++X^ + | ++X^ + | +
“*" Available using the “Ambig” option
+“^” Available using the “MSMS” option
+Simple Formula Extension using CH2 and H2O formula extensions.
Advanced Formula Extension uses the combination of CH2, H2O, H2, O, CH2O formula extensions with multiple iterations to improve assignment.
Standard QA includes both user-defined O/C, H/C, DBE, err limits in addition to the non-adjustable Senior Rules, Nitrogen Rule, Rule 13, Maximum H Rule.
Advanced QA includes a sulfur isotope check, heteroatom cut (HAcut), and the nominal mass series cut (NMScut); HAcut and NMScut can be turned on or off externally.
The IsoFiltR() function can identify many of the 13C and 34S isotope masses, which when removed from the mass list can lower the number of peaks assigned with an incorrect MF. This function operates on a two column data frame using the same structure as the MFAssign() function.
+IsoFiltR() identifies potential isotope masses using a four-step identification method.
+First the mass list is transformed to identify mass difference pairs appropriate for the element being investigated (delta mass for C (1.003355) or S (1.995797), with +/- 5 ppm mass error). Only those that meet this criteria move on to step 2.
Using the mass difference between 12C/13C (1.003355) or 32S/34S (1.995797), the KMD value can be calculated for a specific isotope. This means that the 12C (32S) monoisotopic peak will be in a KMD homologous series with its matching 13C (34S) isotopic peak, analogous to homologous series of CH2. If the KMD values are equivalent for the candidate pair, the peaks can be considered to be in a series and the pair will move on to the third step. The equations for 13C are: KM = 1/1.003355 * m/z, KMD = nominal mass - KM. Replacing 1/1.003355 with 2/1.995797 makes this work for 34S.
Isotopic pairs are separated using a “Resolution Enhanced KMD” approach adapted from Zheng et al. 2019. Resolution enhanced KMD values are calculated by dividing the mass of some homologous series base (in this case CH2) by an integer that was experimentally determined to accomplish the desired separation. This value is then used in the typical KM and KMD calculation in order to calculate the “resolution enhanced” KMD. As an example, BaseMass_adj = 14.01565 / 21 can be considered the integer divided base mass, which is then used in the KM calculation: KMr = (round(BaseMass_adj) / BaseMass_adj) * m/z, followed by KMDr = round(KMr) - KMr to calculate the resolution enhanced KMD. For 13C the integer for this calculation is 21, while for 34S it is 12. After this calculation, peaks that are 12/13 C or 32/34 S pairs will have KMDr difference values of specific values, which can be used to select the pairs that are most likely to be isotope pairs. The KMDr difference is calculated by subtracting the KMDr value of the suspected isotope mass from the KMDr of the suspected monoisotopic mass. The values are -0.291 and 0.709 for 32/34 S and -0.496 and 0.503 for 12/13 C. If the peaks meet these criteria, they can move on to step four. Using CH2 KMD values that are divided by an experimentally derived integer, the isotopic pairs are separated into two specific values. If the difference in the enhanced KMD for the candidate pair matches one of those values, it will move to the fourth tier.
The fourth step uses abundance ratios to constrain the remaining isotope pairs to ensure that the isotope peaks are not too large or too small relative to the intensity of the monoisotopic peak. The limits on this are loose due to the variation of isotope abundance with analyte signal (similar to isotope dilution) as observed in ultrahigh resolution Orbitrap and FT-ICR measurements.
The candidate pairs that make it through these four steps are put into two data frames, Mono and Iso, which contain the monoisotopic and isotopic peaks respectively. Then all peaks that were not flagged as possible mono/iso pairs are added to the Mono output data frame. In complex mixtures, some peaks can be flagged as both monoisotopic and isotopic. In these cases, the masses are included in both outputs and are classified as either monoistopic or isotopic after the MF assignment.
+When the two data frame outputs from IsoFiltR() are put into MFAssign(), the function will match the assigned monoisotopic masses to their corresponding isotopic masses. Additional work would be needed to use the isotopes to reduce ambiguous MF assignments assigned to a single mass. Thus IsoFiltR() should not be considered as definitive proof of the presence or absence of 13C or 34S in MF, but it does assign MF with these expected naturally occurring isotopes and limit the chances that they are incorrectly assigned with a monoisotopic MF.
+MFAssign() includes a number of quality assurance (QA) steps to check the assigned MF for chemically reasonability. Relatively lenient default settings are provided to avoid removing chemically reasonable ambiguous MF assignments. Many of these parameters are customizable, including DBE-O limits (Herzsprung et al. Anal. and Bioanal. Chem. 2014), O/C ratio limits, H/C ratio limits, and minimum number of O. The Hetcut parameter can be used to select the MF with the lowest number of heteroatoms, if more than one MF is assigned to a single mass (Ohno and Ohno, 2013). The NMScut parameter identifies the CH4 vs O exchange series in each nominal mass as described in Koch et al. (2007), which can be used to limit ambiguous assignments. Additional non-adjustable QA parameters are used in all of the MFAssign functions, including the nitrogen rule, large atom rule, and the maximum number of H rule, maximum DBE rule (Lobodin et al., 2012), and the Senior rules (Kind et al. 2007).
+Noise level assessment can be accomplished using the either the HistNoise() or KMDNoise() functions in conjunction with the SNplot() functions. The HistNoise() method is based on the method developed by Zhurov et al. (2014), and KMDNoise() is a new custom method based on our observations of raw data Kendrick mass defect analysis.
+The Zhurov et al. (2014) method uses a histogram distribution of the natural log intensities in the measured raw mass spectrum to determine the point where noise peaks give way to analyte signal. The HistNoise() function attempts to identify this point and reports the noise level so that the signal-to-noise cut point can be determined. The cut point is shown in an output plot red to blue colors, where red indicates noise. The cut point can also be set manually, if the function does not predict a reasonable noise level. We have observed this function to be confounded by distributions that do not match the theoretical distribution, making it difficult or impossible for the function to identify the correct noise cut point. For this reason, we developed the KMDNoise() function described below.
+The KMDNoise() method is based on the observation that the CH2 based KMD values of noise peaks and analyte peaks are naturally separated in a KMD plot, allowing the function to select a region with only noise peaks and use the average intensity of these values to estimate the noise. We refer to this as the KMD slice method. In principle, this is similar to what was briefly described in Reidel and Dittmar (2014), but instead of using a static range of normal mass defects (0.3-0.9), our method uses a mass dependent KMD region, which avoids potentially doubly charged peaks with a mass defect of ~0.5, which would be considered as noise in the Reidel and Dittmar method.
+At least one of these noise estimation functions should be run on the mass list prior to MF assignment with MFAssign() or isotope filtering with IsoFiltR(). Setting a reasonable S/N cut point greatly increases the speed of the functions and improves the output quality.
+The SNplot function is used to show the mass spectrum with the masses below and above the cut point denoted using the same color scheme as in the histogram plots from either HistNoise() or KMDNoise().
+RecalList(), Recal(), and Recal_2() are functions pertaining to the internal mass recalibration method adapted from Kozhinov et al. (2013) and Savory et al. (2011) using a polynomial central moving average to estimate the weights used to recalibrate the masses (Kozhinov et al., 2013) applied to spectral segments (Savory et al., 2011). The function RecalList() can be used with the output of MFAssign() or MFAssignCHO() to generate a data frame containing potential recalibrant CH2 homologous series. There are a variety of metrics included in the output of this function to aid the user in picking suitable recalibrant series, these are described in greater detail in the example of RecalList() below. The user can select up to 10 homologous series as inputs for the mass recalibration with Recal() and Recal_2(). Recal() uses H2 and O KMD and z* series to identify additional MF that are related to the user selected recalibrants. In contrast, Recal_2() does not used those series to expand the pool of potential recalibrants, using only the peaks that correspond to the homologous series chosen as recalibrants. Other than this difference Recal() and Recal_2() work exactly the same. To avoid recalibration problems associated with too many recalibrant masses, the function uses a user-defined number of tallest peaks within a user-defined mass range “bin”. For example, if the bin width is set at 20 and the number of peaks is set at 2, the function will select the two tallest peaks within each 20 m/z window across the range of the spectrum. Additionally, when the monoisotopic peak chosen as a recalibrant has an identified 13C peak, that isotopic peak will also be added to the pool of recalibrants being used. After the recalibrants have been selected, they are split into mass windows of a user defined width (default is 50 m/z) and used to calculate the correction term according to the the adapted form of the Kozhinov et al. method. This will provide a different mass correction term for each mass window in the spectrum. Then the raw mass list(s) that are being recalibrated are split into the same mass windows, and the correction term that is associated with each window is used to correct the masses in that window, thus recalibrating the full spectrum section by section. In addition to the output of recalibrated mass lists the function also generates a plot that shows the recalibration peaks that were used in context with the overall mass spectrum, and produces an output data frame containing the mass, abundance, formula, and error for the recalibrants that were used.
+RecalX() and Recal_2X() are similar to Recal() and Recal_2(), but provide some iteration of the mass calibration and can be used more effectively with small mass windows. The homologous series are chosen in the same way as in Recal() and Recal_2(), but then they are used to do a single term recalibration for the entire spectrum instead of segments. These calibrated masses are then used to do a segmented recalibration. Within each segment the recalibrants from the previous step are used and then the tallest peaks assigned a molecular formula within each window are selected as recalibrants, with half above the central recalibrant and half below.
+The functions will be described in the order that they are most effectively used. The functions do not have to be run in this order, but the best results will likely be obtained in this way. A list of the functions in the recommended order is given below: 1. Run HistNoise() or KMDNoise() to determine the noise level for the data.
+Check effectiveness of S/N cut point using SNplot().
Use IsoFiltR() to identify potential 13C and 34S isotope peaks.
Using the S/N cut point, and the two data frames output from IsoFiltR(), run MFAssignCHO() to assign CHO MF to potentially be used as recalibration ions.
Use RecalList() to generate a list of the potential recalibrant series.
After choosing a few recalibrant series, use Recal() (or Recal_2()) to check whether they are good recalibrants and recalibrate the mass lists using those recalibrants.
Use MFAssign() with the recalibrated mass lists to assign MF to the data.
Check the output plots from MFAssign() to check the quality of the assignment.
The following functions are used for mass lists containing noise.
+This function is an adaptation of the method developed by Zhurov et al. (2014). It is used to estimate the noise level for raw mass spectral data from both FT-ICR and Orbitrap MS. There should, in theory, be a significant first peak that contains the measured masses due to random noise, followed by additional distinct peaks. This function finds the valley between the random noise and the analyte signal of the histogram output. The output noise level can then be used to estimate the signal to noise cut level and constrain the masses that are considered in the MFAssign() function. In some cases the data does not form the expected distribution and even when it does, sometimes the function does not identify the correct valley. In these cases, the KMD slice method generally provides a useful estimate of the noise because it is more general to samples regardless of their noise distribution.
+Data <- read.csv("YourMassList.csv")
+#You can read in an external data set. Make sure the first column is the measured ion mass and the second column is the measured ion abundance (intensity or relative abundance).
+
+HistNoise(Data, SN = 0, bin = 0.01)
+df - a two column data frame containing measured ion abundance and ion mass.
SN - a manual S/N cut point if the function does not find an acceptable value; default is 0.
bin - the binwidth for generating the histogram; default is 0.01.
The output of HistNoise() is a list containing the following components:
+“Noise” - a numeric value containing the estimated noise level.
“Hist” - a histogram of the intensity distribution of the peaks in the mass spectrum. It is color coded to highlight peaks below (red) and above (blue) the estimated noise level.
This function implements the KMD slice method of estimating the noise level for a mass spectrum, described previously in this document.
+Data <- read.csv("YourMassList.csv")
+#You can read in an external data set. Make sure the first column is the measured ion mass and the second column is the measured ion abundance (intensity or relative abundance).
+
+KMDNoise(Data, upper = 0.2, lower = 0.05)
+upper - the y-intercept for the upper boundary of the KMD slice; default is 0.2.
lower - the y-intercept for the lower boundary of the KMD slice; default is 0.05.
The output of KMDNoise() is a list containing the following components:
+“Noise” - a numeric value containing the estimated noise level.
“KMD” - a KMD plot showing the KMD values for all peaks in the spectrum, with the selected noise estimation region bounded by red lines.
This function generates a mass spectrum with color coded mass peaks to indicate if they are below (red) or above (blue) the S/N cut point. This can be used as a qualitative check of the suggested output from HistNoise() or KMDNoise(). Also, it can also be used for qualitative investigation of the S/N level in the mass spectrum independent of the two noise estimation functions.
+Data <- read.csv("YourMassList.csv")
+#You can read in an external data set. Make sure the first column is the measured ion mass and the second column is the measured ion abundance (intensity or relative abundance).
+
+SNplot(Data, cut = 500, mass = 400, window.x = 0.5, window.y = 10)
+df - a two column data frame containing measured ion abundance and ion mass.
cut - the signal-to-noise cut point.
mass - the center mass of the window.
window.x - the width of window on either side of the center mass; default is 0.5.
window.y - the y axis of the plot by multiplying the cut point by this value; default is 10.
IsoFiltR() provides a tentative filtering of masses with 13C and 34S from the overall mass list, as described previously. This decreases the likelihood of incorrect assignments. Be sure to include a noise level, which lessens the number of peaks being considered. The way isotopes are identified requires the generation of very large data frames, so if too many peaks are considered the function will take a long time to run, or will not be able to finish at all.
+Data <- read.csv("YourMassList.csv")
+#You can read in an external data set. Make sure the first column is the measured ion mass and the second column is the measured ion abundance (intensity or relative abundance).
+#Be sure to include a signal-to-noise level cut value so that the function will work properly.
+
+Mono_Iso <- IsoFiltR(Data, SN = 500, Diffrat = 0.1)
+
+Mono <- Mono_Iso[["Mono"]]
+Iso <- Mono_Iso[["Iso"]]
+Data - a two column data frame containing measured ion abundance and ion mass.
SN - a user defined signal-to-noise cut point; default is 0.
Diffrat - a user defined ratio to tighten (larger value) or loosen (lower value) the intensity thresholds for identifying a peak as an isotopic peak; default is 0.1.
MFAssignCHO() is a simpler version of MFAssign() that only assigns CHO MF. This can be helpful when trying to do a quick assignment prior to internal mass recalibration. The MF assignment algorithm is based on the same principles as the full MFAssign(). It uses the CHOFIT algorithm to do a preliminary assignment of the CH2 homologous series on a subset of the masses and Kendrick Mass Defect and z* series analysis to extend the assignments to related remaining masses. An example of its usage is shown below, along with its input parameters.
+Assign <- MFAssignCHO(peaks = Mono, isopeaks = Iso, ionMode = "pos", lowMW =100, highMW = 1000, Mx = 1, ppm_err = 3, H_Cmin = 0.3)
+#This is a typical set of parameters for positive ion data.
+
+#The output list includes the following datasets.
+Unambig <- Assign[["Unambig"]] #Unambiguous MF assignments data frame
+Ambig <- Assign[["Ambig"]] #Ambiguous MF assignments data frame
+Unassigned <- Assign[["None"]] #Unassigned values data frame
+
+#The output list includes the following plots.
+MSAssign <- Assign[["MSAssign"]] #Mass spectrum showing assigned, unassigned, and isotope peaks. Assigned peaks are in green, unassigned peaks are in red, and isotope peaks are in blue.
+
+Error <- Assign[["Error"]] #Error plot with m/z vs. absolute error (ppm) including the unambiguous MF in blue and the ambiguous MF in red.
+
+MSgroups <- Assign[["MSgroups"]] #Mass spectrum showing the assigned peaks with color to indicate the elemental group. The plot is faceted by ambiguity of the MF assignments.
+
+VK <- Assign[["VK"]] #van Krevelen plot colored by elemental group and faceted by ambiguity of the MF assignments.
+
+MSAssign #Print MSAssign
+Error #Print Error
+MSgroups #Print MSgroups
+VK #Print VK
+peaks - the input data frame with the measured ion mass in the first column followed by measured ion abundance in the second column; the column names can be anything.
isopeaks - the input isotopic masses data frame with the same structure as “peaks”; if the two data frames (peaks and isopeaks) come from the IsoFiltR function they will be formatted correctly.
ionMode - the ionization mode with either “pos” for positive mode and “neg” for negative mode; the parameters are case sensitive.
POEx - the assignment of positive odd or even electron ions. When POEx is set to 0, only positive even electron ions are permitted. When POEx is set to 1, positive odd electron ions are allowed, in addition to even electron ions. The default is 0. This option is useful when the measured ions were generated by either atmospheric pressure chemical ionization (APCI) or photoionization (APPI).
NOEx - the assignment of negative odd electron ions. When NOEx is set to 0, only negative even electron ions are permitted. When NOEx is set to 1, negative odd electron ions are allowed, in addition to even electron ions (1 = on). The default is 0. This option is useful when the measured ions were generated by either atmospheric pressure chemical ionization (APCI) or photoionization (APPI).
lowMW - the minimum ion mass to be assigned. The default is 100.
highMW - the maximum ion mass to be assigned. The default is 1000.
Ex - the maximum number of 13C to be used in the function.
Mx - the maximum number of Na+ adducts to be used in the function. Note that this is important for most positive mode ESI data. The default is 0.
NH4x - the maximum number of NH4+ adducts to be used in the function. The default is 0. Note that this will replace one N and 4 H in a CHNO MF that does not have an NH4+ adduct, so great care should be taken with the MF assignments to understand if they are correct.
++Note that the addition of more heteroatoms or adducts will decrease the speed of the function. This is especially true if more than one type of heteroatom is allowed.
+
Zx - the maximum number of charges allowed. The default is 1. Theoretically, MFAssign() can assign multiply charged compounds, but not at this time.
Ox - the maximum number of O to be assigned. It also indirectly sets the number of loops for the core CHOFIT algorithm. Each core loop increases the number of O by 3, which means that setting this value to 30 allows 10 loops of the inner algorithm, which seems to be a good balance between function speed and reasonable MF assignment.
ppm_err - the maximum allowable error for MF assignment and monoisotope/polyisotope peak matching. The value is in parts per million (ppm) and the default is 3.
SN - the signal-to-noise level for the data. This is useful if the signal-to-noise value for the data is known from an external source (such as SNcutCheck()). MFAssign() does not have the ability to independently determine the signal-to-noise level. The default is 0, and the input value must be consistent with the type of abundance in column 1 of peaks (i.e., intensity or relative abundance).
O_Cmin - the minimum allowable O/C ratio for the assigned MF. The default is 0.
O_Cmax - the maximum allowable O/C ratio for the assigned MF. The default is 2.5.
H_Cmin - the minimum allowable H/C ratio for the assigned MF. The default is 0.3.
H_Cmax - the maximum allowable H/C ratio for the assigned MF. The default is 3.
DBEOmin - the minimum allowable DBE minus O value. The default is -13, consistent with Herzsprung et al. 2014.
DBEOmax - the maximum allowable DBE minus O value. The default is 13, consistent with Herzsprung et al. 2014.
Omin - the minimum allowable number of O for a MF. The default is 0.
HetCut - a filtering step that compares ambiguous MF removes the MF with the higher number of heteroatoms (heteroatoms are defined as all elements that are not C or H). This parameter is based on Ohno and Ohno (2013).The default setting is “off” because this can lead to incorrect assignments, especially if many heteroatoms are expected for the data. The input values are “on” or “off” and are case sensitive.
NMScut - a filtering step based on nominal mass patterns as described by Koch et al. (2007). It helps to decrease the number of ambiguous MF. The default setting is “on”, to turn this option off, use “NMScut = off”.
DeNovo - a threshold where m/z values above this threshold are only assigned MF via a formula extension; values below this threshold are not restricted to a formula extension relationship. The default setting is 1000 for CHO assignment.
nLoop - the number of times that the formula extension component of MFAssignCHO() will loop to assign MF, which were not previously assigned using the CHOFIT algorithm. The default number of loops is 5.
Ambig - Turns on or off the QA component of formula extension. The default is “off”.
MSMS - Turns on or off CH2 KMD prescreening of masses below the DeNovo threshold before initial assignment. Default is “off”.
The output of the function is a list containing 3 data frames and 4 plots. The data frames will be described here first.
+The first data frame (Unambig) contains the assigned unambiguous MF along with other useful parameters that are useful for data interpretation such as O/C, H/C, DBE, and more. The column headers and a brief description of each are given below.
+The second data frame (Ambig) contains the assigned ambiguous MF, with the same additional information as the Unambig data frame.
+The third data frame (None) contains the ion masses that were not assigned to a MF, along with their corresponding abundance.
+The following column headers are the same for both Unambig and Ambig data frames. * abundance - the measured abundance of each identified species and is identical to the input data frame values.
+exp_mass - the measured experimental ion mass from the input data frame; it is identical to the input data frame values.
formula - the assigned MF for the experimental mass.
class - the heteroatom class of the MF based on the number of heteroatoms.
group - the elemental group of the MF (CHO, CHNO, etc.).
C - the total number of assigned 12C + 13C atoms.
H - the number of assigned 1H atoms.
O - the number of assigned 16O atoms.
N - the number of assigned 14N atoms.
S - the number of assigned 32S atoms.
P - the number of assigned 31P atoms.
E - the number of assigned 13C atoms.
S34 - the number of assigned 34S atoms.
N15 - the number of assigned 15N atoms.
D - the number of assigned 2H atoms.
Cl - the number of assigned 35Cl atoms.
Cl37 - the number of assigned 37Cl atoms.
M - the number of assigned Na+ adducts.
NH4 - the number of assigned NH4+ adducts.
POE - indicates whether the assigned MF is a positive odd electron mass (1) or not (0).
NOE - indicates whether the assigned MF is a negative odd electron mass (1) or not (0).
Z - the charge on the mass.
neutral_mass - the neutral mass defined as Exp_mass plus or minus its adduct (either H+ or Na+) depending on whether the mass was collected in the negative or positive mode.
O_C - the O/C ratio for the assigned MF.
H_C - the H/C ratio for the assigned MF.
theor_mass - the theoretical neutral mass for the assigned MF using the exact masses of the atoms.
DBE - the number of double bond equivalents for the assigned MF. Note, only the lowest valence number is considered for multivalent elements. Therefore, this does not include unsaturations associated with oxidized elements such as S.
err_ppm - the error between the measured mass and the theoretical mass for the assigned MF in parts per million (ppm). The adduct mass is considered in this calculation.
AE_ppm - the absolute value of the err_ppm.
KM - the Kendrick mass using CH2 as the Kendrick base.
KMD - the Kendrick mass defect using CH2 as the Kendrick base.
max_LA - the theoretical maximum allowable number of large atoms (elements larger than 2H) for a measured mass based on the ‘Rule of 13’.
actual_LA - the actual number of large atoms (elements larger than 2H) in the assigned MF.
rule_13 - the ratio of actual_LA-to-max_LA. If the ratio is less than 1, the MF is chemically feasible based on the ‘Rule of 13’.
DBEO - the DBE value minus the number of O atoms in the MF.
max_H - the maximum possible number of H atoms for the measured mass.
H_test - the number of H atoms divided by the max_H value. If the ratio is less than 1, the MF is chemically feasible based on this parameter.
C13_mass - the measured mass of the single 13C polyisotopic mass from the “isopeaks” input that was matched to the assigned monoisotopic mass from the “peaks” input.
C13_abund - the measured abundance of the single 13C polyisotopic mass that was matched to the assigned monoisotopic mass.
C13_mass2 - the measured mass of the double 13C polyisotopic mass from the “isopeaks” input that was matched to the assigned monoisotopic mass from the “peaks” input.
C13_abund2 - the measured abundance of the double 13C polyisotopic mass that was matched to the assigned monoisotopic mass.
S34_mass - the measured mass of the single S34 polyisotopic mass from the “isopeaks” input that was matched to the assigned monoisotopic mass from the “peaks” input.
S34_abund - the measured abundance of the single S34 polyisotopic mass that was matched to the assigned monoisotopic mass.
tag - a tag identifying whether an assignment is ambiguous or unambiguous. It is denoted as “Ambiguous” or “Unambiguous”.
The third data frame (None) contains the measured masses that were not assigned with a MF. These can be further analyzed using MFAssignAll().
+There are four plot outputs in the MFAssignCHO() function.
+MSAssign - the mass spectrum of the assigned, unassigned, and isotope peaks shown in different colors (green, red, and blue, respectively).
Error - an error plot with the Exp_mass vs. absolute error for the assigned MF. Unambiguous MF are blue and ambiguous MF are red.
MSgroups - a reconstructed mass spectrum of the assigned peaks colored by their elemental composition (CHO, CHNO, etc.). CHO, CHNO, CHOS, CHNOS, CH, CHN elemental groups are considered, all other molecular groups are classified as “Other”. The plot is faceted to separate the ambiguous and unambiguous MF assignments.
VK - the van Krevelen plot of the assigned MF colored by their elemental composition, similar to the MSgroups plot. The plot is faceted to separate the ambiguous and unambiguous MF assignments.
The following two functions are used for internal mass calibration: MFRecalList() provides qualitative metrics for the selection of possible recalibrant series and MFRecalCheck() performs a mass recalibration using the approach described in Kozhinov et al. (2013).
+RecalList() is a function that takes the output of MFAssign() or MFAssignCHO() and provides metrics to rank the homologous series suitability to be used as recalibrants. The function selects CHO homologous series with at least three members. The homologous series are evaluated to determine the number of observations in each series (Number Observed), the mass range of each series (Mass Range), the mass of the tallest peak in each series (Tall Peak), and the “Abundance Score” which shows the percentage difference between the mean abundance of a homologous series and the median abundance within the mass range the “Tall Peak” of each series fall in (for example m/z 200-300). The output is a data frame with all eligible series present for the user to review. In general, a series with many members and a high abundance score is a good place to start. The goal in this recalibration method is to have recalibrants with high local abundance across the entire range of the spectrum.
+#The input for this function is the output from any of the MFAssign functions.
+RecalList <- RecalList(df = Unambig)
+The output is a data frame with nine columns for user evaluation of the possible recalibrant series.
+Series - the heteroatom class (e.g., “O6”), DBE (e.g., 3), and adduct type (“H” or “Na”) concatenated into a single term (class_Adduct_DBE). This series information can be used to identify homologous series to be used as recalibrants in RecalCheck().
Number Observed - the number of observed masses in each homologous series.
Series Index - a number indicating the length of the series relative to the other identified series, the smaller the number the longer the series.
Mass Range - the mass range from the smallest member of the homologous series to the largest.
Tall Peak - the mass of the most abundant peak in each series.
Abundance Score - the percentage difference between the mean abundance of a homologous series and the median abundance within the mass range the “Tall Peak” falls in.
Peak Score - the intensity of the tallest peak in a given series compared to the second tallest peak in the series This comparison is calculated by log10(Max Peak Intensity/ Second Peak Intensity). Values closer to 0 are preferred.
Peak Distance - the number of CH2 units between the tallest and second tallest peak in each series. Values closer to 1 are preferred.
Series Score - the number of actual observations in each series compared to the theoretical maximum number based on the CH2 homologous series. Values closer to 1 are preferred.
Recal() performs recalibration on the Mono and Iso outputs from the IsoFiltR() function and generates a mass spectrum highlighting the selected recalibrant series. The recalibration is based on the first step of the recalibration method described by Kozhinov et al. 2013, which uses a polynomial central moving average to estimate the weights used to recalibrate the masses. Additionally, the concept of a segmented “walking” recalibration from Savory et al. 2011 is used to remove systematic biases in the calibration. The recalibrated output can then be fed directly into MFAssign() for MF assignment of the recalibrated masses. Additionally, the function will output a data frame containing the recalibrants with their original mass error and the new, recalibrated mass error. To improve the mass recalibration across the studied mass range, Recal() finds additional recalibrants related by H2 or O homologous series using Kendrick mass analysis and then selects the tallest peaks within a user defined mass range. Recal_2() usens only the peaks that are part of the chosen recalibrant series, with no automatic additional peak selection. After the recalibrants are selected, the mass spectrum is split into segments of a user defined width and the recalibrants within each segment are used to recalibrate each section. For the purposes of running the functions RecalX() and Recal_2X() are the same as Recal and Recal_2(), with the only exception being the addition of the “num” parameters for the X versions.
+Recalcheck <- Recal(df = Unambig, peaks = Mono, isopeaks = Iso, mode = "neg", SN = 500, mzRange = 50, series1 = "O4_Na_2", series2 = "O4_H_8", series3 = "O6_Na_8")
+
+Plot <- Recalcheck[["Plot"]]
+Plot
+Recal_Mono <- Recalcheck[["Mono"]]
+Recal_Iso <- Recalcheck[["Iso"]]
+List <- Recalcheck[["RecalList"]]
+df - the input data frame in the format of the output from MFAssign() or MFAssignCHO().
peaks - the input data frame of two columns with measured ion mass in the first column and measured ion abundance in the second column; using our recommended sequence, this is the “Mono” output from IsoFiltR().
isopeaks - the input data frame of two columns with the measured ion mass in the first column and measured ion abundance in the second column, typically the “Iso” output from IsoFiltR().
mode - a character string denoting whether the data was collected in negative (“neg”) or positive (“pos”) ion mode.
SN - a numeric value that sets the signal-to-noise threshold for the purposes of the output plot; default is 0.
mzRange - a numeric value that sets the user defined mass segment width; default is 50.
series(1-10) - a character denoting the recalibrant series (e.g., “O6_H_4”); up to 10 recalibrant series may be entered.
min - the minimum mass to be considered; default is 100.
max - the maximum mass to be considered; default is 1000.
bin - the mass window range for recalibrant selection; default is 10.
obs - the number of required recalibrant peaks within each bin; default is 2.
num - sets the number of peaks on either side of defined recalibrant to choose as additional recalibrants. Default is 5. (RecalX() and Recal_2X() only)
Plot - mass spectrum with recalibrant series highlighted in blue with the rest of the mass spectrum in gray.
Mono - a data frame of the recalibrated monoisotopic ion masses and their abundance, formatted for input to MFAssign().
Iso - a data frame of the recalibrated isotopic ion masses and their abundance, formatted for input to MFAssign().
List - data frame containing the selected recalibrant masses and their assigned MF.
MFAssign() is the function typically used for the final MF assignment with additional heteroatoms (e.g., N and S). The general parameters and method of MF assignment are the same as MFAssignCHO(), the major difference is that multiple heteroatoms and isotopes can be included. However, an increasing number of chemically reasonable MF are possible with an increasing number of possible elements and increasing molecular weight. For this reason this function uses a multi-path formula extension approach to reduce the number of ambiguous MF assignments. Thus, the final MF list contains unambiguous MF that may have been selected based on formula extensions that are expected in environmental complex mixtures and ambiguous MF. An additional consequence of this increased complexity is that the default DeNovo cut is lowered to 500 from 1000 in order to limit incorrect assignments. Some unassigned masses are also expected to remain; these could be run with Ambig = “on”.
+An example of the usage of MFAssign is shown below.
+Assign <- MFAssign(peaks = Mono, isopeaks = Iso, ionMode = "pos", lowMW =100, highMW = 1000, Nx = 3, Sx = 1, Mx = 1, ppm_err = 3, H_Cmin = 0.3)
+#The parameter settings are fairly typical for positive ion data.
+
+Unambig <- Assign[["Unambig"]] #Unambiguous MF assignments data frame
+
+Ambig <- Assign[["Ambig"]] #Ambiguous MF assignments data frame
+
+Unassigned <- Assign[["None"]] #Unassigned masses data frame
+
+MSAssign <- Assign[["MSAssign"]] #Mass spectrum showing assigned, unassigned, and isotope peaks. Assigned peaks are in green, unassigned peaks are in red, and isotope peaks are in blue.
+
+Error <- Assign[["Error"]] #Error plot with m/z vs. Absolute Error (ppm) colored to indicate unambiguous MF (blue) and ambiguous MF (red).
+
+MSgroups <- Assign[["MSgroups"]] #Reconstructed mass spectrum showing the assigned peaks colored to indicate the elemental group. The plot is faceted by the ambiguity of the MF assignments.
+
+VK <- Assign[["VK"]] #van Krevelen plot colored to indicate the elemental group and faceted by the ambiguity of the MF assignments.
+
+MSAssign #Print MSAssign
+Error #Print Error
+MSgroups #Print MSgroups
+VK #Print VK
+Many of the input parameters are common between MFAssign() and MFAssignCHO(), so only the new parameters are defined here.
+Nx - the maximum number of 14N atoms.
Sx - the maximum number of 32S atoms.
Px - the maximum number of 31P atoms.
S34x - the maximum number of 34S atoms.
N15x - the maximum number of 15N atoms.
Dx - the maximum number of 2H atoms.
Ex - the maximum number of 13C atoms.
Clx - the maximum number of 35Cl atoms.
Cl37x - the maximum number of 37Cl atoms.
Fx - the maximum number of 19F atoms.
SulfCheck - turns on or off the option for a sulfur isotope check for QA purposes
++Note that an increased number of heteroatoms or adducts will decrease the speed of the function. This is especially true if more than one type of heteroatom is allowed.
+
The output of this function is the same as for MFAssignCHO(); see the description in MFAssignCHO() Output above.
+Although many well-vetted methods of decreasing the number of MF assigned to each mass have been employed in the MFAssign functions, it is important to take time to assess the correctness of the assigned MF based on realistic expectations for the specific data being analyzed.
+It is important to ensure that each input parameter is set properly for the analysis being performed. There are warning messages that will be displayed if the parameters are set outside the typical range (or include a typo), but these warnings will not stop the function from running. If too many heteroatoms are included, the runtime of the function will be greatly increased, and the results may be unreasonable. On the other hand, if too few or incorrect heteroatoms are included, the results may be incorrect.
+If at any time MFAssign() or MFAssignCHO() need to be stopped before they have completed, it is recommended to run the following line of code “.rs.restartR()”. This is required to clear the memory, so that the functions can be run at a reasonable speed in future attempts. This line of code is included within the MFAssign functions and so R will automatically restart as necessary to ensure continued good performance.
+Whenever possible, limit the number of masses in the input data frame to a number as small a as possible since, the speed of the function decreases significantly with an increasing number of masses. The performance is reasonably good (~ <= 60 sec) up to 10,000 masses with a moderate number of elements (3-5). However, both a higher number of masses being evaluated and higher maximum numbers of heteroatoms, decrease the function speed and can become unreasonable depending on the computer (e.g., >30 minutes).
+Setting a reasonable signal-to-noise (SN) threshold can be very important with regard to function speed. Thus, prior to running a raw mass list, you should attempt to estimate the SN or set it to a reasonable value. This can be done with HistNoise(), KMDNoise(), and/or SNplot(). Otherwise the function will try to assign MF to every mass peak (including the noise), which slows down the function and increases the likelihood of incorrect assignments.
+The two input columns can have any name, but it is very important that the measured ion mass is in the first column and the measured abundance (or ion intensity) is in the second column. The function only attempts to assign MF to the first column. It is also important to put only ion masses into the function, using neutral masses will not work.
+As noted in the opening paragraph of this vignette, this package contains MFAssign_RMD() and MFAssignCHO_RMD() functions that were designed to be run within an R Markdown file. The functions are identical to the corresponding non-”RMD” versions (MFAssign() and MFAssignCHO(), respectively) with one exception, they do not include the command “.rs.restartR()”. The restart command line is used to clear the working memory to prevent degraded performance with sequential/repetitive analyses. Similarly, R Markdown automatically includes a restart to clear the working memory after a document is rendered.
+A recommended data analysis practice is to use R Markdown documents to serve as a record of the actual data files that serve as inputs and the parameters used in the individual functions. Thus, we have started using them with MFAssignR functions for record keeping and learned that we can semi-automatically process similar data sets using a short .R script that performs a loop over a selected set of data files.
+MFAssign_RMD() and MFAssignCHO_RMD() should only be used within an R Markdown document and should only be called once each within the document, otherwise the runtime slows down as described above. These RMD functions are particularly useful when they are used in a looped R Markdown document, which will allow the user to semi-automate the MF assignment and improve data throughput. Care must be taken when using these scripts to ensure the correct parameters are set. An R Markdown template with an entire MFAssignR pipeline designed for R Markdown is included as an additional file in this repository. The descriptions for MFAssign() and MFAssignCHO() applicable to MFAssign_RMD() and MFAssignCHO_RMD(); thus, they will not be repeated here. As noted previously, the only difference between these functions is that RMD versions do not have an “.rs.restartR()” command built into them, which allows them to run within an R Markdown document.
+Located in the GitHub repository for this package are an R script and an R Markdown document that can be used to semi-automate MF assignment over a set of data input files. The files can be edited based on the requirements of the user, but the basic template will generate a report showing the outputs of the various functions and the .csv outputs of the unambiguous, ambiguous, and unassigned mass lists, along with a list of the recalibrant ions. As noted, these scripts are best used when you have several data sets that are expected to have the same function parameters applied to them. For example, a set of sample replicates each collected using identical instrument settings could be semi-automated using the same function parameters. However, unrelated samples with very different MF compositions would be less favorable because the recalibrant ions may differ. As previously described the recalibrant ions are one of the parameters that must be manually determined before running the MF assignment scripts.
+To use the MF assignment looping scripts you must put all of the data sets into a single folder. These data sets must all have a consistent name extension, the default is “_MS”, which should be placed at the end of your file name, for example “YourData_MS.csv”. Place “reader” script “Formula Assignment Reader.R” and the markdown template “Formula Assignment Markdown.Rmd” within the folder with the data to be analyzed. Then, add to this folder an empty subfolder called “Assigned Formulas” so that the output .csv files can be saved there. Then, you should check the reader script to ensure that the working directory matches the location where your data is stored. No other changes are required to make this reader script work, apart from ensuring that the name of the R Markdown file being rendered is the same as the one in your folder (the default name is mentioned above).
+For the R Markdown template there are a few changes that will need to be considered before assigning MF. These include the function parameters for heteroatoms, ionization mode, signal-to-noise threshold, and recalibrant series. The changes for heteroatoms and ionization mode can be selected based on what type of data you are analyzing and the expected heteroatoms. In the default template the noise is estimated by the KMDNoise() function and is designed so that the only thing that needs to be changed is the multiplier value (the default is 6). This value can be changed at the top of the document and will be applied to the rest of the document. The determination of the recalibrant series requires the user to take one representative sample and experimentally determine the best recalibrant series to use. These recalibrant series can then be put in the Recal() function and will be applied to all the data sets within the folder.
+After all of these changes are made, you can run the “reader” script (all of the lines) and it will begin assigning MF to each of the datasets in your folder. This can take a significant amount of time, depending on the number of files. On a relatively new Dell XPS 15 laptop with a 7th generation i7 processor with 16 GB of RAM, one sample takes about 4 minutes to be processed, depending on the size of the mass list and the number of heteroatoms being assigned.
+The package functions have a significant reliance on dplyr and tidyr functions for some of the data manipulations. Plots are generated using the ggplot2 package, and the colorRamps package provides the color scheme for the output of KMDNoise().
+There are multiple warning messages that are reported when the functions are being run. Unless the function stops working, these error messages do not otherwise impact the functioning of the code.
+There are a variety of additional functions that are included in this package, but many of them are sub-functions that are necessary for the MFAssign functions and are not independently operational, these functions have not been described in this document. Only the independent functions have been described in this document.
+There is a large mass list built into the package called CHNOS_ML_Ex, which can be used to test whether the function is working correctly. It is a negative mode mass list with even electron ions generated by electrospray ionization. When a maximum of 3 nitrogen and 1 sulfur are allowed, MFAssign() assigns 2116 of 2121 total masses using DeNovo = 500. Additionally, there is a smaller data frame (Short_CHO_neg) of 13 observations randomly sampled from CHNOS_ML_Ex, which is more effective for checking whether or not MF are correct. There is also an 8 observation data frame (Short_CHO_pos) of positive even electron ions generated by electrospray ionization, which can be used to check the MFAssign parameters for positive ions. All functions should be able to assign all the masses in these short example data sets if DeNovo = 1000. Additionally, a raw mass list containing negative even electron ion data is included and can be used to test the Noise, SNplot, IsoFiltR, Recal and MFAssign functions. Its name is Raw_Neg_ML.
+This is the short data frame of negative ion data for testing purposes. The masses and correctly assigned unambiguous MF are shown below.
+Exp_mass | +formula | +
---|---|
531.2092 | +C24H36O13 | +
235.0251 | +C11H8O6 | +
563.1992 | +C24H36O15 | +
331.1767 | +C16H28O7 | +
391.0676 | +C18H16O10 | +
403.0524 | +C15H16O13 | +
321.0620 | +C15H14O8 | +
363.1091 | +C18H20O8 | +
683.2931 | +C33H48O15 | +
207.0301 | +C10H8O5 | +
523.1102 | +C23H24O14 | +
437.1460 | +C21H26O10 | +
487.1465 | +C21H28O13 | +
This is the short data frame of positive ion data for testing purposes. The masses and correctly assigned unambiguous MF are shown below.
+Exp_mass | +formula | +
---|---|
415.1235 | +C18H22O11 | +
325.2162 | +C22H28O2 | +
271.0812 | +C12H14O7 | +
265.0859 | +C17H12O3 | +
195.0652 | +C10H10O4 | +
303.0863 | +C16H14O6 | +
271.1176 | +C13H18O6 | +
267.1591 | +C15H22O4 | +