diff --git a/MFAssignR/vignettes/MFAssignR Vignette.Rmd b/MFAssignR/vignettes/MFAssignR Vignette.Rmd new file mode 100644 index 0000000..17aea06 --- /dev/null +++ b/MFAssignR/vignettes/MFAssignR Vignette.Rmd @@ -0,0 +1,603 @@ +--- +title: "MFAssignR" +author: "Simeon Schum, Lynn Mazzoleni, and Laura Brown" +date: "`r Sys.Date()`" +output: rmarkdown::html_vignette +vignette: > + %\VignetteIndexEntry{MFAssignR} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} +--- + +##Package Overview and References +The MFAssignR package was designed for multi-element molecular formula (MF) assignment of ultrahigh resolution mass spectrometry measurements. A number of tools for internal mass recalibration, MF assignment, signal-to-noise evaluation, and unambiguous formula selections are provided. This package contains MFAssign(), MFAssignCHO(), MFAssignAll(), SNplot(), SNcutCheck(), MFRecalList(), MFRecalCheck(), and IsoFiltR() described in the sections below. Caution with parameter settings and output evaluation is recommended. + +The functions in the MFAssignR package were developed by adapting methods and algorithms from the peer reviewed literature. The following references are referred to in this document: + +Green, N. W. and Perdue, E. M.: Fast Graphically Inspired Algorithm for Assignment of Molecular Formulae in Ultrahigh Resolution Mass Spectrometry, Anal Chem, 87(10), 5086–5094, doi:10.1021/ac504166t, 2015. + +Gross, J. H.: Mass Spectrometry, doi:10.1007/978-3-319-54398-7, 2017. + +Herzsprung, P., Hertkorn, N., Tümpling, W. von, Harir, M., Friese, K. and Schmitt-Kopplin, P.: Understanding molecular formula assignment of Fourier transform ion cyclotron resonance mass spectrometry data of natural organic matter from a chemical point of view, Anal Bioanal Chem, 406(30), 7977–7987, doi:10.1007/s00216-014-8249-y, 2014. + +Koch, B. P., Dittmar, T., Witt, M. and Kattner, G.: Fundamentals of Molecular Formula Assignment to Ultrahigh Resolution Mass Data of Natural Organic Matter, Anal Chem, 79(4), 1758–1763, doi:10.1021/ac061949s, 2007. + +Kozhinov, A. N., Zhurov, K. O. and Tsybin, Y. O.: Iterative Method for Mass Spectra Recalibration via Empirical Estimation of the Mass Calibration Function for Fourier Transform Mass Spectrometry-Based Petroleomics, Anal Chem, 85(13), 6437–6445, doi:10.1021/ac400972y, 2013. + +Kujawinski, E. B. and Behn, M. D.: Automated Analysis of Electrospray Ionization Fourier Transform Ion Cyclotron Resonance Mass Spectra of Natural Organic Matter, Anal Chem, 78(13), 4363–4373, doi:10.1021/ac0600306, 2006. + +Lobodin, V. V., Marshall, A. G. and Hsu, C. S.: Compositional Space Boundaries for Organic Compounds, Anal Chem, 84(7), 3410–3416, doi:10.1021/ac300244f, 2012. + +Ohno, T. and Ohno, P. E.: Influence of heteroatom pre-selection on the molecular formula assignment of soil organic matter components determined by ultrahigh resolution mass spectrometry, Anal Bioanal Chem, 405(10), 3299–3306, doi:10.1007/s00216-013-6734-3, 2013. + +Perdue, E. M. and Green, N. W.: Isobaric Molecular Formulae of C, H, and O: A View from the Negative Quadrants of van Krevelen Space, Anal Chem, 87(10), 5079–5085, doi:10.1021/ac504165k, 2015. + +Zhurov, K. O., Kozhinov, A. N., Fornelli, L. and Tsybin, Y. O.: Distinguishing Analyte from Noise Components in Mass Spectra of Complex Samples: Where to Cut the Noise, Anal Chem, 86(7), 3308–3316, doi:10.1021/ac403278t, 2014. + +##Molecular Formula Assignment +The molecular formula assignment algorithm in MFAssign was adapted from the low mass moiety CHOFIT assignment algorithm developed by Green and Perdue (2015). Briefly, the CHOFIT algorithm uses low mass moieties such as CH4O-1 and C4O-3 to move around in the O/C and H/C space to assign masses to CHO formulas. These low mass moieties efficiently assign CHO formulas without conventional loops. Additional combinatorial assignments with various heteroatoms are made using nested loops that subtract the mass of a heteroatom from the measured ion mass, creating a CHO “core” mass, which can then be assigned using the low mass moiety CHOFIT approach. This is further explained in Green and Perdue (2015) and Perdue and Green (2015). + +###MFAssign Functions +In total there are 3 versions of MF Assign, including MFAssign(), MFAssignCHO(), and MFAssignAll(). Where MFAssign() and MFAssignAll() include external nested loops to assign additional heteroatoms, as described in Green and Perdue (2015) and MFAssignCHO() does not. + +####MFAssign() +Using the low mass moiety and combinatorial assignment approach, MFAssign() can be used to assign molecular formulas with 12C, 1H, and 16O and a variety of heteroatoms and isotopes, including 2H, 13C, 14N, 15N, 31P, 32S, 34S, 35Cl, and 37Cl. It can also assign Na+ adducts, which are common in positive ion mode. Due to the increasing number of chemically reasonable molecular formulas with the increasing number of possible elements and increasing molecular weight, the output will provide a list of ambiguous and unambiguous molecular formulas. + +Advanced Kendrick mass and z* sorting tools are used to reduce the number of ambiguous molecular formulas in MFAssign(). First, Kendrick mass defect (KMD) and z* values are calculated with a CH2 Kendrick base to sort the measured masses into CH2 homologous series (Stenson et al., 2003). The function then selects 1 to 3 members of each CH2 homologous series with masses below the user defined cutoff and attempts to assign molecular formulas. The ambiguous MF are then returned to the unassigned list. Then, the unambiguous MF are used as seeds for additional assignments using CH2, O, H2, H2O, and CH2O molecular formula extensions (Kujawinski and Behn, 2006). To do the formula extensions the KMD and z* values for each of these bases are calculated and then used to assign MF through the addition or subtraction of the series bases. MFAssign() (and MFAssignCHO()) tracks how many different “paths” can be used to assign each formula and if a single mass has multiple formulas, the function will choose the formula that has the largest number of paths that intercept with it. For example, if a single mass has two possible MF and one has 20 potential “paths” to it, while the other has 4, the function will choose the MF with 20 paths. Work is ongoing to track these paths and the removed MF in the data frame output of these functions. Overall, the multi-path MF extension approach greatly reduces the number of ambiguous assignments and provides an increased level of confidence in the final MF list because the MF are related to unambiguous MF assigned below the user defined cut point. + +####MFAssignCHO() +MFAssignCHO() is a simplified version of MFAssign() used to assign CHO molecular formulas only. MFAssignCHO() runs faster than MFAssign() and is best used as a preliminary formula assignment step prior to the selection of recalibrant ions in conjunction with MFRecalList() and MFRecalCheck(), which are described below. + +####MFAssignAll() +MFAssignAll() uses the low mass moiety and combinatorial assignment approach with a simplified MF extension approach. However, only the CH2 and H2O Kendrick bases are used for MF assignment. This function results in a significantly higher number of ambiguous MF and is best used to assign previously unassigned masses after MFAssign() or on short mass lists without a complex mixture. + +###Preliminary Isotope Filtering +The IsoFiltR() function can identify and pre-screen many of the 13C isotope masses, which can lower the number of peaks assigned an incorrect molecular formula. This function operates on a two column dataframe using the same structure as the MFAssign() function. IsoFiltR() works by finding the 13C Kendrick mass defects and then using those in conjunction with the CH2 z* values to differentiate the possible monoisotope/polyisotope pairs. For QA the function checks that the "polyisotopic 13C" peak is less than a certain fraction of the abundance of the "monoisotopic" peak. There are two data frame outputs from this function, "Mono" and "Iso". The "Mono" dataframe contains the masses identified as likely monoisotopes from monoisotope/polyisotope pairs and the remaining unmatched masses The "Iso" dataframe contains only the masses that were identified as polyisotopic 13C masses from the monoisotope/polyisotope pairs. + +When the two dataframe outputs from IsoFiltR() are put into MFAssign(), the function will match the assigned monoisotopic mass peaks to their corresponding polyisotopic mass peak. Additional work is needed to use the isotopes to reduce ambiguous molecular formula assignments assigned to a single mass. IsoFiltR() should not be considered as definitive proof of the presence or absence of polyisotopic 13C molecular formulas, but it does provide some ability to identify theses masses and limit the chances that they are incorrectly assigned with a monoisotopic formula. + +###Molecular Formula Quality Assurance +MFAssign() includes a number of quality assurance (QA) steps to ensure that the assigned molecular formulas are chemically reasonable. Relatively lenient default settings are provided to avoid removing chemically reasonable ambiguous molecular formula assignments. Many of these parameters are customizable, including DBE-oxygen limits (Herzsprung et al. Anal. and Bioanal. Chem. 2014), oxygen-to-carbon ratio limits, hydrogen-to-carbon ratio limits, and minimum number of oxygen limits. The Hetcut parameter can be used to select the molecular formula with the lowest number of heteroatoms if more than one molecular formula is assigned to a single mass (Ohno and Ohno, 2013). The NMScut parameter identifies the CH4 vs O exchange series in each nominal mass as described in Koch et al. (2007), which can be used to limit ambiguous assignments. Additional non-adjustable QA parameters are used in MFAssign(), including the nitrogen rule, large atom rule, and the maximum number of hydrogen rule, maximum DBE rule (Lobodin et al., 2012), and the Senior rules (Kind et al. 2007). + +##Signal-to-Noise Assessment +Signal-to-Noise level assessment can be accomplished using the SNcutCheck() and SNplot() functions, which are based on the method developed by Zhurov et al. (2014). This method uses the histogram distribution of the natural log intensities in a raw mass spectrum to determine the point where noise peaks give way to analyte signal. Additionally, the SNplot() function allows qualitative assessment of the effectiveness of the chosen S/N cut. + +The SNcutCheck() function is used to estimate the signal-to-noise cut for the raw mass spectrum output mass list. The recommended signal-to-noise cut is reported in the console and a histogram showing the distribution of natural log of the abundance (or intensity) values is generated. The cut point is denoted using red for values below the cut point and blue for those above. This function should be run on the mass list prior to molecular formula assignment with MFAssign(). Setting a reasonable S/N cut point greatly increases the speed of the function and improves the output quality. + +The SNplot function is used to show the mass spectrum with the formulas below the cut and above the cut denoted by the same colors as in the histogram plot from SNcutCheck. + +##Internal Mass Recalibration +MFRecalList() and MFRecalCheck() are functions pertaining to the internal mass recalibration method adapted from Kozhinov et al. (2013) using a polynomial central moving average to estimate the weights used to recalibrate the masses. The function MFRecalList() can be used with the output of MFAssign() or MFAssignCHO() to generate a data frame containing potential recalibrant CH2 homologous series. There are a variety of metrics included in the output of this function to aid the user in picking suitable recalibrant series, these are described in greater detail in the example of MFRecalList() below. The user can select up to 10 homologous series as inputs for the mass recalibration with MFRecalCheck(). MFRecalCheck() takes the chosen series and then uses the H2 and O KMD and z* series to identify additional formulas that are related to the chosen recalibrants. To avoid recalibration problems, the function assigns the user defined number of tallest peaks within a user defined mass range “bin” as recalibrants. For example, if the bin width is set at 20 and the number of peaks is set at 2, the function will select the two tallest peaks within each 20 m/z window across the range of the spectrum. The user can set the mass window bins and the number of peaks that will be chosen as recalibrants within each bin. This function then recalibrates the mass list and generates a plot with the input recalibrant series highlighted for a qualitative look at their overall quality. + +##Recommended Order of Operations +The functions will be described in the order that they are most effectively used. The functions do not have to be run in this order, but the best results will likely be obtained in this way. A list of the functions in the recommended order is given below: + +1. Run SNcutCheck() to determine the signal-to-noise cut point for the data + +2. Check effectiveness of S/N cut point using SNplot() + +3. Use IsoFiltR() to identify potential 13C isotope peaks + +4. Using the S/N cut point determined by SNcutCheck(), and the two dataframes output from IsoFiltR(), run MFAssignCHO() to assign CHO formulas for use in identifying recalibrant ions. + +5. Use MFRecalList() to generate a list of potential recalibrant series. + +6. After choosing a few recalibrant series, use MFRecalCheck() to check whether they are good recalibrants and recalibrate the mass lists using those recalibrants. + +7. Use MFAssign() with the recalibrated mass lists to assign molecular formulas to the data. + +8. Check the output plots from MFAssign() to check the quality of the assignment. + +##Function Examples +###Signal-to-Noise Functions + +The following two functions are used for mass lists containing noise. + +####SNcutCheck() + +This function is an adaptation of the method developed by Zhurov et al. (2014). It is used to estimate the S/N cut point for raw mass spectral data from both FT-ICR and Orbitrap MS. In theory, there should be a significant first peak that contains the measured masses due to random noise. This function finds the valley between the random noise and the analyte signal of the histogram output. The output signal-to-noise can then be used to constrain the masses that are considered in the MFAssign() function. In some cases the data does not form the expected distribution, in which case, estimating the S/N cut point and checking it with SNplot() is likely the best course of action. + +```{r, eval = FALSE} +Data <- read.csv("YourMassList.csv") +#You can read in an external data set; or use whatever method is most convenient to prepare the two column dataset. +#Make sure the first column is the measured ion mass and the second column is the measured ion abundance (intensity or relative abundance). + + +SNcutCheck(Data, SN = 0, bin = 0.01) +``` + +#####SNcutCheck() Parameters + +* df - two column dataframe containing measured ion abundance and ion mass. + +* SN - sets a manual S/N cut point if the function does not find an acceptable value; default is 0 + +* bin - sets the binwidth for generating the histogram; default is 0.01 + +#####SNcutCheck() Output + +* A histogram showing the distribution of intensities, with the cut point highlighted by red (below cut point) and blue (above cut point). + +* The function recommended S/N cut is printed in the console. + +####SNplot() + +This function generates a mass spectrum with color coded mass peaks to indicate if they are below (red) or above (blue) the S/N cut point. This can be used as a qualitative check of the suggested output from SNcutCheck(). Also, as mentioned previously, it can also be used for qualitative investigation of the S/N level in the mass spectrum independent of SNcutCheck(). + +```{r, eval = FALSE} +Data <- read.csv("YourMassList.csv") +#You can read in an external data set; or use whatever method is most convenient to prepare the two column dataset. +#Make sure the first column is the measured ion mass and the second column is the measured ion abundance (intensity or relative abundance). + + + +SNplot(Data, cut = 500, mass = 400, window.x = 0.5, window.y = 10) + +``` + +#####SNplot() Parameters + +* df - two column dataframe containing measured ion abundance and ion mass. + +* cut - the signal-to-noise cut point. + +* mass - sets the center mass of the window. + +* window.x - sets the width of window on either side of the center mass; default is 0.5. + +* window.y - sets the y axis of the plot by multiplying the cut point by this value; default is 10. + +#####SNplot() Output + +* A plot based on the chosen parameters that shows the mass spectrum with peaks below the cut point in red and those above in blue. + +###Isotope Filtering Function + +####IsoFiltR() +IsoFiltR() provides a tentative filtering of carbon 13 masses from the overall mass list, as described previously. This decreases the likelihood of incorrect assignments. + +```{r, eval = FALSE} +Data <- read.csv("YourMassList.csv") +#You can read in an external data set; or use whatever method is most convenient to prepare the two column dataset. +#Make sure the first column is the measured ion mass and the second column is the measured ion abundance (intensity or relative abundance). + +Mono_Iso <- IsoFiltR(Data) + +Mono <- Mono_Iso[["Mono"]] +Iso <- Mono_Iso[["Iso"]] +``` + +#####IsoFiltR() Output + +* The output of this function is a list containing two data frames. The first data frame is “Mono” and contains the monoisotopic masses and the masses that were not classified as either monoisotopic or polyisotopic. The second data frame is “Iso” which contains the masses identified as polyisotopic. + +###Preliminary CHO Assignment + +####MFAssignCHO() +MFAssignCHO() is a simpler version of MFAssign() that only assigns CHO molecular formulas. This can be helpful when trying to do a quick assignment prior to internal mass recalibration. The molecular formula assignment algorithm is based on the same principles as the full MFAssign(). It uses the CHOFIT algorithm to do a preliminary assignment of the CH2 homologous series on a subset of the masses and Kendrick Mass Defect and z* series analysis to extend the assignments to related remaining masses. An example of its usage is shown below, along with its input parameters. + +```{r, eval = FALSE} +Assign <- MFAssignCHO(peaks = Mono, isopeaks = Iso, ionMode = "pos", lowMW =100, highMW = 1000, Mx = 1, ppm_err = 3, H_Cmin = 0.3) +#This is a set of typical parameter settings for positive ion data. + +#The output list includes the following datasets. +Unambig <- Assign[["Unambig"]] #Unambiguous MF assignments data frame +Ambig <- Assign[["Ambig"]] #Ambiguous MF assignments data frame +Unassigned <- Assign[["None"]] #Unassigned values data frame + +#The output list includes the following plots. +MSAssign <- Assign[["MSAssign"]] #Mass spectrum showing assigned, unassigned, and isotope peaks. Assigned peaks are in green, unassigned peaks are in red, and isotope peaks are in blue. + +Error <- Assign[["Error"]] #m/z vs. absolute error (ppm) plot with unambiguous MF in blue and ambiguous MF in red. + +MSgroups <- Assign[["MSgroups"]] #Mass spectrum showing the assigned peaks colored by the molecular group they belong to. The plot is faceted by ambiguity of the assignments. + +VK <- Assign[["VK"]] #van Krevelen plot colored by molecular group and faceted by ambiguity of the assignments. + +MSAssign #Print MSAssign +Error #Print Error +MSgroups #Print MSgroups +VK #Print VK +``` + +#####MFAssignCHO() Parameters + +* peaks - the input data frame with the measured ion mass in the first column followed by measured ion abundance in the second column. The column names can be anything. + +* isopeaks - the input isotopic masses dataframe with the same structure as "peaks"; if the two dataframes (peaks and isopeaks) come from the IsoFiltR function they will be formatted correctly. + +* ionMode - sets the ionization mode. It has two possible inputs; "pos" for positive mode and "neg" for negative mode. The options are case sensitive. + +* POEx - allows the assignment of positive odd electron ions. When it is set to 0, positive odd electron ions are not permitted (0 = off). When it is set to 1, positive odd electron ions are allowed (1 = on). The default is 0. This option is likely only needed in certain circumstances possibly including the assignment of ions measured after atmospheric pressure chemical ionization (APCI) or photoionization (APPI). Note, this only works for positive ion data and extra care must be taken with the assignments to ensure they are chemically reasonable. This option does not currently work, but it is planned for a future release. + +* lowMW - sets the minimum ion mass to be assigned. The default is 100. + +* highMW - sets the maximum ion mass to be assigned. The default is 1000. + +* Ex - sets the maximum number of 13C to be used in the function. + +* Mx - sets the maximum number of Na+ adducts to be used in the function. Note that this is important for most positive mode ESI data. The default is 0. + +* NH4x - sets the maximum number of NH4+ adducts to be used in the function. The default is 0. Note that this will replace one nitrogen and 4 hydrogens in a CHNO formula that does not have an NH4+ adduct, so great care should be taken with the formula assignments to understand if they are correct. + +> Note that the addition of more heteroatoms or adducts will decrease the speed of the function. This is especially true if more than one type of heteroatom is allowed. + +* Zx - sets the maximum number of charges allowed. The default is 1. Theoretically, MFAssign() can assign multiply charged compounds, but not at this time. + +* Ox - sets the maximum number of O to be assigned. It also indirectly sets the number of loops for the core CHOFIT algorithm. Each core loop increases the number of oxygen by 3, which means that setting this value to 30 allows 10 loops of the inner algorithm, which seems to be a good balance between function speed and reasonable formula assignment. + +* ppm_err - sets the maximum allowable error for MF assignment and monoisotope/isotope peak matching. The value is in parts per million (ppm) and the default is 3. + +* SN - sets the signal-to-noise level for the data. This is useful if the signal-to-noise value for the data is known from an external source (such as SNcutCheck(). MFAssign() does not have the ability to independently determine the signal-to-noise level. The default is 0, and the input value must be consistent with the type of abundance in column 1 of peaks (i.e., intensity or relative abundance). + +* O_Cmin - sets the minimum allowable oxygen-to-carbon ratio for the assigned formulas. The default is 0. + +* O_Cmax - sets the maximum allowable oxygen-to-carbon ratio for the assigned formulas. The default is 2.5. + +* H_Cmin - sets the minimum allowable hydrogen-to-carbon ratio for the assigned formulas. The default is 0.1. + +* H_Cmax - sets the maximum allowable hydrogen-to-carbon ratio for the assigned formulas. The default is 3. + +* DBEOmin - sets the minimum allowable DBE minus oxygen value. The default is -13, consistent with Herzsprung et al. 2014. + +* DBEOmax - sets the maximum allowable DBE minus oxygen value. The default is 13, consistent with Herzsprung et al. 2014. + +* Omin - sets the minimum allowable number of oxygen for a molecular formula. The default is 0. + +* HetCut - this removes the ambiguous molecular formulas with a greater number of heteroatoms (heteroatoms are defined as all elements that are not C or H) if an MF with fewer heteroatoms was also assigned to the same mass. This parameter is based on Ohno and Ohno (2013).The default setting is "off" because this can sometimes lead to incorrect assignments if many heteroatoms are expected for the data. The only inputs are "on" or "off" and it is case sensitive. + +* NMScut - this is a QA parameter based on nominal mass patterns as described by Koch et al. (2007). It helps to decrease the number of ambiguous formulas. The default setting is “on”, to turn this option off the input is “off”. + +* DeNovo - this defines the m/z cut point below which formulas can be assigned with the CHOFIT algorithm and above which formulas are assigned with the formula extension method. The default setting is 1000 for CHO assignment. + +* nLoop - this defines the number of times that the formula extension component of MFAssignCHO() will loop to assign the masses not unambiguously assigned with the CHOFIT algorithm. The default number of loops is 5. + +#####MFAssignCHO() Output + +The output of the function is a list containing 3 data frames and 4 plots. The dataframes will be described here first. + +The first data frame (Unambig) contains the unambiguous assigned molecular formulas along with other useful parameters that are useful for data interpretation such as O/C, H/C, DBE, and more. The column headers and a brief description of each are given below. + +The second data frame (Ambig) contains the ambiguous assignments, with the same additional information as the Unambig data frame. + +The third data frame (None) contains the ion masses that were not assigned a molecular formula, along with their corresponding abundance. + +The following column headers are the same for both Unambig and Ambig data frames. + +* Abundance - the abundance of each identified species, it is unaltered from what was in the input dataframe. + +* Exp_mass - the experimental ion mass from the input dataframe, it is unaltered. + +* formula - the assigned molecular formula for the mass. + +* class - the molecular class of the molecular formula based on the number of heteroatoms it contains. + +* group - the molecular group of the molecular formula (CHO, CHNO, etc.). + +* C - the number of assigned 12C + 13C atoms. + +* H - the number of assigned 1H atoms. + +* O - the number of assigned 16O atoms. + +* N - the number of assigned 14N atoms. + +* S - the number of assigned 32S atoms. + +* P - the number of assigned 31P atoms. + +* E - the number of assigned 13C atoms. + +* S34 - the number of assigned 34S atoms. + +* N15 - the number of assigned 15N atoms. + +* D - the number of assigned 2H atoms. + +* Cl - the number of assigned 35Cl atoms. + +* Cl37 - the number of assigned 37Cl atoms. + +* M - the number of assigned Na+ adducts. + +* NH4 - the number of assigned NH4+ adducts. + +* POE - indicates whether the assigned formula is a positive odd electron mass (1) or not (0). + +* Z - the charge on the mass. + +* Neutral_mass - the neutral mass defined as Exp_mass plus or minus its adduct (either H+ or Na+) depending on whether the mass was collected in the negative or positive mode. + +* O_C - the oxygen-to-carbon ratio for the assigned molecular formula. + +* H_C - the hydrogen-to-carbon ratio for the assigned molecular formula. + +* theor_mass - the theoretical neutral mass for the assigned molecular formula using the exact masses of the atoms. + +* DBE - the number of double bond equivalents for the assigned molecular formula. Note, only the lowest valence number is considered for multivalent elements. Therefore, this does not include unsaturations associated with oxidized elements such as S. + +* err_ppm - the error between the measured mass and the theoretical mass for the assigned molecular formula in parts per million (ppm). The adduct mass is considered in this calculation. + +* AE_ppm - the absolute value of the err_ppm. + +* KM - Kendrick mass using CH2 as the Kendrick base. + +* KMD - Kendrick mass defect using CH2 as the Kendrick base. + +* max_LA - the theoretical maximum allowable number of large atoms (elements larger than 2H) for a measured mass based on the ‘Rule of 13’. + +* actual_LA - the actual number of large atoms (elements larger than 2H) in the assigned molecular formula. + +* rule_13 - the ratio of actual_LA-to-max_LA. If the ratio is less than 1, the molecular formula is chemically feasible based on the ‘Rule of 13’. + +* DBEO - the DBE value minus the number of oxygen atoms in the molecular formula. + +* max_H - the maximum possible number of hydrogen atoms for the measured mass. + +* H_test - the number of hydrogen atoms divided by the max_H value. If the ratio is less than 1, the molecular formula is chemically feasible based on this parameter. + +* Iso_mass - the measured mass of the paired polyisotopic mass from the "isopeaks" input that was matched to the assigned monoisotopic mass from the "peaks" input. + +* Iso_RA - the measured abundance of the paired polyisotopic mass that was matched to the assigned monoisotopic mass. + +* Tag - a tag denoting whether an assignment is ambiguous or unambiguous. + +The third dataframe (None) contains the measured masses that were not assigned a molecular formula. These can be further analyzed using MFAssignAll(). + +#####MFAssignCHO() Output Plots + +There are four plot outputs in the MFAssignCHO() function. + +* MSAssign - the mass spectrum of the assigned, unassigned, and isotope peaks shown in different colors (green, red, and blue, respectively). + +* Error - the Exp_mass vs. absolute error for the assigned molecular formulas. Unambiguous MF are blue and ambiguous MF are red. + +* MSgroups - a reconstructed mass spectrum of the assigned peaks colored by their elemental composition (CHO, CHNO, etc.). CHO, CHNO, CHOS, CHNOS, CH, CHN elemental groups are highlighted, all other molecular groups are classified as "Other". The plot is faceted to separate the ambiguous and unambiguous MF assignments. + +* VK - the van Krevelen plot of the assigned MF colored by their elemental composition, similar to the MSgroups plot. The plot is faceted to separate the ambiguous and unambiguous MF assignments. + +###Recalibration Functions +The following two functions are used for internal mass calibration: MFRecalList() provides qualitative metrics for the selection of possible recalibrant series and MFRecalCheck() performs a mass recalibration using the approach described in Kozhinov et al. (2013). + +####MFRecalList() + +MFRecalList() is a function that takes the output of MFAssign() or MFAssignCHO() and provides metrics to rank the homologous series suitability to be used as recalibrants. The function selects CHO homologous series with at least three members. The homologous series are evaluated to determine the number of observations in each series (Number Observed), the mass range of each series (Mass Range), the mass of the tallest peak in each series (Tall Peak), and the "Abundance Score" which shows the percentage difference between the mean abundance of a homologous series and the median abundance within the mass range the "Tall Peak" of each series fall in (for example m/z 200-300). The output is a data frame with all eligible series present for the user to review. In general, a series with many members and a high abundance score is a good place to start. The goal in this recalibration method is to have recalibrants with high local abundance across the entire range of the spectrum. + +```{r, eval = FALSE} +#The input for this function is the output from any of the MFAssign functions. +RecalList <-MFRecalList(df = Unambig) + +``` + +#####MFRecalList() Output +The output is a data frame with nine columns which the user can use to inform their decision about which recalibrant series to choose. + +* Series - the molecular class (e.g., "O6"), the DBE (e.g., 3), and adduct type ("H" or "Na") concatenated into a single term (class_Adduct_DBE). This series information is needed to identify which homologous series is to be used as a recalibrant in MFRecalCheck(). + +* Number Observed - the number of observed masses in each homologous series. + +* Series Index - a number indicating the length of the series relative to the other identified series, the smaller the number the longer the series. + +* Mass Range - the mass range from the smallest member of the homologous series to the largest. + +* Tall Peak - the mass of the most abundant peak in each series. + +* Abundance Score - the percentage difference between the mean abundance of a homologous series and the median abundance within the mass range the "Tall Peak" falls in. + +* Peak Score - the intensity of the tallest peak in a given series compared to the second tallest peak in the series This comparison is calculated by log10(Max Peak Intensity/ Second Peak Intensity). Values closer to 0 are preferred. + +* Peak Distance - the number of CH2 units between the tallest and second tallest peak in each series. Values closer to 1 are preferred. + +* Series Score - the number of actual observations in each series compared to the theoretical maximum number based on the CH2 homologous series. Values closer to 1 are preferred. + + +####MFRecalCheck() +MFRecalCheck() performs recalibration on the Mono and Iso outputs from the IsoFiltR() function and generates a mass spectrum highlighting the selected recalibrant series. The recalibration is based on the first step of the recalibration method described by Kozhinov et al. 2013, which uses a polynomial central moving average to estimate the weights used to recalibrate the masses. The recalibrated output can then be fed directly into MFAssign() in order to assign molecular formulas with the recalibrated masses. Additionally, the function will output a data frame containing the recalibrants with their original mass error and new, recalibrated mass error. To improve the mass recalibration across the studied mass range, this function finds additional recalibrants related by H2 or O homologous series using Kendrick mass analysis and then selects the tallest peaks within a user defined mass range. + +```{r, eval = FALSE} + +Recalcheck <- MFRecalCheck(df = Unambig, peaks = Mono, isopeaks = Iso, mode = "neg", SN = 500, series1 = "O4_Na_2", series2 = "O4_H_8", series3 = "O6_Na_8") + +Plot <- Recalcheck[["Plot"]] +Plot +Recal_Mono <- Recalcheck[["Mono"]] +Recal_Iso <- Recalcheck[["Iso"]] +List <- Recalcheck[["RecalList"]] + +``` + +#####MFRecalCheck input + +* df - input data frame in the format of the output from MFAssign() or MFAssignCHO(). + +* peaks - input data frame of two columns with measured ion mass in the first and measured ion abundance in the second, typically the “Mono” output from IsoFiltR(). + +* isopeaks - input data frame of two columns with measured ion mass in the first and measured ion abundance in the second, typically the “Iso” output from IsoFiltR(). + +* mode - a character string denoting whether the data was collected in negative ("neg") or positive ("pos") ion mode. + +* SN - a numeric value that sets the signal-to-noise cut for the purposes of the output plot, default is 0. + +* series(1-10) - a character denoting the recalibrant series (e.g., "O6_H_4"). Up to 10 recalibrant series may be entered. + +* min - a numeric value that sets the minimum mass to be considered. Default is 100. + +* max - a numeric value that sets the maximum mass to be considered. Default is 1000. + +* bin - a numeric value that sets the mass window range for recalibrant selection. Default is 20. + +* obs - a numeric value that sets the number of recalibrant peaks within each bin. Default is 2. + + +#####MFRecalCheck() output + +* Plot - mass spectrum with recalibrant series highlighted in blue with the rest of the mass spectrum in gray. + +* Mono - data frame of recalibrated ion masses and their abundance, formatted for input to MFAssign(). + +* Iso - data frame of recalibrated ion masses and their abundance, formatted for input to MFAssign(). + +* List - data frame containing the selected recalibrant masses and formulas. + + +###Final Molecular Formula Assignment +####MFAssign() + +MFAssign() is the function used for the final MF assignment with heteroatoms. The general parameters and method of formula assignment are the same as MFAssignCHO(), the major difference is that multiple heteroatoms and isotopes can be included. However, an increasing number of chemically reasonable MF are possible with an increasing number of possible elements and increasing molecular weight. For this reason this function uses a multi-path MF extension approach to reduce the number of ambiguous assignments. Thus, the final MF list contains unambiguous MF that may have been selected based on relationships that are expected in environmental complex mixtures and ambiguous MF. Some unassigned masses are also expected to remain; these could be passed through MFAssignAll(). + +An example of the usage of MFAssign is shown below. + +```{r, eval = FALSE} +Assign <- MFAssign(peaks = Mono, isopeaks = Iso, ionMode = "pos", lowMW =100, highMW = 1000, Nx = 3, Sx = 1, Mx = 1, ppm_err = 3, H_Cmin = 0.3) +#The parameter settings are fairly typical for positive ion data. + +Unambig <- Assign[["Unambig"]] #Unambiguous MF assignments data frame + +Ambig <- Assign[["Ambig"]] #Ambiguous MF assignments data frame + +Unassigned <- Assign[["None"]] #Unassigned masses data frame + +MSAssign <- Assign[["MSAssign"]] #Mass spectrum showing assigned, unassigned, and isotope peaks. Assigned peaks are in green, unassigned peaks are in red, and isotope peaks are in blue. + +Error <- Assign[["Error"]] #m/z vs. Absolute Error (ppm) colored to indicate unambiguous MF (blue) and ambiguous MF (red). + +MSgroups <- Assign[["MSgroups"]] #Reconstructed mass spectrum showing the assigned peaks colored to indicate the elemental group. The plot is faceted by ambiguity of the MF assignments. + +VK <- Assign[["VK"]] #van Krevelen plot colored to indicate the elemental group and faceted by ambiguity of the MF assignments. + +MSAssign #Print MSAssign +Error #Print Error +MSgroups #Print MSgroups +VK #Print VK +``` + +#####MFAssign() Input Parameters + +Many of the input parameters are common between MFAssign() and MFAssignCHO(), so only the new parameters are defined here. + +* Nx - sets the maximum number of 14N atoms. + +* Sx - sets the maximum number of 32S atoms. + +* Px - sets the maximum number of 31P atoms. + +* S34x - sets the maximum number of 34S atoms. + +* N15x - sets the maximum number of 15N atoms. + +* Dx - sets the maximum number of 2H atoms. + +* Ex - sets the maximum number of 13C atoms. + +* Clx - sets the maximum number of 35Cl atoms. + +* Cl37x - sets the maximum number of 37Cl atoms. + +>Note that an increased number of heteroatoms or adducts will decrease the speed of the function. This is especially true if more than one type of heteroatom is allowed. + +#####MFAssign() Output + +The output of this function is the same as for MFAssignCHO(); see the description in MFAssignCHO() Output above. + +###Additional Assignments + +####MFAssignAll() + +MFAssignAll() can be more effective at assigning isolated or unassigned masses from the other MFAssign functions. This version does not have a mass cut point and only performs rudimentary formula extension (CH2 and H2O only), which causes it to run somewhat slower than the other versions of the function, especially on large mass lists. Thus, this version is recommended to be run on limited mass lists, especially when trying to assign isolated masses. The lack of advanced formula extension and mass cut point allows for many more ambiguous formula assignments so care should be taken when interpreting the results. + +An example of the usage of MFAssignAll is shown below. + +```{r, eval = FALSE} +Assign <- MFAssignAll(peaks = Mono, isopeaks = Iso, ionMode = "pos", lowMW =100, highMW = 1000, Nx = 3, Sx = 1, Mx = 1, ppm_err = 3, H_Cmin = 0.3) +#The parameter settings are fairly typical for positive ion data. + +Unambig <- Assign[["Unambig"]] #Unambiguous MF assignments data frame + +Ambig <- Assign[["Ambig"]] #Ambiguous MF assignments data frame + +Unassigned <- Assign[["None"]] #Unassigned masses data frame + +MSAssign <- Assign[["MSAssign"]] #Mass spectrum showing assigned, unassigned, and isotope peaks. Assigned peaks are in green, unassigned peaks are in red, and isotope peaks are in blue. + +Error <- Assign[["Error"]] #m/z vs. Absolute Error (ppm) colored to indicate unambiguous MF (blue) and ambiguous MF (red). + +MSgroups <- Assign[["MSgroups"]] #Reconstructed mass spectrum showing the assigned peaks colored to indicate the elemental group. The plot is faceted by ambiguity of the MF assignments. + +VK <- Assign[["VK"]] #van Krevelen plot colored to indicate the elemental group and faceted by ambiguity of the MF assignments. + +MSAssign #Print MSAssign +Error #Print Error +MSgroups #Print MSgroups +VK #Print VK +``` + +#####MFAssignAll() Input Parameters + +Many of the input parameters are common between MFAssignAll() and MFAssign(), the only difference is that MFAssignAll() does not have the mcut and nLoop input parameters. The input parameters are described in the MFAssign() and MFAssignCHO() function sections. + + +#####MFAssign() Output + +The output of this function is the same as for MFAssignCHO() and MFAssign(); see the description in MFAssignCHO() Output above. + + +##Best Practices + +Although many well-vetted methods of decreasing the number of MF assigned for each mass have been employed in the MFAssign functions, it is important to take time to assess the correctness of the assigned MF based on realistic expectations for the specific data being analyzed. + +It is important to ensure each input parameter is set properly for the analysis being performed. There are warning messages that will be displayed if the parameters set are outside typical bounds (or include a typo), but these warnings will not stop the function from running. If too many heteroatoms are included, the runtime of the function will be greatly increased, and the results may become unwieldy. On the other hand, if too few or incorrect heteroatoms are included, the results may be unreasonable. + +If at any time MFAssign(), MFAssignCHO(), or MFAssignAll() need to be stopped before they have completed, it will be necessary to run the following line of code ".rs.restartR()". This is necessary to free up the memory so that the functions can be run at a reasonable speed in future attempts. This line is included within the MFAssign functions and so R will automatically restart as necessary to ensure continued good performance. Unfortunately, this prohibits running the function across multiple datasets in a loop, or in an R Markdown document. Work is ongoing to address this issue in a future release. + +Whenever possible, limit the number of masses in the input data frame to a number as small a as possible since, the speed of the function decreases significantly with an increasing number of masses. The performance is reasonably good (~ <= 60 sec) up to 10,000 masses with a moderate number of elements (3-5). However, both a higher number of masses being evaluated and higher maximum numbers of heteroatoms, decrease the function speed and can become unreasonable depending on the computer (e.g., >30 minutes). + +Setting a reasonable signal-to-noise cut point (SN) can be very important with regard to function speed. Prior to running a raw mass list you should attempt to estimate the SN or set it to a reasonable value. This can be done with SNcutCheck() and/or SNplot(). Otherwise the function will try to assign MF to every mass (including the noise), which slows down the function and increases the likelihood of incorrect assignments. + +The two input columns can have any name, but it is very important that the measured ion mass is in the first column and the measured abundance (or ion intensity) is in the second column. The function only attempts to assign MF to the first column. It is also important to only put ion masses into the function, using neutral masses will not work. + +###Additional Notes and Test Data + +The current form of the package functions have significant reliance on dplyr and tidyr functions to do some aspects of the data manipulations. + +There are multiple warning messages that are reported when the functions are being run. Unless the function stops working, these error messages do not otherwise impact the functioning of the code. + +There are a variety of additional functions that are included in this package, but many of them are sub-functions that are necessary for the MFAssign functions and are not independently operational, these functions have not been described in this document. Only the independent functions have been described in this document. + +There is a large mass list built into the package called CHNOS_ML_Ex which can be used to test whether the function is working correctly. It is a negative mode mass list with even electron ions generated by electrospray ionization. When a maximum of 3 nitrogen and 1 sulfur are allowed, MFAssign() should assign 2116 of 2121 total masses, when the DeNovo is set to 500 for MFAssign(). Additionally, there is a smaller data frame (Short_CHO_neg) of 13 observations randomly sampled from CHNOS_ML_Ex, which is more effective for checking whether or not formulas are correct. There is also an 8 observation data frame (Short_CHO_pos) of positive even electron ions generated by electrospray ionization, which can be used for checking the MFAssign parameters for positive ions and ensuring they assigning correctly. All functions should be able to assign all the masses in these short example data sets if the DeNovo cut point is set to 1000. Additionally, a raw mass list containing negative mode even ion data is included and can be used to test the SNcut, SNplot, IsoFiltR, MFRecal and MFAssign functions. Its name is Raw_Neg_ML. + + +####Short_CHNOS_neg +This is the short dataframe of negative ion data for testing purposes. The masses and correctly assigned unambiguous MF are shown below. + +| Exp_mass | formula | +|:--------:|:-------:| +|531.2092 | C24H36O13 | +|235.0251 | C11H8O6 | +|563.1992 | C24H36O15 | +|331.1767 | C16H28O7 | +|391.0676 | C18H16O10 | +|403.0524 | C15H16O13 | +|321.0620 | C15H14O8 | +|363.1091 | C18H20O8 | +|683.2931 | C33H48O15 | +|207.0301 | C10H8O5 | +|523.1102 | C23H24O14 | +|437.1460 | C21H26O10 | +|487.1465 | C21H28O13 | + +####Short_CHNO_pos +This is the short dataframe of positive ion data for testing purposes. The masses and correctly assigned unambiguous MF are shown below. + +| Exp_mass | formula | +|:--------:|:-------:| +|415.1235 | C18H22O11 | +|325.2162 | C22H28O2 | +|271.0812 | C12H14O7 | +|265.0859 | C17H12O3 | +|195.0652 | C10H10O4 | +|303.0863 | C16H14O6 | +|271.1176 | C13H18O6 | +|267.1591 | C15H22O4 | + + +--------------------------