Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use peak picking in processed spectra #6

Open
ranabanik opened this issue Jul 26, 2022 · 13 comments
Open

How to use peak picking in processed spectra #6

ranabanik opened this issue Jul 26, 2022 · 13 comments

Comments

@ranabanik
Copy link

I have a different number of m/zs per pixel/spectrum(processed). How should I use this tool to homogenize all the spectra?
Currently, I am binning and upsampling(summing all intensity in between bins) all the spectra using step size 0.01/0.02 which roughly ends up being 130,000 m/zs per spectrum. Then I take the average of the spectra and perform peak picking on that to find peak indices...

peak_list = pick_peaks(refmz,
                       meanSpec,
                       fit_type=picking_method,
                       signal_to_noise_threshold=snr,
                       intensity_threshold=intensity_threshold,
                       integrate=False)
peak_mzs = [np.round(p.mz, 5) for p in peak_list]   # only for visualisation
peak_ranges = [
    (
        p.mz - (p.full_width_at_half_max * fwhm_expansion),
        p.mz + (p.full_width_at_half_max * fwhm_expansion),
    )
    for p in peak_list]
# print("peak ranges", peak_ranges)
peak_indices = [
    (self._find_nearest(refmz, p[0]), self._find_nearest(refmz, p[1])) for p in peak_ranges
]
spectra_ = []
for spectrum in spectra:
    peaks = []
    for p in peak_indices:
        peaks.append(np.sum(spectrum[p[0]: p[1]]))
    spectra_.append(peaks)
spectra = np.array(spectra_, dtype=np.float32)

refmz = common mass m/z range with step 0.01.
meanSpec = mean of spectra after upsampling.

I am using the following parameters:
picking_method="quadratic", snr=10, intensity_threshold=5, fwhm_expansion=2
But the resultant spectrum doesn't look anywhere near the raw spectrum if plotted.
image

Moreover, is there any alignment required additionally?

@mobiusklein
Copy link
Owner

This type of binning might produce tall but jagged peaks which would violate the signal to noise ratio threshold. Without seeing the actual values in the upsampled arrays, I couldn't say what's actually going on though. Alignment could be the issue.

If you want to merge profile spectra, you can use ms_peak_picker.average_signal. This function will re-bin your data onto a uniform m/z grid with a delta m/z of dx. For high resolution data, you can pick a value between 0.01 and 0.001 depending upon how much time and memory you have. Then you might smooth your spectra with Savitsky-Golay, though I don't know if that will be necessary.

import ms_peak_picker

spectra = [(mz_array, intensity_array) , (mz_array2, intensity_array2), ...]
mz_grid, intensity_average = ms_peak_picker.average_signal(spectra, dx=0.001)
intensity_average *= len(spectra) # scale the average by the number of spectra to convert to the sum of spectra
mz_grid, intensity_smoothed = ms_peak_picker.scan_filter.SavitskyGolayFilter()(mz_grid, intensity_average)

peak_list = ms_peak_picker.pick_peaks(mz_grid, intensity_smoothed)

If you want average peak lists, you can use ms_peak_picker.reprofile. The same algorithm that converts one list of peaks back into continuous data can also add them together. If you want your peaks to be extra-wide to mimic the method you're using here, you can pass override_fwhm to force the peak shape to be extra broad.

import ms_peak_picker

spectra = [peaks1, peaks2, ...]
mz_grid, intensity_average = ms_peak_picker.reprofile(spectra, dx=0.001)
intensity_average *= len(spectra) # scale the average by the number of spectra to convert to the sum of spectra
mz_grid, intensity_smoothed = ms_peak_picker.scan_filter.SavitskyGolayFilter()(mz_grid, intensity_average)

peak_list = ms_peak_picker.pick_peaks(mz_grid, intensity_smoothed)

@ranabanik
Copy link
Author

import ms_peak_picker

spectra = [(mz_array, intensity_array) , (mz_array2, intensity_array2), ...]
mz_grid, intensity_average = ms_peak_picker.average_signal(spectra, dx=0.001)
intensity_average *= len(spectra) # scale the average by the number of spectra to convert to the sum of spectra
mz_grid, intensity_smoothed = ms_peak_picker.scan_filter.SavitskyGolayFilter()(mz_grid, intensity_average)

peak_list = ms_peak_picker.pick_peaks(mz_grid, intensity_smoothed)

After peak_list what is the best way to get a homogenized 2D matrix of spectra with size -> #mz x #int
For example, I have uploaded a demo data in google drive with .bin extension which has 10 processed spectra-mz.
How would you make it 10 x nPeak dimension?

withpeak_list you shared I ran the same codes I mentioned(fwhm_expansion = 1.0) which gives the following peak spectra:

image

My goal is to compare the spectra, so I actually may not need average spectra.

@mobiusklein
Copy link
Owner

To be precise, you do not wish to have centroids, but a 2-D matrix where each row is a spectrum and each column is the intensity at a specific m/z shared across all spectra?

import pickle

import numpy as np
from scipy.spatial import distance

import ms_peak_picker

%matplotlib inline
from matplotlib import pyplot as plt
from ms_peak_picker.plot import draw_raw, draw_peaklist

plt.rcParams['figure.figsize'] = 10, 6

# Load spectra
spectra = pickle.load(open("./rawmzspectra1.bin", 'rb'))

# Rebin with linear interpolation without merging or any other transformation.
binner = ms_peak_picker.scan_filter.LinearResampling(0.001)
rebinned = [binner(*s) for s in spectra]

# Prove that all m/z axes are identical
for i, ri in enumerate(rebinned):
    if i < len(rebinned) - 1:
        rj = rebinned[i + 1]
    else:
        rj = rebinned[0]
        
    assert np.allclose(ri[0], rj[0])

# Construct the intensity matrix and store a copy of the shared m/z axis
mz_grid = rebinned[0][0]
intensity_grid = np.vstack([ri[1] for ri in rebinned])

# Example pairwise cosine similarity 
print(distance.cosine(intensity_grid[0, :], intensity_grid[1, :]))

# Confirm that peak signal is broadly the same location but different magnitudes
ax = draw_raw(mz_grid, intensity_grid[0, :], alpha=0.5)
draw_raw(mz_grid, intensity_grid[1, :], alpha=0.5, ax=ax)
plt.savefig("spectrum_comparison.pdf")

# Compute pairwise distance matrix over all spectra
distmat = distance.squareform(
    distance.pdist(intensity_grid, metric='cosine', )
)

# Plot pairwise distance matrix as a heat map
q = plt.pcolormesh(distmat)
cbar = plt.colorbar(q)
cbar.ax.set_ylabel("Cosine Similarity", rotation=270, size=12, labelpad=15)
_ = plt.title("Cosine Similarity Pairwise Matrix", size=16)
_ = plt.xticks(np.arange(0.5, 10.5), np.arange(0, 10))
_ = plt.yticks(np.arange(0.5, 10.5), np.arange(0, 10))

image

image

@ranabanik
Copy link
Author

Yes, I want the data structure to be a 2D matrix as you've mentioned with a uniform m/z grid.
ms_peak_picker.scan_filter.LinearResampling(0.01)
This way of homogenization of the spectra depends on the input spacing(=0.01, which generates around 149k peaks) and also depends on Linear 1D interpolation, which doesn't work on noisy data with a high median value(see the plot).

Moreover, the equal spacing/step binning does not preserve the distribution of m/zs.
Please see the Kernel density estimation plot of m/zs distribution.

The blue vertical lines are centroids.

Is there a way to uniform/homogenize the spectra with ms_peak_picker also preserving the m/zs distribution?
Will the centroid you mentioned help?

@mobiusklein
Copy link
Owner

mobiusklein commented Jul 28, 2022

The spacing argument of the LinearResampling object controls the resolution of the resampler, and by setting that to a smaller value, it increases the resolution and it better recapitulates the shape of the signal, but makes the rebinned signal more expensive to work with in time and space. That still makes the spacing on the m/z axis identical across all spectra. This is why I used a value of 0.001 instead of 0.01, to better preserve the actual content of the signal.

You could then use pick_peaks on each rebinned spectrum to get a list of centroid peaks if you wanted to, but then you need to define your similarity metric in terms of discrete values or bin them again, albeit more coarsely. You can compute the cosine similarity of two lists of discrete peaks using the following code:

import math
from collections import defaultdict

def sparse_peak_set_similarity(peak_set_a, peak_set_b, precision=0):
    """Computes the normalized dot product, also called cosine similarity between
    two peak sets, a similarity metric ranging between 0 (dissimilar) to 1.0 (similar).
    Parameters
    ----------
    peak_set_a : Iterable of Peak-like
    peak_set_b : Iterable of Peak-like
        The two peak collections to compare. It is usually only useful to
        compare the similarity of peaks of the same class, so the types
        of the elements of `peak_set_a` and `peak_set_b` should match.
    precision : int, optional
        The precision of rounding to use when binning spectra. Defaults to 0
    Returns
    -------
    float
        The similarity between peak_set_a and peak_set_b. Between 0.0 and 1.0
    """
    bin_a = defaultdict(float)
    bin_b = defaultdict(float)

    positions = set()

    for peak in peak_set_a:
        mz = round(peak.mz, precision)
        bin_a[mz] += peak.intensity
        positions.add(mz)

    for peak in peak_set_b:
        mz = round(peak.mz, precision)
        bin_b[mz] += peak.intensity
        positions.add(mz)

    return bin_dot_product(positions, bin_a, bin_b)


def bin_dot_product(positions, bin_a, bin_b, normalize=True):
    '''Compute a normalzied dot product between two aligned intensity maps
    '''
    z = 0
    n_a = 0
    n_b = 0

    for mz in positions:
        a = bin_a[mz]
        b = bin_b[mz]
        z += a * b
        n_a += a ** 2
        n_b += b ** 2

    if not normalize:
        return z

    n_ab = math.sqrt(n_a) * math.sqrt(n_b)
    if n_ab == 0.0:
        return 0.0
    else:
        return z / n_ab

This method doesn't even require you to homogenize the m/z axis. You select the level of granularity by raising or lowering the precision argument, which you might rewrite as a scale-and-floor instead of as a flat round operation since it gives you more fine-grain control.

The KDE plot you showed is either using a bandwidth that is way too broad or you're trying to summarize big chunks of the m/z dimension together intentionally. I can't think of a reason you'd want to do that with high resolution spectra.

I don't understand why the scipy.spatial.distance matrix isn't what you're looking for either though.

@ranabanik
Copy link
Author

ranabanik commented Jul 28, 2022

Hello Joshua,
Thanks for the explanation.

I don't understand why the scipy.spatial.distance matrix isn't what you're looking for either though.

Sorry for the confusion. I was actually not looking to compare separate spectrums. I could do that with cosine similarity or even pearsonr or the same scipy library you already mentioned.

The reasons to homogenize the spectra to a 2D array with same mzgrid are to:

  1. generate sufficient spatial ion images with each mz as a channel and each spectrum intensity/peak is a pixel.
  2. run downstream image processing and spatial analysis.
peak_list_a = ms_peak_picker.pick_peaks(mz_grid, intensity_grid[8], fit_type='quadratic')
peak_list_b = ms_peak_picker.pick_peaks(mz_grid, intensity_grid[9], fit_type='quadratic')

In general the above generates 1603 and 1749 peaks each. That looks good as I can fetch individual fitted peaks and their properties(SNR, intensity, fwhm etc)

Is there a way to align the peaks or m/zs with the positions and precision match? Something like to convert now 10 x 1493146 upsampled binned array to aligned peak positioned 10 x ~2500 or so? Simply reducing the dimensionality of the matrix but also maintaining the spectrum.

If you want your peaks to be extra-wide to mimic the method you're using here, you can pass override_fwhm to force the peak shape to be extra broad.

One other question, in this context since we are taking one intensity value per m/z into consideration, how significant the width of peaks is?

The KDE plot you showed is either using a bandwidth that is way too broad or you're trying to summarize big chunks of the m/z dimension together intentionally. I can't think of a reason you'd want to do that with high resolution spectra.

Yes, you are right, the KDE plot comes from an optimized bandwidth. I got the concept from this paper that uses the KDE generated clusters for Peak Detection and Filtering Based on Spatial Distribution.

I am relatively new in MSI data so have a bit of questions, sorry for that.

@ranabanik
Copy link
Author

ranabanik commented Jul 28, 2022

Hello Joshua,
I have the following wrapper function to generate the peak spectra matrix:

import math
import numpy as np
import ms_peak_picker

def ms_peak_picker_wrapper(rawspectra, step=0.01, snr=7, mode='centroid', int_thr=1000, fit='quadratic',
                           precision=5, fwhm_expansion=1.2):
    """
    :param rawspectra: list/ndarray object with #spectrum x Nmz x Nint (N varies for processed spectra)
    :param step: upsampling step
    :param snr: SNR threshold
    :param mode: 'profile' or 'centroid'
    :param int_thr:
    :param fit:
    :param precision:
    :param fwhm_expansion: higher value increases peak width + height as intervals are summed

    :return: peakspectra -> 2d array #spectrum x nPeaks
             peakmzs -> list m/zs where peaks were found
    """
    def _find_nearest( array, value):
        idx = np.searchsorted(array, value, side="left")  # Find indices
        if idx > 0 and (
                idx == len(array)
                or math.fabs(value - array[idx - 1]) < math.fabs(value - array[idx])
        ):
            return idx - 1
        else:
            return idx
    binner = ms_peak_picker.scan_filter.LinearResampling(step)
    rebinned = [binner(*s) for s in spectra]
    intensity_grid = np.vstack([ri[1] for ri in rebinned])
    mz_grid, intensity_average = ms_peak_picker.average_signal(spectra, dx=step)
    intensity_average *= len(spectra)

    peak_list = ms_peak_picker.pick_peaks(mz_grid, intensity_average, signal_to_noise_threshold=snr,
                                          peak_mode=mode, intensity_threshold=int_thr*len(rawspectra),
                                          fit_type=fit)
    peakmzs = [np.round(p.mz, precision) for p in peak_list]
    print("There will be #{} peaks".format(len(peakmzs)))
    peak_ranges = [
        (
            p.mz - (p.full_width_at_half_max * fwhm_expansion),
            p.mz + (p.full_width_at_half_max * fwhm_expansion),
        )
        for p in peak_list]
    peak_indices = [(_find_nearest(mz_grid, p[0]), _find_nearest(mz_grid, p[1]))
                    for p in peak_ranges]
    peakspectra = []
    for spectrum in intensity_grid:
        peaks = []
        for p in peak_indices:
            peaks.append(np.sum(spectrum[p[0]: p[1]]))
        peakspectra.append(peaks)
    peakspectra = np.array(peakspectra, dtype=np.float32)
    return peakspectra, peakmzs

The final matrix size is 10x4251 which I am pretty happy with.
Just have a little confusion about:

intensity_threshold=int_thr*len(rawspectra)

this was done to avoid noisy peaks, as avg intensity of spectra was multiplied by len(rawspectra). Is it necessary?

  1. Looks like some of the spectrum misses major peaks(oval marked). See the plot:

Is there a way to avoid such? One thing I noticed, `rebinner` has +1 sample size than `intensity_average`. This shouldn't be an issue in this matter I suppose.
  1. Is any sort of alignment necessary after this?

@mobiusklein
Copy link
Owner

I may have missed something early on after seeing the spectra you shared were profile mode. Are you trying to do this with pre-centroided spectra?

@ranabanik
Copy link
Author

To confirm: is centroided mode where the spectra are like impulses at the center of the profile peak? When generating ion images/downstream analysis, shouldn't it be a single centroided value per m/z?
One thing I noticed, with same parameters in the wrapper function above, centroid mode generates 4000+ peaks while profile mode 2700 peaks.

Could you explain what mode should be chosen in this case and are there specific applications for that?

@mobiusklein
Copy link
Owner

Yes, a centroid spectrum have already been reduced to the discrete centers of a profile peak, and ideally no zero intensity points. Downstream analyses should use centroided spectra, but attempting to put two centroid spectra on the same axis to generate a shared intensity map involves binning those discrete centroids or converting them back into profiles on a shared m/z axis.

When you give ms_peak_picker.pick_peaks's peak_mode argument the value "profile", you tell it that the peaks in this pair of arrays are expressed as continuous profiles that need to be fitted, and based upon the properties of those fits, filtered for noise level and minimum intensity.

When the peak_mode argument's value is "centroid", you tell it that the peaks in this pair of arrays are already centroids, and the algorithm will proceed without fitting peaks assuming each point is a peak, and because centroids contain little to no information about local noise, only apply simple intensity thresholding.

If you pass pick_peaks a profile spectrum but say peak_mode="centroid", then pick_peaks will blindly assume that every point is a peak, which is usually not what you want.

@ranabanik
Copy link
Author

Thanks for the detailed explanation.
So in my case, the peaks are already in profile mode and I leave the peak_mode param as the default profile, by which ms_peak_picker will generate centroided peak spectra. Clear.

I want to show you 3 methods:

  1. peak_mode: profile, number of peaks: 2760+
    image

  2. peak_mode: centroid, number of peaks: 4000+
    image

  3. peak_mode: prominence based on scipy.signal, number of peaks: 760+
    image

Obviously, visually it seems the prominence one seems to follow the spectrum(#8) with computationally lower number of peaks.

  1. Are there ways/metrics to compare the peak picking process?
  2. Seems ms_peak_picker misses the peak that comes before/around 600(green marked), any idea why?

@mobiusklein
Copy link
Owner

What are you doing to produce the data array you're plotting here? Using rawmzspectra1.bin and the the SNR threshold of 7 you're using above, I don't see this issue with the following code:

import numpy as np
import ms_peak_picker
import pickle

from matplotlib import pyplot as plt
plt.rcParams['figure.figsize'] = 12, 6

data = pickle.load(open("./rawmzspectra1.bin", 'rb'))
for i, (mz, intensity) in enumerate(data):
    peaks = ms_peak_picker.pick_peaks(mz, intensity, signal_to_noise_threshold=7, peak_mode='profile')
    ax = ms_peak_picker.plot.draw_raw(mz, intensity, color='red', label='Raw')
    ms_peak_picker.plot.draw_peaklist(peaks, ax=ax, color='blue', label='Picked Peaks', lw=0.5)
    ax.legend()
    ax.set_title(f"Index {i}")
    plt.savefig(f"{1}.png", bbox_inches='tight')
    plt.close(ax.figure)

image
image
image
image
image
image
image
image
image
image

@ranabanik
Copy link
Author

I peak picked with the ms_peak_picker_wrapper function I shared. All the 10 spectra have the same length for some ML analysis where all the features need to be of uniform feature size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants