Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Add PepXMLReader #54

Merged
merged 4 commits into from
Oct 20, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 17 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,16 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [0.6.0] - 2023-10-19

### Added

- `io`: Added new `io.pepxml` reader

### Fixed

- Docs: Add ionbot to README.rst, fix order in API docs

## [0.5.0] - 2023-09-20

### Added
Expand All @@ -21,6 +31,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- `io.mzid`: Allow inconsistent presence of score in PSMs in a single mzid file

### Changed

- `PSM`: Values of the `rescoring_features` dictionary are now coerced to floats
- io: Raise `PSMUtilsIOException` when passed filetype is not known
- `io`: Make io reader `read_file` method inheritable (code cleanup)
Expand All @@ -31,6 +42,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- Formatting: Increase max line length to 99 (code formatting)

### Fixed

- `PSMList`: Fix issue where `psm_list["protein_list"]` resulted in a Numpy error due to the inconsistent shape of the lists.
- `io.tsv`: Throw more descriptive `PSMUtilsIOException` when handeling tsv errors
- `io.msamanda`: Fix support for N/C-terminal modifications
Expand Down Expand Up @@ -69,15 +81,18 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
## [0.3.1] - 2023-06-19

### Changed

- `io.sage`: Change `spectrum_fdr` to `spectrum_q` (crf. lazear/sage#64).

## [0.3.0] - 2023-06-08

### Added

- Add reader for [Sage](https://github.com/lazear/sage) PSM files.
- `io.mzid`: Add reading/writing of PEP and q-values

### Changed

- `psm`: The default values of `PSM.provenance_data`, `PSM.metadata` and `PSM.rescoring_features` are now `dict()` instead of `None`.
- `PSMList`: Also allow Numpy integers for indexing a single PSM
- `io.mzid.MzidReader`: Attempt to parse `retention time` or `scan start time` cvParams from both SpectrumIdentificationResult as SpectrumIdentificationItem levels. Note that according to the mzIdentML specification document (v1.1.1) neither cvParams are expected to be present at either level.
Expand All @@ -88,6 +103,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- Filter warnings from `psims.mzmlb` on import, as `mzmlb` is not used

### Fixed

- `psm`: Fix missing qvalue and pep in docstring
- `peptidoform`: ProForma mass modifications are now correctly parsed within the `rename_modifications` function.
- `io.maxquant.MSMSReader`: Correctly parse empty `Proteins` column to `None`
Expand Down Expand Up @@ -157,7 +173,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### Fixed

- `PSMList`: Truncate __repr__ to first five entries only, avoiding crashing notebook output
- `PSMList`: Truncate `__repr__` to first five entries only, avoiding crashing notebook output
- `Peptidoform`: Minor typing fix
- `add_fixed_modifications`: Allow input as dict as well as list of tuples
- `io`: Fix issue where the `NamedTemporaryFile` for `_supports_write_psm` was seen as invalid Percolator file
Expand Down
2 changes: 2 additions & 0 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -89,11 +89,13 @@ Supported file formats
===================================================================================================================== ======================== =============== ===============
File format psm_utils tag Read support Write support
===================================================================================================================== ======================== =============== ===============
`ionbot CSV <https://ionbot.cloud/>`_ ``ionbot`` ✅ ❌
`OpenMS idXML <https://www.openms.de/>`_ ``idxml`` ✅ ❌
`MaxQuant msms.txt <https://www.maxquant.org/>`_ ``msms`` ✅ ❌
`MS Amanda CSV <https://ms.imp.ac.at/?goto=msamanda>`_ ``msamanda`` ✅ ❌
`mzIdentML <https://psidev.info/mzidentml>`_ ``mzid`` ✅ ✅
`Peptide Record <https://psm-utils.readthedocs.io/en/stable/api/psm_utils.io/#module-psm_utils.io.peptide_record>`_ ``peprec`` ✅ ✅
`pepXML <http://tools.proteomecenter.org/wiki/index.php?title=Formats:pepXML>`_ ``pepxml`` ✅ ❌
`Percolator tab <https://github.com/percolator/percolator/wiki/Interface>`_ ``percolator`` ✅ ✅
Proteome Discoverer MSF ``proteome_discoverer`` ✅ ❌
`Sage <https://github.com/lazear/sage/blob/v0.12.0/DOCS.md#interpreting-sage-output>`_ ``sage`` ✅ ❌
Expand Down
25 changes: 18 additions & 7 deletions docs/source/api/psm_utils.io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,14 @@ psm_utils.io
:members:


psm_utils.io.ionbot
##########################

.. automodule:: psm_utils.io.ionbot
:members:
:inherited-members:



psm_utils.io.idxml
##################
Expand All @@ -24,6 +32,7 @@ psm_utils.io.maxquant
:inherited-members:



psm_utils.io.msamanda
#####################

Expand Down Expand Up @@ -51,6 +60,15 @@ psm_utils.io.peptide_record



psm_utils.io.pepxml
###########################

.. automodule:: psm_utils.io.pepxml
:members:
:inherited-members:



psm_utils.io.percolator
#######################

Expand Down Expand Up @@ -92,10 +110,3 @@ psm_utils.io.xtandem
.. automodule:: psm_utils.io.xtandem
:members:
:inherited-members:

psm_utils.io.ionbot
##########################

.. automodule:: psm_utils.io.ionbot
:members:
:inherited-members:
Binary file not shown.
2 changes: 1 addition & 1 deletion psm_utils/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
"""Common utilities for parsing and handling PSMs, and search engine results."""

__version__ = "0.5.0"
__version__ = "0.6.0"

from warnings import filterwarnings

Expand Down
9 changes: 8 additions & 1 deletion psm_utils/io/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,16 +9,17 @@
from rich.progress import track

import psm_utils.io.idxml as idxml
import psm_utils.io.ionbot as ionbot
import psm_utils.io.maxquant as maxquant
import psm_utils.io.msamanda as msamanda
import psm_utils.io.mzid as mzid
import psm_utils.io.peptide_record as peptide_record
import psm_utils.io.pepxml as pepxml
import psm_utils.io.percolator as percolator
import psm_utils.io.proteome_discoverer as proteome_discoverer
import psm_utils.io.sage as sage
import psm_utils.io.tsv as tsv
import psm_utils.io.xtandem as xtandem
import psm_utils.io.ionbot as ionbot
from psm_utils.io._base_classes import WriterBase
from psm_utils.io.exceptions import PSMUtilsIOException
from psm_utils.psm import PSM
Expand Down Expand Up @@ -49,6 +50,12 @@
"extension": ".peprec.txt",
"filename_pattern": r"(^.*\.peprec(?:\.txt)?$)|(?:^peprec\.txt$)",
},
"pepxml": {
"reader": pepxml.PepXMLReader,
"writer": None,
"extension": ".pepxml",
"filename_pattern": r"^.*\.pepxml$",
},
"percolator": {
"reader": percolator.PercolatorTabReader,
"writer": percolator.PercolatorTabWriter,
Expand Down
146 changes: 146 additions & 0 deletions psm_utils/io/pepxml.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
"""Interface with TPP pepXML PSM files."""

from __future__ import annotations

import logging
from collections import defaultdict
from pathlib import Path
from typing import List, Optional, Union

from pyteomics import pepxml, proforma

from psm_utils.io._base_classes import ReaderBase
from psm_utils.peptidoform import Peptidoform
from psm_utils.psm import PSM
from psm_utils.utils import mass_to_mz

logger = logging.getLogger(__name__)

STANDARD_SEARCHENGINE_SCORES = [
"expect",
"EValue",
"Evalue",
"SpecEValue",
"xcorr", # Fallback if no e-value is present
"delta_dot", # SpectraST
"mzFidelity",
]


class PepXMLReader(ReaderBase):
def __init__(self, filename: Union[str, Path], *args, score_key: str = None, **kwargs) -> None:
"""
Reader for pepXML PSM files.

Parameters
----------
filename: str, pathlib.Path
Path to PSM file.
score_key: str, optional
Name of the score metric to use as PSM score. If not provided, the score metric is
inferred from a list of known search engine scores.

"""
super().__init__(filename, *args, **kwargs)
self.score_key = score_key or self._infer_score_name()

def __iter__(self):
"""Iterate over file and return PSMs one-by-one."""
with pepxml.read(str(self.filename)) as reader:
for spectrum_query in reader:
for search_hit in spectrum_query["search_hit"]:
yield self._parse_psm(spectrum_query, search_hit)

def _infer_score_name(self) -> str:
"""Infer the score from the list of known PSM scores."""
# Get scores from first PSM
with pepxml.read(str(self.filename)) as reader:
for spectrum_query in reader:
score_keys = spectrum_query["search_hit"][0]["search_score"].keys()
break

# Infer score name
if not score_keys:
logger.warning("No pepXML scores found.")
return None
else:
for score in STANDARD_SEARCHENGINE_SCORES: # Check for known scores
if score in score_keys:
logger.debug(f"Using known pepXML score `{score}`.")
return score
else:
logger.warning(f"No known pepXML scores found. Defaulting to `{score_keys[0]}`.")
return score_keys[0] # Default to the first one if nothing found

@staticmethod
def _parse_peptidoform(peptide: str, modifications: List[dict], charge: Optional[int] = None):
"""Parse pepXML peptide to :py:class:`~psm_utils.peptidoform.Peptidoform`."""
modifications_dict = defaultdict(list)
n_term = []
c_term = []
for mod in modifications:
mod_tag = proforma.process_tag_tokens(f"{mod['mass']:+}")
if mod["position"] == 0:
n_term.append(mod_tag)
elif mod["position"] == len(peptide) + 1:
c_term.append(mod_tag)
else:
modifications_dict[mod["position"]].append(mod_tag)

sequence = [(aa, modifications_dict[i] or None) for i, aa in enumerate(peptide)]
properties = {
"n_term": n_term,
"c_term": c_term,
"charge_state": proforma.ChargeState(charge) if charge else None,
"unlocalized_modifications": [],
"labile_modifications": [],
"fixed_modifications": [],
"intervals": [],
"isotopes": [],
"group_ids": [],
}
return Peptidoform(proforma.ProForma(sequence, properties))

def _parse_psm(self, spectrum_query: dict, search_hit: dict) -> PSM:
"""Parse pepXML PSM to :py:class:`~psm_utils.psm.PSM`."""
return PSM(
peptidoform=self._parse_peptidoform(
search_hit["peptide"],
search_hit["modifications"],
spectrum_query["assumed_charge"],
),
spectrum_id=spectrum_query["spectrum"],
run=None,
collection=None,
spectrum=None,
is_decoy=None,
score=search_hit["search_score"][self.score_key],
qvalue=None,
pep=None,
precursor_mz=mass_to_mz(
spectrum_query["precursor_neutral_mass"], spectrum_query["assumed_charge"]
),
retention_time=spectrum_query["retention_time_sec"],
ion_mobility=spectrum_query["ion_mobility"]
if "ion_mobility" in spectrum_query
else None,
protein_list=[p["protein"] for p in search_hit["proteins"]],
rank=search_hit["hit_rank"],
source=None,
provenance_data={
"pepxml_index": str(spectrum_query["index"]),
"start_scan": str(spectrum_query["start_scan"]),
"end_scan": str(spectrum_query["end_scan"]),
},
metadata={
"num_matched_ions": str(search_hit["num_matched_ions"]),
"tot_num_ions": str(search_hit["tot_num_ions"]),
"num_missed_cleavages": str(search_hit["num_missed_cleavages"]),
}.update(
{
f"search_score_{key.lower()}": str(search_hit["search_score"][key])
for key in search_hit["search_score"]
}
),
rescoring_features=None,
)
37 changes: 37 additions & 0 deletions tests/test_io/test_pepxml.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
from psm_utils.peptidoform import Peptidoform
from psm_utils.io.pepxml import PepXMLReader

class TestPepXMLReader:
def test_parse_peptidoform(self):
test_cases = [
{
"in": {
"peptide": "ACDEK",
"modifications": [],
"charge": 2,
},
"out": Peptidoform("ACDEK/2"),
},
{
"in": {
"peptide": "STEEQNGGGQK",
"modifications": [
{"position": 0, "mass": 43.017841151532004},
{"position": 2, "mass": 181.014009},
],
"charge": 2,
},
"out": Peptidoform("[+43.017841151532004]-STE[+181.014009]EQNGGGQK/2"),
},
{
"in": {
"peptide": "ACDEK",
"modifications": [{"position": 6, "mass": 181.014009}],
"charge": 3,
},
"out": Peptidoform("ACDEK-[+181.014009]/3"),
},
]

for test_case in test_cases:
assert test_case["out"] == PepXMLReader._parse_peptidoform(**test_case["in"])