Skip to content

Commit

Permalink
Merge pull request #54 from compomics/feature/pepxml
Browse files Browse the repository at this point in the history
Feature: Add PepXMLReader
  • Loading branch information
RalfG authored Oct 20, 2023
2 parents ee39fd2 + 0aa5990 commit f153e35
Show file tree
Hide file tree
Showing 8 changed files with 229 additions and 10 deletions.
18 changes: 17 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,16 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [0.6.0] - 2023-10-19

### Added

- `io`: Added new `io.pepxml` reader

### Fixed

- Docs: Add ionbot to README.rst, fix order in API docs

## [0.5.0] - 2023-09-20

### Added
Expand All @@ -21,6 +31,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- `io.mzid`: Allow inconsistent presence of score in PSMs in a single mzid file

### Changed

- `PSM`: Values of the `rescoring_features` dictionary are now coerced to floats
- io: Raise `PSMUtilsIOException` when passed filetype is not known
- `io`: Make io reader `read_file` method inheritable (code cleanup)
Expand All @@ -31,6 +42,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- Formatting: Increase max line length to 99 (code formatting)

### Fixed

- `PSMList`: Fix issue where `psm_list["protein_list"]` resulted in a Numpy error due to the inconsistent shape of the lists.
- `io.tsv`: Throw more descriptive `PSMUtilsIOException` when handeling tsv errors
- `io.msamanda`: Fix support for N/C-terminal modifications
Expand Down Expand Up @@ -69,15 +81,18 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
## [0.3.1] - 2023-06-19

### Changed

- `io.sage`: Change `spectrum_fdr` to `spectrum_q` (crf. lazear/sage#64).

## [0.3.0] - 2023-06-08

### Added

- Add reader for [Sage](https://github.com/lazear/sage) PSM files.
- `io.mzid`: Add reading/writing of PEP and q-values

### Changed

- `psm`: The default values of `PSM.provenance_data`, `PSM.metadata` and `PSM.rescoring_features` are now `dict()` instead of `None`.
- `PSMList`: Also allow Numpy integers for indexing a single PSM
- `io.mzid.MzidReader`: Attempt to parse `retention time` or `scan start time` cvParams from both SpectrumIdentificationResult as SpectrumIdentificationItem levels. Note that according to the mzIdentML specification document (v1.1.1) neither cvParams are expected to be present at either level.
Expand All @@ -88,6 +103,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- Filter warnings from `psims.mzmlb` on import, as `mzmlb` is not used

### Fixed

- `psm`: Fix missing qvalue and pep in docstring
- `peptidoform`: ProForma mass modifications are now correctly parsed within the `rename_modifications` function.
- `io.maxquant.MSMSReader`: Correctly parse empty `Proteins` column to `None`
Expand Down Expand Up @@ -157,7 +173,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### Fixed

- `PSMList`: Truncate __repr__ to first five entries only, avoiding crashing notebook output
- `PSMList`: Truncate `__repr__` to first five entries only, avoiding crashing notebook output
- `Peptidoform`: Minor typing fix
- `add_fixed_modifications`: Allow input as dict as well as list of tuples
- `io`: Fix issue where the `NamedTemporaryFile` for `_supports_write_psm` was seen as invalid Percolator file
Expand Down
2 changes: 2 additions & 0 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -89,11 +89,13 @@ Supported file formats
===================================================================================================================== ======================== =============== ===============
File format psm_utils tag Read support Write support
===================================================================================================================== ======================== =============== ===============
`ionbot CSV <https://ionbot.cloud/>`_ ``ionbot`` ✅ ❌
`OpenMS idXML <https://www.openms.de/>`_ ``idxml`` ✅ ❌
`MaxQuant msms.txt <https://www.maxquant.org/>`_ ``msms`` ✅ ❌
`MS Amanda CSV <https://ms.imp.ac.at/?goto=msamanda>`_ ``msamanda`` ✅ ❌
`mzIdentML <https://psidev.info/mzidentml>`_ ``mzid`` ✅ ✅
`Peptide Record <https://psm-utils.readthedocs.io/en/stable/api/psm_utils.io/#module-psm_utils.io.peptide_record>`_ ``peprec`` ✅ ✅
`pepXML <http://tools.proteomecenter.org/wiki/index.php?title=Formats:pepXML>`_ ``pepxml`` ✅ ❌
`Percolator tab <https://github.com/percolator/percolator/wiki/Interface>`_ ``percolator`` ✅ ✅
Proteome Discoverer MSF ``proteome_discoverer`` ✅ ❌
`Sage <https://github.com/lazear/sage/blob/v0.12.0/DOCS.md#interpreting-sage-output>`_ ``sage`` ✅ ❌
Expand Down
25 changes: 18 additions & 7 deletions docs/source/api/psm_utils.io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,14 @@ psm_utils.io
:members:


psm_utils.io.ionbot
##########################

.. automodule:: psm_utils.io.ionbot
:members:
:inherited-members:



psm_utils.io.idxml
##################
Expand All @@ -24,6 +32,7 @@ psm_utils.io.maxquant
:inherited-members:



psm_utils.io.msamanda
#####################

Expand Down Expand Up @@ -51,6 +60,15 @@ psm_utils.io.peptide_record



psm_utils.io.pepxml
###########################

.. automodule:: psm_utils.io.pepxml
:members:
:inherited-members:



psm_utils.io.percolator
#######################

Expand Down Expand Up @@ -92,10 +110,3 @@ psm_utils.io.xtandem
.. automodule:: psm_utils.io.xtandem
:members:
:inherited-members:

psm_utils.io.ionbot
##########################

.. automodule:: psm_utils.io.ionbot
:members:
:inherited-members:
Binary file not shown.
2 changes: 1 addition & 1 deletion psm_utils/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
"""Common utilities for parsing and handling PSMs, and search engine results."""

__version__ = "0.5.0"
__version__ = "0.6.0"

from warnings import filterwarnings

Expand Down
9 changes: 8 additions & 1 deletion psm_utils/io/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,16 +9,17 @@
from rich.progress import track

import psm_utils.io.idxml as idxml
import psm_utils.io.ionbot as ionbot
import psm_utils.io.maxquant as maxquant
import psm_utils.io.msamanda as msamanda
import psm_utils.io.mzid as mzid
import psm_utils.io.peptide_record as peptide_record
import psm_utils.io.pepxml as pepxml
import psm_utils.io.percolator as percolator
import psm_utils.io.proteome_discoverer as proteome_discoverer
import psm_utils.io.sage as sage
import psm_utils.io.tsv as tsv
import psm_utils.io.xtandem as xtandem
import psm_utils.io.ionbot as ionbot
from psm_utils.io._base_classes import WriterBase
from psm_utils.io.exceptions import PSMUtilsIOException
from psm_utils.psm import PSM
Expand Down Expand Up @@ -49,6 +50,12 @@
"extension": ".peprec.txt",
"filename_pattern": r"(^.*\.peprec(?:\.txt)?$)|(?:^peprec\.txt$)",
},
"pepxml": {
"reader": pepxml.PepXMLReader,
"writer": None,
"extension": ".pepxml",
"filename_pattern": r"^.*\.pepxml$",
},
"percolator": {
"reader": percolator.PercolatorTabReader,
"writer": percolator.PercolatorTabWriter,
Expand Down
146 changes: 146 additions & 0 deletions psm_utils/io/pepxml.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
"""Interface with TPP pepXML PSM files."""

from __future__ import annotations

import logging
from collections import defaultdict
from pathlib import Path
from typing import List, Optional, Union

from pyteomics import pepxml, proforma

from psm_utils.io._base_classes import ReaderBase
from psm_utils.peptidoform import Peptidoform
from psm_utils.psm import PSM
from psm_utils.utils import mass_to_mz

logger = logging.getLogger(__name__)

STANDARD_SEARCHENGINE_SCORES = [
"expect",
"EValue",
"Evalue",
"SpecEValue",
"xcorr", # Fallback if no e-value is present
"delta_dot", # SpectraST
"mzFidelity",
]


class PepXMLReader(ReaderBase):
def __init__(self, filename: Union[str, Path], *args, score_key: str = None, **kwargs) -> None:
"""
Reader for pepXML PSM files.
Parameters
----------
filename: str, pathlib.Path
Path to PSM file.
score_key: str, optional
Name of the score metric to use as PSM score. If not provided, the score metric is
inferred from a list of known search engine scores.
"""
super().__init__(filename, *args, **kwargs)
self.score_key = score_key or self._infer_score_name()

def __iter__(self):
"""Iterate over file and return PSMs one-by-one."""
with pepxml.read(str(self.filename)) as reader:
for spectrum_query in reader:
for search_hit in spectrum_query["search_hit"]:
yield self._parse_psm(spectrum_query, search_hit)

def _infer_score_name(self) -> str:
"""Infer the score from the list of known PSM scores."""
# Get scores from first PSM
with pepxml.read(str(self.filename)) as reader:
for spectrum_query in reader:
score_keys = spectrum_query["search_hit"][0]["search_score"].keys()
break

# Infer score name
if not score_keys:
logger.warning("No pepXML scores found.")
return None
else:
for score in STANDARD_SEARCHENGINE_SCORES: # Check for known scores
if score in score_keys:
logger.debug(f"Using known pepXML score `{score}`.")
return score
else:
logger.warning(f"No known pepXML scores found. Defaulting to `{score_keys[0]}`.")
return score_keys[0] # Default to the first one if nothing found

@staticmethod
def _parse_peptidoform(peptide: str, modifications: List[dict], charge: Optional[int] = None):
"""Parse pepXML peptide to :py:class:`~psm_utils.peptidoform.Peptidoform`."""
modifications_dict = defaultdict(list)
n_term = []
c_term = []
for mod in modifications:
mod_tag = proforma.process_tag_tokens(f"{mod['mass']:+}")
if mod["position"] == 0:
n_term.append(mod_tag)
elif mod["position"] == len(peptide) + 1:
c_term.append(mod_tag)
else:
modifications_dict[mod["position"]].append(mod_tag)

sequence = [(aa, modifications_dict[i] or None) for i, aa in enumerate(peptide)]
properties = {
"n_term": n_term,
"c_term": c_term,
"charge_state": proforma.ChargeState(charge) if charge else None,
"unlocalized_modifications": [],
"labile_modifications": [],
"fixed_modifications": [],
"intervals": [],
"isotopes": [],
"group_ids": [],
}
return Peptidoform(proforma.ProForma(sequence, properties))

def _parse_psm(self, spectrum_query: dict, search_hit: dict) -> PSM:
"""Parse pepXML PSM to :py:class:`~psm_utils.psm.PSM`."""
return PSM(
peptidoform=self._parse_peptidoform(
search_hit["peptide"],
search_hit["modifications"],
spectrum_query["assumed_charge"],
),
spectrum_id=spectrum_query["spectrum"],
run=None,
collection=None,
spectrum=None,
is_decoy=None,
score=search_hit["search_score"][self.score_key],
qvalue=None,
pep=None,
precursor_mz=mass_to_mz(
spectrum_query["precursor_neutral_mass"], spectrum_query["assumed_charge"]
),
retention_time=spectrum_query["retention_time_sec"],
ion_mobility=spectrum_query["ion_mobility"]
if "ion_mobility" in spectrum_query
else None,
protein_list=[p["protein"] for p in search_hit["proteins"]],
rank=search_hit["hit_rank"],
source=None,
provenance_data={
"pepxml_index": str(spectrum_query["index"]),
"start_scan": str(spectrum_query["start_scan"]),
"end_scan": str(spectrum_query["end_scan"]),
},
metadata={
"num_matched_ions": str(search_hit["num_matched_ions"]),
"tot_num_ions": str(search_hit["tot_num_ions"]),
"num_missed_cleavages": str(search_hit["num_missed_cleavages"]),
}.update(
{
f"search_score_{key.lower()}": str(search_hit["search_score"][key])
for key in search_hit["search_score"]
}
),
rescoring_features=None,
)
37 changes: 37 additions & 0 deletions tests/test_io/test_pepxml.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
from psm_utils.peptidoform import Peptidoform
from psm_utils.io.pepxml import PepXMLReader

class TestPepXMLReader:
def test_parse_peptidoform(self):
test_cases = [
{
"in": {
"peptide": "ACDEK",
"modifications": [],
"charge": 2,
},
"out": Peptidoform("ACDEK/2"),
},
{
"in": {
"peptide": "STEEQNGGGQK",
"modifications": [
{"position": 0, "mass": 43.017841151532004},
{"position": 2, "mass": 181.014009},
],
"charge": 2,
},
"out": Peptidoform("[+43.017841151532004]-STE[+181.014009]EQNGGGQK/2"),
},
{
"in": {
"peptide": "ACDEK",
"modifications": [{"position": 6, "mass": 181.014009}],
"charge": 3,
},
"out": Peptidoform("ACDEK-[+181.014009]/3"),
},
]

for test_case in test_cases:
assert test_case["out"] == PepXMLReader._parse_peptidoform(**test_case["in"])

0 comments on commit f153e35

Please sign in to comment.