accession attribute in DBSequence should be unique? #91

smdb21 · 2017-08-04T17:11:39Z

When parsing example file https://github.com/HUPO-PSI/mzIdentML/blob/master/examples/1_2examples/crosslinking/xiFDR-CrossLinkExample.mzid, I find these 2 protein entries as DBSequence elements:

<DBSequence searchDatabase_ref="SDB_4299_203" accession="P02768-A" id="dbseq_P02768-A_target" name="ALBU_HUMAN Serum albumin active OS=Homo sapiens GN=ALB PE=1 SV=2">
  <cvParam cvRef="PSI-MS" accession="MS:1001088" name="protein description" value="ALBU_HUMAN 
Serum albumin active OS=Homo sapiens GN=ALB PE=1 SV=2"></cvParam>  
</DBSequence>  
<DBSequence searchDatabase_ref="SDB_4299_203" accession="P02768-A" id="dbseq_P02768-A_decoy" name="ALBU_HUMAN Serum albumin active OS=Homo sapiens GN=ALB PE=1 SV=2">  
    <cvParam cvRef="PSI-MS" accession="MS:1001088" name="protein description" value="ALBU_HUMAN Serum albumin active OS=Homo sapiens GN=ALB PE=1 SV=2"></cvParam>  
</DBSequence>

Although the protein entries are different (one is the decoy entry of the other), the accession attribute is the same.
My question is: should the accession attribute be unique? In the specification document says this about the accession:

The unique accession of this sequence

This caused my a problem because I am collecting all proteins in a map in which the key is the accession.

What do you think?

The text was updated successfully, but these errors were encountered:

julianu · 2017-08-07T08:49:56Z

I would definitely vote to have these accessions unique. Having the same accessions for differing entries is probably an error, and it leads to inconsistencies when mapping the peptides and PSMs to the proteins, in the given example.

colin-combe · 2019-12-18T10:36:58Z

I would vote for them not being unique.

First, there is the decoy/target example above. More generally, proteins can have the same accession number and different sequences - this is why we're not all clones, right?

If the example above leads to inconsistencies then it is an error in the software reading the file, because the id attributes are different?

lutzfischer · 2019-12-18T14:58:14Z

I also think they should be same - it is the decoy counter part for the target - and, unless we have a standard way to denoting them as target decoy pair, I would actually ask for them to have same accession.

Being able to match these up is important for FDR-estimations, as only this way you can make a meaningful separate (target decoy based) FDR for self/internal/intra vs between/inter.

colin-combe · 2022-12-01T13:53:46Z

could this issue be resolved/closed? I think there are reasons why they are not required to be unique.
@andrewrobertjones - what do you think about this?

mobiusklein · 2022-12-01T14:10:56Z

You can have multiple search databases which could have overlapping entries, like searching all of the reviewed sequences of UniProt and then searching again with all the isoforms and unreviewed sequences enabled. The searchDatabase_ref tells you which database an entry should be resolved against. In order for the mzIdentML to be internally consistent, the id is the only field that absolutely has to be unique across all DBSequence entries.

The "supported" method for including decoy proteins in your search database involves adding some marker to the accession attribute of the DBSequence protein, and specifying a regex for matching that marker in your SearchDatabase element using MS:1001283 decoy DB accession regexp
. (edit to correct accession per @colin-combe's catch)

Would it be better if there were an isDecoy attribute like on PeptideEvidence?

colin-combe · 2022-12-01T14:41:47Z

the id is the only field that absolutely has to be unique across all DBSequence entries

that seems sufficient info to close this

The "supported" method for including decoy proteins in your search database involves adding some marker to the accession attribute of the DBSequence protein, and specifying a regex for matching that marker in your SearchDatabase element using MS:1001450 decoy DB accession regexp

where is that documented? (apologies if it's obvious and I'm just being blind)

colin-combe · 2022-12-01T19:03:23Z

where is that documented?

right... its shown in the example in Section 7.5 of 1.2.0 spec (though it isn't discussed in the text).

It's because its accession is MS:1001283 (not MS:1001450 as in your message, though the link is correct in your message), that I didn't find it. (I searched for MS:1001450).

@lutzfischer - I think we've been unaware of this?

colin-combe · 2022-12-01T19:18:39Z

also, re. MS:1001283 - its incorrectly shown as an example CV param for DatabaseName (6.20, pg. 36)?
I say 'incorrectly' because the CV mapping rules given for DatabaseName wouldn't allow it?
All the example CV params given for DatabaseName are wrong?

mobiusklein · 2022-12-02T05:00:45Z

Thanks for catching the accession number error earlier. I was writing in a hurry and must have copied over the wrong accession from OLS.

I think you're right about the parameters in DatabaseName.

As-is, this could only be one of the children given here: https://www.ebi.ac.uk/ols/ontologies/ms/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FMS_1001013&lang=en&viewMode=All&siblings=false
or a userParam.

colin-combe · 2022-12-02T07:44:41Z

i will make a seperate issue for the incorrect DatabaseName example cvParams.

Would it be better if there were an isDecoy attribute [on DBSequence] like on PeptideEvidence?

that sounds sensible to me, but then it is a change to the schema

lutzfischer · 2022-12-08T13:18:45Z

currently the only "reliable" way to detect if a protein is a decoy protein is to go via PeptideEvidence. But I guess there are other ways to have decoys besides extra decoy proteins - concatenated proteins come to mind - where only a part of the "protein" is decoy. Not sure what would be the best way to represent that.

For the case of distinct decoy proteins, actually the current spec document, at least implicitly, by example, suggests different accessions:

<SearchDatabase location="/localdirectory/18.E_coli_K12_edit.fasta" id="K12_nosignal" name="K12"
numDatabaseSequences="9376" releaseDate="01-2008-08-2008" version="1.0" >
	<FileFormat>
		<cvParam accession="MS:1001348" name="FASTA format" cvRef="PSI-MS"/>
	</FileFormat>
	<DatabaseName>
		<userParam name="18.E_coli_K12_edit.fasta" />
	</DatabaseName>
	<cvParam accession="MS:1001197" name="DB composition target+decoy" cvRef="PSI-MS"/>
	<cvParam accession="MS:1001283" name="decoy DB accession regexp" value="Rnd" cvRef="PSI-MS"/>
	<cvParam accession="MS:1001195" name="decoy DB type reverse" cvRef="PSI-MS"/>
</SearchDatabase>

colin-combe mentioned this issue Dec 2, 2022

incorrect example cvParams given for DatabaseName (6.20) #134

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

accession attribute in DBSequence should be unique? #91

accession attribute in DBSequence should be unique? #91

smdb21 commented Aug 4, 2017

julianu commented Aug 7, 2017

colin-combe commented Dec 18, 2019

lutzfischer commented Dec 18, 2019

colin-combe commented Dec 1, 2022

mobiusklein commented Dec 1, 2022 •

edited

Loading

colin-combe commented Dec 1, 2022

colin-combe commented Dec 1, 2022

colin-combe commented Dec 1, 2022

mobiusklein commented Dec 2, 2022

colin-combe commented Dec 2, 2022

lutzfischer commented Dec 8, 2022

accession attribute in DBSequence should be unique? #91

accession attribute in DBSequence should be unique? #91

Comments

smdb21 commented Aug 4, 2017

julianu commented Aug 7, 2017

colin-combe commented Dec 18, 2019

lutzfischer commented Dec 18, 2019

colin-combe commented Dec 1, 2022

mobiusklein commented Dec 1, 2022 • edited Loading

colin-combe commented Dec 1, 2022

colin-combe commented Dec 1, 2022

colin-combe commented Dec 1, 2022

mobiusklein commented Dec 2, 2022

colin-combe commented Dec 2, 2022

lutzfischer commented Dec 8, 2022

mobiusklein commented Dec 1, 2022 •

edited

Loading