Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

accession attribute in DBSequence should be unique? #91

Open
smdb21 opened this issue Aug 4, 2017 · 11 comments
Open

accession attribute in DBSequence should be unique? #91

smdb21 opened this issue Aug 4, 2017 · 11 comments

Comments

@smdb21
Copy link
Contributor

smdb21 commented Aug 4, 2017

When parsing example file https://github.com/HUPO-PSI/mzIdentML/blob/master/examples/1_2examples/crosslinking/xiFDR-CrossLinkExample.mzid, I find these 2 protein entries as DBSequence elements:

<DBSequence searchDatabase_ref="SDB_4299_203" accession="P02768-A" id="dbseq_P02768-A_target" name="ALBU_HUMAN Serum albumin active OS=Homo sapiens GN=ALB PE=1 SV=2">
  <cvParam cvRef="PSI-MS" accession="MS:1001088" name="protein description" value="ALBU_HUMAN 
Serum albumin active OS=Homo sapiens GN=ALB PE=1 SV=2"></cvParam>  
</DBSequence>  
<DBSequence searchDatabase_ref="SDB_4299_203" accession="P02768-A" id="dbseq_P02768-A_decoy" name="ALBU_HUMAN Serum albumin active OS=Homo sapiens GN=ALB PE=1 SV=2">  
    <cvParam cvRef="PSI-MS" accession="MS:1001088" name="protein description" value="ALBU_HUMAN Serum albumin active OS=Homo sapiens GN=ALB PE=1 SV=2"></cvParam>  
</DBSequence>

Although the protein entries are different (one is the decoy entry of the other), the accession attribute is the same.
My question is: should the accession attribute be unique? In the specification document says this about the accession:

The unique accession of this sequence

This caused my a problem because I am collecting all proteins in a map in which the key is the accession.

What do you think?

@julianu
Copy link
Contributor

julianu commented Aug 7, 2017

I would definitely vote to have these accessions unique. Having the same accessions for differing entries is probably an error, and it leads to inconsistencies when mapping the peptides and PSMs to the proteins, in the given example.

@colin-combe
Copy link
Contributor

I would vote for them not being unique.

First, there is the decoy/target example above. More generally, proteins can have the same accession number and different sequences - this is why we're not all clones, right?

If the example above leads to inconsistencies then it is an error in the software reading the file, because the id attributes are different?

@lutzfischer
Copy link
Contributor

I also think they should be same - it is the decoy counter part for the target - and, unless we have a standard way to denoting them as target decoy pair, I would actually ask for them to have same accession.

Being able to match these up is important for FDR-estimations, as only this way you can make a meaningful separate (target decoy based) FDR for self/internal/intra vs between/inter.

@colin-combe
Copy link
Contributor

could this issue be resolved/closed? I think there are reasons why they are not required to be unique.
@andrewrobertjones - what do you think about this?

@mobiusklein
Copy link

mobiusklein commented Dec 1, 2022

You can have multiple search databases which could have overlapping entries, like searching all of the reviewed sequences of UniProt and then searching again with all the isoforms and unreviewed sequences enabled. The searchDatabase_ref tells you which database an entry should be resolved against. In order for the mzIdentML to be internally consistent, the id is the only field that absolutely has to be unique across all DBSequence entries.

The "supported" method for including decoy proteins in your search database involves adding some marker to the accession attribute of the DBSequence protein, and specifying a regex for matching that marker in your SearchDatabase element using MS:1001283 decoy DB accession regexp
. (edit to correct accession per @colin-combe's catch)

Would it be better if there were an isDecoy attribute like on PeptideEvidence?

@colin-combe
Copy link
Contributor

the id is the only field that absolutely has to be unique across all DBSequence entries

that seems sufficient info to close this

The "supported" method for including decoy proteins in your search database involves adding some marker to the accession attribute of the DBSequence protein, and specifying a regex for matching that marker in your SearchDatabase element using MS:1001450 decoy DB accession regexp

where is that documented? (apologies if it's obvious and I'm just being blind)

@colin-combe
Copy link
Contributor

where is that documented?

right... its shown in the example in Section 7.5 of 1.2.0 spec (though it isn't discussed in the text).

It's because its accession is MS:1001283 (not MS:1001450 as in your message, though the link is correct in your message), that I didn't find it. (I searched for MS:1001450).

@lutzfischer - I think we've been unaware of this?

@colin-combe
Copy link
Contributor

also, re. MS:1001283 - its incorrectly shown as an example CV param for DatabaseName (6.20, pg. 36)?
I say 'incorrectly' because the CV mapping rules given for DatabaseName wouldn't allow it?
All the example CV params given for DatabaseName are wrong?

@mobiusklein
Copy link

Thanks for catching the accession number error earlier. I was writing in a hurry and must have copied over the wrong accession from OLS.

I think you're right about the parameters in DatabaseName.
image

As-is, this could only be one of the children given here: https://www.ebi.ac.uk/ols/ontologies/ms/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FMS_1001013&lang=en&viewMode=All&siblings=false
or a userParam.

@colin-combe
Copy link
Contributor

i will make a seperate issue for the incorrect DatabaseName example cvParams.

Would it be better if there were an isDecoy attribute [on DBSequence] like on PeptideEvidence?

that sounds sensible to me, but then it is a change to the schema

@lutzfischer
Copy link
Contributor

currently the only "reliable" way to detect if a protein is a decoy protein is to go via PeptideEvidence. But I guess there are other ways to have decoys besides extra decoy proteins - concatenated proteins come to mind - where only a part of the "protein" is decoy. Not sure what would be the best way to represent that.

For the case of distinct decoy proteins, actually the current spec document, at least implicitly, by example, suggests different accessions:

<SearchDatabase location="/localdirectory/18.E_coli_K12_edit.fasta" id="K12_nosignal" name="K12"
numDatabaseSequences="9376" releaseDate="01-2008-08-2008" version="1.0" >
	<FileFormat>
		<cvParam accession="MS:1001348" name="FASTA format" cvRef="PSI-MS"/>
	</FileFormat>
	<DatabaseName>
		<userParam name="18.E_coli_K12_edit.fasta" />
	</DatabaseName>
	<cvParam accession="MS:1001197" name="DB composition target+decoy" cvRef="PSI-MS"/>
	<cvParam accession="MS:1001283" name="decoy DB accession regexp" value="Rnd" cvRef="PSI-MS"/>
	<cvParam accession="MS:1001195" name="decoy DB type reverse" cvRef="PSI-MS"/>
</SearchDatabase>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants