Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HELP-217 Missing Ensembl IDs #7

Closed
jeremywalter opened this issue Nov 21, 2022 · 3 comments
Closed

HELP-217 Missing Ensembl IDs #7

jeremywalter opened this issue Nov 21, 2022 · 3 comments
Assignees

Comments

@jeremywalter
Copy link
Contributor

When running the prep-script it works fine with a protein having the ensemble ID "ENSG00000214826". However when running the submission tool it will fail. This ID is also not present in the ensemble TSV in the reference CV folder. There are probably other IDs as but the submission stopped after the first error.

Best,
René

@jeremywalter
Copy link
Contributor Author

Arthur Brady
June 15, 2022 at 9:50 PM

All set! Instructions:

go to https://osf.io/bq6k9/ and download the latest prepare_C2M2_submission.py

go to the "external_CV_reference_files" subfolder of that directory and get the latest protein.tsv.gz, compound.tsv.gz and substance.tsv.gz

rerun the prep script using your desired protein IDs

Note: of the list of 1,270 error-throwing UniProtKB accessions that you tried to submit in April, all but 15 will now work. The remaining 15 are legitimate errors, in one of three categories:

obsolete/deleted ID (e.g. Q6ZRZ4: https://www.uniprot.org/uniprot/Q6ZRZ4)

accession is for the wrong DB (e.g. Q6ZW33 is a PRO ID, not a UniProtKB accession; cf. https://www.uniprot.org/uniprot/O94851)

accession is a secondary accession (e.g. P35544); please use the primary accession instead (e.g. P62861 is the primary accession for P35544: cf. https://www.uniprot.org/uniprot/P62861). (Note in this case that the secondary accessions will still be automatically loaded, as C2M2 synonyms, when the primary accession is processed by the prep script – so users will be able to search "P35544" and come up with records relating to "P62861" – but only primary accessions should be used for direct data entry when you're creating C2M2 datapackages for submission).

Final note: the script will now list all offending IDs for all term types (proteins, genes, etc.) and will not exit until it's printed all the IDs that caused problems, which will hopefully simplify error handling in the future.

Arthur Brady
June 15, 2022 at 9:11 PM
Confirmed, I have located and fixed the problem and am updating the prep script and our reference files on OSF now. Stay tuned for detailed instructions on retrying submission as soon as I complete my data transfers.

Arthur Brady
June 15, 2022 at 7:25 PM
I believe I have found the problem with the missing protein IDs and am testing a fix which I believe will work. Assuming all goes well I will push relevant updates to reference TSVs and the submission prep script by tomorrow and notify you here; stand by.

Arthur Brady
June 2, 2022 at 5:41 PM
Cool. I’ll report on the protein issue here when I finish my tests – there’s another ticket mentioning it as well, but I already closed that one and I’m risking straining everyone’s necks with all the ticket consolidation ping pong.

Rene Ranzinger
June 2, 2022 at 5:32 PM
Hi,

I think I did not see any Ensemble ID errors in the resent error dump I sent you. We should be good on this front. But the protein Id problem remains. If you want to make a separate ticket for this to avoid confusion that is fine with me. I just posted all protein related ID errors into this thread independent if it was protein ID or gene ID.

Best,
René

@jeremywalter jeremywalter changed the title HELP-217 HELP-217 Missing Ensemble IDs Nov 21, 2022
@jonathancrabtree jonathancrabtree changed the title HELP-217 Missing Ensemble IDs HELP-217 Missing Ensembl IDs Nov 22, 2022
@jonathancrabtree
Copy link

Hi @ReneRanzinger, the notes I have from Arthur indicate that this case was closed (in June, I'm assuming) and your notes above seem to confirm this (with the exception of the protein id problem, which I'm guessing was filed as the separate case HELP-357, aka issue #1 in this repository). Can this case be closed and, if not, which Ensembl ids are causing submission problems?

@ReneRanzinger
Copy link
Member

ReneRanzinger commented Nov 22, 2022

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants