Add option to generate a nucl database from the Uniprot core proteins #89

alexhbnr · 2022-05-12T15:02:23Z

Instead of downloading the protein sequences directly from Uniprot, this adds the possibility to retrieve the corresponding nucleotide sequences from ENA via metadata stored in XML format.

It iterates over the same input files that are necessary for the functionality to retrieve amino acid sequences from Uniprot. However, instead of directly downloading the FastA file, it downloads the XML file from the Uniprot server. The XML file is parsed using a XML scheme provided from the Uniprot website, then the ENA accession ids for the nucleotide sequences are extracted and the FastA sequences downloaded.

Instead of downloading the protein sequences directly from Uniprot, this adds the possibility to retrieve the nucleotide sequences from ENA via metadata stored in XML format.

fasnicar · 2022-05-18T07:40:49Z

Thanks Alex for this PR.
I tried running the new version of phylophlan_setup_database.py adding the xmlschema package (version 1.10.0 from conda-forge) to my conda env. However, I'm getting the following error:

Traceback (most recent call last):
  File "./phylophlan_setup_database.py", line 25, in <module>
    import xmlschema
  File "/shares/CIBIO-Storage/CM/cmstore/tools/anaconda3/envs/phylophlan-3.0/lib/python3.6/site-packages/xmlschema/__init__.py", line 14, in <module>
    from .resources import normalize_url, normalize_locations, fetch_resource, \
  File "/shares/CIBIO-Storage/CM/cmstore/tools/anaconda3/envs/phylophlan-3.0/lib/python3.6/site-packages/xmlschema/resources.py", line 23, in <module>
    from elementpath import iter_select, XPathContext, XPath2Parser
  File "/shares/CIBIO-Storage/CM/cmstore/tools/anaconda3/envs/phylophlan-3.0/lib/python3.6/site-packages/elementpath/__init__.py", line 18, in <module>
    from .exceptions import ElementPathError, MissingContextError, \
  File "/shares/CIBIO-Storage/CM/cmstore/tools/anaconda3/envs/phylophlan-3.0/lib/python3.6/site-packages/elementpath/exceptions.py", line 12, in <module>
    from .tdop import Token
  File "/shares/CIBIO-Storage/CM/cmstore/tools/anaconda3/envs/phylophlan-3.0/lib/python3.6/site-packages/elementpath/tdop.py", line 405, in <module>
    class Parser(Generic[TK_co], metaclass=ParserMeta):
TypeError: metaclass conflict: the metaclass of a derived class must be a (non-strict) subclass of the metaclasses of all its bases

and I'm not 100% sure how to fix it. Do you have any idea?

alexhbnr · 2022-05-18T08:46:37Z

Which exact version of Python are you using on your system, Francesco? I get different results for different versions of Python 3.6, but of course not the same one as you.

fasnicar · 2022-05-18T10:27:19Z

I have the 3.6.15 from conda-forge (hb7a2778_0_cpython).

alexhbnr · 2022-05-18T15:02:13Z

OK, when I create a fresh Python 3.6.15 conda repo and install xmlsearch, I can import it without any issues. I only get one at 3.6.0 itself. I will dig a bit further in the next days what's going on there.

alexhbnr · 2023-03-06T13:20:43Z

Hi @fasnicar,

I am very sorry for long hiatus. It got lost in my long list of to-dos.

I pulled all the recent changes that you added to v3.0.3 into this PR. I installed the latest version of PhyloPhlAn v3.0.3 via conda/mamba into a new environment using the follow command: mamba create -n phylophlan_uniprot_test -c bioconda phylophlan=3.0.3

Afterwards, I installed the changes of this PR using pip3: pip3 install -U git+https://github.com/alexhbnr/phylophlan@uniprot_nuclseq

The pip command installed the Python package xmlschema v2.2.2 and elementpath v4.0.1. When I ran phylophlan_setup_database -h, I didn't get any error message. However, conda/mamba automatically pulled Python version 3.11, and not v3.6 for which you saw the error.

Would you have time to check this PR once more on your system?

Add option to generate a nucl database from the core proteins

dc3cf5a

Instead of downloading the protein sequences directly from Uniprot, this adds the possibility to retrieve the nucleotide sequences from ENA via metadata stored in XML format.

fasnicar self-requested a review May 17, 2022 12:34

Remove redundant period in output filenames

95a09e4

Merge branch 'master' into uniprot_nuclseq

ecde166

Update URL for XML schema and escape deprecated protein ids

2a576c0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to generate a nucl database from the Uniprot core proteins #89

Add option to generate a nucl database from the Uniprot core proteins #89

alexhbnr commented May 12, 2022

fasnicar commented May 18, 2022

alexhbnr commented May 18, 2022

fasnicar commented May 18, 2022

alexhbnr commented May 18, 2022

alexhbnr commented Mar 6, 2023

Add option to generate a nucl database from the Uniprot core proteins #89

Are you sure you want to change the base?

Add option to generate a nucl database from the Uniprot core proteins #89

Conversation

alexhbnr commented May 12, 2022

fasnicar commented May 18, 2022

alexhbnr commented May 18, 2022

fasnicar commented May 18, 2022

alexhbnr commented May 18, 2022

alexhbnr commented Mar 6, 2023