-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add option to generate a nucl database from the Uniprot core proteins #89
base: master
Are you sure you want to change the base?
Conversation
Instead of downloading the protein sequences directly from Uniprot, this adds the possibility to retrieve the nucleotide sequences from ENA via metadata stored in XML format.
Thanks Alex for this PR.
and I'm not 100% sure how to fix it. Do you have any idea? |
Which exact version of Python are you using on your system, Francesco? I get different results for different versions of Python 3.6, but of course not the same one as you. |
I have the 3.6.15 from conda-forge ( |
OK, when I create a fresh Python 3.6.15 conda repo and install xmlsearch, I can import it without any issues. I only get one at 3.6.0 itself. I will dig a bit further in the next days what's going on there. |
Hi @fasnicar, I am very sorry for long hiatus. It got lost in my long list of to-dos. I pulled all the recent changes that you added to v3.0.3 into this PR. I installed the latest version of PhyloPhlAn v3.0.3 via conda/mamba into a new environment using the follow command: Afterwards, I installed the changes of this PR using The pip command installed the Python package Would you have time to check this PR once more on your system? |
Instead of downloading the protein sequences directly from Uniprot, this adds the possibility to retrieve the corresponding nucleotide sequences from ENA via metadata stored in XML format.
It iterates over the same input files that are necessary for the functionality to retrieve amino acid sequences from Uniprot. However, instead of directly downloading the FastA file, it downloads the XML file from the Uniprot server. The XML file is parsed using a XML scheme provided from the Uniprot website, then the ENA accession ids for the nucleotide sequences are extracted and the FastA sequences downloaded.