-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generating sample table when updating database with MAGs - GTDB taxID and Accession Number #40
Comments
You should get the GTDB taxids via https://github.com/shenwei356/gtdb-taxdump I used that taxdump for setting taxids in GTDB-r207 |
Hi Nick, Thanks for your response. A few things are still not clear to me. I have the GTDB taxids for r207 as obtained through the link above. But its not clear how I generated taxids for my own MAGs? I have used the below command which I found here gtdb_to_taxdump.py
This shows a taxid in the output file taxID_info.tsv. Many thanks P |
You could go from NCBI taxids for each of your MAGs to GTDB taxids, via Another approach is getting the GTDB taxids directly from the GTDB taxdump created by https://github.com/shenwei356/gtdb-taxdump. You would probably need to create your own script for this, however. The process would likely be |
Hi @nick-youngblut, I am experiencing a similar issue. I have a GTDB-Tk output file with GTDB taxonomies, but I don't have any TaxIDs. How do I go about this step of Thank you for any assistance you can provide. |
This is what I ended up doing: # Create lineage dataframe based on gtdb_classification column
# This is what a cell looks like: 'd__Archaea;p__Aenigmatarchaeota;c__Aenigmatarchaeia;o__GW2011-AR5;f__GCA-2688965;g__GCA-2688965;s__GCA-2688965 sp002688965'
gtdb_lineages = df.set_index("genome")["gtdb_classification"].str.split(";", expand=True)
gtdb_lineages = df["gtdb_classification"].str.split(";", expand=True)
# Write a function to extract the scientific name
def get_sci_name_from_row(row):
"""This reads a Pandas Series (a row) and returns the lowest level scientific name."""
# Iterate each value in the reversed row, return that value if it's valid after trimming
ix = -1
for value in row.to_list()[ix::-1]:
if (value_fmt := value[3:]): # must trim the 'value' as it contains the prefix denoting the rank
return value_fmt
else: # if it isn't classified, go to the higher tax rank
ix -= 1
continue
return None
# Export to a text file
gtdb_lineages.apply(get_sci_name_from_row, axis=1).to_csv("scinames.csv") Now I run that with TaxonKIT (my cut -f 2 -d , scinames.csv | taxonkit name2taxid > taxids.csv This gives me a text file with the TaxIDs from the custom GTDB taxdump. I hope it helps. Best, |
Hi there,
I am trying to update the GTDB -r207 database I have downloaded using Struo2 with my own MAGs. It is not clear how I get some of the information including "ncbi_organism_name", "gtdb_taxid" and "accession".
I have annotated my MAGs using GTDB-Tk. Using the FastANI I have de-replicated my genomes removing those with 95% ANI. This has left me with a ~ 3000 MAGs. Given that these MAGs are not close to any other genome in GTDB I don't understand how I can get a taxid? I have attached the current information I have from GTDB about my MAGs.
GTDB_MAG_Information.txt
Your help is greatly appreciated.
Kind regards,
P
The text was updated successfully, but these errors were encountered: