-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ValueError: invalid literal for int() with base 10: 'protein' #13
Comments
Hi Paula, Thanks for bringing my attention to this error. I have been aware of this bug, and am currently considering possible fixes. The header format in your FASTA files is currently not compatible with FeGenie. This is because FeGenie was designed to expect prodigal-formatted headers, where each header ends in an underscore and number (e.g. contig_1, or contig_00001). FeGenie uses that number to remember where each ORF is encoded relative to other potential iron-related genes. If your header ends in some value followed by an underscore (such as 'protein'), it tries to convert that value to an integer, and fails, causing the program to crash. Does this make sense? One potential fix is to require users to make sure that their provided gene sequences have headers that are formatted this way. Would you be able to reformat your fasta amino acid file this way? Another potential fix is to allow users to provide headers that are not formatted this way, but forego the relative genomic localization feature of FeGenie. In this case, however, you would see a lot of false positives in your output files. This is because FeGenie uses the known operon structures of iron-related pathways to rule out false positive hits to iron gene HMMs that may be part of broader gene families, but not necessarily involved in iron-related processes. Genomic context is an important component of FeGenie's identification of iron genes, so I hesitate to make this second option available. Let me know if you have any thoughts on this, or other questions Thanks, |
Dear Arkadiy, Thanks for your answer! I can use contig fasta files and later figure out which genes FeGenie identified in my prokka-annotated amino acid fasta files - a bit more of work, but I want to use the prokka annotations for this and other analyses (and later for genome submissions). As for FeGenie, I see the importance of genomic context. Not sure how feasible this is, but one idea is to allow users to specify how contigs and genes are designated in their input files via additional command line options - for instance, -contigs letters -genes numbers, maybe with a few restrictions, for example, gene numbers have to be between underscores. And then FeGenie would ignore whatever comes next (i.e. an annotation). Best, |
Hi Paula, That sounds like a reasonable approach (using contigs, and then figuring out prokka ORFs corresponding to the genes identified in FeGenie. That does sound like extra work, but definitely doable. I also like the idea of having additional options that users can use to specify contig names. I tried coding this into the script, which I just uploaded to the GitHub. Give it a try and see if it works for you. Assuming all headers are like the one you pasted in the above comment: "unbinned_NHGMMMNG_144129_2E-6E-farnesyl_diphosphate_synthase", if you add the flag One issue that I am foreseeing is that if you have the prokka-generated name (NHGMMMNG) assigned to all contigs, and the genes ordered sequentially, then you might have cases where two geens might appear to be encoded adjacent to each other, but could just be on ends of different contigs. I think if your assembly isn't too fragmented, then this might not be too much of an issue, as long as you are aware of this possibility, and confirm that the FeGenie-identified gene clusters are, indeed, all on the same contigs. Let me know if you have any other issues or question. I didn't have time to test out the new FeGenie script that I just uploaded to GitHub, so if there is some error generated, please let me know. Thanks! |
Dear Arkadiy,
I am getting the error below - could you help me understand what is going on and how to fix this?
I do get files such as FinalSummary.csv and .csv files for each category (i.e. iron_reduction-summary.csv, etc).
The commands I am running:
The input file is a prodigal-generated and prokka-annotated fasta amino acid file.
Example of sequence in this file:
Here is the full slurm output file:
slurm-835700.txt
Thank you very much for your help!
Paula
The text was updated successfully, but these errors were encountered: