Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSG: asking for tag value that does not exist locus_tag #6

Open
mujiezhang opened this issue Sep 1, 2022 · 6 comments
Open

MSG: asking for tag value that does not exist locus_tag #6

mujiezhang opened this issue Sep 1, 2022 · 6 comments

Comments

@mujiezhang
Copy link

When I run the script "Identify_candidate_fragments_from_gbk.pl" , I got an exception like this:
------------- EXCEPTION -------------
MSG: asking for tag value that does not exist locus_tag
STACK Bio::SeqFeature::Generic::get_tag_values /dssg/home/acct-clsjhh/clsjhh/anaconda3/envs/inovirus_detector/lib/perl5/site_perl/Bio/SeqFeature/Generic.pm:604
STACK toplevel /dssg/home/acct-clsjhh/clsjhh/zmj/software/inovirus_detector/srouxjgi-inovirus-dfc3d5c3b1ac/Inovirus_detector/Identify_candidate_fragments_from_gbk.pl:120
So what is the problem? Thanks a lot

@simroux
Copy link
Owner

simroux commented Sep 2, 2022

Hi,

This looks like a format issue in the GenBank file you are trying to use as input, as there is a missing tag ("locus_tag" should be in there). My recommendation would be to try to run the pipeline from a fasta file of the same sequence instead ?

If you are already doing this, then I am not sure what happens, and would need to see the output folder.

Best,
Simon

@mujiezhang
Copy link
Author

Thanks for your suggestion!

  1. You means I can try using the fasta file to generate a new gbk file and run the pipeline again?
  2. I have another question. Take the Shewanella WP3 for example. There is an inovirus SW1 with a genome length about 7.7 kb and att side longer than 10 bp——It has been isolated and sequenced, in the chromosome of WP3. And the inovirus_detector succeed in finding the main genome of SW1, but the prediction length is 10.6 kb without att site. Beside, I tried three gbk files of WP3, one from Genbank, one from Refseq, and one gererated by Prokka. And the prediction length is 10.6 kb, 10.8 kb, 11.9 kb, respectively. Comparing with the true length 7.7 kb, the prediction 11.9 kb have about 4 kb distance, about 50% of the true genome, which may influence the subsequent analysis seriously. Do you have any suggestion?

@simroux
Copy link
Owner

simroux commented Sep 3, 2022

  1. Yes, you can use Identify_candidate_fragments_from_fna.pl instead of Identify_candidate_fragments_from_gbk.pl in the first step (see https://github.com/simroux/Inovirus/tree/master/Inovirus_detector#example-with-a-fasta-file-as-input)

  2. I am not sure I understand the question. This set of script is an automated inovirus detector, but it is expected that the exact boundaries will not always be found. The tool in this case indicates that it could not identify att sites, so any boundary should be interpreted as "possible" at best. If there are better / refined coordinates for this inovirus, these should be used. And as you mention, any analysis (and specifically results interpretation) should be careful and always take into consideration that prophage boundaries were identified by an automatic tool and thus likely to include some errors.

@mujiezhang
Copy link
Author

  1. Thanks for your suggestion. The Identify_candidate_fragments_from_fna.pl script was not in https://bitbucket.org/srouxjgi/inovirus/src/master/Inovirus_detector/, so I did not notice it. Thanks.
  2. Maybe I can describe this question more clearly. It is reasonable that this set of script will not always find the exact boundaries. My question is : for the same bacteria genome, if I use the Identify_candidate_fragments_from_gbk.pl script to predict inovirus from gbk file from Genbank, I get prediction 1 and if I use the Identify_candidate_fragments_from_fna.pl script to predict inovirus from fna file from Genbank, I get prediction 2. But the prediction 1 is always different from prediction 2. I guess it is due to the differences of protein prediction tools. So do you have any suggestion for getting a more reasonable result?
    Really thanks for your time!

@simroux
Copy link
Owner

simroux commented Sep 5, 2022

  1. Yes, this is something we added later on, but we kept the bitbucket repo exactly as it was when the manuscript was published.
  2. I am not sure what a "reasonable" result is here. I think you are correct: there are different gene predictions by different tools, leading to different predicted boundaries. When you have an experimentally validated prophage, then these boundaries should be used and not the predicted ones. Without an experimental validation, there is often no obvious way to pick which prediction is the correct one. One possible option is to look at the reads data to see if you can identify reads spanning the prophage insertion site (see e.g. https://doi.org/10.1186/s40168-021-01033-w), or to look for a similar bacteria without the prophage (https://doi.org/10.1093/nar/gkaa156 - see Fig. 2). It is possible however that none of this works, and there is no way to tell for sure what are the exact boundaries of this element.

@mujiezhang
Copy link
Author

I got it. Thank you very much for your kindly help !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants