Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

what are the differences of different inovirus type in result file? #7

Open
mujiezhang opened this issue Sep 22, 2022 · 4 comments
Open

Comments

@mujiezhang
Copy link

mujiezhang commented Sep 22, 2022

Hi, I have four more questions.

  1. You defined five types inovirus in the result files: prophage, complete, integrase, DR, tRNA. But the type, complete, seems have no att sites. Are the complete inovirus mean only circular contigs? And what is a circular inovirus here and how to identity circular inovirus technically?
  2. The lengths of some direct repeats may be several thousand bp. Is it possible? Does it need to be corrected?
  3. I predicted the inovirus Acholeplasma phage MV-L51 for a test, but I got four predictions like below:
    GCA_000847085.1_ViralProj14573_genomic_frag_X58839_1_2_annot_inovirus-predictions-refined.csv
    GCA_000847085.1_ViralProj14573_genomic_frag_X58839_1_4_annot_inovirus-predictions-refined.csv
    GCA_000847085.1_ViralProj14573_genomic_frag_X58839_1_7_annot_inovirus-predictions-refined.csv
    GCA_000847085.1_ViralProj14573_genomic_frag_X58839_1_8_annot_inovirus-predictions-refined.csv
    And this four inovirus region are the same. It may be related to the blast hits of Marker_ALV1.faa. So these four proteins in Marker_ALV1.faa are markers of pI?
  4. And in other cases, I got two different predicted inovirus in a same bacteria genome, but these two inovirus regions are overlapped. What is the reason? And which one should I choose?

Thanks for your kindly reply!

@simroux
Copy link
Owner

simroux commented Sep 23, 2022

Hi,

  1. Sorry for the confusion: "complete" here refers to the contig, not the genome, i.e. the inovirus region span across the entire ("complete") contig, and there is no host region detected upstream or downstream

  2. These are likely not att site, but could be artifacts from the assembly (i.e. the genome would still be complete). Or it could be that this prophage is in a region with many repeats (not unusual either). The tool only tries to find the "best" repeat possible, and if there is no repeat of a reasonable size but there are some long repeats, it will use the long ones. These should certainly be individually curated if one is looking for exact prophage insertion sites.

  3. This is likely because ALV1 is actually very divergent, and thus detected by blast against 4 marker genes. It is exceedingly rare to even have one hit to these markers, so that's why we prefer to use 4. Of course, in the case of ALV1 itself, the 4 markers lead to a candidate region, which are then extended to give an identical prediction.

  4. It is not uncommon for inoviruses to be inserted in tandem, and the tool can have difficulties distinguishing the exact boundaries. It also not uncommon for some inoviruses prophages to be incomplete, and inserted right next to complete prophages. If you have two distinct pI proteins, my interpretation is usually that these are two distinct prophages, and both should be treated independently.

Best,
Simon

@mujiezhang
Copy link
Author

mujiezhang commented Sep 23, 2022

Realy thanks for your helpful and professional answers.

Still, I have some questions.

For answer 1 So the complete inovirus in the result file mean that all this contig is a inovirus and the complete inovirus in your paper mean circular inovirus and integrated prophages with canonical att sites. And How can I pick out the circular inovirus from the result file? Or how to identity circular inovirus technically? You did not mention it in the paper methods or supplyment document.

For answer 3 As the pI marker of ALV1 has been included into the Final_marker_morph.hmm file, why did you include the other three proteins of ALV1 into the Marker_ALV1.faa file? What are the functions of them? Why can they serve as markers? If the reason is they are rare, it does not make sense. Because there are also many singletons in other inovirus.

For answer 4 For tandem insertions, it is possible to treat the multiple pI protein independently. And tandem insertions could lead to clusters gathering multiple species. So if I want to cluster different inovirus into species, the tandem insertions should be processed separately as you mentioned in your paper, but you did not treat the potential multiple inovirus in tandem insertions independently, because it is hard to find the boundary of the multiple potential inovirus?

Question: As you mentioned in supplyment document, sub-optimal genome assemblies yielding short contigs (i.e. < 5kb) will lead to a large amount of false negatives, as these short contigs will not include enough information to identify them as
putative inoviruses. And there are also many inovirus with a length shorter than 5 kb in your inovirus database. Do you have any suggestions for manual inspection step?

Thanks for your kindly reply!

@simroux
Copy link
Owner

simroux commented Sep 25, 2022

  1. Circular inoviruses will be called "circular" in the tool. Circular inoviruses can be detected using standard approaches to detect circular contigs, i.e. by identifying direct terminal repeats (see www.nature.com/articles/nbt.4306 and https://www.nature.com/articles/s41587-020-00774-7)

  2. ALV1 genes are very rarely detected (in fact, virtually never). The fact that they are so specific to ALV1 means they can be used as marker.

  3. For tandem insertions, pI proteins should definitely be treated independently. We did in fact treat them separately in our paper, by visually inspecting all instances of multiple pI next to each other, and defining boundaries as well as possible. That being said, this also means that we need to be careful in our interpretation, i.e. any new pattern or observation that would be only based on sequences inserted in "tandem insertions" for which boundaries are uncertain should be treated with caution.

Extra question) I don't know that there is a way to manually inspect sequence < 5kb that would not be detected by the tool. You can certainly rule out a number of them because they will have bacterial genes that are unlikely to be encoded by an inovirus genome. But otherwise, you will have many cases where you will have only a pI-like protein and 1 or 2 hypothetical proteins around, and I don't think these are sufficient to determiner whether this short contig would be an inovirus or not.

@mujiezhang
Copy link
Author

mujiezhang commented Sep 26, 2022

Thanks!!! Your suggestions are very helpful. Thanks very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants