Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using non-prodigal gene calls #83

Open
jdwinkler-lanzatech opened this issue Sep 29, 2021 · 11 comments
Open

Using non-prodigal gene calls #83

jdwinkler-lanzatech opened this issue Sep 29, 2021 · 11 comments
Labels
enhancement New feature or request

Comments

@jdwinkler-lanzatech
Copy link

Hi,

I have a set of already called genes (not from prodigal) that I would like to use as input to inStrain. Is there any easy way to do this? Based on a previous issue (#56 specifically), it appears that inStrain depends on prodigal input specifically.

@MrOlm
Copy link
Owner

MrOlm commented Sep 29, 2021

Hello,

Yes, at the moment this is a pretty glaring limitation of inStrain. Fixing this has been near the top of my ToDo list for a while.

Can I ask what format your genes are in? There is experimental support for gene input in the form of genbank files, but that feature is not very well tested.

Best,
MO

@MrOlm MrOlm added the enhancement New feature or request label Sep 29, 2021
@jdwinkler-lanzatech
Copy link
Author

Nothing fancy, it's basically just a fasta file as follows:

>locus tag 1
ATG...
...
TAG

I'm not sure what information inStrain is extracting from the prodigal-do you need coordinates? Maybe the file + GFF annotations might work better than gene calls?

@MrOlm
Copy link
Owner

MrOlm commented Sep 29, 2021

Here is an example prodigal header:

>NC_000913_4 # 3734 # 5020 # 1 # ID=1_4;partial=00;start_type=ATG;rbs_motif=GGA/GAG/AGG;rbs_spacer=5-10bp;gc_cont=0.528

The things inStrain picks up from that are:

  1. The scaffold, which is everything before the last _ (in this case the scaffold is NC_000913)

  2. The start location, which is after the first # (in this case 3734; note that prodigal uses 1-based indexing)

  3. The end location, which is after the second # (in this case 5020)

  4. The strand, which is after the thing # (which in this case is 1. The options are 1 or -1

If you add all that to your gene headers it should work while I try and add a better solution.

Best,
MO

Also just FYI, the specific function that does this parsing is parse_prodigal_genes in the file GeneProfile.py (https://github.com/MrOlm/inStrain/blob/master/inStrain/GeneProfile.py)

@jdwinkler-lanzatech
Copy link
Author

Okay, I can whip that up easily enough!

@jdwinkler-lanzatech
Copy link
Author

Unfortunately I get the same coverage KeyError described in #56. I'll see if I can figure out why.

@MrOlm
Copy link
Owner

MrOlm commented Oct 1, 2021

Let me know if this ends up remaining a problem. My guess is that the problem is that the coordinates of the genes are slightly different from how prodigal reports them. InStrain translates the gene into amino acid space based on the gene coordinates, so if they're off-by-one and it ends up with an amount of nucleotides that's not divisible by 3, that'll throw an error.

Let me know if this continues to be an issue, and at the very least I can update inStrain to throw the exception logs into the log file instead of STDOUT so we can at least know what the error code is.

-Matt

@jdwinkler-lanzatech
Copy link
Author

That gives me an idea, maybe I'm outputting the coordinates incorrectly. I'll check that and get back to you!

@nschan
Copy link

nschan commented Sep 5, 2022

Hi,
similar to @jdwinkler-lanzatech I would be interested in being able to use non-prodigal gene calls or annotations.
Following the discussion here, I took a look at GeneProfile.py and I noticed that there appears to be a function for parsing gbk files (https://github.com/MrOlm/inStrain/blob/master/inStrain/GeneProfile.py#L811), which is documented as "EXPERIMENTAL; the name of the gene must be in the gene qualifier". I am currently trying to use gbk files from bakta, which to me seem like standard gbk files, here is a sample:

     gene            779..1435
                     /locus_tag="FCBPMO_00010"
     CDS             779..1435
                     /db_xref="RefSeq:WP_056275225.1"
                     /db_xref="SO:0001217"
                     /db_xref="UniParc:UPI0006F201EE"
                     /db_xref="UniRef:UniRef100_A0A0Q9DN46"
                     /db_xref="UniRef:UniRef50_A0A0M2H4C6"
                     /db_xref="UniRef:UniRef90_A0A0Q9DN46"
                     /product="hypothetical protein"
                     /locus_tag="FCBPMO_00010"
                     /protein_id="gnl|Bakta|FCBPMO_00010"
                     /translation="MTEHLQTTAPSPTPIARILTFLHRAWVAELRVYSSIARAIARRPA
                     VPKGGTGIGYHRPVLTVLFIFIGLSAVEIPILDLIVHQWPVVRITLLVLGIWGVTWMVG
                     LLCAMLMRPHAVGPDGVRVRSGLELDVPIGWADIASIAISRRVDEPKQPRVIGSEYAER
                     MQDETNIEIELERPVAIRLPGLAPKGGTHQVDRIRLWADDPRAFLDAARPFLVAA"
                     /codon_start=1
                     /transl_table=11
                     /inference="ab initio prediction:Prodigal:2.6"
                     /inference="similar to AA sequence:RefSeq:WP_056275225.1"

But instrain profile returns the error

File "/usr/local/lib/python3.8/site-packages/inStrain/GeneProfile.py", line 821, in parse_genbank_genes
      gene = feature.qualifiers[gene_name][0]
  KeyError: 'gene'

I am not completely sure what to make of it, should i replace /locus_tag with /gene?

Thank you


Amendment:
Replacing /locus_tag with /gene resolves the error above. However the issue described in #56 occurs now.
I am using the instrain container (quay.io/biocontainers/instrain:1.6.1--pyhdfd78af_0).

@MrOlm
Copy link
Owner

MrOlm commented Sep 9, 2022

Hi @nschan -

Thanks for reaching out and sorry for this mess. The current .gbk parsing is really weird, and really only works for specific weirdly-formatted .gbk files. I have re-written a good part of this code in a developmental version of inStrain that I'm working on but it's not quite ready yet.

I really hope to have this working and pushed to the development branch of inStrain soon; until then all I can say is that prodigal-formatted genes are really the only supported option.

Apologies again,
Matt

@nschan
Copy link

nschan commented Oct 11, 2022 via email

@MrOlm
Copy link
Owner

MrOlm commented Oct 11, 2022

Hi @nschan -

I will add this to my "To Do" list, but it likely won't be ready for a while. Right now a big reason I do the intragenic gene calling is to call non-synonymous / synonymous mutations, which isn't possible for non-cds annotations, but I do see the value in this.

Best,
Matt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants