Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing and Inserting gbks from MIBiG #59

Open
nicholascdove opened this issue Dec 20, 2022 · 5 comments
Open

Parsing and Inserting gbks from MIBiG #59

nicholascdove opened this issue Dec 20, 2022 · 5 comments

Comments

@nicholascdove
Copy link

Hi Satria,

Thanks for the great package. I'm having difficulty clustering gbks from MIBiG.

I made an input folder, downloaded MIBiG, and placed the gbks in the input folder.

mkdir -p bigslice_input/AgB_gbk/gbks bigslice_input/taxonomy

wget --no-check-certificate https://dl.secondarymetabolites.org/mibig/mibig_gbk_3.1.tar.gz
tar -xf mibig_gbk_3.1.tar.gz 
mv mibig_gbk_3.1/* bigslice_input/AgB_gbk/gbks

I did the same for my own gbks run through AntiSMASH.

mv data/*gbk bigslice_input/AgB_gbk/gbks

I also made a dummy manifest and taxonomy file (I don't use the sqlite db, I end up parsing it and joining taxonomy from a separate database).

echo -e "# Dataset name\tPath to folder\tPath to taxonomy\tDescription" > bigslice_input/datasets.tsv
echo -e "AgB_gbk\tAgB_gbk/\ttaxonomy/AgB_gbk_taxonomy.tsv\tNULL" >> bigslice_input/datasets.tsv

echo -e "# Genome folder\tKingdom\tPhylum\tClass\tOrder\tFamily\tGenus\tSpecies\tOrganism" > bigslice_input/taxonomy/AgB_gbk_taxonomy.tsv
echo -e "gbks/\tUnknown\tUnknown\tUnknown\tUnknown\tUnknown\tUnknown\tUnknown\tUnknown" >> bigslice_input/taxonomy/AgB_gbk_taxonomy.tsv

When I run

bigslice -i bigslice_input \
  --complete \
  -t 4 \
  bigslice_centroids_output

during the parsing and inserting step, I get: gbks/BGC0000056.gbk is not a recognized antiSMASH clustergbk. And, I get the same message for each MIBiG gbk. At the same time, my own gbks seem to work.

Can you help? I'm wondering if it has to do with the eligible regex definitions on a newer release of MIBiG? I'd try to debug myself, but my programming skills are pretty novice.

Thanks!
Nicholas

@nicholascdove
Copy link
Author

Hmm, maybe it's not a regex thing? I renamed the MIBiG gbks to try to match my gbks that worked.

My gbks that were parsed, inserted, and clustered looked like this: AIM000021_asm31892_contig20486033.region001.gbk
Original MIBiG gbk: BGC0002286.gbk
Trying to add a region string: BGC0002286.region001.gbk
Trying to break the BGC part of the regex definition so that it uses ^.+\\.region[0-9]+$: ABGC0002286.region001.gbk

Unfortunately none of these naming "hacks" were able to get BiG-SLiCE to recognize these MIBiG gbks as AntiSMASH gbks. Also, all of the files (my gbks and the MIBiG gbks) were in the same folder, so I don't think its a directory issue.

@nicholascdove
Copy link
Author

Not actually "closed"; I just hit the wrong button. :)

@nicholascdove
Copy link
Author

Looks like my issue has more to do with the parse_gbk() command from bgc.py. On line 98-170, there is an if/else statement that treats different versions of AntiSMASH gbks differently.

Line 98: if antismash_version.split(".")[0] in ["5", "6"]:
Line 170: else: # assume antiSMASH 4

The problem is that current MIBiG gbks do not have an AntiSMASH version:

image
So, this if/else treats them like an antiSMASH 4 gbk and searches for the feature cluster, and therefore, does not recognize them as an antiSMASH gbk.
Line 170-182:

 else:  # assume antiSMASH 4
                cluster = None
                for feature in gbk.features:
                    if feature.type == "cluster":
                        if cluster:  # contain 2 or more clusters
                            cluster = None
                            break
                        else:
                            cluster = feature
                if not cluster:
                    print(orig_gbk_path +
                          " is not a recognized antiSMASH clustergbk")
                    break

Maybe this is the issue? Please let me know. Thanks!

@nicholascdove
Copy link
Author

Okay, I figured it out. The assumption in my last comment was correct.

For others who are running into a similar issue, my work around was to change the version in the MIBiG gbk from FALSE to 5.0.0. You can use the following code in a for loop:
sed 's/Version :: False/Version :: 5.0.0/' BGC000001.gbk > BGC000001.gbk

I'm going to leave the issue open so the bug can be fixed in the package :)

@BioGavin
Copy link

Here is the command for batch modification:
for i in mibig_gbk_3.1/*.gbk; do sed -i 's/Version :: False/Version :: 5.0.0/' $i; done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants