Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do I extract similar BGC distance values from the interactive output #72

Open
boykawang opened this issue Jul 28, 2023 · 18 comments
Open

Comments

@boykawang
Copy link

Hi, I had some problems extracting data from interactive output pages.
I successfully ran the query module to analyze the novelty of my BGCs (more than 1000), and here are the command I used:

bigslice --query ./my_BGC_gbk_files --n_ranks 3 ~/BigSlice_1.2M_database/full_run_result

I opened the output results using flask server script according the instructions, and the results were displayed on the web page. The important data I need are the distance values between my BGC and the reference BGCs (marked in the red box in following figure). Due to the huge number of reference BGCs, a single BGC needs a lot of web pages to display the distance values, and the table on web page can not be sorted according to the value of the distance. My goal is to obtain the minimum distance value between each of my BGCs and the reference BGC.

image

I have tried to find the distance values I need in both the interactive web source codes and the output sqlite database files, but they are not directly displayed in the codes or sqlite database files.

If you have any solutions or instructions, please contact me as soon as possible. Thanks!

@ialas
Copy link

ialas commented Jul 28, 2023

Hi, I am not associated with BiG-SLiCE but I ran into similar issues.
Just to make sure, you:

  • Queried your BGCs against the 1.2 million BGCs in BiG-SLiCE using the query mode.

Now you have the interactive visualization, which shows information about your BGC. You're interested in specifically the distance value of your BGCs to the closest GCF, which is denoted in the red.

So, for example, your BGC1 has a distance of 1864 to GCF_008711465.1/NZ_VXKQ01000005.region001, and you want to be able to get the lowest distance value for all 1000 of your BGCs to their closest GCFs?

@boykawang
Copy link
Author

Yes, my current situation and what I want to do is just as you described.

Hi, I am not associated with BiG-SLiCE but I ran into similar issues. Just to make sure, you:

  • Queried your BGCs against the 1.2 million BGCs in BiG-SLiCE using the query mode.

Now you have the interactive visualization, which shows information about your BGC. You're interested in specifically the distance value of your BGCs to the closest GCF, which is denoted in the red.

So, for example, your BGC1 has a distance of 1864 to GCF_008711465.1/NZ_VXKQ01000005.region001, and you want to be able to get the lowest distance value for all 1000 of your BGCs to their closest GCFs?

@ialas
Copy link

ialas commented Jul 28, 2023

I generated this data in python version 3.7.12. The relevant libraries were sqlite3, pandas, and numpy.
I had variables assigned to:
The pathway to the data.db in the /full_run_result/result/.
The pathway to the reports folders in /full_run_result/reports/.
Which threshold value you used (I used the threshold value of 900, which grouped the 1.2 million BGCs into 29,955 GCFs).
The run_id used (Each threshold value in BiG-SLiCE is associated with a run_id. The run_id for the 29,955 GCFs formed from threshold 900 is 6).
Which report numbers match up to the data you wanted to analyze: (You can see in the reports folder, the numbers associated with the reports you want. Mine were reports from 111 to 156 with some skipped as I figured things out.
pathToDataDB = "path/to/result/data.db"
pathToReportsDB = "path/to/reports/
run_id = 6
thresholdValue = 900
allFolders = np.concatenate([np.arange(111,145,1), np.arange(149,157,1)])

Here's the code for how I scraped the relevant information in python.

connMib = sqlite3.connect(pathToDataDB);
currMib = connMib.cursor();
storageDict = {};
for j in allFolders: # whatever the report runs i want to look at are
    print('Analyzing folder: ' + str(j))
    connF = sqlite3.connect("path/to/reports/" + str(j) + "/data.db") # open a connection
    curF = connF.cursor();
    dfGCF = pd.read_sql_query("SELECT * FROM gcf_membership", connF);
    dfSliceGCF = dfGCF[(dfGCF.iloc[:,3] == 0)]; # get only the highest rank values
    listBGC = list(dfSliceGCF.gcf_id);
    numBGCs = len(dfGCF.bgc_id.value_counts());
    list1, list2, list3, list4 = list(), list(), list(), list();
    count12 = 0;
    for i in np.arange(1,numBGCs+1,1):
        name1 = curF.execute("SELECT name FROM bgc WHERE id=" + str(i)).fetchall()[0]; # modified for this new version
        dist1 = curF.execute("SELECT membership_value, gcf_membership.gcf_id FROM gcf_membership WHERE bgc_id=" + str(i) + " AND rank=0").fetchall()[0];
        gcf1 = currMib.execute("SELECT id_in_run FROM gcf, clustering WHERE gcf.clustering_id=4 AND gcf.id=" + str(dist1[1]) + " AND clustering.run_id=4").fetchall()[0];
        name2 = name1[0].split('/')[0]; # genome name folder
        name22 = name1[0].split('/')[1]; # tig.region
        if name2 != 'm6_a1363': # hard code it because i messed up.
            name3 = name2.split('.')[-2]; # bc_genomeName
            if len(name3.split('_')) < 2: # a rule for if its bc12_genomeName or bc-genomeName
                name4 = name3.split('-')[-1];
                name42 = name4 + '.' + name22;
            else:
                name4 = name3.split('_')[-1]; # just genomeName
                name42 = name4 + '.' + name22; # genomeName.tig00.region00
        else:
            name4 = 'a1363';
            name42 = name4 + '.' + name22;
        list1.append(name42);
        list2.append(dist1[0]);
        list3.append(dist1[1]);
        list4.append(gcf1[0]);
        count12 += 1;
    genomeName = name4; # name1[0].split('.')[0];
    nameDistDict = {'BGC': list1, 'Distance': list2, 'gcf_ID': list3, 'GCF_Value':list4};
    nameDist = pd.DataFrame(nameDistDict);
    storageDict[genomeName] = nameDist;
    connF.close();

This should get you on the right path. I had to hardcode in how to parse the names of the folders, but the general idea of how to gain access to the data in the SQL is there.
This was designed specifically with the 29,955 GCFs in mine (so run_id = 6 and thresholdValue = 900), and I don't quite remember how I selected gcf.clustering_id=4 and clustering.run_id=4.

To figure out generally how to access things in SQL (where the data was being stored), I used this script.

# First, the completed data database.
conn1 = sqlite3.connect("path/to/result/data.db")
curr1 = conn1.cursor();
# Second, the specific report database (based off a specific query) (currently pointed @ report 114)
conn2 = sqlite3.connect("path/to/reports/114/data.db") # a specific report
curr2 = conn2.cursor()
# Third, the entire report database (not based on a specific query)
conn3 = sqlite3.connect(path/to/reports/reports.db")
curr3 = conn3.cursor()

# Identify all of the tables inside each database.
table_list1 = [a for a in curr1.execute("SELECT name FROM sqlite_master WHERE type = 'table'")]; # 28 tables.
table_list2 = [a for a in curr2.execute("SELECT name FROM sqlite_master WHERE type = 'table'")]; # 10 tables.
table_list3 = [a for a in curr3.execute("SELECT name FROM sqlite_master WHERE type = 'table'")]; # 3 tables.

# Find out specifically what's inside each table inside each database. (WORKS)
storageEx = {}; # Dictionary of dictionaries.
storageEx['Result'] = {}; # Dictionary in dictionary.
storageEx['Sp. Report'] = {};
storageEx['All Reports'] = {};
for i in [len(table_list1), len(table_list2), len(table_list3)]:
    print(i) # Tracker
    if i == 28: # Point values to correct variables.
        table=table_list1;
        connA = curr1;
        dictN = 'Result';
    if i == 10:
        table=table_list2;
        connA = curr2;
        dictN = 'Sp. Report';
    if i == 3:
        table=table_list3;
        connA = curr3;
        dictN = 'All Reports';
    for j in np.arange(0, i): # Iterate through each table in the database pointed at.
        # storageEx[i][table[j][0]] = []; # Create an empty list in the dictionary for just that table.
        test1 = connA.execute('SELECT * FROM ' + str(table[j][0])) # Query that table.
        test3 = [b[0] for b in test1.description]; # Get the headers.
        test2 = pd.DataFrame(test1.fetchmany(180), columns = test3); # Find the first 5 rows of that table. Make it a dataframe, with the columns as the headers from before.
        storageEx[dictN][table[j][0]] = test2; # Put the dataframe in the dictionary that's in the larger dictionary.

This way, I could open up storageEx and see what was inside all the SQL databases so I knew what data I should be looking for.

Hope this helps!

@boykawang
Copy link
Author

Hi, ialas. Thank you for your reply.
I am trying to use your first script to extract the data I need.
I may not be familiar with python, but based on my understanding of the scripts you provided: the first script aims to extract data from these two SQL database files( /full_run_result/result/data.db and /full_run_result/reports/<query number>/data.db). The first script seems to be designed to extract "gcf_membership" values, query BGC names and closest GCF names. If I am right, the distance values marked in the red box below are what the script extracts.

image

I used sqlitebrowser to open the SQL database file ( /full_run_result/reports/<query number>/data.db) and that indeed contained "gcf_membership" values.

image

If I am right, the data extracted by the first script is not what I want.

The data I need is the distance value under module "Similar BGCs".

image

Because, my final goal is to judge the novelty of my BGCs based on these distance values. If a query BGC has a minimum distance value (d) greater than 900, it will be considered a novel BGC.
As mentioned before, I opened these two SQL database files ( /full_run_result/result/data.db and /full_run_result/reports/<query number>/data.db) using sqlitebrowser and did not find the data I wanted in these two files. I think the data displayed on the web page may be generated by using these two database files after a certain calculation process. Therefore, extracting the data I want directly from these two database files doesn't seem to work.

image

image

I hope I'm wrong, so I can use your helpful script to solve this problem soon.
Looking forward to your reply.

I generated this data in python version 3.7.12. The relevant libraries were sqlite3, pandas, and numpy. I had variables assigned to: The pathway to the data.db in the /full_run_result/result/. The pathway to the reports folders in /full_run_result/reports/. Which threshold value you used (I used the threshold value of 900, which grouped the 1.2 million BGCs into 29,955 GCFs). The run_id used (Each threshold value in BiG-SLiCE is associated with a run_id. The run_id for the 29,955 GCFs formed from threshold 900 is 6). Which report numbers match up to the data you wanted to analyze: (You can see in the reports folder, the numbers associated with the reports you want. Mine were reports from 111 to 156 with some skipped as I figured things out. pathToDataDB = "path/to/result/data.db" pathToReportsDB = "path/to/reports/ run_id = 6 thresholdValue = 900 allFolders = np.concatenate([np.arange(111,145,1), np.arange(149,157,1)])

Here's the code for how I scraped the relevant information in python.

connMib = sqlite3.connect(pathToDataDB);
currMib = connMib.cursor();
storageDict = {};
for j in allFolders: # whatever the report runs i want to look at are
    print('Analyzing folder: ' + str(j))
    connF = sqlite3.connect("path/to/reports/" + str(j) + "/data.db") # open a connection
    curF = connF.cursor();
    dfGCF = pd.read_sql_query("SELECT * FROM gcf_membership", connF);
    dfSliceGCF = dfGCF[(dfGCF.iloc[:,3] == 0)]; # get only the highest rank values
    listBGC = list(dfSliceGCF.gcf_id);
    numBGCs = len(dfGCF.bgc_id.value_counts());
    list1, list2, list3, list4 = list(), list(), list(), list();
    count12 = 0;
    for i in np.arange(1,numBGCs+1,1):
        name1 = curF.execute("SELECT name FROM bgc WHERE id=" + str(i)).fetchall()[0]; # modified for this new version
        dist1 = curF.execute("SELECT membership_value, gcf_membership.gcf_id FROM gcf_membership WHERE bgc_id=" + str(i) + " AND rank=0").fetchall()[0];
        gcf1 = currMib.execute("SELECT id_in_run FROM gcf, clustering WHERE gcf.clustering_id=4 AND gcf.id=" + str(dist1[1]) + " AND clustering.run_id=4").fetchall()[0];
        name2 = name1[0].split('/')[0]; # genome name folder
        name22 = name1[0].split('/')[1]; # tig.region
        if name2 != 'm6_a1363': # hard code it because i messed up.
            name3 = name2.split('.')[-2]; # bc_genomeName
            if len(name3.split('_')) < 2: # a rule for if its bc12_genomeName or bc-genomeName
                name4 = name3.split('-')[-1];
                name42 = name4 + '.' + name22;
            else:
                name4 = name3.split('_')[-1]; # just genomeName
                name42 = name4 + '.' + name22; # genomeName.tig00.region00
        else:
            name4 = 'a1363';
            name42 = name4 + '.' + name22;
        list1.append(name42);
        list2.append(dist1[0]);
        list3.append(dist1[1]);
        list4.append(gcf1[0]);
        count12 += 1;
    genomeName = name4; # name1[0].split('.')[0];
    nameDistDict = {'BGC': list1, 'Distance': list2, 'gcf_ID': list3, 'GCF_Value':list4};
    nameDist = pd.DataFrame(nameDistDict);
    storageDict[genomeName] = nameDist;
    connF.close();

This should get you on the right path. I had to hardcode in how to parse the names of the folders, but the general idea of how to gain access to the data in the SQL is there. This was designed specifically with the 29,955 GCFs in mine (so run_id = 6 and thresholdValue = 900), and I don't quite remember how I selected gcf.clustering_id=4 and clustering.run_id=4.

To figure out generally how to access things in SQL (where the data was being stored), I used this script.

# First, the completed data database.
conn1 = sqlite3.connect("path/to/result/data.db")
curr1 = conn1.cursor();
# Second, the specific report database (based off a specific query) (currently pointed @ report 114)
conn2 = sqlite3.connect("path/to/reports/114/data.db") # a specific report
curr2 = conn2.cursor()
# Third, the entire report database (not based on a specific query)
conn3 = sqlite3.connect(path/to/reports/reports.db")
curr3 = conn3.cursor()

# Identify all of the tables inside each database.
table_list1 = [a for a in curr1.execute("SELECT name FROM sqlite_master WHERE type = 'table'")]; # 28 tables.
table_list2 = [a for a in curr2.execute("SELECT name FROM sqlite_master WHERE type = 'table'")]; # 10 tables.
table_list3 = [a for a in curr3.execute("SELECT name FROM sqlite_master WHERE type = 'table'")]; # 3 tables.

# Find out specifically what's inside each table inside each database. (WORKS)
storageEx = {}; # Dictionary of dictionaries.
storageEx['Result'] = {}; # Dictionary in dictionary.
storageEx['Sp. Report'] = {};
storageEx['All Reports'] = {};
for i in [len(table_list1), len(table_list2), len(table_list3)]:
    print(i) # Tracker
    if i == 28: # Point values to correct variables.
        table=table_list1;
        connA = curr1;
        dictN = 'Result';
    if i == 10:
        table=table_list2;
        connA = curr2;
        dictN = 'Sp. Report';
    if i == 3:
        table=table_list3;
        connA = curr3;
        dictN = 'All Reports';
    for j in np.arange(0, i): # Iterate through each table in the database pointed at.
        # storageEx[i][table[j][0]] = []; # Create an empty list in the dictionary for just that table.
        test1 = connA.execute('SELECT * FROM ' + str(table[j][0])) # Query that table.
        test3 = [b[0] for b in test1.description]; # Get the headers.
        test2 = pd.DataFrame(test1.fetchmany(180), columns = test3); # Find the first 5 rows of that table. Make it a dataframe, with the columns as the headers from before.
        storageEx[dictN][table[j][0]] = test2; # Put the dataframe in the dictionary that's in the larger dictionary.

This way, I could open up storageEx and see what was inside all the SQL databases so I knew what data I should be looking for.

Hope this helps!

@ChrisC610
Copy link

Hi,
I recently executed the BIG-SLICE query module using the command:
bigslice --query 05bigscapeinput --n_ranks 5 14big-slice/full_run_result -t 14
The ”full_run_result “ database was obtained from the official guidance, available at http://bioinformatics.nl/~kauts001/ltr/bigslice/paper_data/data/full_run_result.zip.

However, I opened the output results using the Flask server script, and noticed that all BGCs are being uniformly classified into the same GCF model (GCF_0001) with identical distances(199.xxxx).
my big-slice

While I'm not familiar with Python, it seems that this issue might be tied to the extraction of values from the "gcf_membership," as discussed in your previous conversation with @ialas .

I'm seeking guidance on how to specifically address this problem. Could you provide insights into potential missteps in my execution or configuration of BIG-SLICE? Any assistance in resolving this particular situation would be greatly appreciated.
Thank you.

@boykawang
Copy link
Author

Hi, I recently executed the BIG-SLICE query module using the command: bigslice --query 05bigscapeinput --n_ranks 5 14big-slice/full_run_result -t 14 The ”full_run_result “ database was obtained from the official guidance, available at http://bioinformatics.nl/~kauts001/ltr/bigslice/paper_data/data/full_run_result.zip.

However, I opened the output results using the Flask server script, and noticed that all BGCs are being uniformly classified into the same GCF model (GCF_0001) with identical distances(199.xxxx). my big-slice

While I'm not familiar with Python, it seems that this issue might be tied to the extraction of values from the "gcf_membership," as discussed in your previous conversation with @ialas .

I'm seeking guidance on how to specifically address this problem. Could you provide insights into potential missteps in my execution or configuration of BIG-SLICE? Any assistance in resolving this particular situation would be greatly appreciated. Thank you.

Hi,
The results of my previous run are similar to yours, with all gcf_membership being 199.xxxx. I checked GCF_0001, and found that it may be the GCF that contains the most BGCs, so my query BGCs should belong to this gcf (i.e., Gene Cluster Families).
Actually, I want to analyze the novelty of my own BGCs, so gcf_membership is not my concern. I focus on the values shown in the following image.
image
I ended up using the command bigslice --query ./my_BGC_gbk_files --run-id 6 --n_ranks 1 ~/BigSlice_1.2M_database/full_run_result, for your reference. This command uses a threshold of 900 (gcf_membership is 158.xxxx), referring to previous literature.

@ChrisC610
Copy link

Hi, The results of my previous run are similar to yours, with all gcf_membership being 199.xxxx. I checked GCF_0001, and found that it may be the GCF that contains the most BGCs, so my query BGCs should belong to this gcf (i.e., Gene Cluster Families). Actually, I want to analyze the novelty of my own BGCs, so gcf_membership is not my concern. I focus on the values shown in the following image. image I ended up using the command bigslice --query ./my_BGC_gbk_files --run-id 6 --n_ranks 1 ~/BigSlice_1.2M_database/full_run_result, for your reference. This command uses a threshold of 900 (gcf_membership is 158.xxxx), referring to previous literature.

Thank you very much for your response. However, I'm still puzzled by the differences in our results, as our goal is to analyse the novelty of our own BGCs, and our commands both use bigslice --query ./my_BGC_gbk_files --n_ranks 1 full_run_result. I would like to understand why my results are related to “gcf_membership” values but do not include any results under the 'Similar BGCs' module. Also, I noticed that your command includes the "--run-id" parameter, so I tried changing the run-id and rerunning the query, but my results remain unchanged.
my new result

@boykawang
Copy link
Author

@ChrisC610
Hi,
The following four figures may help you find "Similar BGCs".
(1)
b00cbe5691423a9a669542e87af3754
(2)
fff3f949012855c9ad0ab4932506844
(3)
12597353f4dd842ce5dbf5850640128
(4)
43de454359c4a18de2ac29831b9a715

@ChristinaTiantian
Copy link

@ChrisC610 Hi, The following four figures may help you find "Similar BGCs". (1) b00cbe5691423a9a669542e87af3754 (2) fff3f949012855c9ad0ab4932506844 (3) 12597353f4dd842ce5dbf5850640128 (4) 43de454359c4a18de2ac29831b9a715

Thanks!Your help means a lot to me!

@PannyYi
Copy link

PannyYi commented Apr 30, 2024

@boykawang
Hi, I have a tiny question about the query mode of bigslice-2.0. When I run like this:

bigslice --query 00-01PK.genome.fasta/ --query_name 00-01PK 2test-out/
it said
WX20240430-142950
and here is my input file
WX20240430-143112

Do you know what's wrong with the bigslice? Could you please be so kind to help me use the query mode for the novelty of BGCs according to BiG-FAM database?
Thanks a lot!

@boykawang
Copy link
Author

@boykawang Hi, I have a tiny question about the query mode of bigslice-2.0. When I run like this:

bigslice --query 00-01PK.genome.fasta/ --query_name 00-01PK 2test-out/
it said
WX20240430-142950
and here is my input file
WX20240430-143112

Do you know what's wrong with the bigslice? Could you please be so kind to help me use the query mode for the novelty of BGCs according to BiG-FAM database? Thanks a lot!

Hi, it appears that your command did not add an output directory argument.
Instructions for Use: bigslice --query <antismash_output_folder> --n_ranks <output_folder>
In my experience with bigslice version v1.1, the output directory must be the directory where the reference BGCs are located. However, the software authors say that the reference database is being prepared for version 2, so it may not work properly at this time.
I hope this is useful.

@PannyYi
Copy link

PannyYi commented May 11, 2024

@boykawang Hi, I have a tiny question about the query mode of bigslice-2.0. When I run like this:

bigslice --query 00-01PK.genome.fasta/ --query_name 00-01PK 2test-out/
it said
WX20240430-142950
and here is my input file
WX20240430-143112

Do you know what's wrong with the bigslice? Could you please be so kind to help me use the query mode for the novelty of BGCs according to BiG-FAM database? Thanks a lot!

Hi, it appears that your command did not add an output directory argument. Instructions for Use: bigslice --query <antismash_output_folder> --n_ranks <output_folder> In my experience with bigslice version v1.1, the output directory must be the directory where the reference BGCs are located. However, the software authors say that the reference database is being prepared for version 2, so it may not work properly at this time. I hope this is useful.

Hi, Thanks a lot for your reply!
In my command, the output directory is 2test-out/, which only contains a fold "result"(this is according to the issue #66 ). According to your reply, it seems like we need to wait the reference database for version 2.
However, I still have a question. I found that the error is called "can't find a matching hmm library in the database!", and there are people also meet this issue in version1.1.1(#67 ). I was wondering whether we need to adds some parameters when running the antiSMASH (such as the --fullhmmer)? Now I only use the default set when running the antiSMASH-7.1.0, like this:

antismash --genefinding-tool prodigal --output-dir xxxx --output-basename xxxx --cb-knownclusters xxxx.genome.fasta

Could you be so kind to tell me what parameters did you use in antiSMASH, so that the bigslice can run successfully?
Thanks you again for your help.

@boykawang
Copy link
Author

@boykawang Hi, I have a tiny question about the query mode of bigslice-2.0. When I run like this:

bigslice --query 00-01PK.genome.fasta/ --query_name 00-01PK 2test-out/
it said
WX20240430-142950
and here is my input file
WX20240430-143112

Do you know what's wrong with the bigslice? Could you please be so kind to help me use the query mode for the novelty of BGCs according to BiG-FAM database? Thanks a lot!

Hi, it appears that your command did not add an output directory argument. Instructions for Use: bigslice --query <antismash_output_folder> --n_ranks <output_folder> In my experience with bigslice version v1.1, the output directory must be the directory where the reference BGCs are located. However, the software authors say that the reference database is being prepared for version 2, so it may not work properly at this time. I hope this is useful.

Hi, Thanks a lot for your reply! In my command, the output directory is 2test-out/, which only contains a fold "result"(this is according to the issue #66 ). According to your reply, it seems like we need to wait the reference database for version 2. However, I still have a question. I found that the error is called "can't find a matching hmm library in the database!", and there are people also meet this issue in version1.1.1(#67 ). I was wondering whether we need to adds some parameters when running the antiSMASH (such as the --fullhmmer)? Now I only use the default set when running the antiSMASH-7.1.0, like this:

antismash --genefinding-tool prodigal --output-dir xxxx --output-basename xxxx --cb-knownclusters xxxx.genome.fasta

Could you be so kind to tell me what parameters did you use in antiSMASH, so that the bigslice can run successfully? Thanks you again for your help.

Hi. In fact, I used antiSMASH-6.1.1.
You can contact me via my email.
[email protected]

@htaohan
Copy link

htaohan commented May 22, 2024

@boykawang Hi, I have a tiny question about the query mode of bigslice-2.0. When I run like this:

bigslice --query 00-01PK.genome.fasta/ --query_name 00-01PK 2test-out/
it said
WX20240430-142950
and here is my input file
WX20240430-143112

Do you know what's wrong with the bigslice? Could you please be so kind to help me use the query mode for the novelty of BGCs according to BiG-FAM database? Thanks a lot!

Hi, it appears that your command did not add an output directory argument. Instructions for Use: bigslice --query <antismash_output_folder> --n_ranks <output_folder> In my experience with bigslice version v1.1, the output directory must be the directory where the reference BGCs are located. However, the software authors say that the reference database is being prepared for version 2, so it may not work properly at this time. I hope this is useful.

Hi, Thanks a lot for your reply! In my command, the output directory is 2test-out/, which only contains a fold "result"(this is according to the issue #66 ). According to your reply, it seems like we need to wait the reference database for version 2. However, I still have a question. I found that the error is called "can't find a matching hmm library in the database!", and there are people also meet this issue in version1.1.1(#67 ). I was wondering whether we need to adds some parameters when running the antiSMASH (such as the --fullhmmer)? Now I only use the default set when running the antiSMASH-7.1.0, like this:

antismash --genefinding-tool prodigal --output-dir xxxx --output-basename xxxx --cb-knownclusters xxxx.genome.fasta

Could you be so kind to tell me what parameters did you use in antiSMASH, so that the bigslice can run successfully? Thanks you again for your help.

http://bioinformatics.nl/~kauts001/ltr/bigslice/paper_data/data/full_run_result.zip.

Did you solve this problem? I downloaded this file http://bioinformatics.nl/~kauts001/ltr/bigslice/paper_data/data/full_run_result.zip and used it as the output folder, but it still gives an error: "Can't find a matching HMM library in the database!" This is very confusing to me.

@jinxmeng
Copy link

@htaohan download_bigslice_hmmdb may be solve problems

@PannyYi
Copy link

PannyYi commented Nov 4, 2024

Hi, I had some problems extracting data from interactive output pages. I successfully ran the query module to analyze the novelty of my BGCs (more than 1000), and here are the command I used:

bigslice --query ./my_BGC_gbk_files --n_ranks 3 ~/BigSlice_1.2M_database/full_run_result

I opened the output results using flask server script according the instructions, and the results were displayed on the web page. The important data I need are the distance values between my BGC and the reference BGCs (marked in the red box in following figure). Due to the huge number of reference BGCs, a single BGC needs a lot of web pages to display the distance values, and the table on web page can not be sorted according to the value of the distance. My goal is to obtain the minimum distance value between each of my BGCs and the reference BGC.

image I have tried to find the distance values I need in both the interactive web source codes and the output sqlite database files, but they are not directly displayed in the codes or sqlite database files.

If you have any solutions or instructions, please contact me as soon as possible. Thanks!

@boykawang Hi, sorry to bother you. I would like to ask some questions regarding the results from bigslice(v1.1.0). Could you please explain how to convert the report.db results into a visualized web interface? Additionally, my goal is also to analyze the distances between my BGCs and the reference database(BigSlice_1.2M_database/full_run_result) to reflect their novelty. Do you have any solutions for obtaining the minimum distance to the reference GCFs in bulk? Thank you!" And here is my command and result:
image
image
image

@Maiya19724
Copy link

@PannyYi Have you tried running bash <output_folder>/start_server.sh <port(optional)>? Refer to this link for more information.

@jinxmeng
Copy link

jinxmeng commented Nov 7, 2024

Hi, I had some problems extracting data from interactive output pages. I successfully ran the query module to analyze the novelty of my BGCs (more than 1000), and here are the command I used:
bigslice --query ./my_BGC_gbk_files --n_ranks 3 ~/BigSlice_1.2M_database/full_run_result
I opened the output results using flask server script according the instructions, and the results were displayed on the web page. The important data I need are the distance values between my BGC and the reference BGCs (marked in the red box in following figure). Due to the huge number of reference BGCs, a single BGC needs a lot of web pages to display the distance values, and the table on web page can not be sorted according to the value of the distance. My goal is to obtain the minimum distance value between each of my BGCs and the reference BGC.
image
I have tried to find the distance values I need in both the interactive web source codes and the output sqlite database files, but they are not directly displayed in the codes or sqlite database files.
If you have any solutions or instructions, please contact me as soon as possible. Thanks!

@boykawang Hi, sorry to bother you. I would like to ask some questions regarding the results from bigslice(v1.1.0). Could you please explain how to convert the report.db results into a visualized web interface? Additionally, my goal is also to analyze the distances between my BGCs and the reference database(BigSlice_1.2M_database/full_run_result) to reflect their novelty. Do you have any solutions for obtaining the minimum distance to the reference GCFs in bulk? Thank you!" And here is my command and result: image image image

I remember the table named 'gcf_membership,' which documents the distances of different GCFs. Specifically, it includes the columns of 'gcf_membership'.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants