Skip to content
Catarina Loureiro edited this page Dec 13, 2024 · 6 revisions

How BiG-SCAPE Cluster works, in a nutshell

BiG-SCAPE 2 can be run in 3 workflows: Cluster, Query, and Benchmark.

bigscape cluster performs clustering of BGCs into GCFs. This is the equivalent of running BiG-SCAPE 1’s bigscape.py. With bigscape query you can search for BGCs that show similarity to a user provided query BGC/.gbk, and bigscape benchmark compares the results of a BiG-SCAPE 2 Cluster mode run, BiG-SCAPE 1 run or BiG-SLiCE run against a user-provided set of BGC <-> GCF assignments.

BiG-SCAPE Cluster reads BGC information stored in antiSMASH processed GenBank files, and uses a phmm(profile hidden Markov Models) database (commonly, Pfam) database and hmmscan from the HMMER suite to predict protein protein domains in each sequence, thus summarizing each BGC as a linear set of protein domains. BGCs can be grouped in bins by their given antiSMASH class/category, or grouped into a single mixed bin (by using --mix). For every pair of BGCs in the bin, the pair is first aligned (based on a user-defined alignment mode) which allows the region to compare to be defined. Then, the pairwise distance between the pair of BGCs is calculated as the weighted combination of the Jaccard, Adjacency Index (AI) and Domain Sequence Similarity (DSS) indices (more detail here). A cutoff is then applied to these distance values, and clustering with Affinity propagation is applied to generate Gene Cluster Families (GCFs). BiG-SCAPE 2 also allows more than one cutoff to be provided per run, in which case a set of GCFs will be generated for each cutoff.

Learn more about the BiG-SCAPE modes and options with python bigscape.py --help, the tutorials or by reading through this wiki. We suggest going through this wiki in the following order:

Clone this wiki locally