-
Notifications
You must be signed in to change notification settings - Fork 27
Defining the Comparable Region
Given the predictive nature of BGC data, and the inherent challenges that come with defining BGC borders, it might not always be desirable to compare the complete region
as outputted from antiSMASH. To select the domains predicted in each BGC for the distance calculation, i.e define the comparable region between two BGC records, BiG-SCAPE 2 makes use of three strategies: local
, glocal
and global
, which can be selected with --alignment-mode
. --alignment mode auto
is also available, which uses glocal
mode when at least one of the BGC records in each pair has the contig_edge
annotation from antiSMASH v4+, otherwise use global
mode on that pair.
(In --alignment-mode global
)
In the global mode, all domains from both BGCs records, i.e. the entire record as generated by antiSMASH, are taken into account.
(In --alignment-mode local
)
In Local mode, for each pair of records, BiG-SCAPE 2 tries to find the (internal) coordinates/edges of each BGC record that correspond to the highest common domain content shared between both BGCs, and will only use the domains contained within these borders for the distance calculations.
BiG-SCAPE will first find the Longest Common Substring (LCS) of domains shared between both BGC records (strandedness is taken into account). If the LCS contains core biosynthetic domains (as defined by antiSMASH), and/or is larger than a certain fraction of the total number of domains (see config.yml
file) of the shortest BGC record, the extend step can proceed. If this is not the case, the comparison will be performed using the global
mode.
An exception to this rule is the terpene class, in which there is no minimum number of domains required in the LCS, provided the LCS contains a core biosynthetic domain. Users can add more classes to this exception by altering the NO_MIN_CLASSES
variable in the config.yml
file.
In the extension step, BiG-SCAPE 2 will attempt to extend the comparable region from the LCS outwards to the edge of the record, scoring each extension based on shared domain content between BGC records, and will keep the best scoring coordinates.
Once again, the extended region will be checked for its dimensions and biosynthetic content (see config.yml
file), and if these checks do not pass, the comparison will be performed using the Global mode.
(In --alignment-mode glocal
)
Glocal mode is, as the name suggests, an intermediate between Global and Local. It first follows the Local mode workflow to obtain the comparable region with the most shared domains. Subsequently, to include some but not all domain variation in the BGC pair, the comparable region will be extended to the end of each of the shortest arm of the BGC record pair on either side of the LCS/current extend.
(In --alignment-mode auto
)
In this mode BiG-SCAPE will use glocal
when at least one of the BGCs in each pair has the 'contig_edge' annotation from antiSMASH v4+, otherwise will use the global
mode.
In (G)Local mode, for each pair of records, BiG-SCAPE will first find the longest common substring (LCS), and then tries to extend from the LCS outwards to the edges of the BGC records, to find the (internal) coordinates/edges of each BGC record that correspond to the highest common domain content shared between both BGCs. BiG-SCAPE 2 can do this with one of three available extend strategies: legacy
(the same principle as present in BiG-SCAPE 1), simple match
, and greedy
.
(In --extend-strategy legacy
)
For each record pair, BiG-SCAPE assigns a query (has the least domains) and target (has the most domains) role to each of the BGC records. The domain selection is performed for the BGC record with the least amount of domains (i.e., the query), based on the domains which are present in the BGC record with the highest number of domains (the target). If the amount of domains up(down)stream is the same, an arbitrary BGC is selected to extend instead.
Note: BiG-SCAPE 2’s Legacy extend closely follows the behaviour of BiG-SCAPE 1’s extend strategy, with the most significant change being that BiG-SCAPE 1 considers CDSs as the units to extend, and BiG-SCAPE 2 considers domains instead. This change was motivated by the fact that 2 multi-domain CDSs’ that share 99% of all domains would be seen as a mismatch in BiG-SCAPE 1.
BiG-SCAPE will start the extension from the LCS outwards, and attempt to extend the coordinates of the comparable region once for each side, i.e. upstream and downstream of the LCS. It does this according to the following scoring algorithm:
- The algorithm keeps track of a score, a maximum score, an “extension position”, and a “match position” in the target BGC.
- For each domain in the query BGC, the same domain is searched for sequentially in the target.
- If a domain is found, a match bonus (+5) is added to the score. If no domain is found, a mismatch penalty (-3) is subtracted from the score.
- If a domain is found, but it is not the next position in the search, a match bonus is added (+5) as well as a gap penalty that is proportional to the number of positions between the current extension position and the match position (-2*gap length in domain positions).
- If the current score is greater or equal than the max score, the extension position is updated to the match position.
- The algorithm will only search a set distance from the current match position before calling a mismatch only (default: 10% of total number of domains, rounded down).
(In --extend-strategy simple_match
)
Simple match is a new BiG-SCAPE 2 extension strategy that has a higher tolerance for diverse regions. Simple match will always perform an extension of each BGC and each side according to the following algorithm:
- A set of domains that are common in both BGCs are stored in a match list.
- The LCS is then extended for each BGC on each side by applying a mismatch penalty (default -3) when no match is found in the match list, or a match bonus (default +5) when a match is found in the match list.
- Domains are never removed from the match list, and domain matches may occur on the opposite end of an LCS.
- The extension with the highest score is selected. If no extension with a score higher than 0 is detected, the extension algorithm defaults to the LCS for that side of the extension.
(In --extend-strategy greedy
)
Greedy aims to maximize the region of comparison between BGCs that contain common domains. This strategy stores a set of common domains between the two BGCs and sets the comparable region to start at the CDS containing the first common domain, and end at the CDS containing the last common domain.