Skip to content

Set comparison method

Nolan Woods edited this page Sep 21, 2022 · 5 revisions

The set comparison method of genomic island detection works on the assumption that a GI is defined by its gene content and the functionality it confers.

The method extracts the CDS records of a query sequence and tries to match the protein sequences to all known proteins in the curated GI database. The protein IDs are used as tokens to uniquely represent the proteins in a set. All curated GIs have their proteins clustered along with the NCBI COG database to unify and de-duplicate their protein IDs across all GI sequences. A database composed of protein ID sets for each curated GI is generated and used to compare to a candidate GI from a query sequence. Each candidate GI is compared to each set in the generated database using a Jaccard index. Any index above a threshold is retained as a possible match to the associated curated GI.

To reiterate: "known proteins" are ones that have a high identity sequence match to a protein in the curated GI database.

After the query sequence proteins are assigned any known protein IDs within the curated GI database, the sequence is scanned for candidate GIs.

A visual representation of the known proteins and original over the query sequence can be visualized as:

........|..|||||.....|||........|...|....|......
        ^------^
           ^-----------^

"." original query protein position
"|" known protein position
"^-^" candidate

Both known protein IDs and the original protein IDs from the query are combined with their coordinates while generating the candidate sets. Note that there is a duplicate original protein record for each known record. Candidate genomic islands are built up by scanning over the protein coordinates. The first known protein position encountered starts a candidate. Each subsequent protein position is added to the candidate GI. The number of known proteins divided by the total number of proteins in the candidate must remain greater than the threshold, otherwise there is no possibility for it to compare to a curated GI above threshold. If adding a non-known protein ID to the candidate puts the candidate below the threshold, the candidate is output and the left most known protein is removed and all non-known proteins up until the next known protein ID. Additionally, if the next protein is greater than 14kb away from the last, then the candidate is output and the next known protein starts a new candidate.

Clone this wiki locally