There are two main parts to the project - Analysis of sequences using Frequency CGR (FCGR) and Coordinate CGR.
In the frequency CGR method, we divide a grid into a 2D array of size (√(4k ), √(4k))
- A in the top left
- G in the top right
- C in the bottom left
- T in the bottom right
Each quadrant is split according to the same principle for the next nucleotide in the kmer, recursively.
Calculating Euclidean distance between 2 chaos probability matrices
In this method we use the coordinates calculated using the following steps to analyse the sequences
-
Start from the center of the grid
-
1st coordinate - plotted halfway between the center of the square and the vertex representing this nucleotide (A)
-
Successive coordinates - plotted halfway between the previous point and the vertex representing the current nucleotide
Calculating Euclidean distance between 2 chaos vectors obtained
Annotated code of our project has been provided. We also used streamlit open-source app framework for creating a custom web-app for our project. A folder with the source code for the app, snips of the expected output and directions for running it have also been provided.
The data has been gathered from NCBI (https://www.ncbi.nlm.nih.gov/) and GISAID (https://www.gisaid.org/). We tried for two categories of data - hCov-19 and BetaCov-19 sequences (DNA_SEQUENCES folder) and also for human and various animal genome sequences (ANIMAL_GENOME folder).
- Frequency Chaos Game Representation
- Coordinate Chaos Game Representation
- https://towardsdatascience.com/chaos-game-representation-of-a-genetic-sequence-4681f1a67e14
- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7497811/
- https://www.hindawi.com/journals/aaa/2013/926519/
In the code, the CGR of the sequences being analyzed are exported as png files in the same folder as the data. So in the same folders as the data, we have uploaded the images for a sample execution as well.