DNA-sequencing-NLP-machinelearning-project

DNA sequencing using NLP (Natural Language Processing)

techniques & models

classification/regression
CountVectorizer NLP
MultinomialNB (Naive Bayes Classifier)

libraries

- pandas
- numpy
- matplotlib
- scikit-learn
- ipython

k-mers counting

DNA and protein sequences can be viewed metaphorically as the language of life. The language encodes instructions as well as function for the molecules that are found in all life forms. The sequence language analogy continues with the genome as the book, subsequences (genes and gene families) are sentences and chapters, k-mers and peptides (motifs) are words, and nucleotide bases and amino acids are the alphabet. Since the analogy seems so apt, it stands to reason that the amazing work done in the natural language processing field should also apply to the natural language of DNA and protein sequences.

(w.r.t to the datasets) The method I use here is simple and easy. I first take the long biological sequence and break it down into k-mer length overlapping “words”. For example, if I use "words" of length 6 (hexamers), “ATGCATGCA” becomes: ‘ATGCAT’, ‘TGCATG’, ‘GCATGC’, ‘CATGCA’. Hence our example sequence is broken down into 4 hexamer words.

Here I am using hexamer “words” but that is arbitrary and word length can be tuned to suit the particular situation. The word length and amount of overlap need to be determined empirically for any given application.

In genomics, we refer to these types of manipulations as "k-mer counting", or counting the occurances of each possible k-mer sequence. There are specialized tools for this, but the Python natural language processing tools make it supe easy.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.ipynb_checkpoints		.ipynb_checkpoints
datasets		datasets
DNA_sequencing_NLP.ipynb		DNA_sequencing_NLP.ipynb
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DNA-sequencing-NLP-machinelearning-project

techniques & models

libraries

About

Releases

Packages

Languages

License

VivanVatsa/DNA-sequencing-NLP-machinelearning-project

Folders and files

Latest commit

History

Repository files navigation

DNA-sequencing-NLP-machinelearning-project

techniques & models

libraries

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages