Skip to content

A python/c++ module to store large amount of sequences and look at hamming distance clustering

License

Notifications You must be signed in to change notification settings

statbiophys/ATrieGC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ATrieGC

A python/c++ module to store large amount of sequences and look at hamming distance clustering. Should be a lot faster than the naive method (measuring every hamming distances between pairs).

Installation

After cloning the git repository:

pip3 install atriegc

Usage

Working with the nucleotide alphabet

import atriegc

tr = atriegc.TrieNucl()
tr.insert("AAAATGC")
tr.insert("ATAATGC")
tr.insert("TTTTTGC")

max_hamming_distance = 1
print(tr.neighbours("AAATTGC", max_hamming_distance))
print(tr.clusters(max_hamming_distance))

Working with the amino acid alphabet

Where aminoacid are indicated with capital letters.

tr = atriegc.TrieAA()
tr.insert("CARGKYSPATFDSW")

Working with a generic alphabet

The alphabet should be passed as a string which lists all the possible characters of the alphabet

tr = atriegc.Trie("abcdef")

About

A python/c++ module to store large amount of sequences and look at hamming distance clustering

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •