-
Notifications
You must be signed in to change notification settings - Fork 13
Home
Levenshtein transducers accept a query term and return all terms in a dictionary that are within n spelling errors away from it. They constitute a highly-efficient (space and time) class of spelling correctors that work very well when you do not require context while making suggestions. Forget about performing a linear scan over your dictionary to find all terms that are sufficiently-close to the user's query, using a quadratic implementation of the Levenshtein distance or Damerau-Levenshtein distance, these babies find all the terms from your dictionary in linear time on the length of the query term (not on the size of the dictionary, on the length of the query term).
If you need context, then take the candidates generated by the transducer as a starting place, and plug them into whatever model you're using for context (such as by selecting the sequence of terms that have the greatest probability of appearing together).
For a quick demonstration, please visit the Github Page, here.
For installation instructions regarding specific languages, check out the following:
Below, you will find instructions for how to clone the liblevenshtein
repository and checkout its submodules, which include the source code for all
supported languages:
% git clone https://github.com/universal-automata/liblevenshtein.git
Cloning into 'liblevenshtein'...
remote: Counting objects: 787, done.
remote: Compressing objects: 100% (431/431), done.
remote: Total 787 (delta 387), reused 693 (delta 330)
Receiving objects: 100% (787/787), 737.31 KiB | 1.14 MiB/s, done.
Resolving deltas: 100% (387/387), done.
Checking connectivity... done.
% cd liblevenshtein
% git submodule init
Submodule 'coffeescript' ([email protected]:universal-automata/liblevenshtein-coffeescript.git) registered for path 'coffeescript'
Submodule 'java' ([email protected]:universal-automata/liblevenshtein-java.git) registered for path 'java'
Submodule 'shared' ([email protected]:universal-automata/liblevenshtein-shared.git) registered for path 'shared'
% git submodule update
Cloning into 'coffeescript'...
remote: Counting objects: 73, done.
remote: Compressing objects: 100% (60/60), done.
remote: Total 73 (delta 20), reused 63 (delta 10)
Receiving objects: 100% (73/73), 26.80 KiB | 0 bytes/s, done.
Resolving deltas: 100% (20/20), done.
Checking connectivity... done.
Submodule path 'coffeescript': checked out 'dbe5a0197e09390e0a85bb48d43da6d9f4fe69d8'
Cloning into 'java'...
remote: Counting objects: 446, done.
remote: Compressing objects: 100% (246/246), done.
remote: Total 446 (delta 122), reused 360 (delta 36)
Receiving objects: 100% (446/446), 52.65 KiB | 0 bytes/s, done.
Resolving deltas: 100% (122/122), done.
Checking connectivity... done.
Submodule path 'java': checked out '39e3b90e1adca41747025c7137cfc9c1448cf165'
Cloning into 'shared'...
remote: Counting objects: 21, done.
remote: Compressing objects: 100% (16/16), done.
remote: Total 21 (delta 6), reused 19 (delta 4)
Receiving objects: 100% (21/21), 7.29 KiB | 0 bytes/s, done.
Resolving deltas: 100% (6/6), done.
Checking connectivity... done.
Submodule path 'shared': checked out '8c1ccd5d7cb537e82c87380b7b21f0f922a1484b'
Based largely on the works of Stoyan Mihov, Klaus Schulz, and Petar Nikolaev Mitankin, this library generates Levenshtein transducers using nothing more than an input list of dictionary terms. The referenced literature includes: "Fast String Correction with Levenshtein-Automata", which defines the algorithm used to generate the Levenshtein automata, "Universal Levenshtein Automata. Building and Properties", which provided many mathematical proofs that helped me understand the algorithm and supplied the recursive definitions upon which I based my distance functions, and "Incremental Construction of Minimal Acyclic Finite-State Automata", that defined an algorithm for constructing Minimal Acyclic Finite-State Automata in linear time (i.e. MA-FSA, also known as DAWGs: Directed Acyclic Word Graphs) which I use to store the dictionary of terms.
Upon construction, the list of dictionary terms is indexed as an MA-FSA and a transducer is initialized according to the maximum edit distance and algorithm provided. When queried against, the states of the Levenshtein automaton corresponding to the query term, maximum edit distance, and algorithm specified are constructed on-demand (lazily) as the automaton is evaluated, as described by the paper, "Fast String Correction with Levenshtein-Automata". The Levenshtein automaton is intersected with the dictionary MA-FSA, and every string accepted by both automata is emitted in a list of correction candidates for the query term.
In contrast to many other Levenshtein automata-based algorithms, the entire Levenshtein automaton needn't be constructed prior to evaluation, and only those states of the automaton which are actually required are derived, as needed, thereby greatly improving the efficiency of the transducer in terms of both space and time complexity.
The infamous blog post, "Damn Cool Algorithms: Levenshtein Automata", provided me with a good starting point for this transducer, but the algorithm proposed therein was too inefficient for my needs. Yet, it did reference the paper "Fast String Correction with Levenshtein-Automata", which I ultimately used as the basis of the Levenshtein automata. The same paper also serves as the basis of the Levenshtein automata used by the Apache projects, Lucene and Solr (Lucene's FuzzyQuery is 100 times faster in 4.0), which itself is based on the project by Jean-Philippe Barrette-LaPierre, Moman.
Steve Hanov pointed me to the paper, "Incremental Construction of Minimal Acyclic Finite-State Automata", in his blog post entitled, "Compressing dictionaries with a DAWG". Another of Steve's blogs also made an impact on me, namely "Fast and Easy Levenshtein distance using a Trie".
I've become heavily involved with the online movement regarding MOOCs (Massive Open Online Classrooms), and the following courses taught me much regarding the material within this library:
Finally, I must credit the course which first introduced me to the field of Information Retrieval, "Mathematical Applications in Search Engine Design", taught by Dr. Rao Li at the University of South Carolina Aiken. I highly recommend that course if you are in the area.
liblevenshtein is maintained by @dylon