loading very large dictionaries #31

remkoboschker · 2016-10-07T13:53:58Z

Hi,

I would like to use this module to find spelling alternates for strings. My dictionary has about 4 million words in it. Building the dawg takes a long time and I would like to store it in something like redis. Could you provide any advice? Does the whole graph need to be loaded in memory or can I retrieve nodes or part of the graph as alternates are accepted?

I'm using the javascript library. Perhaps any Java functionality needs porting to facilitate it?

dylon · 2016-10-11T14:15:26Z

Hi @remkoboschker,

"Building the dawg takes a long time"
- You are probably building the DAWG from an unsorted list, or are not specifying that the list has already been sorted. As described by the usage documentation, unless you specify otherwise the dictionary is assumed unsorted and will be sorted before constructing the DAWG.
- If you pre-sort the list (lexicographically, case-sensitive, in ascending order) then you should pass true as the second parameter to liblevenshtein.Builder#dictionary(Array, Boolean).
"I would like to store it in something like redis"
- The library does not support this directly, but you could use the low-level constructor for building a DAWG, pass the DAWG as the first parameter to liblevenshtein.Builder#dictionary(Array, Boolean), and store the DAWG in Redis for reuse. When you need the transducer again, you may retrieve the DAWG from Redis and pass it directly to liblevenshtein.Builder#dictionary(Array, Boolean).
  - Before you construct the DAWG, be sure you sort the list lexicographically in case-sensitive, ascending order.
  - Assuming your sorted list of terms is called list, pass list as the singleton parameter to the constructor liblevenshtein.Dawg(Array), like so:
```
var dawg = new liblevenshtein.Dawg(list); // or retrieve it from Redis
var builder = new levenshtein.Builder().dictionary(dawg);
var transducer = builder.transducer();
```
"can I retrieve nodes or part of the graph as alternates are accepted?"
- This case is not supported right now. You wouldn't want to use this approach unless you are dealing with larger-than-memory dictionaries, as it will seriously affect your performance (network latency + network hops + server overhead + database retrieval time, etc.)
- I will support this eventually.

remkoboschker · 2016-10-11T21:01:01Z

Thank you for your replies. I'll get the list sorted, store the dawg and pass it to the constructor for faster setup. Does the java lib support updating the dawg incrementally as mentioned in the #13 and #14 3.0 milestone enhancements?

dylon · 2016-10-12T09:48:14Z

Not yet, but that will be one of the next features I add.

dylon added the question label Oct 11, 2016

dylon closed this as completed Oct 11, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

loading very large dictionaries #31

loading very large dictionaries #31

remkoboschker commented Oct 7, 2016 •

edited

Loading

dylon commented Oct 11, 2016

remkoboschker commented Oct 11, 2016

dylon commented Oct 12, 2016

loading very large dictionaries #31

loading very large dictionaries #31

Comments

remkoboschker commented Oct 7, 2016 • edited Loading

dylon commented Oct 11, 2016

remkoboschker commented Oct 11, 2016

dylon commented Oct 12, 2016

remkoboschker commented Oct 7, 2016 •

edited

Loading