Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

loading very large dictionaries #31

Closed
remkoboschker opened this issue Oct 7, 2016 · 3 comments
Closed

loading very large dictionaries #31

remkoboschker opened this issue Oct 7, 2016 · 3 comments
Labels

Comments

@remkoboschker
Copy link

remkoboschker commented Oct 7, 2016

Hi,

I would like to use this module to find spelling alternates for strings. My dictionary has about 4 million words in it. Building the dawg takes a long time and I would like to store it in something like redis. Could you provide any advice? Does the whole graph need to be loaded in memory or can I retrieve nodes or part of the graph as alternates are accepted?

I'm using the javascript library. Perhaps any Java functionality needs porting to facilitate it?

@dylon dylon added the question label Oct 11, 2016
@dylon
Copy link
Member

dylon commented Oct 11, 2016

Hi @remkoboschker,

  1. "Building the dawg takes a long time"
    • You are probably building the DAWG from an unsorted list, or are not specifying that the list has already been sorted. As described by the usage documentation, unless you specify otherwise the dictionary is assumed unsorted and will be sorted before constructing the DAWG.
    • If you pre-sort the list (lexicographically, case-sensitive, in ascending order) then you should pass true as the second parameter to liblevenshtein.Builder#dictionary(Array, Boolean).
  2. "I would like to store it in something like redis"
    • The library does not support this directly, but you could use the low-level constructor for building a DAWG, pass the DAWG as the first parameter to liblevenshtein.Builder#dictionary(Array, Boolean), and store the DAWG in Redis for reuse. When you need the transducer again, you may retrieve the DAWG from Redis and pass it directly to liblevenshtein.Builder#dictionary(Array, Boolean).
      • Before you construct the DAWG, be sure you sort the list lexicographically in case-sensitive, ascending order.

      • Assuming your sorted list of terms is called list, pass list as the singleton parameter to the constructor liblevenshtein.Dawg(Array), like so:

        var dawg = new liblevenshtein.Dawg(list); // or retrieve it from Redis
        var builder = new levenshtein.Builder().dictionary(dawg);
        var transducer = builder.transducer();
  3. "can I retrieve nodes or part of the graph as alternates are accepted?"
    • This case is not supported right now. You wouldn't want to use this approach unless you are dealing with larger-than-memory dictionaries, as it will seriously affect your performance (network latency + network hops + server overhead + database retrieval time, etc.)
    • I will support this eventually.

@dylon dylon closed this as completed Oct 11, 2016
@remkoboschker
Copy link
Author

Thank you for your replies. I'll get the list sorted, store the dawg and pass it to the constructor for faster setup. Does the java lib support updating the dawg incrementally as mentioned in the #13 and #14 3.0 milestone enhancements?

@dylon
Copy link
Member

dylon commented Oct 12, 2016

Not yet, but that will be one of the next features I add.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants