Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test out running a few dictionaries through this tool #8

Open
waldoj opened this issue Sep 23, 2015 · 18 comments
Open

Test out running a few dictionaries through this tool #8

waldoj opened this issue Sep 23, 2015 · 18 comments
Assignees

Comments

@waldoj
Copy link
Member

waldoj commented Sep 23, 2015

Just see what happens.

@waldoj waldoj self-assigned this Sep 23, 2015
@waldoj
Copy link
Member Author

waldoj commented Oct 9, 2015

I've got an instance of this running on an EC2 Micro, ready to step through a corpus of documents.

@waldoj
Copy link
Member Author

waldoj commented Oct 9, 2015

I started ingesting the Florida and Virginia dictionaries, but on 1,009 of 155,097 (155,097 what, I don't know), it died with this error:

/var/lib/gems/1.9.1/gems/treat-2.1.0/lib/treat/core/dsl.rb:17:in `method_missing': undefined method `value' for nil:NilClass (NoMethodError)
    from /home/ubuntu/synonyms/Build From Corpus/lib/parse_corpus.rb:42:in `build_word_arcs'
    from /home/ubuntu/synonyms/Build From Corpus/lib/build_dict.rb:36:in `block in perform_step_one'
    from /home/ubuntu/synonyms/Build From Corpus/lib/build_dict.rb:15:in `each'
    from /home/ubuntu/synonyms/Build From Corpus/lib/build_dict.rb:15:in `perform_step_one'
    from ./build_dict.rb:33:in `<main>'

My guess is that there's a data error, but without knowing what the 155,097 represents, it's tough to know where to check. The file in question is 7,359 lines, 2,106,973 bytes, 2,070,718 characters, and 305,621 words. Looking through build_dict.rb, I see that the count is these_words.length, and these_words is created like such:

these_words = read_and_clean_file(file, stops)

I think it's a word count. My guess is that it's using a different method of counting words than wc.

@waldoj
Copy link
Member Author

waldoj commented Oct 9, 2015

I tried uncommenting words = reject_non_utf8(words) in lib/parse_corpus.rb, in case that might help, but it didn't do any good.

@waldoj
Copy link
Member Author

waldoj commented Oct 9, 2015

I switched from TSV to CSV, in case there was some kind of encoding problem, but it still crashed. So I'll have to dive into the code. (Keeping in mind that I don't know Ruby.)

@waldoj
Copy link
Member Author

waldoj commented Oct 9, 2015

OK, so it looks like arc.words isn't being populated fully, so there is no .value function.

@waldoj
Copy link
Member Author

waldoj commented Oct 9, 2015

I've observed the bug in action, although not the cause of it. build_word_arcs() is expecting a series of three-value arrays (e.g., [Word (23914420), Word (24491240), Word (24910340)]), and when it encounters one that is missing a value (e.g., [Word (25551560), Word (26401100)]), it dies because of the null value. Presumably there are two potential solutions: skip any array that is missing a value, or solve the underlying problem.

@waldoj
Copy link
Member Author

waldoj commented Oct 9, 2015

I enabled the display of the actual words, plus the array values:

Parsing => ../State Dictionaries/Virginia.csv   [1 of 2]    [301 of 203513]
["longterm", "care", "services"]
[Word (38749800), Word (39138840), Word (39556040)]
Parsing => ../State Dictionaries/Virginia.csv   [1 of 2]    [302 of 203513]
["care", "services", "when"]
[Word (40214040), Word (41047380)]

I'm not sure what's going on here, but somehow ["care", "services", "when"] is being turned into a two-entry array, instead of a three-entry array.

@waldoj
Copy link
Member Author

waldoj commented Oct 9, 2015

I dialed up the verbosity a bit, and got this:

Parsing => ../State Dictionaries/Virginia.csv   [1 of 2]    [301 of 203513]
["longterm", "care", "services"]
Fragment (37437360)  --- "longterm care services"  ---  {:tag_set=>:penn}   --- [] 
Parsing => ../State Dictionaries/Virginia.csv   [1 of 2]    [302 of 203513]
["care", "services", "when"]
Fragment (38526020)  --- "care services (WRB when)"  ---  {:tag_set=>:penn}   --- []

(WRB when) seems like a big red flag. I see that WRB is the part-of-speech tag to describe "where" or "when". This is the only time in the debug output that the string (WRB appears.

@waldoj
Copy link
Member Author

waldoj commented Oct 9, 2015

I've taken the coward's way out—if the array size is less than 3, it skips that set of terms.

@waldoj
Copy link
Member Author

waldoj commented Oct 9, 2015

This seems to be working much better. Also, this is going to take a very long time. About 4 hours just to complete the first of three steps. I may need to step up from an EC2 Micro to something beefier.

@waldoj
Copy link
Member Author

waldoj commented Oct 9, 2015

So, while this is running, it's not really doing it right. That is, we have a term → definition, with a known relationship between the two, but we're ignoring that. This system is set up for parsing unstructured text, but our text has structure. We know that a given phrase has a given definition. Instead, we're munging together every term and every definition—eliminating the structure.

The solution is to retool how this works. Instead of using a system meant to work with unstructured text, we need to use one meant to deal with structured text—something that can break down definitions and relate them to terms. Also, I imagine we'll want to pre-process dictionaries to eliminate redundancies, e.g.:

person: "Person" means any human...

We can eliminate the first use of the defined terms, when present within quotes, and the word "means," leaving us with:

person: any human...

@waldoj
Copy link
Member Author

waldoj commented Oct 9, 2015

Aaand it ran out of memory and crashed when it was 18% finished:

/var/lib/gems/1.9.1/gems/data_objects-0.10.16/lib/data_objects/connection.rb:79:in `initialize': SSL SYSCALL error: EOF detected (DataObjects::ConnectionError)
could not fork new process for connection: Cannot allocate memory

So an EC2 Micro probably won't work. Also, this was going to take a lot longer than 4 hours. Maybe 8 hours? A bigger server is called for.

@waldoj
Copy link
Member Author

waldoj commented Oct 9, 2015

Running it again, this time with a beefier instance, so it won't run out of memory. I don't think it's running any faster, though. :(

@waldoj
Copy link
Member Author

waldoj commented Oct 9, 2015

I was wrong—this is faster. Maybe 50% faster?

@compleatang
Copy link
Contributor

Welcome to Casey's explores NLP without knowing what he's doing land :) (hopefully there's some value for you in that work Waldo)

@waldoj
Copy link
Member Author

waldoj commented Oct 9, 2015

Of course! Now it's Waldo Explores NLP Without Knowing What He's Doing Land. :) I'm working with @seamuskraft to create a synonyms.txt file based on a ~dozen state and municipal codes, using the dictionaries that we've already auto-assembled from their many glossaries. Step 1 is to just your code up and running. Step 2 might be to toss it out and start over again, based on what we've learned in the process—who knows? :)

@compleatang
Copy link
Contributor

:)

@seamuskraft
Copy link

It is worth remembering that Columbus was trying to get hisself to India and China...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants