forked from turian/biased-text-sample
-
Notifications
You must be signed in to change notification settings - Fork 0
pmadhyastha/biased-text-sample
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
biased-text-sample ================== by Joseph Turian Make a biased sample of a large text corpus, based upon text in a smaller text corpus. Essentially, Lucene index the large text corpus, and for each document in the smaller corpus retrieve the top ten Lucene results. Pipe a large stream of text into the indexer: /u/turian/data/web_corpus/WaCky2/sentencesplit.py | ./index-sentences.py REQUIREMENTS: * numpy Used for Bloom filter. * murmurhash Used for Bloom filter. * http://github.com/turian/common
About
Perform a biased sample of text data
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published