StringForwardIndexBuilder #12

elshize · 2017-02-02T14:55:26Z

Motivation

A simple forward index builder to use in unit/integration testing.

Description

Write StringForwardIndexBuilder (within module ReSearch/research/index/forward.py), a class with the following interface:

class StringForwardIndexBuilder:
    def build(self, properties, input)

where properties is a JSON object with index properties (see ReSearch/test/index/test_forward_index.py for an example), and input is a TextIOBase object, like those created by the following statements:

input = open("myfile.txt", "r", encoding="utf-8")

input = io.StringIO("some initial text data")

Use the NLTK tokenizer:

from nltk.tokenize import word_tokenize

No need for stemming or removing punctuation at this point. This one should be used for testing and debugging. At a later point, it should use unified interface with other builders but it is not a concern now.

The text was updated successfully, but these errors were encountered:

elshize · 2017-02-09T22:21:49Z

Once the above is finished (or during), here's some ideas for term processing. We need two kinds of objects:

Term processors, e.g., stemmers; example from research.term:

class EnglishStemmer:
    def __init__(self):
        self.stemmer = SnowballStemmer("english")

    def process(self, term):
        return self.stemmer.stem(term)

Term pruners; example from research.index.pruning:

class EnglishStopWordsPruner:
    def __init__(self):
        self.stopwords = set(stopwords.words('english'))

    def test(self, term):
        return term not in self.stopwords

A list of term processors/pruners should be configured, possibly in JSON parameters.

elshize added this to the 1.0 milestone Feb 2, 2017

elshize added the enhancement label Feb 2, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

StringForwardIndexBuilder #12

StringForwardIndexBuilder #12

elshize commented Feb 2, 2017 •

edited

Loading

elshize commented Feb 9, 2017 •

edited

Loading

StringForwardIndexBuilder #12

StringForwardIndexBuilder #12

Comments

elshize commented Feb 2, 2017 • edited Loading

Motivation

Description

elshize commented Feb 9, 2017 • edited Loading

elshize commented Feb 2, 2017 •

edited

Loading

elshize commented Feb 9, 2017 •

edited

Loading