Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StringForwardIndexBuilder #12

Open
elshize opened this issue Feb 2, 2017 · 1 comment
Open

StringForwardIndexBuilder #12

elshize opened this issue Feb 2, 2017 · 1 comment
Milestone

Comments

@elshize
Copy link
Collaborator

elshize commented Feb 2, 2017

Motivation

A simple forward index builder to use in unit/integration testing.

Description

Write StringForwardIndexBuilder (within module ReSearch/research/index/forward.py), a class with the following interface:

class StringForwardIndexBuilder:
    def build(self, properties, input)

where properties is a JSON object with index properties (see ReSearch/test/index/test_forward_index.py for an example), and input is a TextIOBase object, like those created by the following statements:

input = open("myfile.txt", "r", encoding="utf-8")
input = io.StringIO("some initial text data")

Use the NLTK tokenizer:

from nltk.tokenize import word_tokenize

No need for stemming or removing punctuation at this point. This one should be used for testing and debugging. At a later point, it should use unified interface with other builders but it is not a concern now.

@elshize elshize added this to the 1.0 milestone Feb 2, 2017
@elshize
Copy link
Collaborator Author

elshize commented Feb 9, 2017

Once the above is finished (or during), here's some ideas for term processing. We need two kinds of objects:

  1. Term processors, e.g., stemmers; example from research.term:
class EnglishStemmer:
    def __init__(self):
        self.stemmer = SnowballStemmer("english")

    def process(self, term):
        return self.stemmer.stem(term)
  1. Term pruners; example from research.index.pruning:
class EnglishStopWordsPruner:
    def __init__(self):
        self.stopwords = set(stopwords.words('english'))

    def test(self, term):
        return term not in self.stopwords

A list of term processors/pruners should be configured, possibly in JSON parameters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant