Skip to content
This repository has been archived by the owner on Feb 19, 2022. It is now read-only.

Commit

Permalink
Kluge for French stopwords.
Browse files Browse the repository at this point in the history
I think the problem is with NLTK:  it is missing "les" and "a" from its
French stopword list.
  • Loading branch information
mbwolff committed Aug 8, 2013
1 parent 141fe16 commit 6d48f43
Show file tree
Hide file tree
Showing 2 changed files with 7 additions and 1 deletion.
6 changes: 5 additions & 1 deletion smartstash/core/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,11 @@

def tokenize(text, lang='en'):
# if language is not specified or not in our list, fall back to english
stopwords = nltk.corpus.stopwords.words(stopword_lang.get(lang, 'english'))
stopwords = nltk.corpus.stopwords.words(stopword_lang.get(lang))
if lang == 'fr':
stopwords.append('les')
stopwords.append('a')

tokens = nltk.word_tokenize(text)
words = [w.lower() for w in tokens
if w.isalnum() and w.lower() not in stopwords]
Expand Down
2 changes: 2 additions & 0 deletions smartstash/nltk_data/corpora/stopwords/french
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ il
je
la
le
les
leur
lui
ma
Expand Down Expand Up @@ -116,6 +117,7 @@ eu
eue
eues
eus
a
ai
as
avons
Expand Down

0 comments on commit 6d48f43

Please sign in to comment.