Skip to content
This repository has been archived by the owner on Feb 19, 2022. It is now read-only.

Commit

Permalink
Merge pull request #132 from mbwolff/FixFrStopwords
Browse files Browse the repository at this point in the history
Fix fr stopwords
  • Loading branch information
rlskoeser committed Aug 9, 2013
2 parents be7c13e + 6d48f43 commit a679827
Show file tree
Hide file tree
Showing 3 changed files with 20 additions and 4 deletions.
16 changes: 13 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,19 @@

About
-----
[Serendip-o-matic](http://serendipomatic.org/) connects your sources to digital materials located in libraries, museums, and archives around the world. By first examining your research interests, and then identifying related content in locations such as the Digital Public Library of America (DPLA), Europeana, and Flickr Commons, our serendipity engine helps you discover photographs, documents, maps and other primary sources.

Whether you begin with text from an article, a Wikipedia page, or a full Zotero collection, Serendip-o-matic's special algorithm extracts key terms and returns a surprising reflection of your interests. Because the tool is designed mostly for inspiration, search results aren't meant to be exhaustive, but rather suggestive, pointing you to materials you might not have discovered. At the very least, the magical input-output process helps you step back and look at your work from a new perspective. Give it a whirl. Your sources may surprise you.
[Serendip-o-matic](http://serendipomatic.org/) connects your sources to digital materials
located in libraries, museums, and archives around the world. By first examining your
research interests, and then identifying related content in locations such as the Digital
Public Library of America (DPLA), Europeana, and Flickr Commons, our serendipity engine
helps you discover photographs, documents, maps and other primary sources.

Whether you begin with text from an article, a Wikipedia page, or a full Zotero
collection, Serendip-o-matic's special algorithm extracts key terms and returns a
surprising reflection of your interests. Because the tool is designed mostly for
inspiration, search results aren't meant to be exhaustive, but rather suggestive,
pointing you to materials you might not have discovered. At the very least, the magical
input-output process helps you step back and look at your work from a new perspective.
Give it a whirl. Your sources may surprise you.

Installation notes for developers
---------------------------------
Expand Down
6 changes: 5 additions & 1 deletion smartstash/core/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,11 @@

def tokenize(text, lang='en'):
# if language is not specified or not in our list, fall back to english
stopwords = nltk.corpus.stopwords.words(stopword_lang.get(lang, 'english'))
stopwords = nltk.corpus.stopwords.words(stopword_lang.get(lang))
if lang == 'fr':
stopwords.append('les')
stopwords.append('a')

tokens = nltk.word_tokenize(text)
words = [w.lower() for w in tokens
if w.isalnum() and w.lower() not in stopwords]
Expand Down
2 changes: 2 additions & 0 deletions smartstash/nltk_data/corpora/stopwords/french
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ il
je
la
le
les
leur
lui
ma
Expand Down Expand Up @@ -116,6 +117,7 @@ eu
eue
eues
eus
a
ai
as
avons
Expand Down

0 comments on commit a679827

Please sign in to comment.