Stopwords... #17

mikesname · 2014-04-14T14:23:55Z

There seems to be an issue with stopwords not being properly excluded by the current search config. For example, if you search for Demandes en obtention, the main hit is "Demandes en obtention d' une autorisation de batir" with a score of 28+ (which is as expected), but all the remaining 128 results just have "en" somewhere in their body, with a score of <=6.

There are similar issues with English and German stopwords.

KepaJRodriguez · 2014-04-14T15:29:04Z

I have not the data on this computer, but I gues, this bug is related to the copy of all text information in the generic "text" file.
If the generic search field is a copy of all fields, which stop word list uses it? Language specific stop word lists can be checked only after language detection. If the generic "text" file uses English (I think that is the case) the word "en" is not in the list of English stop words.
Should we maybe merge the lists of all stop words in all the languages just ONLY for the search in this generic field?

mikesname · 2014-04-14T16:34:01Z

Good thinking. Maybe just merge the really common ones. Wonder what the downsides here would be (valid words which are stopwords in some other language being ignored?)

Relevant: http://lucene.472066.n3.nabble.com/multilingual-list-of-stopwords-td481037.html

KepaJRodriguez · 2014-04-15T07:10:41Z

Yes, the issue is whether words which are stop words in a language and common words in other language are relevant for the search or not. Maybe we can begin constructing a new list with following constraints/criteria.
a) The full English list will be part of the generic stop word list.
b) Which language is the second important language with latin alphabet? French or German? Then take this words too, a manual supervision here is not difficult.
c) For the other languages with latin alphabet, import only words with length => 3 characters
d) Stop word list in other alphabets (i.e. Cyrilic): import them completly.
If the constrain (c) is too liberal, maybe we can reduce it to length => 3. What do you think about?

juntezhang · 2014-04-15T07:42:14Z

Hi Mike and Kepa,

It has been some time ago, but you know the language of an EAD finding aid
right? If so, you can create a fieldtype for each language and assign a
stopword list to it.

But are you sure you need to remove stopwords? Perhaps historians are not
picky as linguists, but even stopwords can be meaningful for researchers.

I think stopword removal is particularly useful for Web search and very
generic searchers.

I would not recommend pruning by word length, because there may be
instances of acronyms.

Good luck!

Cheers, junte

On Tue, Apr 15, 2014 at 9:10 AM, Kepa J. Rodriguez <[email protected]

wrote:

Yes, the issue is whether words which are stop words in a language and
common words in other language are relevant for the search or not. Maybe we
can begin constructing a new list with following constraints/criteria.
a) The full English list will be part of the generic stop word list.
b) Which language is the second important language with latin alphabet?
French or German? Then take this words too, a manual supervision here is
not difficult.
c) For the other languages with latin alphabet, import only words with
length => 3 characters
d) Stop word list in other alphabets (i.e. Cyrilic): import them completly.
If the constrain (c) is too liberal, maybe we can reduce it to length =>
3. What do you think about?

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/17#issuecomment-40451939
.

KepaJRodriguez · 2014-04-15T08:02:05Z

Hi Junte, thanks for your answer. We have already the fieldtype for each language, but there is a generic text field in which all the textual information is copied regardless the language (I know that it is a very problematic approach). The problem with the stop words is that in a so multilingual environment, the overgeneration of results can increase dramatically, but of course, we will need to do more test for that.

About the meaningfulness of stopwords I'm not totally sure. In other fields they are meaningful when the user is looking maybe for the title of a old text, or something similar (that is, when the user looks for a string consisting of more than one word). But in our case if a user search information about an event, and get all the collections with the string (i.e.) "a" or "in", the search can be unsuccessful.

"I would not recommend pruning by word length, because there may be instances of acronyms."
Yes, of course it might be problematic, but I mean pruning by word length in words which are in the lists of stop words, not in words of the text.

mikesname · 2014-04-15T10:37:40Z

Hi Junte - good to hear from you! I hope life is treating you well.

As Kepa points out, the underlying problem is probably due to the fact that we have one single default search field (called text) into which everything "text-like" is currently copied. Probably a good start would be to drop that and specify the default fields explicitly. Which I assume you can do, but I'm not entirely sure because the Solr documentation is so dubious.

(Mmmn, just thinking aloud - can case-sensitivity be used to distinguish acronyms from stopwords...???)

mikesname added bug labels Apr 14, 2014

mikesname assigned KepaJRodriguez Apr 14, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stopwords... #17

Stopwords... #17

mikesname commented Apr 14, 2014

KepaJRodriguez commented Apr 14, 2014

mikesname commented Apr 14, 2014

KepaJRodriguez commented Apr 15, 2014

juntezhang commented Apr 15, 2014

KepaJRodriguez commented Apr 15, 2014

mikesname commented Apr 15, 2014

Stopwords... #17

Stopwords... #17

Comments

mikesname commented Apr 14, 2014

KepaJRodriguez commented Apr 14, 2014

mikesname commented Apr 14, 2014

KepaJRodriguez commented Apr 15, 2014

juntezhang commented Apr 15, 2014

KepaJRodriguez commented Apr 15, 2014

mikesname commented Apr 15, 2014