Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stopwords... #17

Open
mikesname opened this issue Apr 14, 2014 · 6 comments
Open

Stopwords... #17

mikesname opened this issue Apr 14, 2014 · 6 comments
Assignees

Comments

@mikesname
Copy link
Contributor

There seems to be an issue with stopwords not being properly excluded by the current search config. For example, if you search for Demandes en obtention, the main hit is "Demandes en obtention d' une autorisation de batir" with a score of 28+ (which is as expected), but all the remaining 128 results just have "en" somewhere in their body, with a score of <=6.

There are similar issues with English and German stopwords.

@KepaJRodriguez
Copy link
Contributor

I have not the data on this computer, but I gues, this bug is related to the copy of all text information in the generic "text" file.
If the generic search field is a copy of all fields, which stop word list uses it? Language specific stop word lists can be checked only after language detection. If the generic "text" file uses English (I think that is the case) the word "en" is not in the list of English stop words.
Should we maybe merge the lists of all stop words in all the languages just ONLY for the search in this generic field?

@mikesname
Copy link
Contributor Author

Good thinking. Maybe just merge the really common ones. Wonder what the downsides here would be (valid words which are stopwords in some other language being ignored?)

Relevant: http://lucene.472066.n3.nabble.com/multilingual-list-of-stopwords-td481037.html

@KepaJRodriguez
Copy link
Contributor

Yes, the issue is whether words which are stop words in a language and common words in other language are relevant for the search or not. Maybe we can begin constructing a new list with following constraints/criteria.
a) The full English list will be part of the generic stop word list.
b) Which language is the second important language with latin alphabet? French or German? Then take this words too, a manual supervision here is not difficult.
c) For the other languages with latin alphabet, import only words with length => 3 characters
d) Stop word list in other alphabets (i.e. Cyrilic): import them completly.
If the constrain (c) is too liberal, maybe we can reduce it to length => 3. What do you think about?

@juntezhang
Copy link

Hi Mike and Kepa,

It has been some time ago, but you know the language of an EAD finding aid
right? If so, you can create a fieldtype for each language and assign a
stopword list to it.

But are you sure you need to remove stopwords? Perhaps historians are not
picky as linguists, but even stopwords can be meaningful for researchers.

I think stopword removal is particularly useful for Web search and very
generic searchers.

I would not recommend pruning by word length, because there may be
instances of acronyms.

Good luck!

Cheers, junte

On Tue, Apr 15, 2014 at 9:10 AM, Kepa J. Rodriguez <[email protected]

wrote:

Yes, the issue is whether words which are stop words in a language and
common words in other language are relevant for the search or not. Maybe we
can begin constructing a new list with following constraints/criteria.
a) The full English list will be part of the generic stop word list.
b) Which language is the second important language with latin alphabet?
French or German? Then take this words too, a manual supervision here is
not difficult.
c) For the other languages with latin alphabet, import only words with
length => 3 characters
d) Stop word list in other alphabets (i.e. Cyrilic): import them completly.
If the constrain (c) is too liberal, maybe we can reduce it to length =>
3. What do you think about?


Reply to this email directly or view it on GitHubhttps://github.com//issues/17#issuecomment-40451939
.

@KepaJRodriguez
Copy link
Contributor

Hi Junte, thanks for your answer. We have already the fieldtype for each language, but there is a generic text field in which all the textual information is copied regardless the language (I know that it is a very problematic approach). The problem with the stop words is that in a so multilingual environment, the overgeneration of results can increase dramatically, but of course, we will need to do more test for that.

About the meaningfulness of stopwords I'm not totally sure. In other fields they are meaningful when the user is looking maybe for the title of a old text, or something similar (that is, when the user looks for a string consisting of more than one word). But in our case if a user search information about an event, and get all the collections with the string (i.e.) "a" or "in", the search can be unsuccessful.

"I would not recommend pruning by word length, because there may be instances of acronyms."
Yes, of course it might be problematic, but I mean pruning by word length in words which are in the lists of stop words, not in words of the text.

@mikesname
Copy link
Contributor Author

Hi Junte - good to hear from you! I hope life is treating you well.

As Kepa points out, the underlying problem is probably due to the fact that we have one single default search field (called text) into which everything "text-like" is currently copied. Probably a good start would be to drop that and specify the default fields explicitly. Which I assume you can do, but I'm not entirely sure because the Solr documentation is so dubious.

(Mmmn, just thinking aloud - can case-sensitivity be used to distinguish acronyms from stopwords...???)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants