-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stopwords... #17
Comments
I have not the data on this computer, but I gues, this bug is related to the copy of all text information in the generic "text" file. |
Good thinking. Maybe just merge the really common ones. Wonder what the downsides here would be (valid words which are stopwords in some other language being ignored?) Relevant: http://lucene.472066.n3.nabble.com/multilingual-list-of-stopwords-td481037.html |
Yes, the issue is whether words which are stop words in a language and common words in other language are relevant for the search or not. Maybe we can begin constructing a new list with following constraints/criteria. |
Hi Mike and Kepa, It has been some time ago, but you know the language of an EAD finding aid But are you sure you need to remove stopwords? Perhaps historians are not I think stopword removal is particularly useful for Web search and very I would not recommend pruning by word length, because there may be Good luck! Cheers, junte On Tue, Apr 15, 2014 at 9:10 AM, Kepa J. Rodriguez <[email protected]
|
Hi Junte, thanks for your answer. We have already the fieldtype for each language, but there is a generic text field in which all the textual information is copied regardless the language (I know that it is a very problematic approach). The problem with the stop words is that in a so multilingual environment, the overgeneration of results can increase dramatically, but of course, we will need to do more test for that. About the meaningfulness of stopwords I'm not totally sure. In other fields they are meaningful when the user is looking maybe for the title of a old text, or something similar (that is, when the user looks for a string consisting of more than one word). But in our case if a user search information about an event, and get all the collections with the string (i.e.) "a" or "in", the search can be unsuccessful.
|
Hi Junte - good to hear from you! I hope life is treating you well. As Kepa points out, the underlying problem is probably due to the fact that we have one single default search field (called (Mmmn, just thinking aloud - can case-sensitivity be used to distinguish acronyms from stopwords...???) |
There seems to be an issue with stopwords not being properly excluded by the current search config. For example, if you search for Demandes en obtention, the main hit is "Demandes en obtention d' une autorisation de batir" with a score of 28+ (which is as expected), but all the remaining 128 results just have "en" somewhere in their body, with a score of <=6.
There are similar issues with English and German stopwords.
The text was updated successfully, but these errors were encountered: