Skip to content
This repository has been archived by the owner on Feb 19, 2022. It is now read-only.

Dealing with stop words and NER in multilingual texts #114

Open
mialondon opened this issue Aug 5, 2013 · 30 comments
Open

Dealing with stop words and NER in multilingual texts #114

mialondon opened this issue Aug 5, 2013 · 30 comments

Comments

@mialondon
Copy link
Contributor

Is the current workflow: 'detect language, apply appropriate stopwords' or 'apply generic multilingual stopwords'? If it's the former, can we detect multiple languages and apply the appropriate lists of stopwords?

As this conversation hints, many scholars work in two or more languages https://twitter.com/wilkohardenberg/status/363677752391516161 so ideally we could cope with returning entities and tokens for at least two languages and also apply stop words.

The trickiness of dealing with this might also be a call for more randomness in the way query terms are mixed so people can refresh the results and see different terms applied.

@rlskoeser
Copy link
Contributor

Current workflow is to detect language using python guess-language and then select appropriate stopwords if it's a language nltk has stopwords for. I hadn't thought about mixed languages, though. Might be helpful to have some sample mixed language text so we can see what guess-language thinks of it, write some tests.

@wilkohardenberg
Copy link

Here is some text that hugely confuses the guess-language function:

Later, pressure increased to focus less on animal conservation and more on the welfare of urban-dwellers and tourism promotion. As from 1930 hunting permits were sold and in 1932 the journal of the Italian Alpine Club published an article proposing to transform the Gran Paradiso into a sort of huge open-air zoological garden, with all the features of an urban park. In the same years the Aostan autonomist politician Emile Chanoux lamented that until then the park had stressed too much its scientific aims, forgetting to respond to what it called its “social function”:
"Ma il Parco non deve essere fine a se stesso; deve avere oltre che una funzione scientifica, anche una funzione sociale, deve essere un richiamo per le folle per una vita sana e naturale, deve essere una sorgente di vita per le popolazioni delle montagne sui cui è costituito, deve essere anche (e perché no?) la grande riserva di caccia della Nazionale, poiché anche questo sport della caccia ha motivo di sussistere per le sue utilità sociali."

@mialondon
Copy link
Contributor Author

After discussing it with my friendly local multilingual historian and thinking over Wilko's issue, I wonder if there are two parts to the problem: the first is dealing with stop words in the appropriate languages, the second is NER (entity recognition) in other languages. Does dbpedia automatically query Wikipedia content from all languages or just English? If not, can we use the current language detection to query the appropriate instances as well as applying different sets of stop words? Thoughts @moltude ?

Also thanks @wilkohardenberg for your input and earlier comments!

@moltude
Copy link
Contributor

moltude commented Aug 5, 2013

I'm still thinking about this but I have a couple of thoughts so far:

  • Yes, it does look like DBpedia supports NER in multiple languages with a specific rest url for each language [https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/User%27s-manual]
  • If we are able to parse and identify non-english named entieis then our search of the aggregators would require seperate queries for each of the languages found (dp.la and europeana support specifying languages in the query) and I'm not sure how those additional queries will impact query times (there may be more effiecent ways of doing this)
  • We might consider whether to use the 'en' form of a non-'en' named entity (I think this can be gotten from the spotlight service) and then use both the original language form and 'en' form to the query (or just the 'en' form). I'm not sure how this will actually play out but it seems possible. There is still going to be troubl distinguish between non-english entities and english stopwords ('it' in Swedish means 'den' in english and we wouldn't want every query that includes 'it' to also search for 'den')

I'm still chewing on this so any additional thoughts would be appreciated.

@mialondon
Copy link
Contributor Author

Useful points, thanks! We could possibly assume that any non-English text is more pertinent and prioritise those queries - but do we actually need to run separate queries against the search APIs or do we just add non-English terms into the mix?

@briancroxall
Copy link
Contributor

Perhaps in the meantime we can make it clear that Serendip-o-matic only supports English language text in the 1.0?

@mbwolff
Copy link
Contributor

mbwolff commented Aug 8, 2013

Hi everyone. I sent the pull request for FR stop words and was referred to this discussion (thanks Mia!). One way to solve this problem might be to break a text up into chunks and run guess-language on each chunk, aggregating results to build list of search terms. Chunks could be separated by punctuation and line breaks. This should work for Wilko's text above. For single words and short phrases from one language inserted into a text written mainly in another language, it may be too much trouble to determine the different languages.

@mialondon
Copy link
Contributor Author

I was thinking paragraphs, as detected by various forms of line breaks (assuming they're still slightly different between OSs), how does that sound?

@wilkohardenberg
Copy link

If feasible it sounds good to me. Single words or short sentences should not be too much of a problem in most cases. I wonder however how this should work on a Zotero library: separate language guessing for each entry?

@mbwolff
Copy link
Contributor

mbwolff commented Aug 8, 2013

Paragraphs are natural chunks, so that works for me.

@mialondon
Copy link
Contributor Author

Can #78 be resolved at the same time?

We'll also have XML markup in various forms if people try copying other reference library formats - @moltude and @amrys came up with a good example of that

@rlskoeser
Copy link
Contributor

Working by paragraph sounds like a feasible solution, although I worry about how that will scale to larger texts (although I suppose there are probably lots of parts of the code where larger text may cause issues). I also wonder if I could adapt the guess-language code to give multiple languages back if there multiple languages with very highly scores - it looks like it might be possible from glancing at the code, but I would need to experiment some. Is there likely to be a problem with combining stop words from all the languages detected? Although that doesn't help as much for knowing which dbpedia spotlight endpoint to use, I guess.

As for the #78 - we probably need some simple input type detection first - plain text, html/xml, csv, etc - and then do some pre-processing based on the input format before generating search terms.

@mbwolff
Copy link
Contributor

mbwolff commented Aug 12, 2013

Hi everyone. Combining stop words from different languages will create problems, e.g. "den" is an article in German and a noun in English.

mw

On Aug 11, 2013, at 6:12 PM, Rebecca Sutton Koeser [email protected] wrote:

Working by paragraph sounds like a feasible solution, although I worry about how that will scale to larger texts (although I suppose there are probably lots of parts of the code where larger text may cause issues). I also wonder if I could adapt the guess-language code to give multiple languages back if there multiple languages with very highly scores - it looks like it might be possible from glancing at the code, but I would need to experiment some. Is there likely to be a problem with combining stop words from all the languages detected? Although that doesn't help as much for knowing which dbpedia spotlight endpoint to use, I guess.

As for the #78 - we probably need some simple input type detection first - plain text, html/xml, csv, etc - and then do some pre-processing based on the input format before generating search terms.


Reply to this email directly or view it on GitHub.

@mialondon
Copy link
Contributor Author

We don't need to keep the paragraph structure, just pass things into a bucket for the appropriate language then push each one to the appropriate tokenisation, stop words and entity recognition steps... Though we might want to adjust the mix of query terms according to the proportional amount of each languages - too fussy?

(At some future point we may want to use the languages detected to query for objects from particular cultures or in particular languages, but that'd need to be considered carefully in relation to 'serendipity' and any future 'hint' function)

@mialondon
Copy link
Contributor Author

Just a note that it might be easiest to work out and document design decisions on the wiki then return here to finish integrating them https://github.com/chnm/serendipomatic/wiki/Serendipomatic-architecture

@mialondon
Copy link
Contributor Author

Do we need a chat to decide on the best solution? If so, who's interested?

@mbwolff
Copy link
Contributor

mbwolff commented Oct 8, 2013

I'm interested.

@moltude
Copy link
Contributor

moltude commented Oct 9, 2013

I'm also ready to dig back in on this.

On Tue, Oct 8, 2013 at 2:25 PM, Mia [email protected] wrote:

Do we need a chat to decide on the best solution? If so, who's interested?


Reply to this email directly or view it on GitHubhttps://github.com//issues/114#issuecomment-25915058
.

@rlskoeser
Copy link
Contributor

I'm interested too.

@mialondon
Copy link
Contributor Author

Cool! Is there an asynchronous way we can talk through the options or should we try for a chat? (I'm complicating things slightly by being in a completely different timezone).

@moltude
Copy link
Contributor

moltude commented Oct 14, 2013

I can make time 9-5 M/F for a chat if that makes the timezone problem
easier (Mia are you GMT?). Other than a chat, I think the best way is to
post to the Github issue tracking. Other ideas?

Thursday or Friday would be the best day for me this week if we wanted to
setup a chat.

On Sat, Oct 12, 2013 at 8:07 PM, Mia [email protected] wrote:

Cool! Is there an asynchronous way we can talk through the options or
should we try for a chat? (I'm complicating things slightly by being in a
completely different timezone).


Reply to this email directly or view it on GitHubhttps://github.com//issues/114#issuecomment-26208724
.

@mbwolff
Copy link
Contributor

mbwolff commented Oct 14, 2013

This Friday afternoon (10/18), US East Coast time, would work for me. Could we videoconference?

mw

On Oct 14, 2013, at 9:17 AM, Scott Williams [email protected] wrote:

I can make time 9-5 M/F for a chat if that makes the timezone problem
easier (Mia are you GMT?). Other than a chat, I think the best way is to
post to the Github issue tracking. Other ideas?

Thursday or Friday would be the best day for me this week if we wanted to
setup a chat.

On Sat, Oct 12, 2013 at 8:07 PM, Mia [email protected] wrote:

Cool! Is there an asynchronous way we can talk through the options or
should we try for a chat? (I'm complicating things slightly by being in a
completely different timezone).


Reply to this email directly or view it on GitHubhttps://github.com//issues/114#issuecomment-26208724
.


Reply to this email directly or view it on GitHub.

@mialondon
Copy link
Contributor Author

I'm GMT+11, the other East Coast Time (I'm in Australia). I could just about do 7am here, though I'd make more sense at 8am! http://www.timeanddate.com/worldclock/meetingtime.html?iso=20131018&p1=240&p2=179

@mbwolff
Copy link
Contributor

mbwolff commented Oct 15, 2013

I can meet Friday 8:00 AM Mia's time (Thursday 5:00 PM my time).

mw

On Oct 14, 2013, at 6:31 PM, Mia [email protected] wrote:

I'm GMT+11, the other East Coast Time (I'm in Australia). I could just about do 7am here, though I'd make more sense at 8am! http://www.timeanddate.com/worldclock/meetingtime.html?iso=20131018&p1=240&p2=179


Reply to this email directly or view it on GitHub.

@mialondon
Copy link
Contributor Author

Skype? I don't have a camera on the dinosaur laptop I'm travelling with so it's voice-only for me at the best of times.

@moltude
Copy link
Contributor

moltude commented Oct 16, 2013

Thursday 5:00 EST on skype would work for me.

On Tue, Oct 15, 2013 at 5:59 PM, Mia [email protected] wrote:

Skype? I don't have a camera on the dinosaur laptop I'm travelling with so
it's voice-only for me at the best of times.


Reply to this email directly or view it on GitHubhttps://github.com//issues/114#issuecomment-26376189
.

@rlskoeser
Copy link
Contributor

I'm available at thursday 5pm EST too. Is skype audio conference calling free? How do we exchange skype account names (prefer not to post them publicly, obviously). When OWOT team did video/audio chat last week it was kind of laggy and a bit difficult to communicate at times, which makes me wonder if a text chat might be more useful - but I guess skype has a chat tool built in that we can use if the audio is too laggy, right? Alternatively we could try a google+ hangout if we want to do video for those who have cameras.

@mialondon
Copy link
Contributor Author

The document for collecting sample text for testing is 'Help us collect multilingual text for testing Serendip-o-matic' https://docs.google.com/document/d/100UygYyACS7tgU70FYpc4d00NTwoXaDzDmSUCu3naJE/edit#

@mialondon
Copy link
Contributor Author

Here's a record of the decisions reached during our chat:

a) set up analytics to keep track of word count, languages
b) hint function still useful future functionality, add language as an option
c) start with sentence level, most common language determines which is used
d) collect multilingual test samples for testing (inc poetry, TEI, whatever)
e) check whether dbpedia is multilingual (I think the answer was yes?)
f) these changes drive need for parallelisation
g) help text on formatting text input (e.g. how to prepare BibTeX, TEI etc formatted text for inclusion)
h) html/xml/whatever detection and graceful management
i) check language options in source APIs
j) refactor so Zotero input arrives at detection process looking like any other text

Of those, a, f, g will be new issues, b adds weight to #11, h is related to #78 and c, d, e, i and j are related to the original issue.

@mialondon
Copy link
Contributor Author

Slightly off-topic, but this article on NER might be worth a look: 'Exploring Entity Recognition and Disambiguation
for Cultural Heritage Collections' http://freeyourmetadata.org/publications/named-entity-recognition.pdf

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants