Trouble moving from English to Spanish: UPOS loses verb tense; lemma doesn't set dictionary entry #1359

psnider · 2023-05-25T18:30:19Z

psnider
May 25, 2023

I've been using v 4.4.0 for English, with these annotators: tokenize,cleanxml,ssplit,mwt,pos,lemma,ner
and I've had pretty good results. For example an input token of "contained" generates: {"pos": "VBD", "lemma": "contain"}.

But I haven't been able to get similar results when I use Spanish.
For example an input token of "habló" (past tense of hablar for él/ella/usted) generates: {"pos": "VERB", "lemma": "habló"}.
So it seems I've lost both the verb tense, and the lemma.
And I can't figure out how to get these back.

Most online dictionaries correctly find the lemma as hablar.
For example, https://dle.rae.es/habló
(which becomes https://dle.rae.es/habl%C3%B3 )
redirects to https://dle.rae.es/hablar, as I expect.

I'm using java 1.8.0_371 on a Mac, and my command line is:
java -Xmx5g edu.stanford.nlp.pipeline.StanfordCoreNLP -props spanish -annotators tokenize,cleanxml,ssplit,mwt,pos,lemma,ner -outputFormat json -file INPUT_FILE

I would appreciate any advice you can offer me.

Note that I have upgraded to 4.5.4, and still have the same results as I did for Spanish with 4.4.0.

AngledLuffa · 2023-05-25T19:18:40Z

AngledLuffa
May 25, 2023
Maintainer

There is no Spanish lemmatizer in CoreNLP, and the POS tagger only uses the coarse grained tags, hence the issues you are running into. You could try Stanza instead, which has both POS features and lemmas in Spanish.

…

On Thu, May 25, 2023 at 11:41 AM psnider ***@***.***> wrote: I've been using v 4.4.0 for English, with these annotators: tokenize,cleanxml,ssplit,mwt,pos,lemma,ner and I've had pretty good results. For example an input token of "contained" generates: {"pos": "VBD", "lemma": "contain"}. But I haven't been able to get similar results when I use Spanish. For example an input token of "habló" (past tense of hablar for él/ella/usted) generates: {"pos": "VERB", "lemma": "habló"}. So it seems I've lost both the verb tense, and the lemma. And I can't figure out how to get these back. Most online dictionaries correctly find the lemma as hablar. For example, https://dle.rae.es/habló (which becomes https://dle.rae.es/habl%C3%B3 ) redirects to https://dle.rae.es/hablar, as I expect. I'm using java 1.8.0_371 on a Mac, and my command line is: java -Xmx5g edu.stanford.nlp.pipeline.StanfordCoreNLP -props spanish -annotators tokenize,cleanxml,ssplit,mwt,pos,lemma,ner -outputFormat json -file INPUT_FILE I would appreciate any advice you can offer me. Note that I have upgraded to 4.5.4, and still have the same results as I did for Spanish with 4.4.0. — Reply to this email directly, view it on GitHub <#1359>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA2AYWOGX4WOIDUAHZ4VQIDXH6RVNANCNFSM6AAAAAAYPHHMEY> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trouble moving from English to Spanish: UPOS loses verb tense; lemma doesn't set dictionary entry #1359

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Trouble moving from English to Spanish: UPOS loses verb tense; lemma doesn't set dictionary entry #1359

psnider May 25, 2023

Replies: 1 comment

AngledLuffa May 25, 2023 Maintainer

psnider
May 25, 2023

AngledLuffa
May 25, 2023
Maintainer