Replies: 1 comment
-
There is no Spanish lemmatizer in CoreNLP, and the POS tagger only uses the
coarse grained tags, hence the issues you are running into. You could try
Stanza instead, which has both POS features and lemmas in Spanish.
…On Thu, May 25, 2023 at 11:41 AM psnider ***@***.***> wrote:
I've been using v 4.4.0 for English, with these annotators:
tokenize,cleanxml,ssplit,mwt,pos,lemma,ner
and I've had pretty good results. For example an input token of
"contained" generates: {"pos": "VBD", "lemma": "contain"}.
But I haven't been able to get similar results when I use Spanish.
For example an input token of "habló" (past tense of hablar for
él/ella/usted) generates: {"pos": "VERB", "lemma": "habló"}.
So it seems I've lost both the verb tense, and the lemma.
And I can't figure out how to get these back.
Most online dictionaries correctly find the lemma as hablar.
For example, https://dle.rae.es/habló
(which becomes https://dle.rae.es/habl%C3%B3 )
redirects to https://dle.rae.es/hablar, as I expect.
I'm using java 1.8.0_371 on a Mac, and my command line is:
java -Xmx5g edu.stanford.nlp.pipeline.StanfordCoreNLP -props spanish
-annotators tokenize,cleanxml,ssplit,mwt,pos,lemma,ner -outputFormat json
-file INPUT_FILE
I would appreciate any advice you can offer me.
Note that I have upgraded to 4.5.4, and still have the same results as I
did for Spanish with 4.4.0.
—
Reply to this email directly, view it on GitHub
<#1359>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA2AYWOGX4WOIDUAHZ4VQIDXH6RVNANCNFSM6AAAAAAYPHHMEY>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I've been using v 4.4.0 for English, with these annotators: tokenize,cleanxml,ssplit,mwt,pos,lemma,ner
and I've had pretty good results. For example an input token of "contained" generates: {"pos": "VBD", "lemma": "contain"}.
But I haven't been able to get similar results when I use Spanish.
For example an input token of "habló" (past tense of hablar for él/ella/usted) generates: {"pos": "VERB", "lemma": "habló"}.
So it seems I've lost both the verb tense, and the lemma.
And I can't figure out how to get these back.
Most online dictionaries correctly find the lemma as hablar.
For example, https://dle.rae.es/habló
(which becomes https://dle.rae.es/habl%C3%B3 )
redirects to https://dle.rae.es/hablar, as I expect.
I'm using java 1.8.0_371 on a Mac, and my command line is:
java -Xmx5g edu.stanford.nlp.pipeline.StanfordCoreNLP -props spanish -annotators tokenize,cleanxml,ssplit,mwt,pos,lemma,ner -outputFormat json -file INPUT_FILE
I would appreciate any advice you can offer me.
Note that I have upgraded to 4.5.4, and still have the same results as I did for Spanish with 4.4.0.
Beta Was this translation helpful? Give feedback.
All reactions