Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pronoun/determinative/possessive lemmas #128

Open
nschneid opened this issue Aug 19, 2024 · 5 comments
Open

Pronoun/determinative/possessive lemmas #128

nschneid opened this issue Aug 19, 2024 · 5 comments

Comments

@nschneid
Copy link
Contributor

We should standardize these and enforce in the validator. As is, e.g. "its" is sometimes lemmatized as "it".

The UD lemmatization policies have evolved and are summarized here for pronouns. Basically,

  • in the personal pronouns, accusative pronouns are mapped to nominative and independent possessives are mapped to dependent possessives
  • "whom" should be mapped to "who" and "whomever" to "whoever"
  • in demonstratives (which in CGEL are always determinatives), plurals are mapped to singular lemma
  • the article "an" is mapped to "a"

(discussion at UniversalDependencies/docs#517)

We could simply adopt the UD policies; or, because they potentially diverge from CGEL at least with regard to possessives, and as pronouns and determinatives are closed classes, we could simply omit the lemmas from the CGELBank trees, and provide a lookup table for anyone who wants them.

Also, for full nouns with a possessive ending, whether that is lemmatized to the non-possesssive form should be consistent. (The possessive ending is considered a separate syntactic word in UD, but not in CGEL; in UD-derived data this is make explicit with :subt features.)

@nschneid
Copy link
Contributor Author

Possible solution that would minimize manual annotation effort:

  • N_pro and D nodes (whether it is a standard form of a pronoun, or has a :correct feature indicating the standard form) never receive an :l feature in the .cgel file
  • possessive Ns (e.g. "John's", "store's") do receive an :l with the genitive ending removed (and converting plural to singular)
  • the cgel.py API provides access to several attributes:
    • the raw :l annotation (if present)
    • the raw :correct annotation (if present)
    • the :l annotation if present else :correct annotation if present else surface form
    • the udlemma, which is a string or list of strings additionally incorporating PRON/DET/possessive lemmatization and tokenization per UD guidelines
    • the cgellemma, which is a string providing a lemma per CGEL guidelines (normalizing across all pronoun cases including genitive, and removing rather than splitting off s-genitive endings)

@BrettRey
Copy link
Collaborator

I guess I'm not following. You write, "We should standardize these and enforce in the validator. As is, e.g. "its" is sometimes lemmatized as "it"." That seems fine. Is the issue that it is only sometimes lemmatized as "it"? Or is there some reason it shouldn't ever be lemmatized?

@nschneid
Copy link
Contributor Author

A lemma is only sometimes provided explicitly for "its"—the annotations are inconsistent across files.

We have to decide: (1) For pronouns and determinatives, which are a closed set, do we want to ask annotators to specify the lemma explicitly in the .cgel file, or compute it automatically as part of the API? (2) If their lemmas are specified explicitly, do we want to be compatible with UD lemmas?

@BrettRey
Copy link
Collaborator

OK, I get it.

  1. I see no need for annotators to specify the lemma, but it would be good if they were computed automatically.
  2. I don't have a strong opinion on UD compatibility.

@nschneid
Copy link
Contributor Author

nschneid commented Sep 8, 2024

A reminder to myself that we DO want hand-specified lemmas not just for nouns and verbs, but also adjectives/adverbs inflected for grade (comparative/superlative).

Coordinators, Subordinators, Prepositions are not normally expected to inflect/have lemmas, though it is conceivable in the cases of spelling variation ("&" / "and", "@" / "at" etc.). Or the non-abbreviated form could be indicated as the :correct form.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants