Use of language labels and synonyms among the food-related ontologies #25

ddooley · 2022-05-12T05:36:22Z

Bernd Krieg-Brückner has examined the challenge of semantic web related language tagging. Feedback is appreciated on his research!

He says:

"I have analyzed the problem of translation into other languages (and regional languages/dialects) further, propose a pragmatic solution below, and will investigate further with my collaborators here (notably Michaela), how to interrelate with (and possibly pick up translation data from) WikiData.

The ISO Standardization situation

Language Codes are standardized under

ISO 639-1: two letter codes, including en and la,
see https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes
ISO 639-3: three letter codes, intended for use as metadata codes, purported to contain all languages,
standardized under SIL (https://iso639-3.sil.org/code_tables/download_tables)
see https://en.wikipedia.org/wiki/List_of_ISO_639-3_codes
Unfortunately, another project to standardize language hierarchy, ISO 639-5, is somewhat unfinished and pragmatically of no use. ISO 639-2 is now basically superseded by ISO 639-3.
The language code tables map many to one: ISO 639-3 => ISO 639-1, where the former may contain a code for the latter, e.g. lat => la, deu => de
ISO 639-3 does include many dialects (cf. https://en.wikipedia.org/wiki/German_language), notably
"Swiss German", code gsw (with Alsatian)
[Alemannic German ... any of the Alemannic dialects spoken in the German-speaking part of Switzerland],
not to be confused with
"Swiss Standard German", or Swiss High German, the written form of High German, one of four official languages in Switzerland (with deviations, mainly in vocabulary, from High German),
which has NO code [but cf. below the "IETF BCP 47 language tag" de-CH]
also
Bavarian, [Bairisch, Bairisch-Österreichisch], code 'bar', as a single language spoken in Bavaria and Austria
[cf. Note below]

For Country Codes and geographic regions there is ISO 3166-1 [https://en.wikipedia.org/wiki/ISO_3166-1].

Standardization on the Internet: IETF language tags

The above situation is a little confusing at first, but becomes more realistic when considering the de facto standardization on the Internet. HTML/XML, W3C and notably Wikipedia/WikiData support the IETF BCP 47 language tag: "a standardized code or tag that is used to identify human languages in the Internet", cf.

https://en.wikipedia.org/wiki/IETF_language_tag
with the excellent introduction
https://www.w3.org/International/articles/language-tags/index.en
and the very useful subtag search tool
https://r12a.github.io/app-subtags/
For a list of subtags see
https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry

Although tags may be long with a defined syntax, they may be abbreviated; the recommendation is to keep them as short as possible. The different kinds of subtags can be distinguished by their length (number of characters).

Terminology and examples:

"de-CH" is a language tag. The de and CH parts are referred to as subtags.
"de-CH" means "German as used in Switzerland" i.e. (official) Swiss Standard German
"de" is an IETF primary language subtag, derived from ISO 639-1
"CH" is a Country Code subtag, derived from ISO 3166-1
"bar" is a language subtag from ISO 639-3
[the Austro-Bavarian language family; cf. also Note below]
"zh-yue" is an extlang-subtag (Cantonese Chinese):
a primary language prefix from ISO 639-1 precedes a language subtag from ISO 639-3
[an extlang-subtag may always be abbreviated by its language subtag;
this is the IETF recommendation, but see below]
"zh-Hans" (Simplified Chinese) includes the script-subtag "Hans"
[a script subtag always has 4 characters, and is to be omitted if at all possible,
e.g. "es" instead of "es-Latn" since Spanish is always encoded with latin characters]

Apparently, Wikipedia/WikiData uses IETF language tags [the only deviation I found is the tag "simple" meaning "en-simple", while "simple" seems to be documented in IETF to be applicable to any primary language prefix].
Protégé uses/recommends an early version (the present documentation is hopelessly outdated).

BKB¹s recommendations for language annotations in FoodOn

use label/synonym annotations etc. with IETF language tags; examples:
- en, en-US, en-CA, de-CH
  [do not use an ISO 639-3 code if an equivalent ISO 639-1 code exists; e.g.
  use la instead of lat, de instead of deu]
abbreviate IETF language tags (as in e.g. WikiData)
!! except for regional languages (with ISO 639-3 code) that are sublanguages of a primary language,
which should be kept as prefix (ISO 639-1 code); examples:
- de-bar [NOT bar]
  [Rationale: the structuring of the prefix "de" as a quasi-macrolanguage is maintained;
  it is easily stripped off, but more complex to reconstruct;
  Contra: Wiki(Data) uses the abbreviation only]
use Country Code subtags (ISO 3166-1) possibly plus (regional) Language Code subtags (ISO 639-3)
for "official" written language vs. regional dialect terms; examples:
- currant[en]: Johannisbeere[de], Ribisel[de-AT]
- potato pancake[en]: Kartoffelpuffer[de], Reibekuchen[de], Platzki[de-AT], Reiberdatschi[de-bar]
  [Rationale: de-AT indicates that the term is only used in Austria, not Bavaria etc.
  de-bar indicates that the term is used throughout the Austro-Bavarian language;
  this may not always be true, but there is not way to restrict to Bavaria only]
  [cf. Note below; similarly de-CH, if the term exists there, otherwise de-gsw (or both)]
keep label/synonym annotations for the same primary language together in a separate file; examples:
- de, de-CH, de-AT, de-bar, …
  [Rationale: regional languages or dialects are then directly accessible. In Protégé a View/CustomRendering
  set to "de-AT, de-bar, de" will select the appropriate label, if present, in that order]
  [it makes sense to keep a regional hierarchy of files, e.g. to keep a folder of all (regional) languages in India]
different spellings should probably be synonyms, not labels, e.g.
- Platzka[de-AT] for Platzki[de-AT]
  [the synonym issue is separate from the regional language issue]

Note (mainly, but not only, for Germans)

There is a confusion between the written and spoken language Bavarian (Bairisch/Boarisch) in Wikipedia. There are special pages for Bavarian (and WikiData has special entries under https://bar.wikipedia.org/), where the language is "written" in a (to my knowledge) non-standard transliteration of spoken "Boarisch". It includes a relevant page on "kitchen vocabulary" - https://bar.wikipedia.org/wiki/Austro-Boarischa_Kuchlwoatschotz - where the transliterated/spoken language "Boarisch" is contrasted against a "Schriftsproch" meaning a written form of "Bairisch" intelligible by Germans (whereas "Boarisch" is quite unintelligible unless one knows it from years of experience). Terms used only in Austria are marked "Ö" and those only in Bavaria are marked "B". [Actually, that article is written in "Ostmiddlboarisch" or "Weanarisch" [Ostmittelbairisch], a dialect spoken in Vienna; whether this is appropriate for all "Boarisch" is another matter.] My recommendations above refer to "Bairisch" (with code de-bar [or bar, resp.]).

ddooley · 2022-05-12T14:05:27Z

Note that OBOFoundry now officially recommends that labels etc. have language tags rather than "string" datatype. OBOFoundry/OBOFoundry.github.io#479 . I'm checking if there are any other restrictions on language tag content with respect to protege or processing tools. I recall there might be a curve ball about what is typically accepted by some software.

ddooley · 2022-05-17T13:48:47Z

One Protege update relevant to this is protegeproject/protege#784 . Requires downloading protege 5.6.0 . However Stanford's default download for MAC OS is 5.5 still.

oldskeptic · 2022-05-18T13:01:09Z

Most of this is regulated by the rdfs W3C Recommendation which points to BCP 47, which references RFC 5646 which "recommends" ISO639-1 (Lang Code), ISO3166-1 (Region / Dialect Codes) and ISO15924 (Region / Dialect Codes + script).

At issue is that the RFCs are recommendations and people tend to implement them "their way". I note the following in RFC5645: "de-CH-1996" represents German as used in Switzerland and as written using the spelling reform beginning in the year 1996 C.E., which is great for hardcore language nerds but this has no support that I know of. Same for ISO8601 time periods.

As of this ticket, librdf/raptor correctly parses ISO3166-1 regions but users of Virtuoso may encounter bumps in some circumstances (openlink/virtuoso-opensource#710). I've seen some toolchains die on anything but two letter language codes.

With respect to OBOFoundry/OBOFoundry.github.io#479, rdfs W3C Recommendation states that language-typed literals with xml:lang tags are implied to be langString typed. Most parsers are very loose about this situation. A blank xml:lang remains useful for cases where a term is language (but not script) agnostic, such as some personal names.

I agree with Bernd that there is a difference between a translation and a synonym. I would also add context to the mix: most crops / varietals / cultivars have common names, trade (commercial) names, scientific names and occasionally the registration / patent number is used. Common names are locale specific, trade and registration numbers jurisdiction specific and scientific names could be latin / greek if someone has gotten around to it.

I personally prefer a single rdfs:label or skos:prefLabel per term per language tag as it allows simple SPARQL queries without unintended projection snafus. Putting a single localized label on the screen for the user to read is the 99% use case and should be as easy as possible.

Skos sees synonyms as skos:altLabels which is a great way of handling the above constraints. My concern is that in most cases synonyms belong to a specific context, which is impossible to record with a Literal.

The use of skos-xl is attractive through the use of skos-xl:Label to assign provenance and / or context to the term. The unfortunate use of a few disjoint and restrictions on it makes it hard to use it for recording a full multilingual vocabulary. FIBO solves this type of problem with the use of tags and a node within a scheme for the identifier. Again, your mileage may vary.

As a general rule when handling nomenclature: The thing and the name of the thing are two different things.

ddooley · 2022-05-19T17:55:45Z

Alan Ruttenberg has just commented about https://www.w3.org/International/questions/qa-choosing-language-tags , mentioning it recommends RFC 5646 which you link to above, Rob!

oldskeptic · 2022-05-19T18:03:36Z

I'd also like to recommend http://www.lexvo.org/ which provides a full graph of languages and scripts along with labels for each language in... every other language. This makes it incredibly useful for multilingual UI work.

ddooley mentioned this issue May 12, 2022

Should we recommend specifying language tag? OBOFoundry/OBOFoundry.github.io#479

Open

ddooley assigned ddooley and unassigned ddooley May 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use of language labels and synonyms among the food-related ontologies #25

Use of language labels and synonyms among the food-related ontologies #25

ddooley commented May 12, 2022

ddooley commented May 12, 2022 •

edited

Loading

ddooley commented May 17, 2022

oldskeptic commented May 18, 2022

ddooley commented May 19, 2022 •

edited

Loading

oldskeptic commented May 19, 2022

Use of language labels and synonyms among the food-related ontologies #25

Use of language labels and synonyms among the food-related ontologies #25

Comments

ddooley commented May 12, 2022

ddooley commented May 12, 2022 • edited Loading

ddooley commented May 17, 2022

oldskeptic commented May 18, 2022

ddooley commented May 19, 2022 • edited Loading

oldskeptic commented May 19, 2022

ddooley commented May 12, 2022 •

edited

Loading

ddooley commented May 19, 2022 •

edited

Loading