-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use of language labels and synonyms among the food-related ontologies #25
Comments
Note that OBOFoundry now officially recommends that labels etc. have language tags rather than "string" datatype. OBOFoundry/OBOFoundry.github.io#479 . I'm checking if there are any other restrictions on language tag content with respect to protege or processing tools. I recall there might be a curve ball about what is typically accepted by some software. |
One Protege update relevant to this is protegeproject/protege#784 . Requires downloading protege 5.6.0 . However Stanford's default download for MAC OS is 5.5 still. |
Most of this is regulated by the rdfs W3C Recommendation which points to BCP 47, which references RFC 5646 which "recommends" ISO639-1 (Lang Code), ISO3166-1 (Region / Dialect Codes) and ISO15924 (Region / Dialect Codes + script). At issue is that the RFCs are recommendations and people tend to implement them "their way". I note the following in RFC5645: As of this ticket, librdf/raptor correctly parses ISO3166-1 regions but users of Virtuoso may encounter bumps in some circumstances (openlink/virtuoso-opensource#710). I've seen some toolchains die on anything but two letter language codes. With respect to OBOFoundry/OBOFoundry.github.io#479, rdfs W3C Recommendation states that language-typed literals with I agree with Bernd that there is a difference between a translation and a synonym. I would also add context to the mix: most crops / varietals / cultivars have common names, trade (commercial) names, scientific names and occasionally the registration / patent number is used. Common names are locale specific, trade and registration numbers jurisdiction specific and scientific names could be latin / greek if someone has gotten around to it. I personally prefer a single Skos sees synonyms as The use of skos-xl is attractive through the use of As a general rule when handling nomenclature: The thing and the name of the thing are two different things. |
Alan Ruttenberg has just commented about https://www.w3.org/International/questions/qa-choosing-language-tags , mentioning it recommends RFC 5646 which you link to above, Rob! |
I'd also like to recommend http://www.lexvo.org/ which provides a full graph of languages and scripts along with labels for each language in... every other language. This makes it incredibly useful for multilingual UI work. |
Bernd Krieg-Brückner has examined the challenge of semantic web related language tagging. Feedback is appreciated on his research!
He says:
"I have analyzed the problem of translation into other languages (and regional languages/dialects) further, propose a pragmatic solution below, and will investigate further with my collaborators here (notably Michaela), how to interrelate with (and possibly pick up translation data from) WikiData.
The ISO Standardization situation
Language Codes are standardized under
see https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes
standardized under SIL (https://iso639-3.sil.org/code_tables/download_tables)
see https://en.wikipedia.org/wiki/List_of_ISO_639-3_codes
Unfortunately, another project to standardize language hierarchy, ISO 639-5, is somewhat unfinished and pragmatically of no use. ISO 639-2 is now basically superseded by ISO 639-3.
The language code tables map many to one: ISO 639-3 => ISO 639-1, where the former may contain a code for the latter, e.g. lat => la, deu => de
ISO 639-3 does include many dialects (cf. https://en.wikipedia.org/wiki/German_language), notably
[Alemannic German ... any of the Alemannic dialects spoken in the German-speaking part of Switzerland],
not to be confused with
which has NO code [but cf. below the "IETF BCP 47 language tag" de-CH]
also
[cf. Note below]
For Country Codes and geographic regions there is ISO 3166-1 [https://en.wikipedia.org/wiki/ISO_3166-1].
Standardization on the Internet: IETF language tags
The above situation is a little confusing at first, but becomes more realistic when considering the de facto standardization on the Internet. HTML/XML, W3C and notably Wikipedia/WikiData support the IETF BCP 47 language tag: "a standardized code or tag that is used to identify human languages in the Internet", cf.
with the excellent introduction
and the very useful subtag search tool
For a list of subtags see
Although tags may be long with a defined syntax, they may be abbreviated; the recommendation is to keep them as short as possible. The different kinds of subtags can be distinguished by their length (number of characters).
Terminology and examples:
"de-CH" means "German as used in Switzerland" i.e. (official) Swiss Standard German
[the Austro-Bavarian language family; cf. also Note below]
a primary language prefix from ISO 639-1 precedes a language subtag from ISO 639-3
[an extlang-subtag may always be abbreviated by its language subtag;
this is the IETF recommendation, but see below]
[a script subtag always has 4 characters, and is to be omitted if at all possible,
e.g. "es" instead of "es-Latn" since Spanish is always encoded with latin characters]
Apparently, Wikipedia/WikiData uses IETF language tags [the only deviation I found is the tag "simple" meaning "en-simple", while "simple" seems to be documented in IETF to be applicable to any primary language prefix].
Protégé uses/recommends an early version (the present documentation is hopelessly outdated).
BKB¹s recommendations for language annotations in FoodOn
use label/synonym annotations etc. with IETF language tags; examples:
[do not use an ISO 639-3 code if an equivalent ISO 639-1 code exists; e.g.
use la instead of lat, de instead of deu]
abbreviate IETF language tags (as in e.g. WikiData)
!! except for regional languages (with ISO 639-3 code) that are sublanguages of a primary language,
which should be kept as prefix (ISO 639-1 code); examples:
[Rationale: the structuring of the prefix "de" as a quasi-macrolanguage is maintained;
it is easily stripped off, but more complex to reconstruct;
Contra: Wiki(Data) uses the abbreviation only]
use Country Code subtags (ISO 3166-1) possibly plus (regional) Language Code subtags (ISO 639-3)
for "official" written language vs. regional dialect terms; examples:
[Rationale: de-AT indicates that the term is only used in Austria, not Bavaria etc.
de-bar indicates that the term is used throughout the Austro-Bavarian language;
this may not always be true, but there is not way to restrict to Bavaria only]
[cf. Note below; similarly de-CH, if the term exists there, otherwise de-gsw (or both)]
keep label/synonym annotations for the same primary language together in a separate file; examples:
[Rationale: regional languages or dialects are then directly accessible. In Protégé a View/CustomRendering
set to "de-AT, de-bar, de" will select the appropriate label, if present, in that order]
[it makes sense to keep a regional hierarchy of files, e.g. to keep a folder of all (regional) languages in India]
different spellings should probably be synonyms, not labels, e.g.
[the synonym issue is separate from the regional language issue]
Note (mainly, but not only, for Germans)
There is a confusion between the written and spoken language Bavarian (Bairisch/Boarisch) in Wikipedia. There are special pages for Bavarian (and WikiData has special entries under https://bar.wikipedia.org/), where the language is "written" in a (to my knowledge) non-standard transliteration of spoken "Boarisch". It includes a relevant page on "kitchen vocabulary" - https://bar.wikipedia.org/wiki/Austro-Boarischa_Kuchlwoatschotz - where the transliterated/spoken language "Boarisch" is contrasted against a "Schriftsproch" meaning a written form of "Bairisch" intelligible by Germans (whereas "Boarisch" is quite unintelligible unless one knows it from years of experience). Terms used only in Austria are marked "Ö" and those only in Bavaria are marked "B". [Actually, that article is written in "Ostmiddlboarisch" or "Weanarisch" [Ostmittelbairisch], a dialect spoken in Vienna; whether this is appropriate for all "Boarisch" is another matter.] My recommendations above refer to "Bairisch" (with code de-bar [or bar, resp.]).
The text was updated successfully, but these errors were encountered: