Specify language tag fallback support #17

domenic · 2024-11-25T08:20:23Z

This relies on the infrastructure from ECMA-402 to give sensible answers about language support even in the presence of many subtags.

@aphillips, this is what I came up with after consulting with @sffc.

It has a couple of implementation-defined parts, namely the use of LookupMatchingLocaleByBestFit, and a similar operation when deciding how to allocate "base" languages between more-specific variants. (See the example given for Chinese.) Maybe the latter could be rephrased to use LookupMatchingLocaleByBestFit, to reduce this? Thoughts welcome.

My understanding is that this implementation-definedness is largely a function of everyone relying on ICU which is not specified, but we've all kind of agreed to be fine with.

This doesn't fully solve the "language arcs" problem discussed in webmachinelearning/translation-api#11 in the context of translation. (And, I wouldn't want to close that issue until we have a full spec for translation anyway.) It's only for the summarizer API so far, which has the simpler question "is this single language supported?" The path to language arcs shouldn't be so hard from here, though.

The end result seems to be pretty reasonable. In particular, it should match ECMA-402 APIs. Since ECMA-402 allows me to do things like new Intl.Collator(["en-US-Braille-x-pirate"]) and get a resolved locale of en-US, or "ja-Bopo-BR" and get a resolved locale of ja, the proposal is that our AI APIs will do the same.

Preview | Diff

This relies on the infrastructure from ECMA-402 to give sensible answers about language support even in the presence of language subtags.

domenic · 2024-11-27T06:17:53Z

I'm going to merge this for now as I am doing some other spec restructuring and I want to put it on top of this. Regardless, any review or help is appreciated, even after merging.

aphillips · 2024-11-27T14:59:17Z

This seems fine.

My understanding is that this implementation-definedness is largely a function of everyone relying on ICU which is not specified, but we've all kind of agreed to be fine with.

Yes, although ICU/CLDR is not necessarily everywhere.

This doesn't fully solve the "language arcs" problem discussed in webmachinelearning/translation-api#11 in the context of translation.

It is different, although it has some similarities. This is a 1:1 matching problem (that is, a resource lookup problem), while language arcs have two sides to match (source and target). In this case, one has some text in a language and one wishes to use the best summarizer for it. Most language tag matching schemes match long tags to shorter ones (e.g. zh-Hant-MO-u-ca-islamic-hc-12 to zh-Hant), with some squiggle room for script subtags and the like.

However, you can also have shorter-to-longer matching, e.g. if your document is labeled fr and you have fr-FR and fr-CA summarizers and need to pick one. CLDR (and thus ICU) defines an addLikelySubtags mechanism (this also helps with zh-TW => zh-Hant-TW) which you might want to reference.

domenic · 2024-11-29T02:28:26Z

Thanks for the review!

However, you can also have shorter-to-longer matching, e.g. if your document is labeled fr and you have fr-FR and fr-CA summarizers and need to pick one.

The way this is handled in the current PR is via the "language tag set completeness rules", which state that if you have fr-FR and fr-CA, you must also have a fr summarizer. We can assume implementations will meet this requirement by choosing one of the two existing ones to represent fr.

It gets trickier when you ask, how can we ensure they pick the "correct" one of the two existing ones. (Which is probably fr-FR, right?) For that I have the following text, which I'm not 100% happy with; suggestions welcome:

Append languageTag to either readilyAvailableLanguages or afterDownloadAvailableLanguages. Which of the two sets to append to is implementation-defined, and should be guided by considerations similar to that of LookupMatchingLocaleByBestFit in terms of keeping "best fallback languages" together.

CLDR (and thus ICU) defines an addLikelySubtags mechanism (this also helps with zh-TW => zh-Hant-TW) which you might want to reference.

When I asked @sffc about this, he said

The BestFit matcher will inherit zh-TW from zh-Hant because that is a parent locale.

and so explicitly calling addLikelySubtags was not necessary. Do you think that's right?

sffc · 2024-12-10T01:21:57Z

index.bs


-    1. Let |languageTag| be that language, represented as a BCP 47 language tag string. <span class="issue">Describe how to handle subtags.</span>
+  <div class="example" id="example-subtags-chinese">
+    A common setup seen in today's software is to support two types of written Chinese: "traditional Chinese" and "simplified Chinese". Let's suppose that the user agent supports summarizing text written in traditional Chinese readily, and simplified Chinese after a download.


Observation: The idea of "downloadable locales" is something I've proposed multiple times in different forms in ECMA-402, but it so far hasn't landed because of the impact it has on fingerprinting/privacy.

sffc · 2024-12-10T01:29:15Z

index.bs


-    1. Set |availableLanguages|[|languageTag|] to "{{AICapabilityAvailability/readily}}".
+    One way this could be implemented would be for [=current summarizer language availabilities=] to return that « "`zh-Hant`" » is readily available, and « "`zh`", "`zh-Hans`" » is available after download. This return value conforms to the requirements of the [=language tag set completeness rules=], in ensuring that "`zh`" is present. Per <a class="allow-2119" href="#readily-or-after-download-implementation-defined">the "should"-level guidance</a>, the implementation has determined that "`zh`" belongs in the list of after-download available languages, with "`zh-Hans`", instead of in the list of readily available languages, with "`zh-Hant`".


I see why you did this, but it seems like it shouldn't be required for zh-Hans to be supported just because zh-Hant is supported. I filed tc39/ecma402#947

sffc · 2024-12-10T01:32:44Z

index.bs

@@ -413,27 +441,66 @@ Every {{AISummarizerCapabilities}} has an <dfn for="AISummarizerCapabilities">av
 </div>

 <div algorithm>
-  The <dfn>current summarizer language availability map</dfn> is given by the following steps. They return a [=map=] from strings representing BCP 47 language tags to {{AICapabilityAvailability}} values, or null. [[!RFC5646]]
+  The <dfn>current summarizer language availabilities</dfn> are given by the following steps. They return a [=list=] containing two [=list/items=]; the items each are [=sets=] of strings representing [=Unicode canonicalized locale identifier=], or null. [[!ECMA-402]]


It's not clear to me whether this function is directly callable from client code, but in ECMA-402 we don't ever return a full list of available locales; instead, you give us a list and we filter the list. This solves a variety of issues including automatically handling fallback. See FilterLocales

Thanks for checking. Yeah, this is not directly callable. There is a languageAvailable(languageTag) method which we filter against this list using LookupMatchingLocaleByBestFit.

(The design is changing slightly; see #22. But the principle of only exposing testing APIs remains.)

There are some speculative use cases for exposing a list of locales, which basically become "build me Google Translate using the browser's functionality". There you want a list of all supported translation source/target pairs. But we're resistant to expose that for fingerprinting reasons so it's currently not in any explainers. I'll be sure to circle back if we do end up with a strong need to expose that.

Specify language tag fallback support

986ec9d

This relies on the infrastructure from ECMA-402 to give sensible answers about language support even in the presence of language subtags.

domenic merged commit da6e057 into main Nov 27, 2024
2 checks passed

domenic deleted the language-tags branch November 27, 2024 06:49

sffc reviewed Dec 10, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specify language tag fallback support #17

Specify language tag fallback support #17

domenic commented Nov 25, 2024 •

edited by pr-preview bot

Loading

domenic commented Nov 27, 2024

aphillips commented Nov 27, 2024

domenic commented Nov 29, 2024

sffc Dec 10, 2024

sffc Dec 10, 2024

sffc Dec 10, 2024

domenic Dec 12, 2024


		1. Set \|availableLanguages\|[\|languageTag\|] to "{{AICapabilityAvailability/readily}}".
		One way this could be implemented would be for [=current summarizer language availabilities=] to return that « "`zh-Hant`" » is readily available, and « "`zh`", "`zh-Hans`" » is available after download. This return value conforms to the requirements of the [=language tag set completeness rules=], in ensuring that "`zh`" is present. Per <a class="allow-2119" href="#readily-or-after-download-implementation-defined">the "should"-level guidance</a>, the implementation has determined that "`zh`" belongs in the list of after-download available languages, with "`zh-Hans`", instead of in the list of readily available languages, with "`zh-Hant`".

Specify language tag fallback support #17

Specify language tag fallback support #17

Conversation

domenic commented Nov 25, 2024 • edited by pr-preview bot Loading

domenic commented Nov 27, 2024

aphillips commented Nov 27, 2024

domenic commented Nov 29, 2024

sffc Dec 10, 2024

Choose a reason for hiding this comment

sffc Dec 10, 2024

Choose a reason for hiding this comment

sffc Dec 10, 2024

Choose a reason for hiding this comment

domenic Dec 12, 2024

Choose a reason for hiding this comment

domenic commented Nov 25, 2024 •

edited by pr-preview bot

Loading