-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Specify language tag fallback support #17
Conversation
This relies on the infrastructure from ECMA-402 to give sensible answers about language support even in the presence of language subtags.
I'm going to merge this for now as I am doing some other spec restructuring and I want to put it on top of this. Regardless, any review or help is appreciated, even after merging. |
This seems fine.
Yes, although ICU/CLDR is not necessarily everywhere.
It is different, although it has some similarities. This is a 1:1 matching problem (that is, a resource lookup problem), while language arcs have two sides to match (source and target). In this case, one has some text in a language and one wishes to use the best summarizer for it. Most language tag matching schemes match long tags to shorter ones (e.g. However, you can also have shorter-to-longer matching, e.g. if your document is labeled |
Thanks for the review!
The way this is handled in the current PR is via the "language tag set completeness rules", which state that if you have It gets trickier when you ask, how can we ensure they pick the "correct" one of the two existing ones. (Which is probably
When I asked @sffc about this, he said
and so explicitly calling addLikelySubtags was not necessary. Do you think that's right? |
|
||
1. Let |languageTag| be that language, represented as a BCP 47 language tag string. <span class="issue">Describe how to handle subtags.</span> | ||
<div class="example" id="example-subtags-chinese"> | ||
A common setup seen in today's software is to support two types of written Chinese: "traditional Chinese" and "simplified Chinese". Let's suppose that the user agent supports summarizing text written in traditional Chinese readily, and simplified Chinese after a download. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Observation: The idea of "downloadable locales" is something I've proposed multiple times in different forms in ECMA-402, but it so far hasn't landed because of the impact it has on fingerprinting/privacy.
|
||
1. Set |availableLanguages|[|languageTag|] to "{{AICapabilityAvailability/readily}}". | ||
One way this could be implemented would be for [=current summarizer language availabilities=] to return that « "`zh-Hant`" » is readily available, and « "`zh`", "`zh-Hans`" » is available after download. This return value conforms to the requirements of the [=language tag set completeness rules=], in ensuring that "`zh`" is present. Per <a class="allow-2119" href="#readily-or-after-download-implementation-defined">the "should"-level guidance</a>, the implementation has determined that "`zh`" belongs in the list of after-download available languages, with "`zh-Hans`", instead of in the list of readily available languages, with "`zh-Hant`". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see why you did this, but it seems like it shouldn't be required for zh-Hans
to be supported just because zh-Hant
is supported. I filed tc39/ecma402#947
@@ -413,27 +441,66 @@ Every {{AISummarizerCapabilities}} has an <dfn for="AISummarizerCapabilities">av | |||
</div> | |||
|
|||
<div algorithm> | |||
The <dfn>current summarizer language availability map</dfn> is given by the following steps. They return a [=map=] from strings representing BCP 47 language tags to {{AICapabilityAvailability}} values, or null. [[!RFC5646]] | |||
The <dfn>current summarizer language availabilities</dfn> are given by the following steps. They return a [=list=] containing two [=list/items=]; the items each are [=sets=] of strings representing [=Unicode canonicalized locale identifier=], or null. [[!ECMA-402]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear to me whether this function is directly callable from client code, but in ECMA-402 we don't ever return a full list of available locales; instead, you give us a list and we filter the list. This solves a variety of issues including automatically handling fallback. See FilterLocales
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for checking. Yeah, this is not directly callable. There is a languageAvailable(languageTag)
method which we filter against this list using LookupMatchingLocaleByBestFit.
(The design is changing slightly; see #22. But the principle of only exposing testing APIs remains.)
There are some speculative use cases for exposing a list of locales, which basically become "build me Google Translate using the browser's functionality". There you want a list of all supported translation source/target pairs. But we're resistant to expose that for fingerprinting reasons so it's currently not in any explainers. I'll be sure to circle back if we do end up with a strong need to expose that.
This relies on the infrastructure from ECMA-402 to give sensible answers about language support even in the presence of many subtags.
@aphillips, this is what I came up with after consulting with @sffc.
It has a couple of implementation-defined parts, namely the use of LookupMatchingLocaleByBestFit, and a similar operation when deciding how to allocate "base" languages between more-specific variants. (See the example given for Chinese.) Maybe the latter could be rephrased to use LookupMatchingLocaleByBestFit, to reduce this? Thoughts welcome.
My understanding is that this implementation-definedness is largely a function of everyone relying on ICU which is not specified, but we've all kind of agreed to be fine with.
This doesn't fully solve the "language arcs" problem discussed in webmachinelearning/translation-api#11 in the context of translation. (And, I wouldn't want to close that issue until we have a full spec for translation anyway.) It's only for the summarizer API so far, which has the simpler question "is this single language supported?" The path to language arcs shouldn't be so hard from here, though.
The end result seems to be pretty reasonable. In particular, it should match ECMA-402 APIs. Since ECMA-402 allows me to do things like
new Intl.Collator(["en-US-Braille-x-pirate"])
and get a resolved locale ofen-US
, or"ja-Bopo-BR"
and get a resolved locale ofja
, the proposal is that our AI APIs will do the same.Preview | Diff