Skip to content
This repository has been archived by the owner on Jan 25, 2020. It is now read-only.

Support for per-key metadata & processing directives (breaking changes) #13

Open
aredridel opened this issue Aug 4, 2014 · 35 comments
Open

Comments

@aredridel
Copy link
Contributor

Erik just suggested something I find compelling.

What about allowing metadata in the key (or, I suggest, the value) of the .properties format. (It can be carried over to the JSON format, too):

akey{translate=false}=A Key

I suggested putting the metadata in the value to simplify consumption, so that a .properties reader can ignore the metadata and defer processing to strip metadata to lookup time:

akey={notranslate}A Key

Specific flags could be defined later, with specific meanings.

As far as processing directives, it was suggested to namespace keys for special behavior with @:

@include=../../US/en/base.properties

I heartily agree.

There's a question about whether we should support multiple @include in the same file -- if it uses @include= for each, it can't be done as post-processing in .properties because only one key will be used; it would become a lexical extension.

Specific questions: What sigil to use for directives -- $, @, _ ? Any special syntax for multiple includes? Do we need multiple includes? Does the suggestion for per-value flags make sense? Is there a better one?

Would we want to introduce optional quoting of values to handle significant whitespace?

@kkleidal
Copy link

kkleidal commented Aug 4, 2014

I like the idea of embedded meta-data. As a side note: the state-based customization I proposed in #12 would be able to handle key- or value-embedded meta data. You would just have to override the processKeyValue function accordingly. That would allow the user to choose how metadata is represented (in the key, in the value, in the comments, etc). The upside is more flexibility; the downside is less standardization.

@aredridel
Copy link
Contributor Author

Yeah, I don't think that's a thing we'd want to flex without a good use-case for that variation. At a certain point you're not making a file format but a family of related ones. Expressivity goes up; meaning you can get from it goes down.

@aredridel
Copy link
Contributor Author

@kkleidal I'd love push-back on any of this -- jump in, please! If we're going to break compatibility, it'd be nice to have a maintainable format, processable by a wide variety of tools, and not difficult to work with.

@kkleidal
Copy link

kkleidal commented Aug 4, 2014

Key-embedded metadata sounds promising. It sounds very similar to XML attributes, and the '{' brackets don't seem to interfere with the rules already in place for key validation. In contrast, the '{' brackets would interfere with L10n placeholder tags in values.

The only problem which persists is the question of losslessness. I know we've discussed readjusting that goal, but my manager just gave me new insight into why it has been a goal in the past. In the past, localization has aimed to be byte-perfect, meaning that PJM's/linguists could do a diff between translated and non/translated files and the only thing the diff should pick up is the translated strings themselves. If the deserialization/serialization isn't lossless, this process won't work as is; it would require a new diff tool (and acceptance of the new diff tool). Abandoning losslessness would mean changing the process of localization (which isn't necessarily bad, it's just... well, change). I just want to offer up that perspective.

@aredridel
Copy link
Contributor Author

Yeah. I think losslessness for values is super important -- but I don't think it is for metadata at all.

That said, validating work is important.

@aredridel
Copy link
Contributor Author

The reason I hesitate is that losslessness is expensive. It implies certain shapes of the API, it restricts what must be stored in data structures, it's just a big chunk of stuff hanging out, but not relied on directly. That dependency is implicit, too. It can't be searched for, it's just something you have to maintain and not break, forever.

@kkleidal
Copy link

kkleidal commented Aug 4, 2014

Well then maybe we should look into rewriting diff. I don't know how deep that rabbit hole runs though; I'll have to talk more with my manager and @diegone (who is out today).

@aredridel
Copy link
Contributor Author

Yeah. That's a good concern -- it may, also, not be a long term problem if what's stored in repos tends to be the output of the process.

Perhaps adding 'diff' functions to spud is valuable. I think that's a good place for it.

@diegone
Copy link

diegone commented Aug 5, 2014

Here are my thoughts:

  • hacking the value to something like {notranslate}A Key seems clever from an engineering perspective that the parsing/retention of meta is delayed until consumed/needed, but doesn't seem particularly logic/intuitive. Besides needing different better separators so they won't be confused as placeholders, they'd need special processing so they are not exposed to linguists and it would introduce new escaping requirements of the new separators
  • enriching the key to akey{translate=false} starts being awfully similar to an SGML-like <akey translate=false>. Not sure the argument that Shift-, is much more user friendly than Shift-[ will hold
  • I wouldn't advocate it for enterprise use, but for a small project, even a CSV file would be better than this format:
    • it's a standard format everybody understands
    • it can be opened in Excel and almost every l10n vendor supports it
    • it's easy to add metadata to keys, just add a new column!
  • I think the argument that transformation needs to preserve space because we run diffs on them doesn't make much sense. If we have proper validation tools there is little reason to compare source files and target files. However, it makes sense to diff one version of a translated file and another so we should guarantee that in that case the diff would only show up meaningful changes (e.g. order of keys is preserved)
  • preserving leading/trailing white-space in values is not desirable, in fact it should be trimmed. L10n processes/linguists are not guaranteed to maintain non-linguistic spaces, so it should not be relied on
  • if we're going to come up with a new made up format, we should have very good reasons why we can't use an existing format for which parsers/editors/validators/documentations may already exist and be well understood

I think the real objectives here are:

  • standardize in the file format l10n concerns so that they are not custom hacks patched on by some l10n team by are officially sanctioned by the framework
  • standardize an in-memory representation of resource files so that it can be consumed both by runtime and back-office needs => JSON has a JS Object/Array in-memory model, XML has a DOM in-memory model, what is the in-memory model of this format? If the answer is "depends" then it's difficult to build on top of. Not sure why we would want different in-memory models depending on use cases unless the overhead is unbearable
  • be resilient to merge issues: format should prevent as much as possible issues caused by bad merges (e.g. current translate begin/end is convenient but brittle, alternative translate flag per key is more solid but cumbersome)

Besides, here are some l10n concerns not currently addressed:

  • very basic requirement of what needs to be adapted to what country and translated to what language is missing.
    • Current assumption that US means "everything" is just wrong and too coarse grained leading to over-translations
    • Current assumption that /en overrides the source for a country (and all languages implied by that country) is simplistic and misses common requirement of "this files is applicable to these 2 countries"
    • A workaround could be to use empty files with an include directive for each file that needs translation, but it's very cumbersome
  • values are not standardized: need to know what's a placeholder, what's translatable, what encoding to apply (e.g. html entities, js escapes, html attributes, dust placeholders)
  • key-level fallback/override
    • runtime fallback and l10n source should be driven by the same meta
    • sometimes developers want to run their app while a few new strings are still being translated and thus missing from the old translated files. It'd be nice to fallback to English rather than snowman (even if the include directive is not there because no adaptation was needed)
    • the include directive can be made to work for adaptations, but if only a string needs to be overridden, a new entire file needs to be created
    • multiple-level fallbacks may be desirable (e.g. AT -> DE -> EMEA -> GLOBAL)
    • per locale fallback may be desirable (e.g. en-AU is often used as the base for some countries)
  • confusion between language/locale/country
    • the intent of the DE/en folder is to place content for the "DE" site. Sites are commonly mapped to countries and this leads to the confusion that "DE" represents a country and it can be derived from the "en_DE" locale or "en-DE" language. However, sometimes sites don't map 1:1 to countries. For example there could be a pan-European EMEA site, China-for-export site, mobile sites, and so on. It would be beneficial treating the top level directory as a site identifier and leave the language intact (e.g. "EMEA/en-US" could hold US English for an EMEA site to be translated into en-GB)
  • because source and target content share the same locales root, there is confusion between source content and target content
    • can't tell whether a DE/en file is an adaptation done by a developer, or a full English file delivered by l10n for the DE/en site
    • can't tell whether GB/en is the en-US source written by the developer or the en-GB target written by l10n
    • the same DE/en file is used by developers to put keys specific to DE-only functionality, but also by content writers to store adaptations. This leads to merge issues/lack of ownership
  • localization notes
  • plurals

Some of the issues above are not spud-specific, so there lies a challenge orchestrating multi-module changes.

@aredridel
Copy link
Contributor Author

hacking the value to something like {notranslate}A Key seems clever from an engineering perspective that the parsing/retention of meta is delayed until consumed/needed, but doesn't seem particularly logic/intuitive. Besides needing different better separators so they won't be confused as placeholders, they'd need special processing so they are not exposed to linguists and it would introduce new escaping requirements of the new separators

Yes. It would be a new file format, with proper support for metadata, explicit, not as "comments". Superficially similar, and pretty easy to learn.

@aredridel
Copy link
Contributor Author

enriching the key to akey{translate=false} starts being awfully similar to an SGML-like <akey translate=false>. Not sure the argument that Shift-, is much more user friendly than Shift-[ will hold

Yes. Adding metadata explicitly looks something like this in any case.

@aredridel
Copy link
Contributor Author

  • I wouldn't advocate it for enterprise use, but for a small project, even a CSV file would be better than this format:
    • it's a standard format everybody understands
    • it can be opened in Excel and almost every l10n vendor supports it
    • it's easy to add metadata to keys, just add a new column!

I'd strongly disagree here. CSV leaves major points unspecified: Character set, line ending, quote handling, how multi-line values are handled, what the columns mean, whether to use headers. There are very few interoperable CSV implementations. Excel and LibreOffice disagree on at least three of these points and require explicit configuration to override -- and in some cases, there is no option to override.

It's also distinctly difficult to get right when editing manually, and errors cause misparses rather than outright failures.

In addition, it's merely a transport format with no meaning. Any meaning has to be applied as another layer on top.

@aredridel
Copy link
Contributor Author

I think the argument that transformation needs to preserve space because we run diffs on them doesn't make much sense. If we have proper validation tools there is little reason to compare source files and target files. However, it makes sense to diff one version of a translated file and another so we should guarantee that in that case the diff would only show up meaningful changes (e.g. order of keys is preserved)

I'm inclined to agree.

@aredridel
Copy link
Contributor Author

preserving leading/trailing white-space in values is not desirable, in fact it should be trimmed. L10n processes/linguists are not guaranteed to maintain non-linguistic spaces, so it should not be relied on

Makes sense, considering the number of layers that would have to preserve such things. We'll have to call this out explicitly. Right now spaces are preserved in the .properties

@aredridel
Copy link
Contributor Author

if we're going to come up with a new made up format, we should have very good reasons why we can't use an existing format for which parsers/editors/validators/documentations may already exist and be well understood

👍

@aredridel
Copy link
Contributor Author

standardize in the file format l10n concerns so that they are not custom hacks patched on by some l10n team by are officially sanctioned by the framework

Agreed. Having multiple groups responsible for different aspects of an underspecified file format is asking for trouble. "Those are just comments" from one perspective are "that's critical information" from another, for instance.

@aredridel
Copy link
Contributor Author

standardize an in-memory representation of resource files so that it can be consumed both by runtime and back-office needs => JSON has a JS Object/Array in-memory model, XML has a DOM in-memory model, what is the in-memory model of this format? If the answer is "depends" then it's difficult to build on top of. Not sure why we would want different in-memory models depending on use cases unless the overhead is unbearable

Consuming translations for display is in essence a hash table. Unordered, fast lookup, by key. In addition, we expect to be able to organize them in groups, and to iterate in order a few specific parts.

Consuming translations for translation is essence a list, with metadata associated.

These are at odds with each other.

@aredridel
Copy link
Contributor Author

be resilient to merge issues: format should prevent as much as possible issues caused by bad merges (e.g. current translate begin/end is convenient but brittle, alternative translate flag per key is more solid but cumbersome)

Good concern. This is among my discomforts with begin/end markers, and doubly so with ones that are 'advisory' rather than syntactic as in comments.

@aredridel
Copy link
Contributor Author

values are not standardized: need to know what's a placeholder, what's translatable, what encoding to apply (e.g. html entities, js escapes, html attributes, dust placeholders)

Agreed, though not sure it should be solved at this layer. We are talking about a markup language at this point.

@totherik
Copy link
Contributor

totherik commented Aug 5, 2014

This thread got really off topic, so much so I'm inclined to lock it as it no longer addresses the question at hand. This is an attempt to get it back on track.

Proposal

# `@` prefix to denote document-level metadata/pragmas
@include my/other/content.properties

# Use of `{}` delimited KVPs to denote key-level metadata
copyright{translate=false}=Copyright 2014, PayPal, Inc.

Request for Feedback

The question here is, does this proposal get us closer to where we need to be while leaving us the flexibility to expand upon it in the future?

If it does not, please state specifically which parts will not work and suggestions for altering it such that the design gets us closer to our goal.

(Note: this file format is already an implicit "spec" due to the use of utf8 and the inclusion of support for indexed and subscript key as discrete datatypes. As a result, this proposed change does not result in a "new format.")

@diegone
Copy link

diegone commented Aug 5, 2014

What is the syntax/semantics of the @ proposal? is it a k-v pair or it depends on the actual directive? What is the implied in-memory representation?

In other words: is @ semantically another comment or another key or something else?

@aredridel
Copy link
Contributor Author

For the purposes of consuming content for display at runtime, @include otherfile will load the contents of the other file, filling in unset keys. Syntactically, it is not a content key-value pair.

The syntax can be used for other document-level metadata.

Again, for the purposes of display, we eventually need an efficient map from key to value. If we define metadata that would be interpreted at that time, we will need an efficient map from key to an object representing value and line-level metadata.

For the purpose of translation and localization, I can see treating the files separately making a lot of sense, leaving their source structure intact.

There is no implication of in-memory representation in general.

@diegone
Copy link

diegone commented Aug 5, 2014

copyright{translate=false} is better than a comment. How do we support multiple meta keys that may contain whitespace like a localization note? It sounds like we may need quotes around values. Which also means we need some sort of escaping rules.

In terms of extensibility, it'd be nice to make the syntax/semantics extensible so it doesn't become a breaking change every time we introduce something new.

@aredridel
Copy link
Contributor Author

Semantics are easier to make extensible than syntax, since there's opportunities for namespacing. I think this proposal provides it -- document and line-level metadata, extensible by allocating another name.

Quoting around values makes sense. How about JS-style quoting, as used in javascript property names? Optional for simple bare words, quote and escape otherwise:

PropertyName :
    IdentifierName
    StringLiteral
    NumericLiteral

Syntax being extensible means you leave swathes 'undefined' and then carefully pick around any messes made. Let's avoid that.

@diegone
Copy link

diegone commented Aug 5, 2014

Feel free to tell me to shut up if I'm off-topic.

The reason I was bringing up other requirements, was to address "is it extensible" question and highlight possible new breaking changes that may be required in the future. If we think after this we're done, we may be good to go. But if we think we're going to add more to it (e.g. to have a standardized message format) and potentially if we want to make it more usable (e.g. today keys are not scoped and flattened out so you need to repeat a.b.c.d over and over), then the format we end up with deviates from a Java/iOS properties file format so much that its similarity may hurt us more than help us.

For example, extending the example above to:

copyright{translate=true locNode='use {currentYear} so that we don't have to change it every year'}=Copyright 1999-{currentYear}, PayPal, Inc.

would lead to parsing errors/unpredictable results and it wouldn't be obvious because this is a new format nobody is familiar with.

What I suggest is that we mock what the end game may look like and then see whether it's better adding more features to this format, coming up with a new one, or extending a different one (e.g. https://github.com/aseemk/json5)

@aredridel
Copy link
Contributor Author

Yeah. I think we explicitly don't want JSON-like; json5 fixes some of the error prone spots, but I don't think it brings anything useful to the table -- certainly not metadata, we'd have to invent that as a layer on top. If there's something specific we should use, by all means, bring it up. I think we'll probably reject XML out of hand, because developers mostly hate it. I don't know of another format that's particularly suited or has amazing tooling that we care about beyond that.

I'm not sure what your extended example means or what the problem is there, other than that there's an unescaped apostrophe. We can define an unambiguous grammar here, though. This doesn't have to be dicey.

I think message format can be left orthogonal -- and decisions around it deferred -- at this point by accepting any unicode sequence as a value. I don't think there's any value in tying those parts together.

@diegone
Copy link

diegone commented Aug 6, 2014

That was my point. If you can't figure out what's wrong in my example, the average developer unfamiliar with this custom format would be even more confused. The other mistake in my example is that the } is also unescaped and would inadvertently close the meta block. Nobody is trained to escape braces, so it's error-prone.

The value of adopting something like json or json5 is that everybody knows the rules of correct json or js syntax -- you don't need to explain that you need to escape braces. On top of it you get benefits like nested objects instead of repeated unscoped keys.

What is the argument about preferring an unfamiliar new format instead of something every js developer is already familiar with? (like js syntax)

By using something like json5 you have a rich-enough familiar syntax you can use to support new features later on. I don't think the messageformat discussion is orthogonal because tomorrow when you try to define how to represent a placeholder, it's not going to be "any unicode sequence" anymore and you're going to have to make breaking changes again and invent some new syntax and new escaping rules. All I'm saying is: can we settle on a syntax/format that is robust enough to support future requirements we already know we want?

@aredridel
Copy link
Contributor Author

No, I can't figure it out because it hasn't been defined. ;-) Using the rules I'd proposed, the braces wouldn't close anything, because they're quoted.

The JSON family though don't give us a structure for metadata -- we'll have to impose a restriction on it to create a format.

I don't think we consider nested objects without the repetition a benefit. Certainly painful for merges.

The argument for a format for the purpose is to restrict it to the task at hand: The continuum of expressiveness to specificity has trade-offs; choose a generic container like json and you have to restrict its use quite a lot to express what you mean; choose more generic like bytestreams and you must restrict it greatly to specify what you mean.

On the other end of the spectrum, tighten it too much -- like the existing .properties format -- and you can't express what's needed at all -- you end up recycling underused syntax for metadata, tacking it on in the margins.

@rragan
Copy link
Contributor

rragan commented Aug 6, 2014

I need to read through all this list but for what it's worth, we have extended .properties to allow target metadata without breaking things for the general user. Syntax is:
contentElement.testTarget.@target[DE,AT,EBAY_DE,EBAY_AT]=Australian DE & AT target TestPrj where the metadata is in the last element with a name of @target and values in brackets following it. This format is transparently mapped to/from the 4cb representation.

@aredridel
Copy link
Contributor Author

Yeah. Overloading the key is one of the options for doing this; that said, at a certain point, you start adding hard-to-preserve semantics. Consuming those with vanilla spud would be difficult, too. A post-processing step is needed in that case.

@diegone
Copy link

diegone commented Aug 6, 2014

When you say "we have extended" is it implemented in some fork?

@totherik
Copy link
Contributor

totherik commented Aug 6, 2014

Again, I need to bring this discussion back on track.

If there's is a proposal to use an alternate document format, please file that as a separate issue. As far as I can tell that should merely involve writing a serializer for that particular format. Also, if the use of a properties-based format is a complete non-starter for adoption, a good place to capture that rationale is in the context of a proposed alternate solution or file format. It is not a good use of anyone's time to keep churning in this fashion.

Regarding this proposal, if there are scenarios that seem problematic with the proposed design, please propose alternative solutions. A general "this won't work due to this 2% use-case" is not constructive and has not added value to this discussion.

@totherik
Copy link
Contributor

totherik commented Aug 6, 2014

Apparently some comments were made while I was authoring my previous one and they're pertinent to this topic, so I want to make sure this isn't lost:

@rragan, per @diegone

When you say "we have extended" is it implemented in some fork?

@rragan
Copy link
Contributor

rragan commented Aug 6, 2014

Proposal:

Same @xxxx as Erik's for document-level metadata/pragmas

For key level metadata: @xxx[params].@yyy[params]... where all such metadata appears at the end of the key after the parts of the key. This keeps the key intact as an entity and further limits ambiguity by making @ metadata all be grouped at the end of the key. Examples:

hello=Hi
hello.@target[DE].@translate[no]=Guten Morgen

The syntax is consistent with existing general rules for tokens that can appear in the key. It provides a direct mapping from XML attributes to this format (and back).

We have a working converter from this format of property file to 4cb's and back again without loss of fidelity. See the ContentConverters repo (an internal repo so not generally accessible). It is in Java because L10N tools are in Java and want to go 4cb<->properties programmatically. Note that there is no support for the target concept in content yet. This work was a step down that path.

Note: Use of [ ]'s vs Erik's { }'s is no big deal. We tried to stay within the general token formats of the key and { } did not do that.

@totherik
Copy link
Contributor

totherik commented Aug 6, 2014

Great, thanks for the detail @rragan.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants