-
Notifications
You must be signed in to change notification settings - Fork 9
Support for per-key metadata & processing directives (breaking changes) #13
Comments
I like the idea of embedded meta-data. As a side note: the state-based customization I proposed in #12 would be able to handle key- or value-embedded meta data. You would just have to override the processKeyValue function accordingly. That would allow the user to choose how metadata is represented (in the key, in the value, in the comments, etc). The upside is more flexibility; the downside is less standardization. |
Yeah, I don't think that's a thing we'd want to flex without a good use-case for that variation. At a certain point you're not making a file format but a family of related ones. Expressivity goes up; meaning you can get from it goes down. |
@kkleidal I'd love push-back on any of this -- jump in, please! If we're going to break compatibility, it'd be nice to have a maintainable format, processable by a wide variety of tools, and not difficult to work with. |
Key-embedded metadata sounds promising. It sounds very similar to XML attributes, and the '{' brackets don't seem to interfere with the rules already in place for key validation. In contrast, the '{' brackets would interfere with L10n placeholder tags in values. The only problem which persists is the question of losslessness. I know we've discussed readjusting that goal, but my manager just gave me new insight into why it has been a goal in the past. In the past, localization has aimed to be byte-perfect, meaning that PJM's/linguists could do a diff between translated and non/translated files and the only thing the diff should pick up is the translated strings themselves. If the deserialization/serialization isn't lossless, this process won't work as is; it would require a new diff tool (and acceptance of the new diff tool). Abandoning losslessness would mean changing the process of localization (which isn't necessarily bad, it's just... well, change). I just want to offer up that perspective. |
Yeah. I think losslessness for values is super important -- but I don't think it is for metadata at all. That said, validating work is important. |
The reason I hesitate is that losslessness is expensive. It implies certain shapes of the API, it restricts what must be stored in data structures, it's just a big chunk of stuff hanging out, but not relied on directly. That dependency is implicit, too. It can't be searched for, it's just something you have to maintain and not break, forever. |
Well then maybe we should look into rewriting diff. I don't know how deep that rabbit hole runs though; I'll have to talk more with my manager and @diegone (who is out today). |
Yeah. That's a good concern -- it may, also, not be a long term problem if what's stored in repos tends to be the output of the process. Perhaps adding 'diff' functions to spud is valuable. I think that's a good place for it. |
Here are my thoughts:
I think the real objectives here are:
Besides, here are some l10n concerns not currently addressed:
Some of the issues above are not spud-specific, so there lies a challenge orchestrating multi-module changes. |
Yes. It would be a new file format, with proper support for metadata, explicit, not as "comments". Superficially similar, and pretty easy to learn. |
Yes. Adding metadata explicitly looks something like this in any case. |
I'd strongly disagree here. CSV leaves major points unspecified: Character set, line ending, quote handling, how multi-line values are handled, what the columns mean, whether to use headers. There are very few interoperable CSV implementations. Excel and LibreOffice disagree on at least three of these points and require explicit configuration to override -- and in some cases, there is no option to override. It's also distinctly difficult to get right when editing manually, and errors cause misparses rather than outright failures. In addition, it's merely a transport format with no meaning. Any meaning has to be applied as another layer on top. |
I'm inclined to agree. |
Makes sense, considering the number of layers that would have to preserve such things. We'll have to call this out explicitly. Right now spaces are preserved in the |
👍 |
Agreed. Having multiple groups responsible for different aspects of an underspecified file format is asking for trouble. "Those are just comments" from one perspective are "that's critical information" from another, for instance. |
Consuming translations for display is in essence a hash table. Unordered, fast lookup, by key. In addition, we expect to be able to organize them in groups, and to iterate in order a few specific parts. Consuming translations for translation is essence a list, with metadata associated. These are at odds with each other. |
Good concern. This is among my discomforts with begin/end markers, and doubly so with ones that are 'advisory' rather than syntactic as in comments. |
Agreed, though not sure it should be solved at this layer. We are talking about a markup language at this point. |
This thread got really off topic, so much so I'm inclined to lock it as it no longer addresses the question at hand. This is an attempt to get it back on track. Proposal
Request for FeedbackThe question here is, does this proposal get us closer to where we need to be while leaving us the flexibility to expand upon it in the future? If it does not, please state specifically which parts will not work and suggestions for altering it such that the design gets us closer to our goal. (Note: this file format is already an implicit "spec" due to the use of utf8 and the inclusion of support for indexed and subscript key as discrete datatypes. As a result, this proposed change does not result in a "new format.") |
What is the syntax/semantics of the In other words: is |
For the purposes of consuming content for display at runtime, The syntax can be used for other document-level metadata. Again, for the purposes of display, we eventually need an efficient map from key to value. If we define metadata that would be interpreted at that time, we will need an efficient map from key to an object representing value and line-level metadata. For the purpose of translation and localization, I can see treating the files separately making a lot of sense, leaving their source structure intact. There is no implication of in-memory representation in general. |
In terms of extensibility, it'd be nice to make the syntax/semantics extensible so it doesn't become a breaking change every time we introduce something new. |
Semantics are easier to make extensible than syntax, since there's opportunities for namespacing. I think this proposal provides it -- document and line-level metadata, extensible by allocating another name. Quoting around values makes sense. How about JS-style quoting, as used in javascript property names? Optional for simple bare words, quote and escape otherwise:
Syntax being extensible means you leave swathes 'undefined' and then carefully pick around any messes made. Let's avoid that. |
Feel free to tell me to shut up if I'm off-topic. The reason I was bringing up other requirements, was to address "is it extensible" question and highlight possible new breaking changes that may be required in the future. If we think after this we're done, we may be good to go. But if we think we're going to add more to it (e.g. to have a standardized message format) and potentially if we want to make it more usable (e.g. today keys are not scoped and flattened out so you need to repeat a.b.c.d over and over), then the format we end up with deviates from a Java/iOS properties file format so much that its similarity may hurt us more than help us. For example, extending the example above to:
would lead to parsing errors/unpredictable results and it wouldn't be obvious because this is a new format nobody is familiar with. What I suggest is that we mock what the end game may look like and then see whether it's better adding more features to this format, coming up with a new one, or extending a different one (e.g. https://github.com/aseemk/json5) |
Yeah. I think we explicitly don't want JSON-like; I'm not sure what your extended example means or what the problem is there, other than that there's an unescaped apostrophe. We can define an unambiguous grammar here, though. This doesn't have to be dicey. I think message format can be left orthogonal -- and decisions around it deferred -- at this point by accepting any unicode sequence as a value. I don't think there's any value in tying those parts together. |
That was my point. If you can't figure out what's wrong in my example, the average developer unfamiliar with this custom format would be even more confused. The other mistake in my example is that the The value of adopting something like json or json5 is that everybody knows the rules of correct json or js syntax -- you don't need to explain that you need to escape braces. On top of it you get benefits like nested objects instead of repeated unscoped keys. What is the argument about preferring an unfamiliar new format instead of something every js developer is already familiar with? (like js syntax) By using something like json5 you have a rich-enough familiar syntax you can use to support new features later on. I don't think the messageformat discussion is orthogonal because tomorrow when you try to define how to represent a placeholder, it's not going to be "any unicode sequence" anymore and you're going to have to make breaking changes again and invent some new syntax and new escaping rules. All I'm saying is: can we settle on a syntax/format that is robust enough to support future requirements we already know we want? |
No, I can't figure it out because it hasn't been defined. ;-) Using the rules I'd proposed, the braces wouldn't close anything, because they're quoted. The JSON family though don't give us a structure for metadata -- we'll have to impose a restriction on it to create a format. I don't think we consider nested objects without the repetition a benefit. Certainly painful for merges. The argument for a format for the purpose is to restrict it to the task at hand: The continuum of expressiveness to specificity has trade-offs; choose a generic container like json and you have to restrict its use quite a lot to express what you mean; choose more generic like bytestreams and you must restrict it greatly to specify what you mean. On the other end of the spectrum, tighten it too much -- like the existing |
I need to read through all this list but for what it's worth, we have extended .properties to allow target metadata without breaking things for the general user. Syntax is: |
Yeah. Overloading the key is one of the options for doing this; that said, at a certain point, you start adding hard-to-preserve semantics. Consuming those with vanilla |
When you say "we have extended" is it implemented in some fork? |
Again, I need to bring this discussion back on track. If there's is a proposal to use an alternate document format, please file that as a separate issue. As far as I can tell that should merely involve writing a serializer for that particular format. Also, if the use of a properties-based format is a complete non-starter for adoption, a good place to capture that rationale is in the context of a proposed alternate solution or file format. It is not a good use of anyone's time to keep churning in this fashion. Regarding this proposal, if there are scenarios that seem problematic with the proposed design, please propose alternative solutions. A general "this won't work due to this 2% use-case" is not constructive and has not added value to this discussion. |
Proposal: Same @xxxx as Erik's for document-level metadata/pragmas For key level metadata: @xxx[params].@yyy[params]... where all such metadata appears at the end of the key after the parts of the key. This keeps the key intact as an entity and further limits ambiguity by making @ metadata all be grouped at the end of the key. Examples: hello=Hi The syntax is consistent with existing general rules for tokens that can appear in the key. It provides a direct mapping from XML attributes to this format (and back). We have a working converter from this format of property file to 4cb's and back again without loss of fidelity. See the ContentConverters repo (an internal repo so not generally accessible). It is in Java because L10N tools are in Java and want to go 4cb<->properties programmatically. Note that there is no support for the target concept in content yet. This work was a step down that path. Note: Use of [ ]'s vs Erik's { }'s is no big deal. We tried to stay within the general token formats of the key and { } did not do that. |
Great, thanks for the detail @rragan. |
Erik just suggested something I find compelling.
What about allowing metadata in the key (or, I suggest, the value) of the
.properties
format. (It can be carried over to the JSON format, too):I suggested putting the metadata in the value to simplify consumption, so that a
.properties
reader can ignore the metadata and defer processing to strip metadata to lookup time:Specific flags could be defined later, with specific meanings.
As far as processing directives, it was suggested to namespace keys for special behavior with
@
:I heartily agree.
There's a question about whether we should support multiple
@include
in the same file -- if it uses@include=
for each, it can't be done as post-processing in.properties
because only one key will be used; it would become a lexical extension.Specific questions: What sigil to use for directives --
$
,@
,_
? Any special syntax for multiple includes? Do we need multiple includes? Does the suggestion for per-value flags make sense? Is there a better one?Would we want to introduce optional quoting of values to handle significant whitespace?
The text was updated successfully, but these errors were encountered: