-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
URI construction for DMLex fragments #111
Merged
Merged
Changes from all commits
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
0726bc0
URI construction for DMLex fragments
c53b7f0
adjusted the URI/IRI addressing description according to discussion:
1192372
mention fragment IRIs in the linking module
09225f1
links to mentioned objects in frag_iri
vojtech-kovar 8a743a5
add few IRI examples
vojtech-kovar a794ecd
added myself in the list of editors
vojtech-kovar 12e272b
object IRIs:
vojtech-kovar ab739df
IRIs -> DMLex fragment identification strings
vojtech-kovar 9adafad
"identification strings" (was: IRIs): use listingOrder when duplicate
vojtech-kovar File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -41,16 +41,16 @@ | |
<para><literal><olink targetptr="core_example">example</olink></literal></para> | ||
</listitem> | ||
</itemizedlist> | ||
<simplesect id="optionalroots"> | ||
<section id="optionalroots"> | ||
<title>Optional roots</title> | ||
<para> | ||
When exchanging data encoded in a DMLex serialization | ||
which has the concept of a "root" or top-level object, such as XML, JSON or NVH, | ||
the object types <literal>lexicographicResource</literal> and <literal>entry</literal> | ||
can serve as such roots. | ||
</para> | ||
</simplesect> | ||
<simplesect id="fragid"> | ||
</section> | ||
<section id="fragid"> | ||
<title>Fragment identification</title> | ||
<para> | ||
Incomplete parts of DMLex objects represent valid fragments as long as it is possible to identify their complete source DMLex object. | ||
|
@@ -64,7 +64,64 @@ | |
</listitem> | ||
</itemizedlist> | ||
</para> | ||
</simplesect> | ||
<section id="frag_iri"> | ||
<title>DMLex fragment identification strings</title> | ||
<para>DMLex provides a recommended method for addressing DMLex objects present on-line, useful for linking (cf. <xref linkend="linking"/>) and general interoperability. Implementing this method is not <glossterm>required</glossterm> for conformance.</para> | ||
|
||
<para>Every fragment <glossterm>should</glossterm> be assigned a unique fragment identification string, composed of <literal>lexicographicResource.uri</literal>, with protocol identification prefix (such as <literal>http://</literal> or <literal>https://</literal>) removed, and a sequence of identifiers that uniquely determines the path in the DMLex tree structure. The DMLex fragment identification string of the root object <literal>lexicographicResource</literal> is the value of its attribute <literal>lexicographicResource.uri</literal>, with protocol identification prefix (such as <literal>http://</literal> or <literal>https://</literal>) removed. The fragment identification strings of its direct children are constructed as follows:</para> | ||
|
||
<para><literal>lexicographicResource.uri/objectTypeName/objectID</literal></para> | ||
|
||
<para>(We define below how object IDs are created.)</para> | ||
|
||
<para>The DMLex fragment identification strings of descendant objects are constructed by appending the children's type names and IDs to the fragment identification strings of their direct parents, using “/” as the delimiter. In other words, the full template for a fragment identification string looks as follows:</para> | ||
|
||
<para><literal>lexicographicResource.uri/objectTypeName/objectID/child1TypeName/child1ID/child2TypeName/child2ID/…</literal></para> | ||
|
||
<para>For example, a particular <literal><olink targetptr="core_sense">sense</olink></literal> (which is a property of <literal><olink targetptr="core_entry">entry</olink></literal>) is assigned the following fragment identification string:</para> | ||
|
||
<para><literal>lexicographicResource.uri/entry/entryID/sense/senseID</literal></para> | ||
|
||
<para>A fragment identification string of an <literal><olink targetptr="core_example">example</olink></literal> (which is a property of <literal><olink targetptr="core_sense">sense</olink></literal>, which is a property of <literal><olink targetptr="core_entry">entry</olink></literal>) has the following structure:</para> | ||
|
||
<para><literal>lexicographicResource.uri/entry/entryID/sense/senseID/example/exampleID</literal></para> | ||
|
||
<section id="objectids"> | ||
<title>Object IDs</title> | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Object ID is potentially ambiguous for etymology |
||
|
||
<para>For the purpose of creating DMLex fragment identification strings, each object is assigned a unique ID relative to its parent, based on values of its properties declared as <glossterm>unique</glossterm>. Multiple situations can occur:</para> | ||
|
||
<orderedlist> | ||
<listitem>The object type has a single <glossterm>unique</glossterm> property with an arity of “exactly one”, and the value of the property is a string or a number. In this case, the object ID is the string or the number, with the following modifications performed in that particular order: | ||
<itemizedlist> | ||
<listitem>every “\” (ASCII character 5C) is replaced by “\\”</listitem> | ||
<listitem>every “~” (ASCII character 7E) is replaced by “\~”</listitem> | ||
<listitem>every “_” (ASCII character 5F) is replaced by “\_”</listitem> | ||
<listitem>every “0” (zero, ASCII character 30) is replaced by “\0”</listitem> | ||
<listitem>all IRI-unsafe characters (outside the <literal>iunreserved</literal> class according to [<link linkend="bib_rfc3987">RFC 3987</link>]) are percent-encoded according to [<link linkend="bib_rfc3986">RFC 3986</link>]</listitem> | ||
</itemizedlist> | ||
</listitem> | ||
<listitem>The object type has a single <glossterm>unique</glossterm> property with an arity of “exactly one”, and the value of the property is a child DMLex object. In this case, the object ID is the same as the object ID of the child object. (Note: this case actually does not occur in the specification as such; we list it here to streamline the description of the following cases.)</listitem> | ||
<listitem>The object type has a single <glossterm>unique</glossterm> property with an arbitrary arity. In this case, all the partial single values or child object IDs are constructed according to the steps 1. and 2., and the resulting object ID is their concatenation using “_” (ASCII character 5F) as a separator. The order of the partial values is driven by the <literal>listingOrder</literal> of the respective objects. If this procedure returns an empty string (which can happen in case of <glossterm>unique</glossterm> attributes that allow the arity of zero), the string “0” (zero, ASCII character 30) is used instead of the empty string.</listitem> | ||
<listitem>The object type has multiple <glossterm>unique</glossterm> properties. In this case, all the partial values or child object IDs are constructed according to the steps 1., 2. and 3., and the resulting object ID is their concatenation using “~” (ASCII character 5F) as a separator. The order of the partial values is driven by the order of the properties as given in this specification. (Note: all atributes marked as <glossterm>unique</glossterm> need to be represented in the ID, as empty values are replaced by “0” according to step 3. No empty IDs are allowed.)</listitem> | ||
<listitem>In specific situations it may happen there are multiple different objects with all the <glossterm>unique</glossterm> properties empty, i.e. multiple objects with duplicate IDs (the same sequence of zeros) emerge as the result of the step 4. One example of such a situation is multiple senses without <literal>indicator</literal>s or <literal>definition</literal>s, but with different translations. In that case, and only in that case, the value of <literal>listingOrder</literal> is concatenated to the sequence of zeros, to distinguish between the duplicate IDs. If there is only one such object, <literal>listingOrder</literal> is not concatenated to the sequence of zeros.</listitem> | ||
</orderedlist> | ||
|
||
<para>DMLex does not define the structure of DMLex fragment identification strings for object types without <glossterm>unique</glossterm> properties.</para> | ||
</section> | ||
<section id="iri_examples"> | ||
<title>DMLex fragment identification string examples</title> | ||
<para>Particular examples of DMLex fragment identification strings can then look as follows:</para> | ||
<itemizedlist> | ||
<listitem><literal>www.example.com/lexicon/entry/cat~1~noun</literal></listitem> | ||
<listitem><literal>www.example.com/lexicon/entry/cat~1~noun/sense/0~small%20furry%20animal</literal> (Here we assume that the sense's <literal>indicator</literal> is empty and it has one <literal>definition</literal> which says “small furry animal”).</listitem> | ||
<listitem><literal>www.example.com/lexicon/entry/cat~1~noun/sense/0~small%20furry%20animal/example/I%20have%20two%20dogs%20and%20a%20cat.</literal></listitem> | ||
<listitem><literal>www.example.com/lexicon/entry/cat~1~noun/sense/0~0</literal> (Here we assume that both the sense's <literal>definition</literal> and its <literal>indicator</literal> are empty, and there is only one such sense.)</listitem> | ||
<listitem><literal>www.example.com/lexicon/entry/cat~1~noun/sense/0~02</literal> (Here we assume that both the sense's <literal>definition</literal> and its <literal>indicator</literal> are empty, there are multiple such senses, and this is the sense number 2, of all this entry's senses.)</listitem> | ||
</itemizedlist> | ||
</section> | ||
</section> | ||
</section> | ||
<xi:include href="objectTypes/lexicographicResource.xml" xmlns:xi="http://www.w3.org/2001/XInclude"/> | ||
<xi:include href="objectTypes/entry.xml" xmlns:xi="http://www.w3.org/2001/XInclude"/> | ||
<xi:include href="objectTypes/partOfSpeech.xml" xmlns:xi="http://www.w3.org/2001/XInclude"/> | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -61,6 +61,15 @@ | |
</affiliation> | ||
<email>[email protected]</email> | ||
</editor> | ||
<editor> | ||
<firstname>Vojtěch</firstname> | ||
<surname>Kovář</surname> | ||
<affiliation> | ||
<orgname><ulink url="https://www.muni.cz/">Masaryk University</ulink></orgname> | ||
<address format="linespecific"><email>[email protected]</email></address> | ||
</affiliation> | ||
<email>[email protected]</email> | ||
</editor> | ||
<editor> | ||
<firstname>Simon</firstname> | ||
<surname>Krek</surname> | ||
|
@@ -620,6 +629,15 @@ | |
<title/> | ||
<bibliomixed id="bcp14"> | ||
<abbrev>BCP 14</abbrev> is a concatenation of [RFC 2119] and [RFC 8174] </bibliomixed> | ||
<bibliomixed id="bib_rfc3986"> | ||
<abbrev>RFC 3986</abbrev> | ||
Tim Berners-Lee, Roy T. Fielding, Larry M Masinter | ||
<title>Uniform Resource Identifier (URI): Generic Syntax</title>, | ||
<citetitle> | ||
<ulink url="https://datatracker.ietf.org/doc/rfc3986/">https://datatracker.ietf.org/doc/rfc3986/</ulink> | ||
</citetitle> | ||
IETF (Internet Engineering Task Force) RFC 3986, January 2005. | ||
</bibliomixed> | ||
<bibliomixed id="bib_rfc3987"> | ||
<abbrev>RFC 3987</abbrev> | ||
Martin J. Dürst, Michel Suignard, | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be
lexicographicResource.uri#objectTypeName/objectID
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR claims to create fragments but does not. Fragments are the portion of an (HTTP) URI that occur after the
#
symbol. This is an important distinction as, for example this URLhttp://www.example.com/lexicon/lexicographicResource/entry/cat
refers to a document that describes only the entry cat. In contrasthttp://www.example.com/lexicon#lexicographic/entry/cat
refers to the identified section of the documenthttp://www.example.com/lexicon
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is just a terminological misunderstanding. "Fragment" in fragment identification does not refer to URI fragment. It's merely a fragment in the sense of part of the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, the term "fragment" is pretty widely understood and I wouldn't redefine it. Secondly, I think you do want fragments in this sense as otherwise it is very challenging to create URIs that resolve, and this would be a big technical issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this matches our semantics, we want to identify the objects directly, not as anchors within the whole lexicographic resource. (But I don't think it's something extremely important, will not fight against fragments if more of you think it's better.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
URIs starting with
http
are HTTP URLs. The examples you have given are HTTP URLs so the assumption is pretty clear.HTTP is a foundational standard of the web, I don't understand why you think no-one follows it
I am not sure I agree on the need an internal addressing mechanism, but if we do introduce a mechanism like this (once we have ironed out the bugs), it should not be a mechanism that looks like HTTP but does not function like HTTP. Creating our own rules that contradict one of the most widely deployed standards is only likely to lead to confusion and challenges in implementation.
Returning the whole lexicon document for all URIs is the behaviour we get for free if we implement this using
#
fragments. Otherwise we put a lot of technical questions to implementers, such as whose job it is to validate these identifiers and how can this be implemented on widely-used servers (Apache, nginx, etc.).I don't understand your resistance here. This comment is about changing one character in a URI to make it conformant with widely-used standards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(My comments about IRIs/URLs are mainly about resolvability: we do not enforce that, so in that sense that might not be a valid/usable URL (and it will frequently be the case it is not usable) --- but that's a minor point.)
The core of my objection is though the fact that the whole dictionary is not to be seen as one HTTP document -- this is completely up to the implementation what a document is in the context of HTTP.
So: Returning the whole lexicon document for all URIs is the behaviour we get for free if we implement this using # fragments is very much an unwanted behaviour. Depending on the context, I want to be able make different HTTP fragments over the same DMLex fragment. Such as that you have links between entries or senses, but you want to navigate the user to a particular example or some other part within an HTML page -- making anything after the
lexicographicResource
being a URL fragment makes it impossible to anchor anything within a particular entry.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our standard says that the lexicon (or entry) is a single document in XML and JSON serializations.
I am not really sure I understand... "navigating to a particular [element] within an HTML page" is the use case of fragments. A particular application could easily further extend this fragment scheme if they wish so there is no challenge with adding extra fragments to the "DMLex fragments", we are simply defining one mechanism within a DMLex document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the argument is about following or not following HTTP conventions -- we all want to follow that. I think the core disagreement is about this:
That's right -- but I think this behaviour does not play well with DMLex principles -- I always felt like its nature are interlinked objects, with lexicographicResource being just one of them and nothing special. Using HTTP fragments would make it very special, and would (kind of) enforce downloading the whole lexicographic resource whenever asking for a single entry, sense, or even example. I don't like that. (But I am still new to these discussions, and yes, it's just one character, so do say and I will do as you say ;-))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but that's a completely different thing and it is relevant only for those two particular serializations. The addressing mechanism is not serialization-specific, so this is not relevant.
That's not true -- the RFC for URI (https://datatracker.ietf.org/doc/html/rfc3986) is very clear that the fragment may not contain a hash sign.
The bottom line is: