Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URI construction for DMLex fragments #111

Merged
merged 9 commits into from
May 20, 2024
65 changes: 61 additions & 4 deletions dmlex-v1.0/specification/core/specification.xml
Original file line number Diff line number Diff line change
Expand Up @@ -41,16 +41,16 @@
<para><literal><olink targetptr="core_example">example</olink></literal></para>
</listitem>
</itemizedlist>
<simplesect id="optionalroots">
<section id="optionalroots">
<title>Optional roots</title>
<para>
When exchanging data encoded in a DMLex serialization
which has the concept of a "root" or top-level object, such as XML, JSON or NVH,
the object types <literal>lexicographicResource</literal> and <literal>entry</literal>
can serve as such roots.
</para>
</simplesect>
<simplesect id="fragid">
</section>
<section id="fragid">
<title>Fragment identification</title>
<para>
Incomplete parts of DMLex objects represent valid fragments as long as it is possible to identify their complete source DMLex object.
Expand All @@ -64,7 +64,64 @@
</listitem>
</itemizedlist>
</para>
</simplesect>
<section id="frag_iri">
<title>DMLex fragment identification strings</title>
<para>DMLex provides a recommended method for addressing DMLex objects present on-line, useful for linking (cf. <xref linkend="linking"/>) and general interoperability. Implementing this method is not <glossterm>required</glossterm> for conformance.</para>

<para>Every fragment <glossterm>should</glossterm> be assigned a unique fragment identification string, composed of <literal>lexicographicResource.uri</literal>, with protocol identification prefix (such as <literal>http://</literal> or <literal>https://</literal>) removed, and a sequence of identifiers that uniquely determines the path in the DMLex tree structure. The DMLex fragment identification string of the root object <literal>lexicographicResource</literal> is the value of its attribute <literal>lexicographicResource.uri</literal>, with protocol identification prefix (such as <literal>http://</literal> or <literal>https://</literal>) removed. The fragment identification strings of its direct children are constructed as follows:</para>

<para><literal>lexicographicResource.uri/objectTypeName/objectID</literal></para>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be lexicographicResource.uri#objectTypeName/objectID?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR claims to create fragments but does not. Fragments are the portion of an (HTTP) URI that occur after the # symbol. This is an important distinction as, for example this URL http://www.example.com/lexicon/lexicographicResource/entry/cat refers to a document that describes only the entry cat. In contrast http://www.example.com/lexicon#lexicographic/entry/cat refers to the identified section of the document http://www.example.com/lexicon.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is just a terminological misunderstanding. "Fragment" in fragment identification does not refer to URI fragment. It's merely a fragment in the sense of part of the data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, the term "fragment" is pretty widely understood and I wouldn't redefine it. Secondly, I think you do want fragments in this sense as otherwise it is very challenging to create URIs that resolve, and this would be a big technical issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • well, I was placing my content within the already existing section Fragment identification, so I guess the meaning of the word "fragment" was already (re)defined and it has the sense of (partial) DMLex objects. I could rename 3.2.1 Fragment IRIs to Object IRIs to avoid confusion but I don't want to touch 3.2 now as it was introduced before me (unless there is a consensus that it should be renamed)
  • I don't think we want URI fragments, don't understand how is it challenging, can you explain? My view on URI fragments is that they are anchors within the response of the previous part of the URI -- e.g. you download a web page based on the URI without the fragment and scroll down to the anchor defined by the fragment. Also the RFC says

The fragment's format and resolution is therefore dependent on the media type of a potentially retrieved representation, even though such a retrieval is only performed if the URI is dereferenced.

I don't think this matches our semantics, we want to identify the objects directly, not as anchors within the whole lexicographic resource. (But I don't think it's something extremely important, will not fight against fragments if more of you think it's better.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think we ever discussed that we would want the IRIs to be usable as URLs so this is a far reaching implicit assumption that is false at this moment.

URIs starting with http are HTTP URLs. The examples you have given are HTTP URLs so the assumption is pretty clear.

But even if we agree that we want this, I strongly object against adhering any such notion of document in the sense of what HTTP or similar protocols might have defined ages ago and noone actually follows.

HTTP is a foundational standard of the web, I don't understand why you think no-one follows it

The IRIs are strictly to be understood as links within DMLex-internal addressing mechanisms, and under no circumstances as addresses giving any expectations as to what they should return, particularly not within the framework of one arbitrary protocol such as HTTP.

I am not sure I agree on the need an internal addressing mechanism, but if we do introduce a mechanism like this (once we have ironed out the bugs), it should not be a mechanism that looks like HTTP but does not function like HTTP. Creating our own rules that contradict one of the most widely deployed standards is only likely to lead to confusion and challenges in implementation.

Btw, even for HTTP, I think it is completely fine for implementers to return whole documents, or any portions of them -- DMLex is not designed to be a round-trip mechanism, this is simply out of the scope.

Returning the whole lexicon document for all URIs is the behaviour we get for free if we implement this using # fragments. Otherwise we put a lot of technical questions to implementers, such as whose job it is to validate these identifiers and how can this be implemented on widely-used servers (Apache, nginx, etc.).

I don't understand your resistance here. This comment is about changing one character in a URI to make it conformant with widely-used standards.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(My comments about IRIs/URLs are mainly about resolvability: we do not enforce that, so in that sense that might not be a valid/usable URL (and it will frequently be the case it is not usable) --- but that's a minor point.)

The core of my objection is though the fact that the whole dictionary is not to be seen as one HTTP document -- this is completely up to the implementation what a document is in the context of HTTP.

So: Returning the whole lexicon document for all URIs is the behaviour we get for free if we implement this using # fragments is very much an unwanted behaviour. Depending on the context, I want to be able make different HTTP fragments over the same DMLex fragment. Such as that you have links between entries or senses, but you want to navigate the user to a particular example or some other part within an HTML page -- making anything after the lexicographicResource being a URL fragment makes it impossible to anchor anything within a particular entry.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The core of my objection is though the fact that the whole dictionary is not to be seen as one HTTP document -- this is completely up to the implementation what a document is in the context of HTTP.

Our standard says that the lexicon (or entry) is a single document in XML and JSON serializations.

So: Returning the whole lexicon document for all URIs is the behaviour we get for free if we implement this using # fragments is very much an unwanted behaviour. Depending on the context, I want to be able make different HTTP fragments over the same DMLex fragment. Such as that you have links between entries or senses, but you want to navigate the user to a particular example or some other part within an HTML page -- making anything after the lexicographicResource being a URL fragment makes it impossible to anchor anything within a particular entry.

I am not really sure I understand... "navigating to a particular [element] within an HTML page" is the use case of fragments. A particular application could easily further extend this fragment scheme if they wish so there is no challenge with adding extra fragments to the "DMLex fragments", we are simply defining one mechanism within a DMLex document.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the argument is about following or not following HTTP conventions -- we all want to follow that. I think the core disagreement is about this:

Returning the whole lexicon document for all URIs is the behaviour we get for free

That's right -- but I think this behaviour does not play well with DMLex principles -- I always felt like its nature are interlinked objects, with lexicographicResource being just one of them and nothing special. Using HTTP fragments would make it very special, and would (kind of) enforce downloading the whole lexicographic resource whenever asking for a single entry, sense, or even example. I don't like that. (But I am still new to these discussions, and yes, it's just one character, so do say and I will do as you say ;-))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The core of my objection is though the fact that the whole dictionary is not to be seen as one HTTP document -- this is completely up to the implementation what a document is in the context of HTTP.

Our standard says that the lexicon (or entry) is a single document in XML and JSON serializations.

Yes, but that's a completely different thing and it is relevant only for those two particular serializations. The addressing mechanism is not serialization-specific, so this is not relevant.

So: Returning the whole lexicon document for all URIs is the behaviour we get for free if we implement this using # fragments is very much an unwanted behaviour. Depending on the context, I want to be able make different HTTP fragments over the same DMLex fragment. Such as that you have links between entries or senses, but you want to navigate the user to a particular example or some other part within an HTML page -- making anything after the lexicographicResource being a URL fragment makes it impossible to anchor anything within a particular entry.

I am not really sure I understand... "navigating to a particular [element] within an HTML page" is the use case of fragments. A particular application could easily further extend this fragment scheme if they wish so there is no challenge with adding extra fragments to the "DMLex fragments", we are simply defining one mechanism within a DMLex document.

That's not true -- the RFC for URI (https://datatracker.ietf.org/doc/html/rfc3986) is very clear that the fragment may not contain a hash sign.

The bottom line is:

  • the addressing mechanism is not serialization specific, there is no concept of a document
  • if (rarely) the URLs would be resolvable, we want to make it possible to anchor to arbitrary response parts, therefore we do not want to use the hash sign anywhere.


<para>(We define below how object IDs are created.)</para>

<para>The DMLex fragment identification strings of descendant objects are constructed by appending the children's type names and IDs to the fragment identification strings of their direct parents, using “/” as the delimiter. In other words, the full template for a fragment identification string looks as follows:</para>

<para><literal>lexicographicResource.uri/objectTypeName/objectID/child1TypeName/child1ID/child2TypeName/child2ID/…</literal></para>

<para>For example, a particular <literal><olink targetptr="core_sense">sense</olink></literal> (which is a property of <literal><olink targetptr="core_entry">entry</olink></literal>) is assigned the following fragment identification string:</para>

<para><literal>lexicographicResource.uri/entry/entryID/sense/senseID</literal></para>

<para>A fragment identification string of an <literal><olink targetptr="core_example">example</olink></literal> (which is a property of <literal><olink targetptr="core_sense">sense</olink></literal>, which is a property of <literal><olink targetptr="core_entry">entry</olink></literal>) has the following structure:</para>

<para><literal>lexicographicResource.uri/entry/entryID/sense/senseID/example/exampleID</literal></para>

<section id="objectids">
<title>Object IDs</title>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Object ID is potentially ambiguous for etymology


<para>For the purpose of creating DMLex fragment identification strings, each object is assigned a unique ID relative to its parent, based on values of its properties declared as <glossterm>unique</glossterm>. Multiple situations can occur:</para>

<orderedlist>
<listitem>The object type has a single <glossterm>unique</glossterm> property with an arity of “exactly one”, and the value of the property is a string or a number. In this case, the object ID is the string or the number, with the following modifications performed in that particular order:
<itemizedlist>
<listitem>every “\” (ASCII character 5C) is replaced by “\\”</listitem>
<listitem>every “~” (ASCII character 7E) is replaced by “\~”</listitem>
<listitem>every “_” (ASCII character 5F) is replaced by “\_”</listitem>
<listitem>every “0” (zero, ASCII character 30) is replaced by “\0”</listitem>
<listitem>all IRI-unsafe characters (outside the <literal>iunreserved</literal> class according to [<link linkend="bib_rfc3987">RFC 3987</link>]) are percent-encoded according to [<link linkend="bib_rfc3986">RFC 3986</link>]</listitem>
</itemizedlist>
</listitem>
<listitem>The object type has a single <glossterm>unique</glossterm> property with an arity of “exactly one”, and the value of the property is a child DMLex object. In this case, the object ID is the same as the object ID of the child object. (Note: this case actually does not occur in the specification as such; we list it here to streamline the description of the following cases.)</listitem>
<listitem>The object type has a single <glossterm>unique</glossterm> property with an arbitrary arity. In this case, all the partial single values or child object IDs are constructed according to the steps 1. and 2., and the resulting object ID is their concatenation using “_” (ASCII character 5F) as a separator. The order of the partial values is driven by the <literal>listingOrder</literal> of the respective objects. If this procedure returns an empty string (which can happen in case of <glossterm>unique</glossterm> attributes that allow the arity of zero), the string “0” (zero, ASCII character 30) is used instead of the empty string.</listitem>
<listitem>The object type has multiple <glossterm>unique</glossterm> properties. In this case, all the partial values or child object IDs are constructed according to the steps 1., 2. and 3., and the resulting object ID is their concatenation using “~” (ASCII character 5F) as a separator. The order of the partial values is driven by the order of the properties as given in this specification. (Note: all atributes marked as <glossterm>unique</glossterm> need to be represented in the ID, as empty values are replaced by “0” according to step 3. No empty IDs are allowed.)</listitem>
<listitem>In specific situations it may happen there are multiple different objects with all the <glossterm>unique</glossterm> properties empty, i.e. multiple objects with duplicate IDs (the same sequence of zeros) emerge as the result of the step 4. One example of such a situation is multiple senses without <literal>indicator</literal>s or <literal>definition</literal>s, but with different translations. In that case, and only in that case, the value of <literal>listingOrder</literal> is concatenated to the sequence of zeros, to distinguish between the duplicate IDs. If there is only one such object, <literal>listingOrder</literal> is not concatenated to the sequence of zeros.</listitem>
</orderedlist>

<para>DMLex does not define the structure of DMLex fragment identification strings for object types without <glossterm>unique</glossterm> properties.</para>
</section>
<section id="iri_examples">
<title>DMLex fragment identification string examples</title>
<para>Particular examples of DMLex fragment identification strings can then look as follows:</para>
<itemizedlist>
<listitem><literal>www.example.com/lexicon/entry/cat~1~noun</literal></listitem>
<listitem><literal>www.example.com/lexicon/entry/cat~1~noun/sense/0~small%20furry%20animal</literal> (Here we assume that the sense's <literal>indicator</literal> is empty and it has one <literal>definition</literal> which says “small furry animal”).</listitem>
<listitem><literal>www.example.com/lexicon/entry/cat~1~noun/sense/0~small%20furry%20animal/example/I%20have%20two%20dogs%20and%20a%20cat.</literal></listitem>
<listitem><literal>www.example.com/lexicon/entry/cat~1~noun/sense/0~0</literal> (Here we assume that both the sense's <literal>definition</literal> and its <literal>indicator</literal> are empty, and there is only one such sense.)</listitem>
<listitem><literal>www.example.com/lexicon/entry/cat~1~noun/sense/0~02</literal> (Here we assume that both the sense's <literal>definition</literal> and its <literal>indicator</literal> are empty, there are multiple such senses, and this is the sense number 2, of all this entry's senses.)</listitem>
</itemizedlist>
</section>
</section>
</section>
<xi:include href="objectTypes/lexicographicResource.xml" xmlns:xi="http://www.w3.org/2001/XInclude"/>
<xi:include href="objectTypes/entry.xml" xmlns:xi="http://www.w3.org/2001/XInclude"/>
<xi:include href="objectTypes/partOfSpeech.xml" xmlns:xi="http://www.w3.org/2001/XInclude"/>
Expand Down
18 changes: 18 additions & 0 deletions dmlex-v1.0/specification/dmlex.xml
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,15 @@
</affiliation>
<email>[email protected]</email>
</editor>
<editor>
<firstname>Vojtěch</firstname>
<surname>Kovář</surname>
<affiliation>
<orgname><ulink url="https://www.muni.cz/">Masaryk University</ulink></orgname>
<address format="linespecific"><email>[email protected]</email></address>
</affiliation>
<email>[email protected]</email>
</editor>
<editor>
<firstname>Simon</firstname>
<surname>Krek</surname>
Expand Down Expand Up @@ -620,6 +629,15 @@
<title/>
<bibliomixed id="bcp14">
<abbrev>BCP 14</abbrev> is a concatenation of [RFC 2119] and [RFC 8174] </bibliomixed>
<bibliomixed id="bib_rfc3986">
<abbrev>RFC 3986</abbrev>
Tim Berners-Lee, Roy T. Fielding, Larry M Masinter
<title>Uniform Resource Identifier (URI): Generic Syntax</title>,
<citetitle>
<ulink url="https://datatracker.ietf.org/doc/rfc3986/">https://datatracker.ietf.org/doc/rfc3986/</ulink>
</citetitle>
IETF (Internet Engineering Task Force) RFC 3986, January 2005.
</bibliomixed>
<bibliomixed id="bib_rfc3987">
<abbrev>RFC 3987</abbrev>
Martin J. Dürst, Michel Suignard,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,8 @@
<para><literal>ref</literal>
<glossterm>required</glossterm> (exactly one) and <glossterm>unique</glossterm> (in
combination with other unique properties if present). Reference to an object, such as an
entry or a sense.</para>
entry or a sense. The IRI addressing mechanism described in <xref linkend="frag_iri"/>
can be used (but is not <glossterm>required</glossterm>).</para>
</listitem>
<listitem>
<para><literal>role</literal>
Expand Down
5 changes: 5 additions & 0 deletions dmlex-v1.0/specification/modules/linking/specification.xml
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,11 @@
of the <code>member</code> datatype.</para>
<para>The Linking Module can be used to set up relations between objects inside the same
lexicographic resource, or between objects residing in different lexicographic resources.</para>
<para>For linking, some type of reference IDs of linked objects are needed (cf. the
<literal>ref</literal> property in <xref linkend="linking_member"/>). DMLex does not prescribe
the exact form of these IDs, however, a recommended method for creating unique IRIs for
DMLex objects is available in <xref linkend="frag_iri"/>, which may be useful especially
when linking objects from different lexicographic resources on the Web.</para>
<para>Examples: <xref linkend="ex12"/>, <xref linkend="ex13"/>, <xref linkend="ex14"/>, <xref
linkend="ex15"/>, <xref linkend="ex16"/>, <xref linkend="ex17"/>, <xref linkend="ex18"/>. </para>
<xi:include href="extensions/lexicographicResource.xml" xmlns:xi="http://www.w3.org/2001/XInclude"/>
Expand Down
Loading