-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
URI construction for DMLex fragments #111
Conversation
Is this related to #97? |
Yes -- sorry for not mentioning that before, and thanks you volunteered for reviewing :) We had a discussion about that at the meeting today after you left, and there will be some changes -- so maybe wait with the review after I implement the changes (tomorow or Monday, I hope). |
notes from today's meeting:
feel free to add if I forgot anything |
I have some doubts about this scheme:
|
Thanks for the notes, let me add my thoughts:
Not sure if I understand correctly: Do you mean e.g. two different It could anyway be stated more explicitly in the description of UNIQUEness.
Yes, that's right -- I've asked about it and we have discussed this at the meeting after you left, and even considered an option of some hashing, but we agreed we prefer readibility and transparency to compression.
I am against using
That's right, thanks for spotting -- I will state that explicitly.
OK
I can do that, too, I just didn't want this feature be over-presented (maybe it's not that important :) ) -- what do others think? |
- URI -> IRI - more structure in headings - say it's a recommendation for on-line resources - specify order of multivalue IDs
I have now implemented the changes we agreed on, please review if you can :) |
In fact, it is possible to have multiple etymologies without description under the same entry, this is the problem. |
<para><literal>lexicographicResource.uri/entry/entryID/sense/senseID/example/exampleID</literal></para> | ||
|
||
<section id="objectids"> | ||
<title>Object IDs</title> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Object ID is potentially ambiguous for etymology
Another issue is that the fields are not identified so in some cases the identifier may be ambigous <entry>
<headword>foo</headword>
<sense>
<indicator>x</indicator>
</sense>
<sense>
<definition>x</definition>
</sense>
</entry> Both resolve to http://www.example.com/lexicographicResource/entry/foo/sense/x |
|
||
<para>Every fragment <glossterm>should</glossterm> be assigned a unique IRI (Internationalized Resource Identifier [<link linkend="bib_rfc3987">RFC 3987</link>]), composed of <literal>lexicographicResource.uri</literal> and a sequence of identifiers that uniquely determines the path in the DMLex tree structure. The IRI of the root object <literal>lexicographicResource</literal> is the value of its attribute <literal>lexicographicResource.uri</literal>, converted to IRI according to the algorithm specified in [<link linkend="bib_rfc3987">RFC 3987</link>], if needed. The IRIs of its direct children are constructed as follows:</para> | ||
|
||
<para><literal>lexicographicResource.uri/objectTypeName/objectID</literal></para> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be lexicographicResource.uri#objectTypeName/objectID
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR claims to create fragments but does not. Fragments are the portion of an (HTTP) URI that occur after the #
symbol. This is an important distinction as, for example this URL http://www.example.com/lexicon/lexicographicResource/entry/cat
refers to a document that describes only the entry cat. In contrast http://www.example.com/lexicon#lexicographic/entry/cat
refers to the identified section of the document http://www.example.com/lexicon
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is just a terminological misunderstanding. "Fragment" in fragment identification does not refer to URI fragment. It's merely a fragment in the sense of part of the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, the term "fragment" is pretty widely understood and I wouldn't redefine it. Secondly, I think you do want fragments in this sense as otherwise it is very challenging to create URIs that resolve, and this would be a big technical issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- well, I was placing my content within the already existing section Fragment identification, so I guess the meaning of the word "fragment" was already (re)defined and it has the sense of (partial) DMLex objects. I could rename 3.2.1 Fragment IRIs to Object IRIs to avoid confusion but I don't want to touch 3.2 now as it was introduced before me (unless there is a consensus that it should be renamed)
- I don't think we want URI fragments, don't understand how is it challenging, can you explain? My view on URI fragments is that they are anchors within the response of the previous part of the URI -- e.g. you download a web page based on the URI without the fragment and scroll down to the anchor defined by the fragment. Also the RFC says
The fragment's format and resolution is therefore dependent on the media type of a potentially retrieved representation, even though such a retrieval is only performed if the URI is dereferenced.
I don't think this matches our semantics, we want to identify the objects directly, not as anchors within the whole lexicographic resource. (But I don't think it's something extremely important, will not fight against fragments if more of you think it's better.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think we ever discussed that we would want the IRIs to be usable as URLs so this is a far reaching implicit assumption that is false at this moment.
URIs starting with http
are HTTP URLs. The examples you have given are HTTP URLs so the assumption is pretty clear.
But even if we agree that we want this, I strongly object against adhering any such notion of document in the sense of what HTTP or similar protocols might have defined ages ago and noone actually follows.
HTTP is a foundational standard of the web, I don't understand why you think no-one follows it
The IRIs are strictly to be understood as links within DMLex-internal addressing mechanisms, and under no circumstances as addresses giving any expectations as to what they should return, particularly not within the framework of one arbitrary protocol such as HTTP.
I am not sure I agree on the need an internal addressing mechanism, but if we do introduce a mechanism like this (once we have ironed out the bugs), it should not be a mechanism that looks like HTTP but does not function like HTTP. Creating our own rules that contradict one of the most widely deployed standards is only likely to lead to confusion and challenges in implementation.
Btw, even for HTTP, I think it is completely fine for implementers to return whole documents, or any portions of them -- DMLex is not designed to be a round-trip mechanism, this is simply out of the scope.
Returning the whole lexicon document for all URIs is the behaviour we get for free if we implement this using #
fragments. Otherwise we put a lot of technical questions to implementers, such as whose job it is to validate these identifiers and how can this be implemented on widely-used servers (Apache, nginx, etc.).
I don't understand your resistance here. This comment is about changing one character in a URI to make it conformant with widely-used standards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(My comments about IRIs/URLs are mainly about resolvability: we do not enforce that, so in that sense that might not be a valid/usable URL (and it will frequently be the case it is not usable) --- but that's a minor point.)
The core of my objection is though the fact that the whole dictionary is not to be seen as one HTTP document -- this is completely up to the implementation what a document is in the context of HTTP.
So: Returning the whole lexicon document for all URIs is the behaviour we get for free if we implement this using # fragments is very much an unwanted behaviour. Depending on the context, I want to be able make different HTTP fragments over the same DMLex fragment. Such as that you have links between entries or senses, but you want to navigate the user to a particular example or some other part within an HTML page -- making anything after the lexicographicResource
being a URL fragment makes it impossible to anchor anything within a particular entry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The core of my objection is though the fact that the whole dictionary is not to be seen as one HTTP document -- this is completely up to the implementation what a document is in the context of HTTP.
Our standard says that the lexicon (or entry) is a single document in XML and JSON serializations.
So: Returning the whole lexicon document for all URIs is the behaviour we get for free if we implement this using # fragments is very much an unwanted behaviour. Depending on the context, I want to be able make different HTTP fragments over the same DMLex fragment. Such as that you have links between entries or senses, but you want to navigate the user to a particular example or some other part within an HTML page -- making anything after the lexicographicResource being a URL fragment makes it impossible to anchor anything within a particular entry.
I am not really sure I understand... "navigating to a particular [element] within an HTML page" is the use case of fragments. A particular application could easily further extend this fragment scheme if they wish so there is no challenge with adding extra fragments to the "DMLex fragments", we are simply defining one mechanism within a DMLex document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the argument is about following or not following HTTP conventions -- we all want to follow that. I think the core disagreement is about this:
Returning the whole lexicon document for all URIs is the behaviour we get for free
That's right -- but I think this behaviour does not play well with DMLex principles -- I always felt like its nature are interlinked objects, with lexicographicResource being just one of them and nothing special. Using HTTP fragments would make it very special, and would (kind of) enforce downloading the whole lexicographic resource whenever asking for a single entry, sense, or even example. I don't like that. (But I am still new to these discussions, and yes, it's just one character, so do say and I will do as you say ;-))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The core of my objection is though the fact that the whole dictionary is not to be seen as one HTTP document -- this is completely up to the implementation what a document is in the context of HTTP.
Our standard says that the lexicon (or entry) is a single document in XML and JSON serializations.
Yes, but that's a completely different thing and it is relevant only for those two particular serializations. The addressing mechanism is not serialization-specific, so this is not relevant.
So: Returning the whole lexicon document for all URIs is the behaviour we get for free if we implement this using # fragments is very much an unwanted behaviour. Depending on the context, I want to be able make different HTTP fragments over the same DMLex fragment. Such as that you have links between entries or senses, but you want to navigate the user to a particular example or some other part within an HTML page -- making anything after the lexicographicResource being a URL fragment makes it impossible to anchor anything within a particular entry.
I am not really sure I understand... "navigating to a particular [element] within an HTML page" is the use case of fragments. A particular application could easily further extend this fragment scheme if they wish so there is no challenge with adding extra fragments to the "DMLex fragments", we are simply defining one mechanism within a DMLex document.
That's not true -- the RFC for URI (https://datatracker.ietf.org/doc/html/rfc3986) is very clear that the fragment may not contain a hash sign.
The bottom line is:
- the addressing mechanism is not serialization specific, there is no concept of a document
- if (rarely) the URLs would be resolvable, we want to make it possible to anchor to arbitrary response parts, therefore we do not want to use the hash sign anywhere.
Comment on empty specifiers should be added before acceptance of this PR |
A couple more potentially ambiguous results. <entry homographNumber="0">
<headword>test</headword>
</entry>
<entry>
<headword>test</headword>
<definition>0</headword>
</entry> <pronunciation soundFile="x"/>
<pronunciation>
<transcription>x</transcription>
</pronunciation> I checked the others :) |
One further comment, not even sure if this a bug, but it is not possible to construct a fragment identifier for |
Yes, that's correct -- the procedure cannot live without the UNIQUE identifiers. I tried to say it by the following sentence:
should I add anything to it? |
- no empty object IDs (replace empty values with "0" and escape "0") - avoid conflicts "indicator:x, empty definition" vs. "empty indicator, definition:x"
I see this problem with listingOrder, but currently we also change the URI every time a unique element (e.g., definition) changes, and this is tricky to implement in a dynamic web application use case. We could allow
|
I have thought about this over the weekend and I see four key issues with the proposal as it stands
As solutions, I see the following approaches
|
This discussion gets repetitive so let me just summarize why most of the objections are either false or largely missing the point of this PR. First of all it needs to be emphasized that the specification is very clear about the fact that it describes an addressing mechanism on the model level and then there are serialization-specific addressing mechanisms which anyone is free to use (this would be e.g. XPath/XQuery for XML). This answers Objection number 1, because if we are talking about static hosting of data files, those files are necessarily serialized in some format, and then a serialization-specific addressing mechanism should be used. The Objection number 2 says "PR requires data producers to adopt a particular IRI scheme" which is not true (it is optional), and generally completely ignores the primary motivation behind a model-level addressing mechanism, i.e. being able to address without the restrictions of any particular serialization method. This objection for reasons not explained instead keeps talking about a request-response processing mechanism, which again, is not the primary motivation behind the addressing, and can be easily done using any serialization-specific addressing mechanisms. Again, the primary motivation of the model-level addressing is to point to a particular DMLex object in serialization unspecific way; not defining a request-response round-trip. The issues described in Objection number 3 were also discussed multiple time and they are not very relevant to this PR. All this is intentional and in line with best lexicographic as well as data maintenance practices to prevent unintentional data degradation. The principles of DMLex are to remove processing complexity where it is not necessary, not where we would arbitrarily wish to do so. The fact that many tools currently to dot exercise these integrity checks suggests that it is even more so important to promote it in the standard. Objection number 4 is true but it is important to realize that the links are not meant to be human-processed, or human-presented in the full form. They would be machine processed and visualized in implementation-specific ways that will suite the user/device/situation context. So yes, the links could be sometimes long a ugly, but also in many cases rather short and easy to interpret. To sum up, I find all the objections completely invalid and do not understand the motivation behind bringing them again and again without any reasonable justification. |
You are asking why this is important, so I will try to reiterate this:
|
In the beginning of all this we wanted recommended addresses for all DMLex objects, based on the data (and namely the values of the UNIQUE properties), not arbitrary IDs, nor a particular serialization. It was all about (and only about) suggesting unique identifiers, not prescribing how they should behave if used in HTTP requests or in any other particular scenario. I get it now that @jmccrae does not like this very principle (to put it mildly), on the other hand we agreed we will do it in a meeting with all of us present, so I took it as agreed. It was crystal clear from the very beginning that it is not possible to devise a method of addressing that will guarantee that all the possible use cases will work out of the box. I am pretty sure that we cannot even predict any substantial part of the possible use case scenarios, we can just bring some arbitrary examples. But now we are (John is) bringing one arbitrary use case after another and argue it does not work out of the box for them. Well, it doesn't. It is not possible to satisfy everyone. (And I don't like trying to satisfy all the use cases we can think of, especially by complicating the DMLex model itself, like we did on the last meeting with the new property deciding if '/' or '#' is used. None of the use cases, nor the whole addressing itself, is so important that it would be worth making the model more complex.) So, instead of fiddling with arbitrary use cases, I think we should answer the main question: "Do we want a model-level mechanism as described in the first paragraph, even though it does not satisfy all the use-cases perfectly?" Do we? I think the model-level addressing brings a choice: either use this, even if it requires some extra effort with particular formats/setups, or use a serialization-specific addressing and/or their own IDs if it's more convenient. The advantage of the former option would be universality (indepedence on a particular resource, its serialization format and arbitrary IDs -- if you are e.g. a dictionary aggregator, this could make you happy) and readability (even if the address leads to nowhere, a human is able to decode/fix it, unlike an address with arbitrary IDs.) Of course, we can as well decide to drop all this (John's option 3, and also the current status) which leaves only the latter option. @michmech @DavidFatDavidF please comment |
I think that this is getting a bit out of hand for what is a small part of this overall great project. When summarising the issues discussed in this long thread I have been accused of "bringing them again and again without any reasonable justification" and by defining three use cases I am accused of "bringing one arbitrary use case after another". Can we chill it please? As I have made clear, I am open to compromise (Option 2) although as is clear, my personal opinion is that user-defined identifiers (Option 3) would be superior to content-based ones. These concerns are based on blocking technical issues that have become clear to me from implementing this system and I have outlined them clearly above. To implement the compromise option (Option 2) I would propose the following text: <para>Every top-level model object may be assigned one or more identifiers
that uniquely determines the path in the DMLex tree structure. These can be used to construct IRIs, by
appending them to the IRI of the root object. The IRI of the root element is the value of its attribute <literal>lexicographicResource.uri</literal>, converted to IRI according to the algorithm specified in
[<link linkend="bib_rfc3987">RFC 3987</link>]. IRIs can be constructed in a schemes such as
follows:</para>
<para><literal>lexicographicResource.uri/objectTypeName/objectID</literal></para>
<para><literal>lexicographicResource.uri#objectTypeName/objectID</literal></para>
<para>Other schemas may be adopted by applications. This standard does not mandate the adoption of any
IRI schema or describe what kind of resources are located by IRIs constructed in this way.</para>
etc... Then all examples are changed so that they do not include the HTTP URI (e.g., This satisfies Problem 1, as it is much more vague and does not mandate a URI schema so more use cases can be satisfied. Problem 2 is mostly side-stepped as this proposal now doesn't require anything of producers or consumers of data. I also think it is closer to what @mjakubicek has in mind, as he doesn't want a "request-response" mechanism based on serialization, while an HTTP URI requires that you can make an HTTP request and receive a serialized response. I would also reiterate the proposal to also allow object IDs by
The adoption of listing order as an alternative mechanism would solve Problem 4, and Problem 3 would be reduced as implementers can choose the option that is more stable for their application. I am happy to turn this into a PR if others are happy with this. |
This is utter nonsense, the fragment ID is just a string. That's it John, a string. You do whatever you like with it.
You see John, this is the problem. You're forcing in your world here, that we are not necessarily interested in. Making an IRI does not bring in RDF, nor does it bring in content negotiation. You have to live with the fact that others do not see things that way. An IRI is just a string. Nothing else. To quote from https://www.ietf.org/rfc/rfc3987.txt: "An IRI is a sequence of characters from the Universal Character Set (Unicode/ISO 10646)" The standard also makes it absolutely clear that IRIs are not bound to a protocol with regard to this, on multiple places, e.g. "Applications using IRIs as identity tokens with no relationship to a protocol MUST use the Simple String Comparison" This is exactly our case, it's a string, it compares as a string, and it serves as identification of some DMLex entry part for us. We may call them "DMLex fragment identification strings" and not "IRIs", but given your attitude I doubt this would help here to move forward.
Yes, all those are valid integrity checks that need to performed, thank you for that. We all know we need to do more of them, to find out all the forgotten small bugs in the spec here and there. None of that presents any substantial challenge. In any case, this discussion leads nowhere. I find all the issues raised by John as void and none of the proposals by John are acceptable for me, particularly not the variant number 3, which is absolutely disastrous as discussed many times. For the next meeting, I propose voting on this PR as is; and if it is not approved, we simply remove fragment identification from the specs completely and move on. |
For last: it does NOT. "an HTTP URI requires that you can make an HTTP request". There is no "HTTP URI". Just "URI", and an URI (or IRI, in our case), unlike an URL, does not mandate you need to be able to locate the resource. The name of the protocol does not affect this. But if all the bugs you is the http:// scheme, we may just use urn: instead. It would perhaps fit more even from the theoretical perspective, though that's going to be a very subjective issue. |
@mjakubicek, you continue to make highly uncivil comments on a public forum.
I think this is exactly what I just proposed, right?
HTTP URI is an established term. It is pretty clear it means URIs that use the
I would support this, however I note that it requires a registration process with IANA as described in RFC 8141 |
So if we keep everything else as is, and replace all occurrences of "IRI" in the spec with "DMLex fragment identification string", you will vote for this?
Yes, but not requiring that you can make an HTTP request, which is what you were saying, and I was refuting. It's not about quibbling, but about facts John. Facts that you present here that are simply not true, and you continue doing so despite being falsified multiple times.
Only if we would want to make our own namespace which we do not need to, there are other options (e.g. the tag namespace, maybe others too.) which require no central registration. |
I guess so, but I would prefer that they did not start with
"The term "Uniform Resource Locator" (URL) refers to the subset of URIs that, in addition to identifying a resource, provide a means of locating the resource by describing its primary access mechanism (e.g., its network "location")" [RFC 3986] My facts are pretty clear. |
Fine, I think noone really worries about the scheme being used here, which I see as a completely arbitrary choice.
Facts are clear in that you now for the first time talk about a URL (i.e. a Uniform Resource Locator, not URI which is Uniform Resource Identifier), which was never discussed and never considered and never mentioned before. What you were saying before was that "an HTTP URI requires that you can make an HTTP request" -- and this is simply not true, and thus all your seemingly necessary implications you were making thereof are not true as well. |
You have exactly arrived at the solution I proposed this morning. Why would I object? Of course, it needs to be implemented and #123 needs a resolution before this PR can be merged. I also would like us to consider the use of
We have already discussed URLs in fact:
That URLs designate such resources means that you only refer to resources that meet these requirements. Being accessible by HTTP means you can access them by making an HTTP request. Hence "an HTTP URL requires that you can make an HTTP request". |
Because this is not what your initial proposal was (this morning), as everyone can read up in the thread. I do not want the "#" to be part of "DMLex fragment identification strings", which is what your proposal starts with, and then continues on with other things, among others also mentioning this rename. And that's why I'm double checking that we understand that the only change performed would be a wording issue solvable by a simple sed (i.e. find and replace command): sed 's/IRI/DMLex fragment identification strings/g' That's it.
Ok, you got me, we have already rule them out once ;-) |
In principle that's right, although a quick look at the text shows that a little more care than a text replacement is needed! The other part is removing the
All seem good and avoid creating identifiers that are accidentally non-functioning URLs. |
I don't feel like adding more disagreement to this discussion, and nobody else wrote anything, so I did what you propose (i.e., renamed IRIs to "DMLex fragment identification strings" and removed the Just FTR: Though acceptable, I don't agree with it -- I think one of the reasons why we said first URIs and then IRIs is that they can be used as HTTP(S) URLs which is an advantage, and we are now losing this option (kind of, as adding I have also addressed the problem with #123, using |
Okay, sounds like a good fix. My objection is I don't think that hard to understand: HTTP URLs that lead nowhere are called broken links and cause many problems not just to the user experience, but also affecting SEO for websites. Implementing only working HTTP URLs ensures the global uniqueness of these identifiers and prevents malicious attacks. |
No description provided.