-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to get more information from parsers? #45
Comments
I don't know if such information is that valuable. For me the task of a parser is to read and interpret the contents of an RDF document. Therefore, it should return only absolute IRIs, resolving prefixes and base IRI. Accordingly, the information someone could get from a parser should already be included in the triples returned. As a special case, retrieving the base IRI from a Turtle document (or other Notation3 derived format) can be ... dangerous. As stated in RDF 1.1 chapter 6.3
So there can be an unlimited number of # This is valid Turtle
@base <http://example.com/fun/stuff/> .
<Peter> a <../Human> .
@base <../> .
<Human> <#hasEntity> <stuff/Peter> . So what base IRI should be returned? Of course I see that there are possibilities where it can be beneficial to yield such informations and not to convert IRIs immediately but I think those are very optimized and in such cases parsers should return an own type of |
absolutely, I'm not suggesting otherwise. Consider the following use-case though. I get a turtle file, I parse it as a graph, then apply some changes to the graph. Then I want to serialize it back to turtle, into something as similar as possible as the original file. For this, I need to "remember" which prefixes were declared in the original file (and possibly its Note that in other APIs (such as Rdflib in Python, and I Jena in Java), every graph has an associated prefix mapping, precisely to address the use-case above. But I never liked this design, because for me the graph should be only the abstract syntax. Granted, some turtle files might override their |
And for the record, the more I think about it, the more I'm leaning towards the callback solution. Among others, it has the advantage of allowing to consume the TripleSource or QuadSource, as we would do with a standard iterator. |
Concerning #23 I get the feeling that this topic is to much detail for a common API. You said yourself in another issue that you like to keep BTW |
I beg to differ: except for N-Triples and N-Quads, all major concrete syntaxes have a notion of base and prefix binding (Turtle, TriG, N3, RDF/XML, JSON-LD, RDFa). |
Hi @pchampin , is there any progress on this? Could we just add a
Sophia is a toolkit to work with RDF and Linked Data. Not a canonical representation of RDF. When building a toolkit it is important to consider developer experience From my point of view, in a RDF graph there are only 2 things that matters to developers: the list of triples/quads and the map of prefixes. Prefixes are important:
But right now sophia only enables to work with 1 of those 2 essential components of RDF. Ideally this prefixes hashmap would be populated automatically when parsing. The What do you think? I can help with implementing this if needed |
Syntactically the Turtle format allows redefining/shadowing existing prefixes (and bases!) at any point within the file, making that line a pivot point for how prefixed names and relative IRIs are resolved. Two textually identical So a HashMap of prefix-label-to-value already isn't enough to accurately capture the full spectrum of what the file format would permit while being syntactically valid. This flexibility in the format makes armchair design of the proposed API and data types needed to fulfill the ask pretty non-trivial. |
Hi @shanesveller, thanks for the feedback! The URIs stored in the graph are already resolved, so it does not matter if some prefixes are redefined mid-way. We can just keep the last prefix defined (which is what the RIO turtle parser already do)
We do not want to "capture the full spectrum of what the turtle file can define". We just want to capture the prefixes used in an already defined RDF resource (e.g. a file). For which a And for the edge-case of the 2 people who are having fun redefining prefixes in the middle of their turtle files. It is ok, we will "lose" the non-important information of the first prefix defined. But that does not remove anything in how helpful making prefixes available in Sophia will be to RDF developers!
What you mean by flexibility of the format and armchair design? We are not gonna do any parsing of turtle format, I just propose that Sophia properly integrates a prefixes hashmap already returned by the parsers it is using. It could be implemented quite fast and easily:
I am not sure what is non-trivial here? There are no changes proposed to the API, and nothing we need to implement. We just pick up the prefixes already served by the RIO turtle parser at the end of |
Unfortunately not at the moment.
Definitely not! See below. What this issue is not about
I strongly disagree. Prefix maps are not part of the RDF graph. They are part of some serialization formats, but they are in no way intrinsic to the graph.
No, AFAIK, RDF storage implementations use other strategies for efficiency (like indexing). They don't rely on prefix maps. Now, I don't deny that prefix maps have a great value for developers, and that's why I opened this issue in the first place: I want to be able to get (a good approximation of) the prefixes declared in the parsed content, so that I can use them later (in particular when I serialize back the graph). However, I don't want to make the prefix map part of the Graph or Dataset traits (even though several other RDF APIs do that), precisely because it perpetuates the misconception that prefixes are "part" of the data model. What I could live with would be a WithPrefixMap trait, that some implementations of Graph or Dataset could also implement, if they really want to. But that should definitely be a separate trait. But that's actually not what this issue is about. What this issue is aboutThis issue is not about bundling the prefix map in the graph, but about extracting the prefix map from the parser (which, anyway, would be necessary to bundle it with the graph if we really wanted to). Currently, the Rio parsers on which most Sophia parsers are based does not make this possible. That's why this issue has been stalling; we first need to change Rio, then reflect this change in Sophia. Furthermore, Rio is no longer actively maintained. It's main developer has moved towards a new parser architecture (see for example https://github.com/oxigraph/oxigraph/tree/main/lib/oxttl). Ultimately, I might drop Rio and use the new oxigraphs parsers instead. But that's also a major refactoring.
Help is always welcome :) The most future-proof path would probably be
I realize this is a big workplan... |
It's already done with the |
I did look quicly at the oxttl doc (too quickly, obviously!) but I missed it. That's great news. All the more reason to migrate to oxttl... |
Actually I found the We just need to call this method after the last step of parsing :D (but it is also a good opportunity to upgrade to
I don't mind if this is implemented as a different trait
It is nice that we talk about where to put the PrefixMap here, because this way if we want to help you implementing it we already know where to start ;) |
😮 ok, sorry for missing that one!
Please note that you currently get a
I don't see an easy way to automate that, because I don't think it even makes sense to attach a prefix-map to a
Help is always welcome 😄. I just want to avoid getting a PR merging two different issues (exposing a |
Beyond the parsed triples/quads, parsers may collect additional useful information, for example prefix declarations or base IRI. What would be the best API to get this information?
My initial idea was to add methods to the triple/quad source returned by the
parse
methods, to access this information. For example:The drawback of this approach is that it forces us to keep the triple source, even when it is exhausted. Method such as
Graph::insert_all
can not consume it, they have to borrow it mutably (which is rather counter intuitive).Another approach would be to use a kind of callback:
This approach might be slightly harder to implement, but offers more flexibility. And it makes it possible to consume sources while still getting the additional information.
@Tpt @MattesWhite any thought?
The text was updated successfully, but these errors were encountered: