Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix duplicated values (fixes #28) #30

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

ckindermann
Copy link
Contributor

Jena's internal model doesn't treat an RDF graph as a set of triples. Instead, repeated triples (meaning triples with the same subject, predicate, and object) are represented as different Java objects, even though they are the same w.r.t. equals. It seems that Jena's internal model repeats triples for blank nodes annotated with rdf:nodeID="genid3", leading to duplicates in our LDTab output.

The proposed change fixes this issue. However, it would prevent us from round-tripping files with duplicate triples. This is not exactly desirable. I'll see whether I can come up with a better solution. At least we know where the 'bug' is located.

@ckindermann
Copy link
Contributor Author

ckindermann commented Sep 27, 2024

This is a tricky issue - I don't think there is a 'correct' way to solve this.

The RDF specification says that "[Identifiers] are not persistent or portable [...] for blank nodes. Blank node identifiers are not part of the RDF abstract syntax, but are entirely dependent on the concrete syntax or implementation."

This means we are not required to handle blank node identifiers (such as rdf:nodeID="genid3") in LDTab as far as the RDF spec is concerned. However, handling blank node identifiers is necessary if we want to offer a perfect round-trip service (without some normalization procedure). In other words, we'd need to assign blank node structures (with "datatype":"_JSON") an ID ... but this obviously goes against the design of LDTab to eliminate blank nodes where possible.

So, we have three options:

  1. RDF as sets of triples: Accept commit f039757 as is (and drop support for persisting any duplicate triples)
  2. RDF with duplicates: Change f039757 to only remove duplicated triples where the subject is a blank node (so there is still support for persisting duplicated triples - but we don't offer support for persisting blank node identifiers)
  3. Full support for blank node identifiers: introduce a meta key to persist blank node IDs when it matters.

@jamesaoverton
Copy link
Member

We discussed this on a call, and tentatively agreed on 1.

@ckindermann ckindermann marked this pull request as ready for review October 28, 2024 05:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug: Assigning a blank node an rdf:nodeID can lead to duplicates in LDTab
2 participants