Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

6.6.2 Enhanced Statistics uses LinkSet wrongly #81

Closed
VladimirAlexiev opened this issue Aug 29, 2014 · 22 comments
Closed

6.6.2 Enhanced Statistics uses LinkSet wrongly #81

VladimirAlexiev opened this issue Aug 29, 2014 · 22 comments
Assignees
Labels
Milestone

Comments

@VladimirAlexiev
Copy link

Sec 6.6.2 uses LinkSet to provide

  • properties and the number of unique objects linked to the property
  • properties and the number of unique literals

This is totally wrong: void:LinkSet and void:linkPredicate are used to describe links between datasets, not counts within one dataset.
You should use void:propertyPartition (and maybe void:classPartition within it) and void:distinctSubjects.

@KimJBaran
Copy link
Contributor

Cannot comment on propertyPartition, classPartition and distinctSubjects, but use of LinkSet and linkPredicate appear to be wrong.

From http://www.w3.org/TR/void/#linkset:

VoID also allows the description of RDF links between datasets. An RDF link is an RDF triple whose subject and object are described in different datasets.

and

The property void:linkPredicate can be used to specify the type of links that connect two datasets. In other words, it names the RDF property in the predicate position of the link triples.

The following example uses void:linkPredicate to state that the DBpedia and Geonames datasets are linked by triples that have the owl:sameAs predicate:

@KimJBaran KimJBaran added the bug label Aug 29, 2014
@micheldumontier
Copy link
Member

disagree. a void:Dataset is a set of RDF triples. a void:Linkset is a collection of RDF triples between two datasets. Therefore, we can create Linksets between any arbitrary datasets.

@VladimirAlexiev
Copy link
Author

Yes, but the section I'm quoting doesn't talk about 2 datasets. It appears to want to provide some stats of 1 dataset, and uses wrong class and property.
See http://www.w3.org/TR/void/#class-property-partitions (as opposed to http://www.w3.org/TR/void/#describing-linksets)

@micheldumontier
Copy link
Member

A dataset is any set of triples. in the formulation for the enhanced statistics, we describe a set of relations (i.e. linkset) between arbitrary partitions of a dataset. each partition is a dataset in its own right (see void:subset). i think this approach is justifiable, and falls within the scope of VoID constructs provided. You seem not to agree - could you provide an alternative formulation?

@VladimirAlexiev
Copy link
Author

we describe a set of relations (i.e. linkset) between arbitrary partitions of a dataset.

Not true. Eg section properties and the number of unique objects linked to the property shows this query:

SELECT  ?p (COUNT(DISTINCT ?o ) AS ?count ) { ?s ?p ?o } GROUP BY ?p

Where do you see 2 arbitrary (i.e. independent) partitions here?

The right way to express this is (see http://www.w3.org/TR/void/#statistics):

:rdfdataset
    void:propertyPartition [
        void:property <property-uri> ;
        void:distinctObjects "###"^^xsd:integer] .

This counts any objects (URIs, blank nodes, literals), as per the above query and the VOID spec. If you want to count only resources, see http://www.w3.org/TR/void/#class-property-partitions and use rdfs:Resource (not rdfs:Class):

:rdfdataset
    void:propertyPartition [void:property <property-uri> ;
      void:classPartition [void:class rdfs:Resource;
        void:distinctObjects "###"^^xsd:integer]].

The key to understanding the above is that both void:propertyPartition and void:classPartition create sub-datasets, which are sets of triples. So it's legitimate to speak of the void:distinctObjects of those triples.

@micheldumontier
Copy link
Member

We need to specify
1 - the property
2 - the subject class partition
3 - the object class partition

so the reason we started using the linkset was because of "void:subjectsTarget" and "void:objectsTarget" to specify both the subject and target class partitions. Can you elaborate on how we can get this kind of functionality using a void:propertyPartition?

@VladimirAlexiev
Copy link
Author

Dear Michel,

I cannot see any query in the quoted section that reports on property and two classes. The closest query that I see is: unique subject types that are linked through a property to unique object types:

SELECT (COUNT(DISTINCT ?s ) AS ?scount ) ?p (COUNT(DISTINCT ?o ) AS ?ocount ) { ?s ?p ?o } GROUP BY ?p

It counts distinct subjects and objects per property. This can be reported as follows:

:rdfdataset
    void:propertyPartition [
        void:property <property-uri> ;
        void:distinctSubjects "###"^^xsd:integer] .
        void:distinctObjects "###"^^xsd:integer] .

However, the same query seems to want to (incorrectly) report on property and two classes:

:rdfdataset
    void:subset [
        a void:LinkSet ; 
        void:linkPredicate <property-uri> ;
        void:subjectsTarget [
            void:class <subject-type-uri> ;
            void:entities "###"^^xsd:integer ;
            void:objectsTarget [
                void:class <object-type-uri> ;
                void:entities "###"^^xsd:integer]]].

To make such a report, you need to use the http://ldf.fi/void-ext ontology (see here for a tool implementing such counts: http://jiemakel.github.io/aether/, and a paper explaning it), eg like this:

:rdfdataset
    void:propertyPartition [void:property <property-uri> ;
        void:classPartition [void:class <subject-class-uri>;
            void-ext:objectClassPartition [void:class <object-class-uri>;
                void:triples "###"^^xsd:integer]]].

Above we use:

  • void:classPartition "subset of a dataset which describes instances of a particular class", where these instances are in the subject position.
  • void-ext:objectClassPartition: "subset of a void:Dataset that contains only entities of a certain rdfs:Class that occur in the object position of triples in the dataset"

Note that if you have some subclass or subproperty inference in the repository, those partitions won't be exclusive...

@micheldumontier
Copy link
Member

so the objectClassPartition is a property of the classPartition? and the void:triples are associated with the objectClassPartition? strange.

@micheldumontier micheldumontier self-assigned this Sep 29, 2014
@VladimirAlexiev
Copy link
Author

void-ext:objectClassPartition is analogous to void:classPartition: they make a subset (both are subprops of void:subset). The difference is that objectClassPartition restricts the Objects of triples in the subset, whereas classPartition restricts the Subjects.

This needs to be qualified: http://www.w3.org/TR/void/#class-property-partitions says "The (classPartition) contains all triples that describe entities that have this class as their rdf:type". Is it true that the word "describe" means "have as subject"? SPARQL deliberately leaves freedom about how a "DESCRIBE ?s" query is implemented. Most repos return Concise Bounded Description (CBD), which includes all "?s ?p ?o" triples, but also all triples "?s ?p1 ?blank. ?blank ?p2 ?o" where ?blank is a blank node (recursively); and "?statement rdf:subject ?s. ?statement ?p ?o" (i.e. all reified statements about ?s). Others even return Symmetric CBD, which includes statements where ?s is Object.

objectClassPartition is a property of the classPartition?

No: objectClassPartition can be applied against and void:Dataset, no matter whether it's the result of a partition or not. The subsets being void:Dataset, you can subdivide them further. You can swap the order/nesting of the propertyPartition, classPartition, objectClassPartition and still get almost the same results. At each level, you need to describe the parameter of partition: void:property and void:class (twice).

By "almost" I refer to the ambiguity of "describe" above. You also need to be careful about literals: if your repo does not automagically declare all literals to be of class rdf:Literal, then objectClassPartition will skip all data triples (having a literal as their object). And "declare literals as rdf:Literal" means eg "123 a rdf:Literal" which is weird, because in RDF 1.0 literals cannot be the subject of a statement (maybe RDF 1.1 allows that)

@micheldumontier
Copy link
Member

Hi,
ok, i modified the relevant structures - see the diff here :
https://github.com/joejimbo/HCLSDatasetDescriptions/compare/statistics

how does that look?

@micheldumontier
Copy link
Member

@VladimirAlexiev can you have a look at the diff?

@VladimirAlexiev
Copy link
Author

  1. Instead of void:entities, I think you should use void:distinctObjects or void:distinctSubjects respectively. Although void:entities is left a bit vague in the spec (number of "main entities" in a dataset), in this case it would mean "all nodes". But you want to report only the distinct nodes in object resp subject position.
  2. Cosmetic: I'd collapse all closing ] on the same line (and you don't need punctuation). So instead of this:
void:entities "###"^^xsd:integer ;
        ]
    ].

Use that:

void:entities "###"^^xsd:integer ]].

Cheers!

@micheldumontier
Copy link
Member

@VladimirAlexiev ok, i have made the edits. can you verify the correctness for each statistic?

@micheldumontier
Copy link
Member

@VladimirAlexiev
Copy link
Author

Thanks for adding me to the contributors! Could you please change it to this:

<dd>Vladimir Alexiev, Ontotext Corp, Bulgaria &lt;<a href="mailto:[email protected]">[email protected]</a>&gt;</dd>

@micheldumontier
Copy link
Member

done.

@AlasdairGray
Copy link

Please ensure that the examples both within the document and hcls.ttl are updated. (Relates to issue #89)

@egombocz
Copy link

I'll look at the IO Informatics use case and will harmonize it in accordance with the guidelines

@AlasdairGray AlasdairGray added this to the Publication milestone Nov 10, 2014
@AlasdairGray
Copy link

@egombocz I think your comment relates to issue #74

@mscottm
Copy link
Contributor

mscottm commented Dec 15, 2014

I sent a note to Vladimir asking him to verify what Michel did (followup to #81 (comment)).

@micheldumontier
Copy link
Member

@VladimirAlexiev can you have another look at the latest?

@micheldumontier
Copy link
Member

refactored statistics have now been merged as per commit e85578a

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants