Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Named Query "Proceedings" too slow and incompatible with QLever #45

Open
WolfgangFahl opened this issue Dec 17, 2022 · 9 comments
Open
Assignees
Labels
bug Something isn't working
Milestone

Comments

@WolfgangFahl
Copy link
Owner

    PREFIX pq: <http://www.wikidata.org/prop/qualifier/>
    PREFIX p: <http://www.wikidata.org/prop/>
    PREFIX schema: <http://schema.org/>
    PREFIX wd: <http://www.wikidata.org/entity/>
    PREFIX wdt: <http://www.wikidata.org/prop/direct/>
    PREFIX wikibase: <http://wikiba.se/ontology#>
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    SELECT 
      ?item 
      ?itemLabel 
      ?itemDescription 
      ?ceurwspart 
      ?sVolume 
      ?Volume 
      ?short_name 
      ?dblpProceedingsId 
      ?ppnId 
      ?event 
      ?eventLabel 
      ?dblpEventId 
      ?eventSeries 
      ?eventSeriesLabel 
      ?eventSeriesOrdinal 
      ?title 
      ?language_of_work_or_name 
      ?language_of_work_or_nameLabel 
      ?URN_NBN 
      ?publication_date 
      ?fullWorkUrl 
      ?described_at_URL 
      ?homePage 
    WHERE {
      ?item wdt:P31 wd:Q1143604;
        wdt:P179 wd:Q27230297;
        rdfs:label ?itemLabel.
      FILTER((LANG(?itemLabel)) = "en")
      OPTIONAL {
        ?item schema:description ?itemDescription.
        FILTER((LANG(?itemDescription)) = "en")
      }
      OPTIONAL { ?item wdt:P478 ?Volume. }
      OPTIONAL { ?item (p:P179/pq:P478) ?_sVolume. BIND(xsd:integer(?_sVolume) as ?sVolume)}
      OPTIONAL { ?item wdt:P1813 ?short_name. }
      OPTIONAL { ?item wdt:P8978 ?dblpProceedingsId. }
      OPTIONAL { ?item wdt:P6721 ?ppnId. }
      OPTIONAL {?item wdt:P4109 ?URN_NBN.}        
      OPTIONAL { ?item wdt:P1476 ?title. }
      OPTIONAL { ?item wdt:P577 ?publication_date. }
      OPTIONAL { ?item wdt:P953 ?fullWorkUrl. }
      OPTIONAL { ?item wdt:P973 ?described_at_URL. }
      OPTIONAL { ?item wdt:P856 ?homePage. }
      OPTIONAL {
        ?item wdt:P407 ?language_of_work_or_name.
        ?language_of_work_or_name rdfs:label ?language_of_work_or_nameLabel.
        FILTER((LANG(?language_of_work_or_nameLabel)) = "en")
      }
      {
        SELECT 
          ?item 
          (GROUP_CONCAT(?_event; SEPARATOR = "|") AS ?event) 
          (GROUP_CONCAT(?_eventLabel; SEPARATOR = "|") AS ?eventLabel) 
          (GROUP_CONCAT(?_eventSeries; SEPARATOR = "|") AS ?eventSeries) 
          (GROUP_CONCAT(?_eventSeriesLabel; SEPARATOR = "|") AS ?eventSeriesLabel) 
          (GROUP_CONCAT(?_eventSeriesOrdinal; SEPARATOR = "|") AS ?eventSeriesOrdinal)
          (GROUP_CONCAT(?_dblpEventId; SEPARATOR = "|") AS ?dblpEventId) 
        WHERE {
          ?item wdt:P31 wd:Q1143604;
            wdt:P179 wd:Q27230297;
            wdt:P4745 ?_event.
          ?_event rdfs:label ?_eventLabel.
          FILTER((LANG(?_eventLabel)) = "en")
          OPTIONAL { ?_event wdt:P10692 ?_dblpEventId. }
          OPTIONAL {
            ?_event p:P179 ?_partOfTheEventSeriesStmt.
            ?_partOfTheEventSeriesStmt ps:P179 ?_eventSeries;
              pq:P1545 ?_eventSeriesOrdinal.
            ?_eventSeries rdfs:label ?_eventSeriesLabel.
            FILTER((LANG(?_eventSeriesLabel)) = "en")
          }
        }
        GROUP BY ?item
      }
    }
    ORDER BY ?sVolume
@WolfgangFahl
Copy link
Owner Author

WolfgangFahl commented Dec 17, 2022

More often than not the above query times out on the wikidata query service. It also doesn't work in the Qlever environment.
see #42

@WolfgangFahl
Copy link
Owner Author

Event details may be queried separately:

SELECT 
?item 
(GROUP_CONCAT(?_event; SEPARATOR = "|") AS ?event) 
(GROUP_CONCAT(?_eventLabel; SEPARATOR = "|") AS ?eventLabel) 
(GROUP_CONCAT(?_eventSeries; SEPARATOR = "|") AS ?eventSeries) 
(GROUP_CONCAT(?_eventSeriesLabel; SEPARATOR = "|") AS ?eventSeriesLabel) 
(GROUP_CONCAT(?_eventSeriesOrdinal; SEPARATOR = "|") AS ?eventSeriesOrdinal)
(GROUP_CONCAT(?_dblpEventId; SEPARATOR = "|") AS ?dblpEventId) 
WHERE {
  VALUES ?item {
    wd:Q107266045
  }  
  ?item  wdt:P4745 ?_event.
  ?_event rdfs:label ?_eventLabel.
  FILTER((LANG(?_eventLabel)) = "en")
  OPTIONAL { ?_event wdt:P10692 ?_dblpEventId. }
  OPTIONAL {
    ?_event p:P179 ?_partOfTheEventSeriesStmt.
    ?_partOfTheEventSeriesStmt ps:P179 ?_eventSeries;
                               pq:P1545 ?_eventSeriesOrdinal.
    ?_eventSeries rdfs:label ?_eventSeriesLabel.
    FILTER((LANG(?_eventSeriesLabel)) = "en")
  }
}
GROUP BY ?item

@tholzheim
Copy link
Collaborator

The part that is incompatible to QLever is the casting to integer for the sorting of the result

xsd:integer(?sVolume) is not supported by QLever

@tholzheim
Copy link
Collaborator

Also adding DISTINCT to the sub-query improves the execution time

@WolfgangFahl
Copy link
Owner Author

Please create an issue with QLever for

xsd:integer(?sVolume) is not supported by QLever

upstream

@WolfgangFahl
Copy link
Owner Author

We need a two-phase query implementation now.

@tholzheim
Copy link
Collaborator

The query is already two-phased see

  • used here:
    def update(self, withStore: bool = True):
    """
    update my table from the Wikidata Proceedings SPARQL query
    """
    if self.debug:
    print(f"Querying proceedings from {self.baseurl} ...")
    # query proceedings
    wd_proceedings_records: List[dict] = self.sparql.queryAsListOfDicts(self.wdQuery.query)
    # query events
    event_query = self.qm.queriesByName["EventsByProceeding"]
    wd_event_records: List[dict] = self.sparql.queryAsListOfDicts(event_query.query)
    # add events to proceeding records
    proceedings_event_map, _duplicates = LOD.getLookup(wd_event_records, "item")
    for proceedings_record in wd_proceedings_records:
    item = proceedings_record.get("item")
    if item in proceedings_event_map:
    event_record = proceedings_event_map.get(item)
    proceedings_record.update(**event_record)
    primaryKey = "URN_NBN"
    withCreate = True
    withDrop = True
    entityInfo = self.sqldb.createTable(
    wd_proceedings_records,
    "Proceedings",
    primaryKey,
    withCreate,
    withDrop,
    sampleRecordCount=5000,
    failIfTooFew=False
    )
    procsByURN, duplicates = LOD.getLookup(wd_proceedings_records, 'URN_NBN')
    if withStore:
    self.sqldb.store(procsByURN.values(), entityInfo, executeMany=True, fixNone=True)
    if len(duplicates)>0:
    print(f"found {len(duplicates)} duplicates URN entries")
    if len(duplicates)<10:
    print(duplicates)
    return wd_proceedings_records

which uses the queries

  • for proceedings

    'Proceedings':
    sparql: |
    #
    # get CEUR-WS Proceedings records by Volume with linked Event and EventSeries
    #
    # WF 2022-08-13
    #
    # the Volume number P478 is sometimes available with the proceedings item and sometimes as a qualifier
    # of
    #
    PREFIX pq: <http://www.wikidata.org/prop/qualifier/>
    PREFIX p: <http://www.wikidata.org/prop/>
    PREFIX schema: <http://schema.org/>
    PREFIX wd: <http://www.wikidata.org/entity/>
    PREFIX wdt: <http://www.wikidata.org/prop/direct/>
    PREFIX wikibase: <http://wikiba.se/ontology#>
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    PREFIX ps: <http://www.wikidata.org/prop/statement/>
    PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
    SELECT DISTINCT
    ?item
    ?itemLabel
    ?itemDescription
    ?ceurwspart
    ?sVolume
    ?Volume
    ?short_name
    ?dblpProceedingsId
    ?ppnId
    ?event
    ?eventLabel
    ?dblpEventId
    ?eventSeries
    ?eventSeriesLabel
    ?eventSeriesOrdinal
    ?title
    ?language_of_work_or_name
    ?language_of_work_or_nameLabel
    ?URN_NBN
    ?publication_date
    ?fullWorkUrl
    ?described_at_URL
    ?homePage
    WHERE {
    ?item wdt:P31 wd:Q1143604;
    wdt:P179 wd:Q27230297;
    rdfs:label ?itemLabel.
    FILTER((LANG(?itemLabel)) = "en")
    OPTIONAL {
    ?item schema:description ?itemDescription.
    FILTER((LANG(?itemDescription)) = "en")
    }
    OPTIONAL { ?item wdt:P478 ?Volume. }
    OPTIONAL { ?item (p:P179/pq:P478) ?_sVolume. BIND(xsd:integer(?_sVolume) as ?sVolume)}
    OPTIONAL { ?item wdt:P1813 ?short_name. }
    OPTIONAL { ?item wdt:P8978 ?dblpProceedingsId. }
    OPTIONAL { ?item wdt:P6721 ?ppnId. }
    OPTIONAL {?item wdt:P4109 ?URN_NBN.}
    OPTIONAL { ?item wdt:P1476 ?title. }
    OPTIONAL { ?item wdt:P577 ?publication_date. }
    OPTIONAL { ?item wdt:P953 ?fullWorkUrl. }
    OPTIONAL { ?item wdt:P973 ?described_at_URL. }
    OPTIONAL { ?item wdt:P856 ?homePage. }
    OPTIONAL {
    ?item wdt:P407 ?language_of_work_or_name.
    ?language_of_work_or_name rdfs:label ?language_of_work_or_nameLabel.
    FILTER((LANG(?language_of_work_or_nameLabel)) = "en")
    }
    }
    ORDER BY ?sVolume

  • for aggregated events by proceeding

    'EventsByProceeding':
    'sparql': |
    SELECT DISTINCT
    ?item
    (GROUP_CONCAT(?_event; SEPARATOR = "|") AS ?event)
    (GROUP_CONCAT(?_eventLabel; SEPARATOR = "|") AS ?eventLabel)
    (GROUP_CONCAT(?_eventSeries; SEPARATOR = "|") AS ?eventSeries)
    (GROUP_CONCAT(?_eventSeriesLabel; SEPARATOR = "|") AS ?eventSeriesLabel)
    (GROUP_CONCAT(?_eventSeriesOrdinal; SEPARATOR = "|") AS ?eventSeriesOrdinal)
    (GROUP_CONCAT(?_dblpEventId; SEPARATOR = "|") AS ?dblpEventId)
    WHERE {
    ?item wdt:P31 wd:Q1143604;
    wdt:P179 wd:Q27230297;
    wdt:P4745 ?_event.
    ?_event rdfs:label ?_eventLabel.
    FILTER((LANG(?_eventLabel)) = "en")
    OPTIONAL { ?_event wdt:P10692 ?_dblpEventId. }
    OPTIONAL {
    ?_event p:P179 ?_partOfTheEventSeriesStmt.
    ?_partOfTheEventSeriesStmt ps:P179 ?_eventSeries;
    pq:P1545 ?_eventSeriesOrdinal.
    ?_eventSeries rdfs:label ?_eventSeriesLabel.
    FILTER((LANG(?_eventSeriesLabel)) = "en")
    }
    }
    GROUP BY ?item

@tholzheim
Copy link
Collaborator

Please create an issue with QLever for

xsd:integer(?sVolume) is not supported by QLever

upstream

see ad-freiburg/qlever#853

@VladimirAlexiev
Copy link

@WolfgangFahl
I think the main problem with the query is that when you use OPTIONAL with multi-valued fields, that causes Cartesian product (explosion). If the variables have N1, N2, N3 values then the result set contains N1N2N3 rows.

  • If you have this problem, doing DISTINCT is too late: you've already exploded the result set, and DISTINCT has to do a lot of work (materialize the whole result set, sort it, uniquify it)
  • GROUP_CONCAT is not needed on single-valued fields, and is wrong on multi-valued fields (you'd get the same value N1 repeated N2*N3 times)

Use UNION instead of OPTIONAL

@WolfgangFahl WolfgangFahl modified the milestones: 1.0, 0.2 Feb 20, 2023
@WolfgangFahl WolfgangFahl modified the milestones: 0.2.0, 0.3 Mar 22, 2023
@WolfgangFahl WolfgangFahl changed the title Named Query "Proceedings" to slow and incompatible with QLever Named Query "Proceedings" too slow and incompatible with QLever Dec 28, 2023
@WolfgangFahl WolfgangFahl modified the milestones: 0.3, 0.4 Dec 28, 2023
@WolfgangFahl WolfgangFahl modified the milestones: 0.4, 0.4.1 Jul 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: In Progress
Development

No branches or pull requests

3 participants