Extract on QuestionnaireResponse takes a very long time #1444

piotrszul · 2023-05-22T05:23:52Z

The extract operation (see below) takes 2.24 minutes on dataricks compare to >10 seconds for the pure SQL based version.
This is on NDJSON dataset with three resources in total and one included with filters.

#
# Initialization code
#

from pathling import PathlingContext
from pathling import Expression as fpe
from pyspark.sql.functions import col, explode, explode_outer, first

pc = PathlingContext.create(spark, max_nesting_level=5)

ds = pc.read.ndjson("s3://pathling-demo/staging/devdays/questionnaire")


#
# SQL version
#

response_ds = ds.read('QuestionnaireResponse').cache()
ritem_ds = response_ds \
    .where(col('id') == 'f201') \
    .withColumn('item0', explode_outer('item')) \
    .withColumn('item1', explode_outer('item0.item')) \
    .withColumn('item2', explode_outer('item1.item')) \
    .select(col('id'), col('subject.reference').alias('subject'), col('item0.linkid').alias('link_id0'), col('item1.linkid').alias('link_id1'), col('item1.text'), col('item1.answer.valueString').getItem(0).alias('answer'))

fitems = ritem_ds.groupBy(col('id'), col('subject')).pivot("link_id1", ['2.1', '2.3', '2.4', '3.1']).agg(first('answer').alias('answer')) \
    .withColumnRenamed('2.1', 'gender') \
    .withColumnRenamed('2.3', 'country_of_birth') \
    .withColumnRenamed('2.4', 'marital_status') \
    .withColumnRenamed('3.1', 'smoker') 
display(fitems)

#
# Extract version
# 

x = ds.extract('QuestionnaireResponse', 
    [
        'id',
        fpe("subject.reference").alias("subject"),
        fpe("item.item.where(linkId='2.1').answer.valueString.first()").alias("gender"), 
        fpe("item.item.where(linkId='2.3').answer.valueString.first()").alias("country_of_birth"), 
        fpe("item.item.where(linkId='2.4').answer.valueString.first()").alias("maritial_status"), 
        fpe("item.item.where(linkId='3.1').answer.valueString.first()").alias("smoker"), 
    ], 
    filters=["id = 'f201'"])
display(x)

The text was updated successfully, but these errors were encountered:

johngrimes added the optimisation Improvement to the performance or efficiency of existing functionality label Jul 6, 2023

johngrimes added this to Pathling Jul 6, 2023

johngrimes moved this from Backlog to In progress in Pathling Jul 6, 2023

github-project-automation bot moved this to Backlog in Pathling Jul 6, 2023

johngrimes self-assigned this Jul 6, 2023

johngrimes added this to the SQL on FHIR Views milestone Jul 6, 2023

johngrimes assigned piotrszul and unassigned johngrimes Aug 22, 2023

johngrimes assigned johngrimes and unassigned piotrszul Jan 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract on QuestionnaireResponse takes a very long time #1444

Extract on QuestionnaireResponse takes a very long time #1444

piotrszul commented May 22, 2023

Extract on QuestionnaireResponse takes a very long time #1444

Extract on QuestionnaireResponse takes a very long time #1444

Comments

piotrszul commented May 22, 2023