Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Array representations within FHIRPath queries #1759

Open
2 of 6 tasks
johngrimes opened this issue Oct 3, 2023 · 3 comments
Open
2 of 6 tasks

Array representations within FHIRPath queries #1759

johngrimes opened this issue Oct 3, 2023 · 3 comments
Assignees
Labels
refactoring Improving the design of existing code

Comments

@johngrimes
Copy link
Member

johngrimes commented Oct 3, 2023

This issue represents the work to refactor the FHIRPath query engine to use arrays and array operations to represent FHIRPath expressions. This will have a number of benefits, including the elimination of joins and the simplification of query plans.

This work was initiated within #1438, but we would like to integrate this work back into main without the SQL on FHIR views implementation itself. This will allow us to evaluate this change separately, and control the magnitude of the change.

Outstanding tasks:

  • Create a new branch (off fhir-views) and a PR for this change
  • Do a partial implementation on aggregate to determine overall feasibility
  • Complete implementation of FHIRPath functions, operators and other syntax elements
  • Reactivate aggregate and extract executors and refactor to work with new implementation
  • Reactivate test suite and refactor as necessary
  • Update benchmarking and evaluate relative to main
@johngrimes johngrimes added the refactoring Improving the design of existing code label Oct 3, 2023
@github-project-automation github-project-automation bot moved this to Backlog in Pathling Oct 3, 2023
@johngrimes johngrimes moved this from Backlog to In progress in Pathling Oct 3, 2023
@johngrimes
Copy link
Member Author

I have found an issue with the implementation on issue/1759, given the following query:

observations = data.view(
    "Observation",
    select=[
        {
            "column": [
                {
                    "description": "Observation ID",
                    "path": "getResourceKey()",
                    "name": "id",
                },
                {
                    "description": "Patient ID",
                    "path": "subject.getReferenceKey()",
                    "name": "patient_id",
                },
                {
                    "description": "Observation date",
                    "path": "effectiveDateTime",
                    "name": "date",
                },
            ],
            "select": [
                {
                    "forEach": "code.coding",
                    "column": [
                        {
                            "description": "Observation code",
                            "path": "code",
                            "name": "code",
                        },
                    ],
                },
                {
                    "forEach": "value.ofType(Quantity)",
                    "column": [
                        {
                            "description": "Total cholesterol unit",
                            "path": "unit",
                            "name": "unit",
                        },
                        {
                            "description": "Total cholesterol value",
                            "path": "value",
                            "name": "value",
                        },
                    ],
                },
            ],
        }
    ],
    where=[
        {
            "description": "Total cholesterol > 240 mg/dL",
            "path": "where(code.coding.where(system = 'http://loinc.org').exists(code = '2093-3')).value.ofType(Quantity) > 240 'mg/dL'",
        }
    ],
)

observations.explain()

I get the following plan:

{'resource': 'Observation', 'select': [{'column': [{'description': 'Observation ID', 'path': 'getResourceKey()', 'name': 'id'}, {'description': 'Patient ID', 'path': 'subject.getReferenceKey()', 'name': 'patient_id'}, {'description': 'Observation date', 'path': 'effectiveDateTime', 'name': 'date'}], 'select': [{'forEach': 'code.coding', 'column': [{'description': 'Observation code', 'path': 'code', 'name': 'code'}]}, {'forEach': 'value.ofType(Quantity)', 'column': [{'description': 'Total cholesterol unit', 'path': 'unit', 'name': 'unit'}, {'description': 'Total cholesterol value', 'path': 'value', 'name': 'value'}]}]}], 'where': [{'description': 'Total cholesterol > 240 mg/dL', 'path': "where(code.coding.where(system = 'http://loinc.org').exists(code = '2093-3')).value.ofType(Quantity) > 240 'mg/dL'"}]}
== Physical Plan ==
Project [Observation#974.id_versioned AS id#832, Observation#974.subject.reference AS patient_id#834, Observation#974.effectiveDateTime AS date#836, @8ehd09#982.code AS code#838, Observation#974.valueQuantity.unit AS unit#840, Observation#974.valueQuantity.value AS value#842]
+- Filter isnotnull(@8ehd09#982)
   +- Generate explode(Observation#974.code.coding), [Observation#974], false, [@8ehd09#982]
      +- Project [struct(id, id#883, id_versioned, id_versioned#884, meta, meta#885, implicitRules, implicitRules#886, language, language#887, text, text#888, identifier, identifier#889, basedOn, basedOn#890, partOf, partOf#891, status, status#892, category, category#893, code, code#894, ... 62 more fields) AS Observation#974]
         +- Filter (isnotnull(valueQuantity#909) AND CASE WHEN CASE WHEN NOT CASE WHEN isnotnull(filter(filter(code#894.coding, lambdafunction((lambda x_0#1015.system = http://loinc.org), lambda x_0#1015, false)), lambdafunction((lambda x_2#1023.code = 2093-3), lambda x_2#1023, false))) THEN (size(filter(filter(code#894.coding, lambdafunction((lambda x_0#1015.system = http://loinc.org), lambda x_0#1015, false)), lambdafunction((lambda x_2#1023.code = 2093-3), lambda x_2#1023, false)), true) = 0) ELSE true END THEN (valueQuantity#909._code_canonicalized = g.m-3) ELSE false END THEN UDF(CASE WHEN NOT CASE WHEN isnotnull(filter(filter(code#894.coding, lambdafunction((lambda x_1#1016.system = http://loinc.org), lambda x_1#1016, false)), lambdafunction((lambda x_3#1024.code = 2093-3), lambda x_3#1024, false))) THEN (size(filter(filter(code#894.coding, lambdafunction((lambda x_1#1016.system = http://loinc.org), lambda x_1#1016, false)), lambdafunction((lambda x_3#1024.code = 2093-3), lambda x_3#1024, false)), true) = 0) ELSE true END THEN valueQuantity#909._value_canonicalized END, [2400,0]) ELSE false END)
            +- FileScan parquet [id#883,id_versioned#884,meta#885,implicitRules#886,language#887,text#888,identifier#889,basedOn#890,partOf#891,status#892,category#893,code#894,subject#895,focus#896,encounter#897,effectiveDateTime#898,effectiveInstant#899,effectivePeriod#900,effectiveTiming#901,issued#902,performer#903,valueBoolean#904,valueCodeableConcept#905,valueDateTime#906,... 19 more fields] Batched: false, DataFilters: [isnotnull(valueQuantity#909), CASE WHEN CASE WHEN NOT CASE WHEN isnotnull(filter(filter(code#894..., Format: Parquet, Location: PreparedDeltaFileIndex(1 paths)[file:/Users/gri306/Library/CloudStorage/OneDrive-CSIRO/Data/synth..., PartitionFilters: [], PushedFilters: [IsNotNull(valueQuantity)], ReadSchema: struct<id:string,id_versioned:string,meta:struct<id:string,versionId:string,versionId_versioned:s...

It would be better if we did not have to retrieve all of the columns for this query, only the ones that are required.

@johngrimes
Copy link
Member Author

I have merged main into issue/1759, and updated the version to 6.5.0-SNAPSHOT.

@johngrimes
Copy link
Member Author

@piotrszul Would you mind creating a pull request for this change, so that it is easier to follow activity on the branch?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
refactoring Improving the design of existing code
Projects
Status: In progress
Development

No branches or pull requests

2 participants