-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NOTES] Experiences with dbt-synthea for Big Patient Datasets #91
Comments
Fantastic, thanks @TheCedarPrince ! I think I'll try it on my end as well so we have 2x OS & laptop worth of data on performance :) |
Nice work @TheCedarPrince ! I was playing around with this in Snowflake over the weekend and generated a million patients and started running it on their to see how it would benchmark compared to duckdb - keen to hear how you get on! |
Question for folks: Can I run dbt-synthea on "chunks" of data I have manually split up (Meaning, I have folders like |
Also, @katy-sadowski , I was going to tinker a little later with also getting hyperfine to work with this process for benchmarking purposes. It's considered the state of the art for benchmarking and is used a ton in numerical or HPC environments -- especially like high energy physics areas. Ran into some problems yesterday when I tried to naively run it so there might be some system environment quirks. |
Yeah you definitely can with duckdb with next to no effort - you'd have to change the sources to glob like this! |
Interesting -- what files would need to be changed to accommodate this in
So I would want to |
So there are two approaches:
I would modify the sources file below dbt-synthea/models/staging/synthea/_synthea__sources.yml Lines 1 to 24 in 4c34141
to: version: 2
sources:
- name: synthea
meta:
external_location: "/path/to/synthea/output/{name}/*.csv"
tables:
- name: allergies
- name: careplans
- name: claims_transactions
- name: claims
- name: conditions
- name: devices
- name: encounters
- name: imaging_studies
- name: immunizations
- name: medications
- name: observations
- name: organizations
- name: patients
- name: payer_transitions
- name: payers
- name: procedures
- name: providers
- name: supplies and it should work without having to have an explicit load step 🚀 Does that make sense? It blew my mind the first time I realised I could do this! Makes working with partitioned CSV/Parquet files so much easier 😄 |
Latest updates @lawrenceadams and @katy-sadowski : I generated about one million synthetic patients (with some dead patients included) each with 1 year lookback using Synthea (see my Click to expand to see `synthea.properties`
And then I did the following: (test) thecedarprince@thecedarledge:~/FOSS/dbt-synthea$ dbt deps
14:58:36 Running with dbt=1.8.7
14:58:37 Installing dbt-labs/dbt_utils
14:58:37 Installed from version 1.3.0
14:58:37 Up to date!
(test) thecedarprince@thecedarledge:~/FOSS/dbt-synthea$ dbt run-operation load_data_duckdb --args "{file_dict: $file_dict, vocab_tables: false}"
14:59:23 Running with dbt=1.8.7
14:59:23 Registered adapter: duckdb=1.8.0
14:59:24 Found 85 models, 29 seeds, 425 data tests, 29 sources, 537 macros Following this, I loaded the vocab files: (test) thecedarprince@thecedarledge:~/FOSS/dbt-synthea$ dbt run-operation load_data_duckdb --args "{file_dict: $file_dict, vocab_tables: true}"
15:12:50 Running with dbt=1.8.7
15:12:51 Registered adapter: duckdb=1.8.0
15:12:51 Found 85 models, 29 seeds, 425 data tests, 29 sources, 537 macros
(test) thecedarprince@thecedarledge:~/FOSS/dbt-synthea$ dbt seed --select states omop
15:13:48 Running with dbt=1.8.7
15:13:48 Registered adapter: duckdb=1.8.0
15:13:49 Found 85 models, 29 seeds, 425 data tests, 29 sources, 537 macros
15:13:49
15:13:51 Concurrency: 1 threads (target='dev')
15:13:51
15:13:51 1 of 1 START seed file dbt_synthea_dev_map_seeds.states ........................ [RUN]
15:13:52 1 of 1 OK loaded seed file dbt_synthea_dev_map_seeds.states .................... [INSERT 51 in 1.71s]
15:13:52
15:13:52 Finished running 1 seed in 0 hours 0 minutes and 3.67 seconds (3.67s).
15:13:53
15:13:53 Completed successfully
15:13:53
15:13:53 Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1 And then I ran (test) thecedarprince@thecedarledge:~/FOSS/dbt-synthea$ dbt run
15:13:57 Running with dbt=1.8.7
15:13:57 Registered adapter: duckdb=1.8.0
15:13:58 Found 85 models, 29 seeds, 425 data tests, 29 sources, 537 macros
15:13:58
15:13:58 Concurrency: 1 threads (target='dev')
15:13:58
15:13:58 1 of 85 START sql table model dbt_synthea_dev.dose_era ......................... [RUN]
15:14:00 1 of 85 OK created sql table model dbt_synthea_dev.dose_era .................... [OK in 1.78s]
15:14:00 2 of 85 START sql table model dbt_synthea_dev.episode .......................... [RUN]
15:14:01 2 of 85 OK created sql table model dbt_synthea_dev.episode ..................... [OK in 1.66s]
15:14:01 3 of 85 START sql table model dbt_synthea_dev.episode_event .................... [RUN]
15:14:03 3 of 85 OK created sql table model dbt_synthea_dev.episode_event ............... [OK in 1.50s]
15:14:03 4 of 85 START sql table model dbt_synthea_dev.fact_relationship ................ [RUN]
15:14:04 4 of 85 OK created sql table model dbt_synthea_dev.fact_relationship ........... [OK in 1.23s]
15:14:04 5 of 85 START sql table model dbt_synthea_dev.metadata ......................... [RUN]
15:14:05 5 of 85 OK created sql table model dbt_synthea_dev.metadata .................... [OK in 1.21s]
15:14:05 6 of 85 START sql table model dbt_synthea_dev.note ............................. [RUN]
15:14:06 6 of 85 OK created sql table model dbt_synthea_dev.note ........................ [OK in 1.29s]
15:14:06 7 of 85 START sql table model dbt_synthea_dev.note_nlp ......................... [RUN]
15:14:08 7 of 85 OK created sql table model dbt_synthea_dev.note_nlp .................... [OK in 1.25s]
15:14:08 8 of 85 START sql table model dbt_synthea_dev.specimen ......................... [RUN]
15:14:09 8 of 85 OK created sql table model dbt_synthea_dev.specimen .................... [OK in 1.26s]
15:14:09 9 of 85 START sql table model dbt_synthea_dev.stg_map__states .................. [RUN]
15:14:10 9 of 85 OK created sql table model dbt_synthea_dev.stg_map__states ............. [OK in 1.23s]
15:14:10 10 of 85 START sql table model dbt_synthea_dev.stg_synthea__allergies .......... [RUN]
15:14:12 10 of 85 OK created sql table model dbt_synthea_dev.stg_synthea__allergies ..... [OK in 1.62s]
15:14:12 11 of 85 START sql table model dbt_synthea_dev.stg_synthea__careplans .......... [RUN]
15:14:14 11 of 85 OK created sql table model dbt_synthea_dev.stg_synthea__careplans ..... [OK in 2.10s]
15:14:14 12 of 85 START sql table model dbt_synthea_dev.stg_synthea__claims ............. [RUN]
15:15:24 12 of 85 OK created sql table model dbt_synthea_dev.stg_synthea__claims ........ [OK in 69.77s]
15:15:24 13 of 85 START sql table model dbt_synthea_dev.stg_synthea__claims_transactions [RUN]
15:18:07 13 of 85 ERROR creating sql table model dbt_synthea_dev.stg_synthea__claims_transactions [ERROR in 163.53s]
15:18:07 14 of 85 START sql table model dbt_synthea_dev.stg_synthea__conditions ......... [RUN]
15:18:15 14 of 85 OK created sql table model dbt_synthea_dev.stg_synthea__conditions .... [OK in 8.14s]
15:18:15 15 of 85 START sql table model dbt_synthea_dev.stg_synthea__devices ............ [RUN]
15:18:17 15 of 85 OK created sql table model dbt_synthea_dev.stg_synthea__devices ....... [OK in 1.63s]
15:18:17 16 of 85 START sql table model dbt_synthea_dev.stg_synthea__encounters ......... [RUN]
15:18:33 16 of 85 OK created sql table model dbt_synthea_dev.stg_synthea__encounters .... [OK in 16.23s]
15:18:33 17 of 85 START sql table model dbt_synthea_dev.stg_synthea__imaging_studies .... [RUN]
15:18:50 17 of 85 OK created sql table model dbt_synthea_dev.stg_synthea__imaging_studies [OK in 16.53s]
15:18:50 18 of 85 START sql table model dbt_synthea_dev.stg_synthea__immunizations ...... [RUN]
15:18:52 18 of 85 OK created sql table model dbt_synthea_dev.stg_synthea__immunizations . [OK in 2.35s]
15:18:52 19 of 85 START sql table model dbt_synthea_dev.stg_synthea__medications ........ [RUN]
15:19:12 19 of 85 OK created sql table model dbt_synthea_dev.stg_synthea__medications ... [OK in 19.96s]
15:19:12 20 of 85 START sql table model dbt_synthea_dev.stg_synthea__observations ....... [RUN]
15:19:33 20 of 85 OK created sql table model dbt_synthea_dev.stg_synthea__observations .. [OK in 21.33s]
15:19:33 21 of 85 START sql table model dbt_synthea_dev.stg_synthea__organizations ...... [RUN]
15:19:35 21 of 85 OK created sql table model dbt_synthea_dev.stg_synthea__organizations . [OK in 1.71s]
15:19:35 22 of 85 START sql table model dbt_synthea_dev.stg_synthea__patients ........... [RUN]
15:19:38 22 of 85 OK created sql table model dbt_synthea_dev.stg_synthea__patients ...... [OK in 2.91s]
15:19:38 23 of 85 START sql table model dbt_synthea_dev.stg_synthea__payer_transitions .. [RUN]
15:19:54 23 of 85 OK created sql table model dbt_synthea_dev.stg_synthea__payer_transitions [OK in 15.85s]
15:19:54 24 of 85 START sql table model dbt_synthea_dev.stg_synthea__payers ............. [RUN]
15:19:56 24 of 85 OK created sql table model dbt_synthea_dev.stg_synthea__payers ........ [OK in 1.95s]
15:19:56 25 of 85 START sql table model dbt_synthea_dev.stg_synthea__procedures ......... [RUN]
15:20:00 25 of 85 OK created sql table model dbt_synthea_dev.stg_synthea__procedures .... [OK in 4.16s]
15:20:00 26 of 85 START sql table model dbt_synthea_dev.stg_synthea__providers .......... [RUN]
15:20:02 26 of 85 OK created sql table model dbt_synthea_dev.stg_synthea__providers ..... [OK in 1.95s]
15:20:02 27 of 85 START sql table model dbt_synthea_dev.stg_synthea__supplies ........... [RUN]
15:20:04 27 of 85 OK created sql table model dbt_synthea_dev.stg_synthea__supplies ...... [OK in 2.19s]
15:20:04 28 of 85 START sql table model dbt_synthea_dev.stg_vocabulary__concept ......... [RUN]
15:20:08 28 of 85 OK created sql table model dbt_synthea_dev.stg_vocabulary__concept .... [OK in 3.58s]
15:20:08 29 of 85 START sql table model dbt_synthea_dev.stg_vocabulary__concept_ancestor [RUN]
15:20:11 29 of 85 OK created sql table model dbt_synthea_dev.stg_vocabulary__concept_ancestor [OK in 3.40s]
15:20:11 30 of 85 START sql table model dbt_synthea_dev.stg_vocabulary__concept_class ... [RUN]
15:20:13 30 of 85 OK created sql table model dbt_synthea_dev.stg_vocabulary__concept_class [OK in 1.85s]
15:20:13 31 of 85 START sql table model dbt_synthea_dev.stg_vocabulary__concept_relationship [RUN]
15:20:18 31 of 85 OK created sql table model dbt_synthea_dev.stg_vocabulary__concept_relationship [OK in 4.74s]
15:20:18 32 of 85 START sql table model dbt_synthea_dev.stg_vocabulary__concept_synonym . [RUN]
15:20:20 32 of 85 OK created sql table model dbt_synthea_dev.stg_vocabulary__concept_synonym [OK in 2.36s]
15:20:20 33 of 85 START sql table model dbt_synthea_dev.stg_vocabulary__domain .......... [RUN]
15:20:22 33 of 85 OK created sql table model dbt_synthea_dev.stg_vocabulary__domain ..... [OK in 1.77s]
15:20:22 34 of 85 START sql table model dbt_synthea_dev.stg_vocabulary__drug_strength ... [RUN]
15:20:24 34 of 85 OK created sql table model dbt_synthea_dev.stg_vocabulary__drug_strength [OK in 2.13s]
15:20:24 35 of 85 START sql table model dbt_synthea_dev.stg_vocabulary__relationship .... [RUN]
15:20:26 35 of 85 OK created sql table model dbt_synthea_dev.stg_vocabulary__relationship [OK in 1.80s]
15:20:26 36 of 85 START sql table model dbt_synthea_dev.stg_vocabulary__source_to_concept_map [RUN]
15:20:26 36 of 85 ERROR creating sql table model dbt_synthea_dev.stg_vocabulary__source_to_concept_map [ERROR in 0.06s]
15:20:26 37 of 85 START sql table model dbt_synthea_dev.stg_vocabulary__vocabulary ...... [RUN]
15:20:28 37 of 85 OK created sql table model dbt_synthea_dev.stg_vocabulary__vocabulary . [OK in 1.75s]
15:20:28 38 of 85 START sql table model dbt_synthea_dev.int__er_visits .................. [RUN]
15:20:31 38 of 85 OK created sql table model dbt_synthea_dev.int__er_visits ............. [OK in 3.11s]
15:20:31 39 of 85 START sql table model dbt_synthea_dev.int__ip_visits .................. [RUN]
15:20:35 39 of 85 OK created sql table model dbt_synthea_dev.int__ip_visits ............. [OK in 4.15s]
15:20:35 40 of 85 START sql table model dbt_synthea_dev.int__op_visits .................. [RUN]
15:20:57 40 of 85 OK created sql table model dbt_synthea_dev.int__op_visits ............. [OK in 22.01s]
15:20:57 41 of 85 START sql table model dbt_synthea_dev.care_site ....................... [RUN]
15:20:59 41 of 85 OK created sql table model dbt_synthea_dev.care_site .................. [OK in 1.77s]
15:20:59 42 of 85 START sql table model dbt_synthea_dev.int__person ..................... [RUN]
15:21:01 42 of 85 OK created sql table model dbt_synthea_dev.int__person ................ [OK in 2.49s]
15:21:01 43 of 85 START sql table model dbt_synthea_dev.location ........................ [RUN]
15:21:04 43 of 85 OK created sql table model dbt_synthea_dev.location ................... [OK in 2.64s]
15:21:04 44 of 85 START sql table model dbt_synthea_dev.provider ........................ [RUN]
15:21:06 44 of 85 OK created sql table model dbt_synthea_dev.provider ................... [OK in 1.96s]
15:21:06 45 of 85 START sql table model dbt_synthea_dev.concept ......................... [RUN]
15:21:11 45 of 85 OK created sql table model dbt_synthea_dev.concept .................... [OK in 4.84s]
15:21:11 46 of 85 START sql table model dbt_synthea_dev.int__source_to_source_vocab_map . [RUN]
15:21:16 46 of 85 OK created sql table model dbt_synthea_dev.int__source_to_source_vocab_map [OK in 5.19s]
15:21:16 47 of 85 START sql table model dbt_synthea_dev.concept_ancestor ................ [RUN]
15:21:21 47 of 85 OK created sql table model dbt_synthea_dev.concept_ancestor ........... [OK in 4.67s]
15:21:21 48 of 85 START sql table model dbt_synthea_dev.concept_class ................... [RUN]
15:21:23 48 of 85 OK created sql table model dbt_synthea_dev.concept_class .............. [OK in 1.81s]
15:21:23 49 of 85 START sql table model dbt_synthea_dev.concept_relationship ............ [RUN]
15:21:28 49 of 85 OK created sql table model dbt_synthea_dev.concept_relationship ....... [OK in 5.81s]
15:21:28 50 of 85 START sql table model dbt_synthea_dev.int__source_to_standard_vocab_map [RUN]
15:21:34 50 of 85 OK created sql table model dbt_synthea_dev.int__source_to_standard_vocab_map [OK in 5.33s]
15:21:34 51 of 85 START sql table model dbt_synthea_dev.concept_synonym ................. [RUN]
15:21:37 51 of 85 OK created sql table model dbt_synthea_dev.concept_synonym ............ [OK in 3.53s]
15:21:37 52 of 85 START sql table model dbt_synthea_dev.domain .......................... [RUN]
15:21:39 52 of 85 OK created sql table model dbt_synthea_dev.domain ..................... [OK in 1.79s]
15:21:39 53 of 85 START sql table model dbt_synthea_dev.drug_strength ................... [RUN]
15:21:43 53 of 85 OK created sql table model dbt_synthea_dev.drug_strength .............. [OK in 3.32s]
15:21:43 54 of 85 START sql table model dbt_synthea_dev.relationship .................... [RUN]
15:21:45 54 of 85 OK created sql table model dbt_synthea_dev.relationship ............... [OK in 1.92s]
15:21:45 55 of 85 SKIP relation dbt_synthea_dev.source_to_concept_map ................... [SKIP]
15:21:45 56 of 85 START sql table model dbt_synthea_dev.cdm_source ...................... [RUN]
15:21:46 56 of 85 OK created sql table model dbt_synthea_dev.cdm_source ................. [OK in 1.84s]
15:21:46 57 of 85 START sql table model dbt_synthea_dev.vocabulary ...................... [RUN]
15:21:48 57 of 85 OK created sql table model dbt_synthea_dev.vocabulary ................. [OK in 1.79s]
15:21:48 58 of 85 START sql table model dbt_synthea_dev.int__all_visits ................. [RUN]
15:22:16 58 of 85 OK created sql table model dbt_synthea_dev.int__all_visits ............ [OK in 28.17s]
15:22:16 59 of 85 START sql table model dbt_synthea_dev.person .......................... [RUN]
15:22:19 59 of 85 OK created sql table model dbt_synthea_dev.person ..................... [OK in 2.42s]
15:22:19 60 of 85 START sql table model dbt_synthea_dev.int__encounter_provider ......... [RUN]
15:22:26 60 of 85 OK created sql table model dbt_synthea_dev.int__encounter_provider .... [OK in 6.78s]
15:22:26 61 of 85 START sql table model dbt_synthea_dev.int__drug_immunisations ......... [RUN]
15:22:27 61 of 85 OK created sql table model dbt_synthea_dev.int__drug_immunisations .... [OK in 1.86s]
15:22:27 62 of 85 START sql table model dbt_synthea_dev.int__drug_medications ........... [RUN]
15:22:43 62 of 85 OK created sql table model dbt_synthea_dev.int__drug_medications ...... [OK in 15.90s]
15:22:43 63 of 85 START sql table model dbt_synthea_dev.int__observation_allergies ...... [RUN]
15:22:46 63 of 85 OK created sql table model dbt_synthea_dev.int__observation_allergies . [OK in 2.41s]
15:22:46 64 of 85 START sql table model dbt_synthea_dev.int__observation_conditions ..... [RUN]
15:22:51 64 of 85 OK created sql table model dbt_synthea_dev.int__observation_conditions [OK in 5.41s]
15:22:51 65 of 85 START sql table model dbt_synthea_dev.int__observation_observations ... [RUN]
15:22:57 65 of 85 OK created sql table model dbt_synthea_dev.int__observation_observations [OK in 6.05s]
15:22:57 66 of 85 START sql table model dbt_synthea_dev.int__assign_all_visit_ids ....... [RUN]
15:23:14 66 of 85 OK created sql table model dbt_synthea_dev.int__assign_all_visit_ids .. [OK in 17.05s]
15:23:14 67 of 85 START sql table model dbt_synthea_dev.death ........................... [RUN]
15:23:17 67 of 85 OK created sql table model dbt_synthea_dev.death ...................... [OK in 2.77s]
15:23:17 68 of 85 START sql table model dbt_synthea_dev.observation_period .............. [RUN]
15:23:20 68 of 85 OK created sql table model dbt_synthea_dev.observation_period ......... [OK in 2.63s]
15:23:20 69 of 85 START sql table model dbt_synthea_dev.payer_plan_period ............... [RUN]
15:23:52 69 of 85 OK created sql table model dbt_synthea_dev.payer_plan_period .......... [OK in 32.26s]
15:23:52 70 of 85 START sql table model dbt_synthea_dev.int__final_visit_ids ............ [RUN]
15:24:03 70 of 85 OK created sql table model dbt_synthea_dev.int__final_visit_ids ....... [OK in 10.63s]
15:24:03 71 of 85 START sql table model dbt_synthea_dev.condition_occurrence ............ [RUN]
15:24:15 71 of 85 OK created sql table model dbt_synthea_dev.condition_occurrence ....... [OK in 12.92s]
15:24:15 72 of 85 START sql table model dbt_synthea_dev.device_exposure ................. [RUN]
15:24:21 72 of 85 OK created sql table model dbt_synthea_dev.device_exposure ............ [OK in 5.36s]
15:24:21 73 of 85 START sql table model dbt_synthea_dev.drug_exposure ................... [RUN]
15:25:27 73 of 85 OK created sql table model dbt_synthea_dev.drug_exposure .............. [OK in 66.43s]
15:25:27 74 of 85 START sql table model dbt_synthea_dev.measurement ..................... [RUN]
15:26:11 74 of 85 OK created sql table model dbt_synthea_dev.measurement ................ [OK in 43.41s]
15:26:11 75 of 85 START sql table model dbt_synthea_dev.observation ..................... [RUN]
15:26:38 75 of 85 OK created sql table model dbt_synthea_dev.observation ................ [OK in 27.04s]
15:26:38 76 of 85 START sql table model dbt_synthea_dev.procedure_occurrence ............ [RUN]
15:26:47 76 of 85 OK created sql table model dbt_synthea_dev.procedure_occurrence ....... [OK in 8.86s]
15:26:47 77 of 85 START sql table model dbt_synthea_dev.visit_detail .................... [RUN]
15:27:10 77 of 85 OK created sql table model dbt_synthea_dev.visit_detail ............... [OK in 23.57s]
15:27:10 78 of 85 START sql table model dbt_synthea_dev.visit_occurrence ................ [RUN]
15:27:33 78 of 85 OK created sql table model dbt_synthea_dev.visit_occurrence ........... [OK in 23.03s]
15:27:33 79 of 85 START sql table model dbt_synthea_dev.condition_era ................... [RUN]
15:27:40 79 of 85 OK created sql table model dbt_synthea_dev.condition_era .............. [OK in 7.18s]
15:27:40 80 of 85 START sql table model dbt_synthea_dev.drug_era ........................ [RUN]
15:28:43 80 of 85 OK created sql table model dbt_synthea_dev.drug_era ................... [OK in 62.78s]
15:28:43 81 of 85 SKIP relation dbt_synthea_dev.int__cost_condition ..................... [SKIP]
15:28:43 82 of 85 START sql table model dbt_synthea_dev.int__cost_drug_exposure_1 ....... [RUN]
15:28:48 82 of 85 OK created sql table model dbt_synthea_dev.int__cost_drug_exposure_1 .. [OK in 4.84s]
15:28:48 83 of 85 START sql table model dbt_synthea_dev.int__cost_drug_exposure_2 ....... [RUN]
15:29:21 83 of 85 OK created sql table model dbt_synthea_dev.int__cost_drug_exposure_2 .. [OK in 32.66s]
15:29:21 84 of 85 START sql table model dbt_synthea_dev.int__cost_procedure ............. [RUN]
15:29:29 84 of 85 OK created sql table model dbt_synthea_dev.int__cost_procedure ........ [OK in 7.95s]
15:29:29 85 of 85 SKIP relation dbt_synthea_dev.cost .................................... [SKIP]
15:29:29
15:29:29 Finished running 85 table models in 0 hours 15 minutes and 31.08 seconds (931.08s).
15:29:29
15:29:29 Completed with 2 errors and 0 warnings:
15:29:29
15:29:29 Runtime Error in model stg_synthea__claims_transactions (models/staging/synthea/stg_synthea__claims_transactions.sql)
Out of Memory Error: failed to offload data block of size 256.0 KiB (16383.9 PiB/1.2 TiB used).
This limit was set by the 'max_temp_directory_size' setting.
By default, this setting utilizes the available disk space on the drive where the 'temp_directory' is located.
You can adjust this setting, by using (for example) PRAGMA max_temp_directory_size='10GiB'
15:29:29
15:29:29 Runtime Error in model stg_vocabulary__source_to_concept_map (models/staging/vocabulary/stg_vocabulary__source_to_concept_map.sql)
Parser Error: SELECT clause without selection list
15:29:29
15:29:29 Done. PASS=80 WARN=0 ERROR=2 SKIP=3 TOTAL=85 For reference, here are the size of the files I used:
|
It does because, looking at my above comment, I now realize the problem is not so much the loading of data but in the ETL step -- I was unclear where the error was coming from but I have narrowed it above where you can see ETL'ing fails. So, what I would like is to load multiple patient files chunked however I want (by like 100K patients per data generation) and then ETL those patients per data generation into a final duckdb file that continues to grow as I continue adding in new patients. As it stands right now, my understanding is that it is only one shot. I cannot add new patients to an old duckdb database made with Is that helping to frame more my problem? |
Yeah this makes sense - I think a problem is that in this case the duckdb query planner fails and cannot figure out what is going on anymore:
Looks like the query planner has gone berserk there - unsure what is going on... Could you share what platform you're on and how much RAM you have? A fun problem 😆 I'm going to try tomorrow - let me know if you have any breakthroughs! |
Fedora 39 (Desktop)
Out of curiosity, is my intuition here that chunking up how many patients I am ETL'ing each time a good thought? Or is this problem unrelated to what solution I wanted sketched out above? |
I'd have thought that would be enough! Interesting... 🤔 Yes that's definitely a way of doing it! Two ways of doing this spring to my mind:
Not sure if either is super ideal but it is possible! |
One thing I hadn't appreciated is that every model gets materialised as a table - in reality - especially when using huge source sizes like we are now - it might be best to set the staging models (which do nothing but project ) to materialize as views: Lines 20 to 22 in 8ff8911
Otherwise we will have 4 copies of the data (loaded / staged / intermediate / omop)! |
@TheCedarPrince I got a bit further and then realised I had forgot to seed in the vocabulary 🙃 It did manage to get dbt run --fail-fast 17:53:10 Running with dbt=1.8.7 17:53:11 Registered adapter: duckdb=1.8.0 17:53:11 Found 86 models, 29 seeds, 425 data tests, 29 sources, 538 macros 17:53:11 17:53:11 Concurrency: 1 threads (target='dev') 17:53:11 17:53:11 1 of 86 START sql table model dbt_synthea_dev.dose_era ......................... [RUN] 17:53:13 1 of 86 OK created sql table model dbt_synthea_dev.dose_era .................... [OK in 1.73s] 17:53:13 2 of 86 START sql table model dbt_synthea_dev.episode .......................... [RUN] 17:53:15 2 of 86 OK created sql table model dbt_synthea_dev.episode ..................... [OK in 1.80s] 17:53:15 3 of 86 START sql table model dbt_synthea_dev.episode_event .................... [RUN] 17:53:16 3 of 86 OK created sql table model dbt_synthea_dev.episode_event ............... [OK in 1.63s] 17:53:16 4 of 86 START sql table model dbt_synthea_dev.fact_relationship ................ [RUN] 17:53:18 4 of 86 OK created sql table model dbt_synthea_dev.fact_relationship ........... [OK in 1.61s] 17:53:18 5 of 86 START sql table model dbt_synthea_dev.metadata ......................... [RUN] 17:53:20 5 of 86 OK created sql table model dbt_synthea_dev.metadata .................... [OK in 1.81s] 17:53:20 6 of 86 START sql table model dbt_synthea_dev.note ............................. [RUN] 17:53:21 6 of 86 OK created sql table model dbt_synthea_dev.note ........................ [OK in 1.63s] 17:53:21 7 of 86 START sql table model dbt_synthea_dev.note_nlp ......................... [RUN] 17:53:23 7 of 86 OK created sql table model dbt_synthea_dev.note_nlp .................... [OK in 1.64s] 17:53:23 8 of 86 START sql table model dbt_synthea_dev.specimen ......................... [RUN] 17:53:25 8 of 86 OK created sql table model dbt_synthea_dev.specimen .................... [OK in 1.78s] 17:53:25 9 of 86 START sql table model dbt_synthea_dev.stg_map__states .................. [RUN] 17:53:26 9 of 86 OK created sql table model dbt_synthea_dev.stg_map__states ............. [OK in 1.64s] 17:53:26 10 of 86 START sql table model dbt_synthea_dev.stg_synthea__allergies .......... [RUN] 17:53:30 10 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__allergies ..... [OK in 3.08s] 17:53:30 11 of 86 START sql table model dbt_synthea_dev.stg_synthea__careplans .......... [RUN] 17:53:33 11 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__careplans ..... [OK in 3.35s] 17:53:33 12 of 86 START sql table model dbt_synthea_dev.stg_synthea__claims ............. [RUN] 17:54:25 12 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__claims ........ [OK in 52.08s] 17:54:25 13 of 86 START sql table model dbt_synthea_dev.stg_synthea__claims_transactions [RUN] 17:58:35 13 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__claims_transactions [OK in 249.68s] 17:58:35 14 of 86 START sql table model dbt_synthea_dev.stg_synthea__conditions ......... [RUN] 17:58:40 14 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__conditions .... [OK in 5.35s] 17:58:40 15 of 86 START sql table model dbt_synthea_dev.stg_synthea__devices ............ [RUN] 17:58:43 15 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__devices ....... [OK in 2.92s] 17:58:43 16 of 86 START sql table model dbt_synthea_dev.stg_synthea__encounters ......... [RUN] 17:58:52 16 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__encounters .... [OK in 8.92s] 17:58:52 17 of 86 START sql table model dbt_synthea_dev.stg_synthea__imaging_studies .... [RUN] 17:59:12 17 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__imaging_studies [OK in 19.78s] 17:59:12 18 of 86 START sql table model dbt_synthea_dev.stg_synthea__immunizations ...... [RUN] 17:59:15 18 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__immunizations . [OK in 3.68s] 17:59:15 19 of 86 START sql table model dbt_synthea_dev.stg_synthea__medications ........ [RUN] 17:59:23 19 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__medications ... [OK in 7.71s] 17:59:23 20 of 86 START sql table model dbt_synthea_dev.stg_synthea__observations ....... [RUN] 18:00:08 20 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__observations .. [OK in 44.97s] 18:00:08 21 of 86 START sql table model dbt_synthea_dev.stg_synthea__organizations ...... [RUN] 18:00:11 21 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__organizations . [OK in 3.00s] 18:00:11 22 of 86 START sql table model dbt_synthea_dev.stg_synthea__patients ........... [RUN] 18:00:15 22 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__patients ...... [OK in 3.84s] 18:00:15 23 of 86 START sql table model dbt_synthea_dev.stg_synthea__payer_transitions .. [RUN] 18:00:23 23 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__payer_transitions [OK in 7.10s] 18:00:23 24 of 86 START sql table model dbt_synthea_dev.stg_synthea__payers ............. [RUN] 18:00:26 24 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__payers ........ [OK in 2.98s] 18:00:26 25 of 86 START sql table model dbt_synthea_dev.stg_synthea__procedures ......... [RUN] 18:00:34 25 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__procedures .... [OK in 8.31s] 18:00:34 26 of 86 START sql table model dbt_synthea_dev.stg_synthea__providers .......... [RUN] 18:00:37 26 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__providers ..... [OK in 3.17s] 18:00:37 27 of 86 START sql table model dbt_synthea_dev.stg_synthea__supplies ........... [RUN] 18:00:40 27 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__supplies ...... [OK in 3.35s] 18:00:40 28 of 86 START sql table model dbt_synthea_dev.stg_vocabulary__concept ......... [RUN] 18:00:40 28 of 86 ERROR creating sql table model dbt_synthea_dev.stg_vocabulary__concept [ERROR in 0.06s] 18:00:40 29 of 86 START sql table model dbt_synthea_dev.stg_vocabulary__concept_ancestor [RUN] 18:00:41 CANCEL query model.synthea_omop_etl.stg_vocabulary__concept_ancestor ........... [CANCEL] 18:00:41 29 of 86 ERROR creating sql table model dbt_synthea_dev.stg_vocabulary__concept_ancestor [ERROR in 0.04s] 18:00:41 18:00:41 Runtime Error in model stg_vocabulary__concept (models/staging/vocabulary/stg_vocabulary__concept.sql) Parser Error: SELECT clause without selection list 18:00:41 18:00:41 Finished running 86 table models in 0 hours 7 minutes and 29.65 seconds (449.65s). 18:00:41 18:00:41 Completed with 59 errors and 0 warnings: 18:00:41 18:00:41 Runtime Error in model stg_vocabulary__concept (models/staging/vocabulary/stg_vocabulary__concept.sql) Parser Error: SELECT clause without selection list 18:00:41 |
This actually is really quite reasonable for a naive and quick approach. I'll try that at some point probably tomorrow. As you know, it would have drawbacks of duplicating lots of information -- I immediately think of the vocab tables.
Hunh! Not sure what is involved in that approach but seems interesting!
AH! That's why my persistent DuckDB database file keeps getting so large! Makes sense.
Yea, I am unsure what this means as well... |
@TheCedarPrince can you check the compiled SQL for this? Regarding running dbt in batches, we need to be careful because certain tables are "shared" across patients like care_site, location, and provider. I'm guessing dbt's incremental mode would handle this sort of thing? But I haven't looked into it much. My overall take is I'd prefer to avoid these sorts of approaches if at all possible and rather understand:
As I understand it, incremental mode was designed for huge datasets with the need for very frequent updates, which is not the typical use case for OMOP CDMs. Finally, regarding the duplication of data across stage/int/marts - that's what I hope to address in #38 . Keeping all those tables around is only useful for debugging purposes while developing the ETL. |
Re: incremental, you're right: it's really made for updates every n hours/days etc. If duckdb is unable to cope then we'd probably need a different database engine in this insistence, but this does seem like an extreme stress test/case. I'd test on Postgres but it'd almost definitely run out of space |
I have (with some tuning) eventually managed to get every table build... except observations which runs out of memory sadly for me.
Here is the query plan for it for reference: ┌─────────────────────────────┐ │┌───────────────────────────┐│ ││ Physical Plan ││ │└───────────────────────────┘│ └─────────────────────────────┘ ┌───────────────────────────┐ │ PROJECTION │ │ ──────────────────── │ │ observation_id │ │ person_id │ │ observation_concept_id │ │ observation_date │ │ observation_datetime │ │observation_type_concept_id│ │ NULL │ │ NULL │ │ value_as_concept_id │ │ qualifier_concept_id │ │ unit_concept_id │ │ provider_id │ │ visit_occurrence_id │ │ visit_detail_id │ │ observation_source_value │ │observation_source_concept_│ │ id │ │ NULL │ │ NULL │ │ NULL │ │ NULL │ │ NULL │ │ │ │ ~319014617 Rows │ └─────────────┬─────────────┘ ┌─────────────┴─────────────┐ │ PROJECTION │ │ ──────────────────── │ │ #0 │ │ #1 │ │ #2 │ │ #3 │ │ #4 │ │ #5 │ │ #6 │ │ #7 │ │ #8 │ │ #9 │ │ #10 │ │ #11 │ │ #12 │ │ #13 │ │ #14 │ │ #15 │ │ │ │ ~319014617 Rows │ └─────────────┬─────────────┘ ┌─────────────┴─────────────┐ │ WINDOW │ │ ──────────────────── │ │ Projections: │ │ ROW_NUMBER() OVER (ORDER │ │ BY person_id ASC NULLS │ │ LAST) │ └─────────────┬─────────────┘ ┌─────────────┴─────────────┐ │ HASH_JOIN │ │ ──────────────────── │ │ Join Type: INNER │ │ │ │ Conditions: ├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐ │ patient_id = │ │ │ person_source_value │ │ │ │ │ │ ~319014617 Rows │ │ └─────────────┬─────────────┘ │ ┌─────────────┴─────────────┐ ┌─────────────┴─────────────┐ │ PROJECTION │ │ SEQ_SCAN │ │ ──────────────────── │ │ ──────────────────── │ │ #0 │ │ person │ │ #1 │ │ │ │ #2 │ │ Projections: │ │ #3 │ │ person_source_value │ │ #4 │ │ person_id │ │ #5 │ │ │ │ #6 │ │ │ │ #7 │ │ │ │ #8 │ │ │ │ #9 │ │ │ │ #10 │ │ │ │ #11 │ │ │ │__internal_decompress_integ│ │ │ │ ral_bigint(#12, 1) │ │ │ │ │ │ │ │ ~253445155 Rows │ │ ~1144346 Rows │ └─────────────┬─────────────┘ └───────────────────────────┘ ┌─────────────┴─────────────┐ │ HASH_JOIN │ │ ──────────────────── │ │ Join Type: LEFT │ │ │ │ Conditions: ├─────────────────────────────────────────────────────────────────────────────────────────────────────┐ │encounter_id = encounter_id│ │ │ patient_id = patient_id │ │ │ │ │ │ ~253445155 Rows │ │ └─────────────┬─────────────┘ │ ┌─────────────┴─────────────┐ ┌─────────────┴─────────────┐ │ PROJECTION │ │ PROJECTION │ │ ──────────────────── │ │ ──────────────────── │ │ #0 │ │ #0 │ │ #1 │ │ #1 │ │ #2 │ │__internal_compress_integra│ │ #3 │ │ l_usmallint(#2, 1) │ │ #4 │ │ │ │ #5 │ │ │ │ #6 │ │ │ │ #7 │ │ │ │ #8 │ │ │ │__internal_decompress_integ│ │ │ │ ral_bigint(#9, 1) │ │ │ │ │ │ │ │ ~253445155 Rows │ │ ~58246069 Rows │ └─────────────┬─────────────┘ └─────────────┬─────────────┘ ┌─────────────┴─────────────┐ ┌─────────────┴─────────────┐ │ HASH_JOIN │ │ SEQ_SCAN │ │ ──────────────────── │ │ ──────────────────── │ │ Join Type: LEFT │ │ int__encounter_provider │ │ │ │ │ │ Conditions: │ │ Projections: │ │encounter_id = encounter_id├────────────────────────────────────────────────────────────────────────┐ │ encounter_id │ │ │ │ │ patient_id │ │ │ │ │ provider_id │ │ │ │ │ │ │ ~253445155 Rows │ │ │ ~58246069 Rows │ └─────────────┬─────────────┘ │ └───────────────────────────┘ ┌─────────────┴─────────────┐ ┌─────────────┴─────────────┐ │ UNION │ │ PROJECTION │ │ │ │ ──────────────────── │ │ │ │ #0 │ │ ├───────────────────────────────────────────┐ │__internal_compress_integra│ │ │ │ │ l_uinteger(#1, 1) │ │ │ │ │ │ │ │ │ │ ~56156413 Rows │ └─────────────┬─────────────┘ │ └─────────────┬─────────────┘ ┌─────────────┴─────────────┐ ┌─────────────┴─────────────┐┌─────────────┴─────────────┐ │ PROJECTION │ │ SEQ_SCAN ││ SEQ_SCAN │ │ ──────────────────── │ │ ──────────────────── ││ ──────────────────── │ │ #0 │ │int__observation_observatio││ int__final_visit_ids │ │ #1 │ │ ns ││ │ │ #2 │ │ ││ Projections: │ │ #3 │ │ Projections: ││ encounter_id │ │ CAST(#4 AS TIMESTAMP) │ │ patient_id ││ visit_occurrence_id_new │ │ #5 │ │ encounter_id ││ │ │ #6 │ │ observation_concept_id ││ │ │ #7 │ │ observation_date ││ │ │ │ │ observation_datetime ││ │ │ │ │observation_type_concept_id││ │ │ │ │ observation_source_value ││ │ │ │ │observation_source_concept_││ │ │ │ │ id ││ │ │ │ │ ││ │ │ ~29567613 Rows │ │ ~223877542 Rows ││ ~56156413 Rows │ └─────────────┬─────────────┘ └───────────────────────────┘└───────────────────────────┘ ┌─────────────┴─────────────┐ │ UNION ├──────────────┐ └─────────────┬─────────────┘ │ ┌─────────────┴─────────────┐┌─────────────┴─────────────┐ │ SEQ_SCAN ││ SEQ_SCAN │ │ ──────────────────── ││ ──────────────────── │ │ int__observation_allergies││int__observation_conditions│ │ ││ │ │ Projections: ││ Projections: │ │ patient_id ││ patient_id │ │ encounter_id ││ encounter_id │ │ observation_concept_id ││ observation_concept_id │ │ observation_date ││ observation_date │ │ observation_datetime ││ observation_datetime │ │observation_type_concept_id││observation_type_concept_id│ │ observation_source_value ││ observation_source_value │ │observation_source_concept_││observation_source_concept_│ │ id ││ id │ │ ││ │ │ ~719717 Rows ││ ~28847896 Rows │ └───────────────────────────┘└───────────────────────────┘ It seems like DuckDB will try to spill to disk where it can, including for operations such as joins/orders/windowing it is unable to - but the docs note that if you many of these it may fail to use the disk - which is what I imagine is happening here! |
I had some success running in Snowflake (XS Warehouse) - the same queries causing it to fail locally are causing large spill to local warehouse storage. In total took about 20 minutes to run the whole project (without running tests) (~0.3 credits ~ $0.80 pre-tax 💸 ) These are the queries with their metrics if of interest! Metrics (CSV) I will try to dig in and see what the worst offenders are, but it is measurements/observations at a glance; both of which have large joins with few predicates as far as I can see! |
I realized what this is. The vocab download doesn't come with a source_to_concept_map (duh). I'll file a ticket to handle this table in BYO vocab mode. Also, I'm looking at the observation SQL. I'm surprised to see there's nothing crazy going on in there. This @lawrenceadams , what were the runtimes for the other models that succeeded / which were the slowest? |
Apologies @katy-sadowski I ran out of time to dig in to the results over the weekend! These are the 5 slowest models, and generally the ones that needed the most out of memory processing (arguably a moot point as we could reduce the number of threads [threads=8] / use a bigger warehouse - but still interesting).
Looking at the most expensive nodes in observation: This join is the most expensive, followed by the one you linked to, followed by the window function I agree @katy-sadowski that second join predicate looks redundant - although interestingly when running that model by itself with and without the join on the patient ID results in 79 fewer rows being returned (253447517 [join on encounter_id] - 253447438 [join on both encounter and patient ID] = 79), which is interesting... I'll need to inspect why! When running the model by itself they're much faster (Only done on observation but took 2 minutes instead of 12) - at some point I'll run on snowflake with one thread to see what happens; it will be more useful! |
Interesting thanks @lawrenceadams !
In this case I wonder if it would be faster to move the joins down into the individual intermediate tables.
weird - agree that should be checked! are there encounters missing a patient ID in your source data? (there aren't any in the seed dataset). if so we probably want to handle that scenario explicitly in the ETL.
you mean |
Quite possibly, from memory there is no filtering that happens at any point so we'd have to pay the price at some point. Maybe worth trying! When I've done this on real EHR data I tend to have these joins consolidated before the end, and use the final models to do things like global transformations (e.g. join vocabulary where needed) - but we only do that once. I wonder how others in this space handle it!
Great shout - I did this and there are 79 cases which violate this assumption! No idea how this happens as I can't imagine it's a feature of Synthea? It might be worth re-building the dataset and seeing if it happens for others. Can others replicate?
if effect yep, I re ran it with |
Maybe a good discussion topic for our next meeting.
I will try. I'm guessing it's a bug in Synthea (or maybe a weird feature like this one). Once I've got my mega dataset I can start experimenting with some of these optimization ideas too :) |
I can try replicating! What do you want me to try doing @lawrenceadams ? |
Great idea! |
Amazing!! Did you manage to make a huge synthea dataset? Is it possible to check if encounters have a different patient ID attached to them, as above? |
So you want me to generate a huge synthea dataset? Would 1 million patients with 3 or 5 year retrospective be enough to be "huge" for you @lawrenceadams? I can generate that and then check out the different patient ID. :) |
Hi guys! I finally had some time to run dbt on my million-patient Synthea dataset (duckdb, single-threaded run). Some initial findings to report:
|
Hey @katy-sadowski,
I ended up opening an issue as Discussions is not enabled for this repo.
Wanted to share some notes on my experiences with you on using this solution for 1 million patients each with 5 year retrospective look back:
cffi
being needed -- might be missing from dependencies somehow [BUG]cffi
Python Package Missing but Required #88.dbt
file should go was a little vague.dbt
Folder #89dbt run
fails on large tables in the OMOP Schema -- especially those tables which have a large lineage.I am going to reproduce these errors tomorrow and actually generate 1 million patients from scratch using Synthea and re-run the pipeline to find all the tables that fail to be built.
Hope this helps folks!
~ tcp 🌳
The text was updated successfully, but these errors were encountered: