Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[2024 Symposium] Synthetic Data #345

Closed
Tracked by #356
kzollove opened this issue Aug 19, 2024 · 9 comments
Closed
Tracked by #356

[2024 Symposium] Synthetic Data #345

kzollove opened this issue Aug 19, 2024 · 9 comments
Assignees
Labels

Comments

@kzollove
Copy link
Collaborator

kzollove commented Aug 19, 2024

Items to be completed in support of Synthetic Data generation in advance of the OHDSI Symposium

  • Consider using synthetic data with generated addresses in 3 locations (domestic & international)
    • Consider Boston, Sao Paolo, China, and a European city

Only using locations at the city level; reference global PM2.5 data 1998-2016; synthetic patients take city locations

In workshop: brief description of the synthetic dataset, minimal detail on location generation (keyword minimal)

TODO:

@kzollove
Copy link
Collaborator Author

Jared loaded all Belgian place data into Nominatim, has workflow in place, will shift to one of target cities (likely start with Boston)

Geocoded needs to take place in a pre-step - will likely need to be its own containerized process

TODO: Brief written description on how the Tufts Syntegra database will be divided into parts to represent separate EHRs

@kzollove kzollove moved this from 🏷TODO to 🏃‍♀ In Progress in GIS Project Management Sep 6, 2024
@kzollove
Copy link
Collaborator Author

kzollove commented Sep 6, 2024

Update from Jared: Geocoded a subset of Boston addresses using Nominatim; scripted, references docker images that downloads from Open Street Maps or makes direct calls to Nominatim API (for downloading the OSM files, not for geocoding the addresses)

Geocoding will be added to the containerization with option to use DeGauss in US or Nominatim for outside of the US

Hitting issues with the load_variable step; we will chat outside of meeting to resolve

@kzollove
Copy link
Collaborator Author

kzollove commented Sep 6, 2024

TODO a high level explainer on how to produce fake geocoded locations to link to real PM2.5 data

@kzollove
Copy link
Collaborator Author

load_exposure no longer working @kzollove

@kzollove
Copy link
Collaborator Author

load_exposure no longer working @kzollove

Resolved

@kzollove
Copy link
Collaborator Author

Need PM2.5 dataset in gaia-db @kzollove

@kzollove
Copy link
Collaborator Author

Send Jared a list of city names and a single coordinate for each city represented in the SEDACS dataset @tibbben

@kzollove
Copy link
Collaborator Author

kzollove commented Oct 4, 2024

Jared created locations for the cities that Tim sent. Using the Tufts syntegra dataset to create mock EHRs

Jared will add synthetic Location/ Location_history with links back to syntegra to GIS repo

@github-project-automation github-project-automation bot moved this from 🔒Blocked to ✔Done in GIS Project Management Oct 10, 2024
@jshoughtaling jshoughtaling reopened this Oct 16, 2024
@jshoughtaling
Copy link
Collaborator

jshoughtaling commented Oct 16, 2024

GIS-Specific Synthetic Dataset

Description

The purpose of the GIS-specific synthetic data is to provide the foundation for an end-to-end demonstration that combines electronic health record (EHR) data in OMOP-CDM format with (1) datasets containing geospatial variables (e.g. average levels of pollutants in a region), and (2) terminology to capture those variables in an OMOP-CDM format.

This particular dataset is an augmented subset (~7000 individuals) of the full (~500k individuals) Tufts Synthetic Dataset, with a particular focus on individuals who have Chronic Obstructive Pulmonary Disorder (COPD). We conducted the augmentation using the gaiaCore toolchain developed by the GIS workgroup, combined with a location assignment approach that distributed a specified ratio of COPD vs non-COPD synthetic individuals to a subset of global cities with wide ranging values of PM2.5. Once assigned, we derived an EXPOSURE_OCCURRENCE table using pre-assigned 2B+ concept_id values that we were able to reference later in the downstream analytics workflow. The details of this augmentation process, including code and references, are described at length in the sections below.

Tufts Synthetic Data

The Tufts Synthetic Dataset is a set of completely synthetic electronic health record (EHR) data for approximately 500,000 fake patients. It was produced in 2021 through a collaboration between Syntegra, Inc. and Tufts Medical Center. A novel deep learning transformer model (the kind of model used by modern LLMs) developed by Syntegra was used to generate synthetic clinical data including data on visits, conditions, drugs, measurements, procedures, observation and device exposures. The model was trained on a version of the Tufts Research Data Warehouse (TRDW) that contained longitudinal EHR data on patients who received care at Tufts Medical Center. Both the TRDW training data and the synthetic data conform to version 5.3 of the OMOP common data model. Note that for the purposes of this GIS demonstration, we converted the data to OMOP version 5.4.

An expert determination of the HIPAA compliance of the dataset was conducted by Mirador Analytics Ltd. in 2022 (Report No SYN222P1a). It confirmed that the data are safe to share and to use without posing a risk to patient privacy. Analyses by Syntegra and Tufts Medical Center researchers confirmed the statistically realistic properties of the data through comparisons of descriptive statistics, treatment pathways and prediction models on the synthetic and real data. The data also contain realistic data quality errors. This realism makes the dataset a useful asset in training researchers to work with OMOP-shaped data in all phases of observational research from data quality assessment through analysis using the tools and practices of the Observational Health Data Science and Informatics (OHDSI) community and other OMOP-using communities. Its realism and format also make it useful for testing software designed to work with OMOP-shaped data, and for preparing study packages that use OHDSI tools to define and conduct full observational research studies.

PM2.5 Dataset

The Annual PM2.5 Concentrations for Countries and Urban Areas, 1998-2016, consists of mean concentrations of particulate matter (PM2.5) for countries and urban areas (see manual for more details). The PM2.5 data are from the Global Annual PM2.5 Grids from MODIS, MISR and SeaWiFS Aerosol Optical Depth (AOD) with GWR, 1998-2016. The urban areas are from the Global Rural-Urban Mapping Project, Version 1 (GRUMPv1): Urban Extent Polygons, Revision 02, and its time series runs from 1998 to 2016. The country averages are population-weighted such that concentrations in populated areas count more toward the country average than concentrations in less populated areas, and its time series runs from 2008 to 2015.

Analytic Use Case

While the analytic approach and associated results are described in more detail elsewhere, the general motivation for creating this dataset was to be able to support a patient-level prediction (PLP) model to predict the risk of COPD for a particular individual given pollutant levels in their city of residence together with their EHR data.

Data Processing

We first converted the Tufts Synthetic Data to OMOP version 5.4 using a set of SQL scripts against a Databricks instance. In the same schema, we inserted the GIS terminology into the standard OMOP vocabulary tables, and then filled the LOCATION, LOCATION_HISTORY, and EXPOSURE_OCCURRENCE tables as described in the subsections below.

COPD Cohort Definition

In order to capture those patients fitting our desired COPD phenotype, we executed a simple cohort creation query that referenced a broad COPD concept (255573) and all of its descendants:

CREATE TABLE copd_cohort AS (
WITH copd_desc AS (
    Select descendant_concept_id AS cid FROM concept_ancestor WHERE ancestor_concept_id = 255573),
  copd_concepts AS (
    SELECT concept_id, concept_name FROM condition_occurrence co
    INNER JOIN copd_desc cd
    ON co.condition_concept_id = cd.cid
    INNER JOIN concept cn 
    ON cd.cid = cn.concept_id
    GROUP BY concept_id, concept_name),
  copd_patients AS (
    SELECT DISTINCT person_id AS person_id FROM condition_occurrence co
    INNER JOIN copd_concepts cc
    ON cc.concept_id = co.condition_concept_id
  )
  SELECT p.*  FROM copd_patients cp
  INNER JOIN person p
  ON p.person_id = cp.person_id
);

Note that we also integrated the GIS-specific synthetic data into an Atlas instance, and created/applied the cohort definition there as well.

Location Assignment

We referred to two recent works that describe (1) the general, global prevalence of COPD as of 2023, and (2) the relationship between prevalence of COPD local concentrations of PM2.5:

  1. Boers, Elroy, et al. "Global burden of chronic obstructive pulmonary disease through 2050." JAMA Network Open 6.12 (2023): e2346598-e2346598.
  2. Liu, Sha, et al. "Association between exposure to ambient particulate matter and chronic obstructive pulmonary disease: results from a cross-sectional study in China." Thorax 72.9 (2017): 788-795.

From these two papers, we derived a very crude relationship between Odds Ratios (OR) of COPD versus concentration of PM2.5:

Estimated_OR

We then selected 20 cities evenly distributed along this concentration range using the medians of their 18 year annual concentration data in the PM2.5 dataset. With these 20 cities, we used the estimated OR relationship above to calculate a crude distribution of cases versus non-cases such that the total individuals with COPD in the Tufts Synthetic Dataset were all included. Note we pulled these cities directly from the PM2.5 data, and in that dataset they had already been assigned latitude and longitude point values; we carried those values through the rest of the processes instead of needing to geocode based on the city name.

CITY COUNTRY LATITUTE LONGITUDE MEDIAN PM2.5 CASE NOT CASE OR
BULANDSHAHR INDIA 28.40449524 77.85832214 95.94 121 233 4.67
WANGDU CHINA 38.71282768 115.1666565 85.59 115 239 4.330
JINGHAI CHINA 38.93782806 116.9374886 74.41 109 244 4.02
SARSAWAN INDIA 30.00217158 77.34992132 63.93 100 253 3.56
DOKKHAMTAI THAILAND 19.17445183 99.96205373 55.26 88 265 2.99
LAHORE PAKISTAN 31.49387074 74.35156631 47.86 77 277 2.50
KINSHASA CONGO DR -4.397176266 15.33447051 35.82 53 300 1.59
JOHANNESBURG SOUTH AFRICA -26.17050266 28.0999918 33.24 50 303 1.49
KATOWICE POLAND 50.22116089 18.97915935 28.02 44 309 1.28
PAVIA ITALY 45.20035744 9.183955636 25.69 41 312 1.18
HUARMEY PERU -10.07883644 -78.14238828 20.18 38 315 1.09
SCHOUWENDUIVELAND NETHERLANDS 51.64616013 3.924992681 17.24 38 315 1.09
FRESNO USA 36.69616127 -119.6958351 15.47 38 315 1.09
BUFFALO USA 42.89597574 -78.67471096 10.51 35 318 0.99
LAPLAYOSA ARGENTINA -32.09550285 -63.03333855 4.98 35 318 0.99
PERTH AUSTRALIA -32.03704071 115.975156 2.14 32 321 0.90
FORKS USA 47.94616127 -124.3791656 2.1 32 321 0.90
SITKA USA 57.07949448 -135.3333359 0.7 29 324 0.81
ALEXANDRA NEW ZEALAND -45.24550247 169.4166565 0.6 29 324 0.81
SUVA FIJI -18.07050228 178.4999847 0.18 27 327 0.74

Case Sampling

Once defining the case/non-case distribution numbers across the 20 locations, we set out to create a non-COPD individual sub sample from the Tufts Synthetic Dataset with an age and gender distribution aligned with the existing COPD individuals.

We calculated a rough age distribution of the entire COPD cohort, as well as a gender split, and used these values within a set of nested subqueries to capture a crudely representative cohort without COPD. We then placed these ~6500 non-COPD individuals together with the ~1100 COPD individuals, and then randomly assigned them to the different locations based on the ratios derived above. We've included an auto-generated SQL script to randomly select patients into a LOCATION_ASSIGNMENT table, which serves as a precursor to the LOCATION_HISTORY table below.

Location History

For the purposes of this demo, we made some general assumptions to simplify the creation of synthetic data and interpretation of downstream analyses:

  • All individuals in the synthetic dataset lived within one of the 20 urban areas selected from the PM2.5 dataset, either for the entire period of 1998-2016, or from their birthdate to 2016 if they were born after 1998-01-01. We did not represent any movement between locations.
  • Apart from maintaining a ratio of COPD to NON-COPD patients according to the estimated OR, the individuals were assigned to locations entirely at random

Once creating the LOCATION_ASSIGNMENT table, we populated the LOCATION_HISTORY table with the following query:

CREATE OR REPLACE TABLE location_history AS (
     SELECT l.location_id,
            32848                                     AS relationship_type_concept_id, 
            1147314                                   AS domain_id,
            la.person_id                              AS entity_id,
            CASE
                WHEN year_of_birth < 1998 THEN CAST('1998-01-01' AS DATE)
                ELSE CAST(birth_datetime AS DATE) END AS start_date,
            CAST('2016-12-31' AS DATE)                AS end_date
     FROM location l
              INNER JOIN location_assignment la
                         ON l.location_id = la.location_id
              INNER JOIN person p
                         ON la.person_id = p.person_id
     WHERE year_of_birth < 2017
 );

Note that the DDL and description of LOCATION_HISTORY can be found in the GIS Documentation.

Exposure Occurrence

We took two approaches to populate the EXPOSURE_OCCURRENCE table:

  1. end-to-end workflow based on gaiaCore functionality enabled by the recent containerization work in GIS
  2. simplified query to reference the appropriate concept_id value for annual average of PM2.5, given this demonstration is only focused on a single variable.

The gaiaCore containerized workflow is described in detail elsewhere; briefly, we added the PM2.5 dataset as a data source in GaiaDB, and then converted its contents to the geom/attr representations expected by the gaiaCore package. We've also copied the query for populating the EXPOSURE_OCCURRENCE table directly below:

INSERT INTO exposure_occurrence (
select 
     row_number() OVER (order by l.entity_id) AS exposure_occurrence_id
	  , l.location_id
      , l.entity_id AS person_id
      , 2052499839 AS exposure_concept_id
      , TO_DATE(CONCAT(p.year, '-01-01'), 'yyyy-mm-dd') AS exposure_start_date
      , TO_DATE(CONCAT(p.year, '-01-01'), 'yyyy-mm-dd')::timestamp AS exposure_start_datetime
      , TO_DATE(CONCAT(p.year + 1, '-01-01'), 'yyyy-mm-dd') AS exposure_end_date
      , TO_DATE(CONCAT(p.year + 1, '-01-01'), 'yyyy-mm-dd')::timestamp AS exposure_end_datetime
      , 2052499878 AS exposure_type_concept_id 
      , 2052496943 AS exposure_relationship_concept_id
      , 2052499839 AS exposure_source_concept_id
      , p.PM_VALUE AS exposure_source_value
      , 'WITHIN' AS exposure_relationship_source_value
      , 'ug/m3' AS dose_unit_source_value
      , 1 AS quantity
      , CAST(NULL AS VARCHAR(50)) AS modifier_source_value
      , 4172703 AS operator_concept_id
      , p.PM_VALUE AS value_as_number
      , CAST(NULL AS INTEGER) AS value_as_concept_id
      , 32964 AS unit_concept_id
    from pm25_limited p
    inner join location_history l
    on p.id = l.location_id
);

Note that we converted the 20-city subset of the PM2.5 data to a single, long-format table that is referenced in the query above.

With the addition of the EXPOSURE_OCCURRENCE table, the GIS-specific synthetic dataset in OMOP v5.4 format was integrated together with the GIS extension tables (LOCATION_HISTORY and EXPOSURE_OCCURRENCE) and a global GIS dataset related to urban pollution levels.

Access to Dataset

If you would like to access and download the GIS-specific synthetic COPD dataset, please contact Jared Houghtaling for more information about the associated data use agreement (DUA)!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: ✔Done
Development

No branches or pull requests

2 participants