-
Notifications
You must be signed in to change notification settings - Fork 41
Dev Environment and Google BigQuery Database Access Setup
Primary tech stack includes
- Python (scipy, pandas, scikit-learn, matplotlib)
- R for certain statistical analyses (e.g., propensity score matching)
- Google BigQuery database (mostly replacing local PostgreSQL 9.6 instances)
- HTML/JavaScript for web applications
- Git(Hub) for source control, issue+project tracking, Wiki notes
Basic dev skills recommended
- SQL queries (mostly to pull things out of databases)
https://www.w3schools.com/sql/default.asp
https://technically.dev/posts/sql-for-the-rest-of-us.html https://brohrer.github.io/sql_resources.html
https://cloud.google.com/bigquery/docs/how-to - Unix command-line use (digital notebooks and interactive parsers good for playing around, but eventually you'll need to produce and execute reusable code modules. Respectively, being able to SSH connect and remote control a server). https://practicalunix.org/
- Git or other Version Control system Standard software engineering practice. Commit and Push any code changes to repository by end of any day code is edited. Don't wait for a server meltdown, hard drive failure, or stolen laptop to learn this lesson the hard way. Git has more complex (branching and pull request) functionality than is necessary in most cases. At the least, learn how to git clone, add, commit, push, pull and merge conflicts with other users.
- Python programming (including object-oriented programming and automated unit testing)
https://wiki.python.org/moin/BeginnersGuide
https://brohrer.github.io/python_resources.html
https://pandas.pydata.org/pandas-docs/stable/getting_started/comparison/comparison_with_r.html - R tidyverse, dplyr data manipulation, regression / glmnet models
https://rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf - The Computational Social Science Course has a nice introductory lab workshop that reviews many basic data manipulation concepts in R-dplyr, Unix command line, Python streaming
- Review the DevWorkshop code directory for various tutorials on core tools and skills, such as practicing SQL queries, unit testing, PivotTables, and more.
- Medical Statistics Review
https://lagunita.stanford.edu/courses/Medicine/MedStats./Summer2015/about
Windows dev environment notes
- Example profile script to setup command-line environment variables for Windows dev environment https://github.com/HealthRex/CDSS/blob/master/setup/windows.profile.bat
See workshop tutorial notes for guidance
Another workshop tutorial on practicing clinical data SQL queries
Multiple tiers of access security
- Publicly available Google Cloud Platform / BigQuery that anyone can be granted access to
- Access restricted to those with Stanford.edu account (still not secure enough for PHI / High-Risk data)
- Access restricted to those connected Stanford VPN with respective access controls (The last is supported by Nero / Stanford Research Computing Center and is deemed acceptable for high-risk = protected health information, whereas the former are not)
Google Cloud Platform mining-clinical-decisions is a Stanford.edu restricted project for the group
- Project administrator needs to grant project or BigQuery access to GCP project
Google Cloud Platform som-nero-phi-jonc101 is a Stanford Nero-PHI compliant restricted project for the group
- User needs to download install and run VPN client Full Traffic non-split-tunnel (requires two factor authentication)
- Project administrator needs to add Stanford user to workgroup for Nero-PHI compliant access
- https://workgroup.stanford.edu/main/WGApp
- If extra privileges still needed, will need to contact SRCC sys admins to grant.
- Navigate to GCP -> ComputeEngine -> VM Instances
- Click "Create Instance" and input necessary specifications (eg, compute power necessary)
- Click "Start VM"; there is a "SSH" button on the side to copy the CLI for your local machine (needs Stanford Full Tunnel VPN)
- Can use the instructions below for Authentication of GCP, including creating the JSON token for logging into a specific project.
- Install Rstudio Server or whichever language/software as necessary.
source: https://cloud.google.com/bigquery/docs/authentication/
Google Cloud SDK (requires Python) https://cloud.google.com/sdk/install
-
Install then run
gcloud init
Need to create authentication key JSON file
gcloud auth application-default login
Should require a web login, then creates local JSON file that can use to simulate login via stanford.edu account to a specific project. (The key appears tied to one user login and one GCP project.)
- Mac/Linux: export GOOGLE_APPLICATION_CREDENTIALS="[PathToKeyFile]"
- Windows(cmd): set GOOGLE_APPLICATION_CREDENTIALS=[PathToKeyFile] (Or Python scripts in medinfo code tree will likely import LocalEnv.py that can set it)
Big Query Python API reference: https://googleapis.github.io/google-cloud-python/latest/index.html
Install Python modules for connection
pip install --upgrade google-cloud-bigquery
Much of our research is based on the Stanford Clinical Data Warehouse (aka STARR, formerly STRIDE). Our primary data source within STARR is Epic "Clarity" database, from which STARR has provided us data for all Stanford affiliated patients from 2008 onward. For more information about the STARR data warehouse, see this overview. You can look online for a data dictionary, though ultimately a lot requires just browsing through the data tables and reconstructing what everything means.
In 2019, we now have access to a version of the database that includes both inpatient and outpatient cases (essentially the "STARR Tahoe Datalake), accessible via a Google BigQuery interface (functionally it looks like querying a SQL database). Data Dictionary
If you are not familiar with SQL, please refer to the following resources.
- Free online SQL courses: Codecademy, Khan Academy, and w3schools
- SQL syntax reference: w3schools
As with any data set, STRIDE can be misleading. In particular, modeling the complexities of inpatient hospital operations to a set of SQL tables inevitably results in information loss. As a result, simple questions may have surprising (i.e. likely incorrect) answers simply by virtue of quirks in the data collection process.
That being said, we still need to do research. Therefore, our approach is simply to try to understand the limitations in our data set. Here is a (non-exhaustive) list of known limitations.
- Incomplete data at time boundaries: While we describe the database as covering the years "2008 – ", we know that coverage is incomplete at the boundaries. For example, we have data for some patients in 2008 but not all, likely because Epic was rolled out slowly to the various departments in the hospital, so there is a period during which only a fraction of patient encounters yielded EMR data.
- Inconsistent mortality rates: The mortality rate declines precipitously from ~2000 per year to ~1400 between 2011 and 2012 (without a corresponding drop in admissions). We suspect that this is caused by changes in how deaths are reported. In particular, the hospital cannot track deaths occurring soon after discharge, and instead relies on the Social Security Death Index (SSDI) to retroactively update its records. We suspect that a change in the SSDI's recording practices/efficacy affected STRIDE's records downstream.
- Shifting diagnosis frequency: The top 10 admit and problem list diagnoses shift over the years. While in theory it's possible that these shifts are explained by changes in the patient population, we believe it's more likely the result of changes in medical coding patterns (though it's very difficult to disentangle the two). In particular, diagnoses shift from less specific diagnoses (e.g. general symptoms, abdominal pain, chest symptoms) to more specific ones (e.g. shortness of breath, dyspnea, fever). In addition, after 2014 there is a spike in admissions for chemotherapy and immunotherapy. Similarly, kidney failure starts to dominate problem lists after 2014. We believe part of this change in coding patterns may be driven by the shift from ICD9 to ICD10 codes.
-
Inconsistent proc_id:proc_code mapping:
order_proc
models all procedures ordered at the hospital. Each unique procedure type is defined by aproc_id:proc_code
pair. However, this mapping changes in 2014 (e.g.proc_code
LABA1C maps toproc_id
472505 in 2010 and 1036 in 2016).
For examples of how to answer specific clinical research questions, see the SQL Dev Workshop for directed examples on learning how to query for information out of the database.