Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Containerize and publish gaia-core and gaia-db #340

Open
wants to merge 25 commits into
base: main
Choose a base branch
from

Conversation

jshoughtaling
Copy link
Collaborator

@jshoughtaling jshoughtaling commented Jul 19, 2024

This PR containerizes the main elements of the OHDSI GIS repository, namely gaia-core and gaia-db.

Note

The proposed CI/CD pipeline to build and push images with GitHub actions relies on a user account with elevated permissions. The following lines will need to be updated in the workflow yml files to complete the build and push action:

  • build_gaia_core.yml: 40
  • build_gaia_db.yml: 44
    A GitHub action secret, named GH_TOKEN containing a scoped personal access token for the user above, needs to be added to the repository so it can be referenced in the action.
    Moreover, the process assumes these images will be hosted on the GitHub Container Registry (ghcr.io) rather than on Docker Hub. I have seen OHDSI images hosted in both places and am not sure what the current status or approach is in that regard. GHCR is cleaner with regard to authentication, IMO, but the code can be easily adapted to the docker.io space if needed. In either case, the image build pipeline requires elevated credentials.

GAIA-CORE

This image simply installs the gaia R package into a pinned (4.2.1) ohdsi/broadsea-hades base image. I have not added any sort of entrypoint; more work needs to be done to script processes that can load and subsequently geocode data in a single workflow. I will discuss further with @kzollove.

GAIA-DB

This image is based on the alpine flavored postgis base image. The initialization script for the database combines and modifies existing sql scripts used in both the catalog initialization (via the backbone schema) and the vocabulary integration.

Once deployed and auto-initialized, the containerized Postgres database includes:

  • GIS Catalog (backbone schema)
  • Constrained GIS vocabulary tables (vocabulary schema)
  • postgis tools (native to image, tiger schema)

In order to build the images locally, clone this branch of the repo and run the following commands at the top level:

sudo docker build -t gaia-db-test -f docker/gaia-db/Dockerfile .
sudo docker build -t gaia-core-test -f docker/gaia-core/Dockerfile .

In order to run them locally after building, you can execute the following:

sudo docker run --rm --env POSTGRES_PASSWORD=SuperSecret --name gaia-db gaia-db-test
sudo docker run --rm --entrypoint /bin/bash -it --name gaia-core gaia-core-test

Warning

I needed to update the staging vocabulary csv files in order to create a relationally consistent vocabulary schema. This meant adding certain entries and assigning 2B+ concept id values. I will need to discuss in more detail with @p-talapova to make sure these changes align with her efforts.

TODO

  • Offset 2B+ id assignments to avoid collisions with local vocabularies
  • Add docker-compose.yml file with both services and their associated configuration parameters
  • Add robust documentation for the Docker image deployments
  • Create/integrate existing scripts to automate workflows within the gaia-core package
  • Discuss implications of including CONCEPT_RELATIONSHIP entities here that reference codes in proprietary vocabularies
  • Include integrations with existing OMOP vocabulary tables
  • ...

@kzollove kzollove requested review from kzollove and rtmill July 22, 2024 18:23
Copy link
Collaborator

@kzollove kzollove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for these fantastic contributions, Jared!

From what I can tell, both of the images build correctly. Thanks for fixing the gaia-core Dockerfile and updating the README in gaia-db.

Looks good to merge, please feel free to do so when you're ready.

(Ah, I now see a TODO... )

@tibbben
Copy link
Collaborator

tibbben commented Jul 31, 2024

Hi @jshoughtaling and @kzollove,

I just successfully built both images on a PC (no 'sudo'). This looks really cool and seems that we could integrate it with some of the work in the /inst/gaiadb directory (existing postGIS dockerfile) and inst/repository (some other work I was doing with the catalog). I also have a more updated version of some of these files including a start on networking between gaia-core, gaia-db, and the catalog ... perhaps worth merging at some point, and considering a refactor on where to put the Dockerfiles in the repo ...

I am also thinking about helm charts and rancher for more cloud native kubernetes installs .... this approach would be in addition to the docker compose approach ... more on this later :)

thoughts?

@jshoughtaling
Copy link
Collaborator Author

@jshoughtaling to message Lee Evans ASAP about Broadsea integrations.

@jshoughtaling jshoughtaling self-assigned this Aug 28, 2024
@jshoughtaling
Copy link
Collaborator Author

Thoughts on the current (working) process flow:

  • Step 1: Place LOCATION table in mounted directory for pre-loaded Novinatim image (hosted on TuftsCTSI ghcr)
  • Step 2: Bring up novinatim and launch geocoding loop script. Output LOCATION_PLUS table. Bring down Novinatim.
  • Step 3: LOCATION_PLUS and LOCATION_HISTORY table mounted into gaia-db container and loaded into gaia-db postgres instance
  • Step 4: Launch loadExposure function in gaia-core container that references gaia-db along with selection of variables. Output EXPOSURE_OCCURRENCE.
  • Step 5: User loads EXPOSURE_OCCURRENCE into OMOP instance, lat/lon info stays behind in gaia-db. EXPOSURE OCCURRENCE can be submitted alongside redacted LOCATION & LOCATION HISTORY table (preserving only ID values) in deidentified data submissions to research consortia
  • Step 6: nominatim & gaia-db in state following Step 5 can be used for location-based cohort creation
  • Step 6A: input a geographical ROI into novinatim and get relavent lat/lon geocodes
  • Step 6B: Input lat/lon ROI and timespan of interest, and output would be a list of person_id values

@kzollove
Copy link
Collaborator

kzollove commented Sep 19, 2024

@jshoughtaling I am still ending up with Broadsea, Hades installed in R, but no gaiaCore.

I am following instructions above fairly closely:

docker build -t gaia-db-test -f docker/gaia-db/Dockerfile .;\
docker build -t gaia-core-test -f docker/gaia-core/Dockerfile .;\
docker run -itd --rm --env POSTGRES_PASSWORD=SuperSecret --name gaia-db gaia-db-test;\
docker run --rm -e USER="ohdsi" -e=PASSWORD="mypass" -itd -p 8787:8787 --name gaia-core gaia-core-test

I've tried this on WSL2 (on Tufts laptop) and personal laptop (Intel Mac)

I'm assuming it quietly fails when image is built. When I try to install from in the container, I get ERROR: dependencies ‘rpostgis’, ‘sf’ are not available for package ‘gaiaCore’ among others:

Sorry for terrible formatting > remotes::install_github("OHDSI/GIS") Downloading GitHub repo OHDSI/GIS@HEAD Installing 2 packages: sf, rpostgis Installing packages into ‘/usr/local/lib/R/site-library’ (as ‘lib’ is unspecified) trying URL 'https://packagemanager.rstudio.com/cran/__linux__/focal/latest/src/contrib/sf_1.0-17.tar.gz' Content type 'binary/octet-stream' length 3863868 bytes (3.7 MB) ================================================== downloaded 3.7 MB trying URL 'https://packagemanager.rstudio.com/cran/__linux__/focal/latest/src/contrib/rpostgis_1.5.1.tar.gz' Content type 'application/x-tar' length 1672998 bytes (1.6 MB) ================================================== downloaded 1.6 MB * installing *source* package ‘sf’ ... ** package ‘sf’ successfully unpacked and MD5 sums checked ** using staged installation configure: CC: gcc configure: CXX: g++ -std=gnu++14 checking for gdal-config... /usr/bin/gdal-config checking gdal-config usability... yes configure: GDAL: 3.0.4 checking GDAL version >= 2.0.1... yes checking for gcc... gcc checking whether the C compiler works... yes checking for C compiler default output file name... a.out checking for suffix of executables... checking whether we are cross compiling... no checking for suffix of object files... o checking whether the compiler supports GNU C... yes checking whether gcc accepts -g... yes checking for gcc option to enable C11 features... none needed checking for stdio.h... yes checking for stdlib.h... yes checking for string.h... yes checking for inttypes.h... yes checking for stdint.h... yes checking for strings.h... yes checking for sys/stat.h... yes checking for sys/types.h... yes checking for unistd.h... yes checking for gdal.h... yes checking GDAL: linking with --libs only... yes checking GDAL: /usr/share/gdal/pcs.csv readable... no checking GDAL: checking whether PROJ is available for linking:... yes checking GDAL: checking whether PROJ is available for running:... yes configure: GDAL: 3.0.4 configure: pkg-config proj exists, will use it configure: using proj.h. configure: PROJ: 6.3.1 checking PROJ: checking whether PROJ and sqlite3 are available for linking:... yes checking for geos-config... /usr/bin/geos-config checking geos-config usability... yes configure: GEOS: 3.8.0 checking GEOS version >= 3.4.0... yes checking for geos_c.h... yes checking geos: linking with -L/usr/lib/x86_64-linux-gnu -lgeos_c... yes configure: Package CPP flags: -DHAVE_PROJ_H -I/usr/include/gdal -I/usr/include configure: Package LIBS: -lproj -L/usr/lib -lgdal -L/usr/lib/x86_64-linux-gnu -lgeos_c configure: creating ./config.status config.status: creating src/Makevars ** libs g++ -std=gnu++14 -I"/usr/local/lib/R/include" -DNDEBUG -DHAVE_PROJ_H -I/usr/include/gdal -I/usr/include -I'/usr/local/lib/R/site-library/Rcpp/include' -I/usr/local/include -fpic -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g -c RcppExports.cpp -o RcppExports.o g++ -std=gnu++14 -I"/usr/local/lib/R/include" -DNDEBUG -DHAVE_PROJ_H -I/usr/include/gdal -I/usr/include -I'/usr/local/lib/R/site-library/Rcpp/include' -I/usr/local/include -fpic -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g -c bbox.cpp -o bbox.o g++ -std=gnu++14 -I"/usr/local/lib/R/include" -DNDEBUG -DHAVE_PROJ_H -I/usr/include/gdal -I/usr/include -I'/usr/local/lib/R/site-library/Rcpp/include' -I/usr/local/include -fpic -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g -c gdal.cpp -o gdal.o gdal.cpp: In function ‘Rcpp::NumericVector CPL_transform_bounds(Rcpp::NumericVector, Rcpp::List, int)’: gdal.cpp:713:9: error: ‘ret’ was not declared in this scope 713 \| return ret; \| ^~~ make: *** [/usr/local/lib/R/etc/Makeconf:177: gdal.o] Error 1 ERROR: compilation failed for package ‘sf’ * removing ‘/usr/local/lib/R/site-library/sf’ ERROR: dependency ‘sf’ is not available for package ‘rpostgis’ * removing ‘/usr/local/lib/R/site-library/rpostgis’ The downloaded source packages are in ‘/tmp/Rtmpoc3MTG/downloaded_packages’ Running `R CMD build`... * checking for file ‘/tmp/Rtmpoc3MTG/remotes15154f767f/OHDSI-GIS-71e4b83/DESCRIPTION’ ... OK * preparing ‘gaiaCore’: * checking DESCRIPTION meta-information ... OK * checking for LF line-endings in source and make files and shell scripts * checking for empty or unneeded directories Omitted ‘LazyData’ from DESCRIPTION * building ‘gaiaCore_0.0.0.9000.tar.gz’ Installing package into ‘/usr/local/lib/R/site-library’ (as ‘lib’ is unspecified) ERROR: dependencies ‘rpostgis’, ‘sf’ are not available for package ‘gaiaCore’ * removing ‘/usr/local/lib/R/site-library/gaiaCore’ Warning messages: 1: In i.p(...) : installation of package ‘sf’ had non-zero exit status 2: In i.p(...) : installation of package ‘rpostgis’ had non-zero exit status 3: In i.p(...) : installation of package ‘/tmp/Rtmpoc3MTG/file1515702ae0/gaiaCore_0.0.0.9000.tar.gz’ had non-zero exit status --   > | > >

* modifications for Broadsea builds from github

* remove repository docker from inst
@tibbben
Copy link
Collaborator

tibbben commented Sep 28, 2024

Soooo ...

... after playing for a while both from the Tufts/Broadsea_GISland#gaia-core and directly from OHDSI/GIS#containerize, here's a network observation and an error. But first, I can build the images from either repo with no errors, I can run R and import gaiaCore no problem, and (after a network tweak from OHDSI/GIS) I can connect the database from R - whoot!!

As always, the last note at the very bottom is perhaps thee most important ...

Network tweak for running containers from OHDSI/GIS.

after the builds (directly from Kyles note above), but before running the images, we need to create a docker network:

docker build -t gaia-db-test -f docker/gaia-db/Dockerfile .;\
docker build -t gaia-core-test -f docker/gaia-core/Dockerfile .;\
docker network create gaia-default-network

then we need to run the containers referencing that network:

docker run -itd --rm --env POSTGRES_PASSWORD=SuperSecret --network gaia-default-network --name gaia-db gaia-db-test;\
docker run --rm -e USER="ohdsi" -e=PASSWORD="mypass" -itd -p 8787:8787 --network gaia-default-network --name gaia-core gaia-core-test

Error on trying to load variable

From R after loading the libraries we must set the databaseConnector (NOTE: on the OHDSI/GIS#containerize build has the database name as "postgres" not "gaiaDB"):

connectionDetails <- DatabaseConnector::createConnectionDetails(
  dbms = "postgresql",
  server = "gaia-db/postgres",
  port = 5432,
  user="postgres",
  password = "SuperSecret") 

The using loadVariable() seems to have a bug still:

> gaiaCore::loadVariable(connectionDetails,222)
Connecting using PostgreSQL driver
Connecting using PostgreSQL driver
Connecting using PostgreSQL driver
Connecting using PostgreSQL driver
Connecting using PostgreSQL driver
Connecting using PostgreSQL driver
trying URL 'https://svi.cdc.gov/Documents/Data/2018/db/states_counties/SVI_2018_US_county.zip'
Content type 'application/x-zip-compressed' length 7450810 bytes (7.1 MB)
==================================================
downloaded 7.1 MB

Reading layer `SVI2018_US_county' from data source 
  `/tmp/RtmpcqZGk9/gaia/SVI2018_US_county.gdb' using driver `OpenFileGDB'
Simple feature collection with 3142 features and 125 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -179.1489 ymin: 18.91036 xmax: 179.7785 ymax: 71.36516
Geodetic CRS:  NAD83
Connecting using PostgreSQL driver
Loading geom table dependency
Connecting using PostgreSQL driver
Connecting using PostgreSQL driver
Connecting using PostgreSQL driver
trying URL 'https://www2.census.gov/geo/tiger/TIGER2018/COUNTY/tl_2018_us_county.zip'
downloaded 75.5 MB

Reading layer `tl_2018_us_county' from data source 
  `/tmp/RtmpcqZGk9/gaia/tl_2018_us_county.shp' using driver `ESRI Shapefile'
Simple feature collection with 3233 features and 17 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -179.2311 ymin: -14.60181 xmax: 179.8597 ymax: 71.43979
Geodetic CRS:  NAD83
Connecting using PostgreSQL driver
Connecting using PostgreSQL driver
  |======================================================================| 100%
Executing SQL took 0.00935 secs
Connecting using PostgreSQL driver
  |======================================================================| 100%
Executing SQL took 0.017 secs
  |======================================================================| 100%
Executing SQL took 0.00754 secs
  |======================================================================| 100%
Executing SQL took 0.0122 secs
  |                                                                      |   0%Error in `.createErrorReport()`:
! Error executing SQL:
org.postgresql.util.PSQLException: ERROR: syntax error at or near "DEFAULT"
  Position: 97
An error report has been created at  /ohdsi-gis/errorReportSql.txt
Run `rlang::last_trace()` to see where the error occurred.
Warning messages:
1: In file(file, "rt", encoding = fileEncoding) :
  file("") only supports open = "w+" and open = "w+b": using the former
2: In file(file, "rt", encoding = fileEncoding) :
  file("") only supports open = "w+" and open = "w+b": using the former

I have tried this on several different variables, always the same error. The SQL error that gets dumped is:

DBMS:
postgresql

Error:
org.postgresql.util.PSQLException: ERROR: syntax error at or near "DEFAULT"
  Position: 97

SQL:
ALTER TABLE ONLY census_tiger_line."geom_us_county_2018"  ALTER COLUMN geom_record_id TYPE  SET DEFAULT nextval('census_tiger_line.geom_us_county_2018_geom_record_id_seq'::regclass)

R version:
R version 4.2.1 (2022-06-23)

Platform:
x86_64-pc-linux-gnu

Attached base packages:
- stats
- graphics
- grDevices
- utils
- datasets
- methods
- base

Other attached packages:
- gaiaCore (0.0.0.9000)
- pingr (2.0.3)
- RCurl (1.98-1.16)
- DatabaseConnector (6.3.2)

and finally, there are two things that look puzzling:

file("") only supports open = "w+" and open = "w+b": using the former
The empty file "" ... is there a possibility with the containerized version, the temp file location is messed up? Or the file name got lost?

in the SQL statement it looks like there might be a blank TYPE:
ALTER TABLE ONLY census_tiger_line."geom_us_county_2018" ALTER COLUMN geom_record_id TYPE SOMETHING MISSING HERE SET DEFAULT nextval('census_tiger_line.geom_us_county_2018_geom_record_id_seq'::regclass)

@kzollove
Copy link
Collaborator

Hi @tibbben thanks for testing this! The loadVariable bug was handled earlier but the changes to gaia-core/Dockerfile (i.e. to install gaiaCore from OHDSI/containerize) did not propagate to OHDSI/containerize until just now, thanks for catching that.

Upon testing that update, I was getting a weird error with Andromeda. Looks like Broadsea Hades image has not been updated in years. Fixed in the gaia-core/Dockerfile by just updating Andromeda, but I wonder if we shouldnt ask Broadsea team to update that image?

Anyways, I am going to push the current OHDSI/containerize to main at EOD unless anyone objects. We can continue to develop but I want to change gaia-core/Dockerfile to install from OHDSI/main sooner than later.

Also, see OHDSI/containerize/README.md/Getting Started for most up to date install instructions, and feel free to add there when ready

@tibbben
Copy link
Collaborator

tibbben commented Oct 1, 2024

it all works!! I loaded a variable from the catalog container (in Broadsea). Let me know when you move to the main branch and I can with the url in OHDSI. I will add some readme study soon (to the containerize branch)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants