Database modeling - experiment with loading GKS variation data into parquet and DuckDB #8

theferrit32 · 2024-10-04T16:54:49Z

Submitter Name

Kyle Ferriter

Submitter Affiliation

Broad Institute

Project Details

We would like to load GKS data into a relational model for rich querying. We currently heavily use BigQuery in backend processes but would like something free and portable that can be distributed to users. Anyvar uses Postgresql with most data stored in JSON columns. We would like something easier to configure and run than Postgres, such as embedded, single file/directory database like SQLite, Rocksdb, Duckdb. We have looked a little bit at DuckDB, which can import/export Parquet and NDJSON files and can provide a SQL query interface over them with basically no configuration needed. A difficulty with NDJSON import is that the automatic schema inference fails when the data contains heterogeneous rows, which is the case for GKS data.

We are aiming for these deliverables:

Convert Variation NDJSON files to Parquet files using DuckDB. Determine whether there needs to be multiple sets of parquet files with different schemas or whether a big schema of union types is possible. (*not by adding a set of columns for each different type)
Load the parquet datasets into a DuckDB database file (based on the docs it sounds like the internal format is similar to parquet, but preloading it into a native database file may improve performance compared to loading parquet directories).
Provide notebooks demonstrating queries we anticipate to be common.

Stretch:

Look at table indexing. What storage overhead does adding indexes add? To what degree does it speed up common queries.
Write an Anyvar storage backend to use this.
Investigate the write performance. If this serves as a backend to a writeable API like Anyvar, how quickly can it write new records? Parquet is not known for write performance, as it is designed for analytics/warehousing. Is it good enough? Is writing rows a bottleneck?

Required Skills

Python
SQL
Experience defining relational data schemas

theferrit32 · 2024-10-17T21:47:06Z

Semi-related, this was just posted today:

Pg_parquet: An extension to connect Postgres and parquet

https://news.ycombinator.com/item?id=41871068

pg_parquet is a PostgreSQL extension that allows you to read and write Parquet files, which are located in S3 or file system, from PostgreSQL via COPY TO/FROM commands

https://www.crunchydata.com/blog/pg_parquet-an-extension-to-connect-postgres-and-parquet

korikuzma added the ASHG-2024 ASHG GKS Unconference 2024 label Nov 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Database modeling - experiment with loading GKS variation data into parquet and DuckDB #8

Database modeling - experiment with loading GKS variation data into parquet and DuckDB #8

theferrit32 commented Oct 4, 2024

theferrit32 commented Oct 17, 2024

Database modeling - experiment with loading GKS variation data into parquet and DuckDB #8

Database modeling - experiment with loading GKS variation data into parquet and DuckDB #8

Comments

theferrit32 commented Oct 4, 2024

Submitter Name

Submitter Affiliation

Project Details

Required Skills

theferrit32 commented Oct 17, 2024