Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Database modeling - experiment with loading GKS variation data into parquet and DuckDB #8

Open
theferrit32 opened this issue Oct 4, 2024 · 1 comment
Labels
ASHG-2024 ASHG GKS Unconference 2024

Comments

@theferrit32
Copy link

Submitter Name

Kyle Ferriter

Submitter Affiliation

Broad Institute

Project Details

We would like to load GKS data into a relational model for rich querying. We currently heavily use BigQuery in backend processes but would like something free and portable that can be distributed to users. Anyvar uses Postgresql with most data stored in JSON columns. We would like something easier to configure and run than Postgres, such as embedded, single file/directory database like SQLite, Rocksdb, Duckdb. We have looked a little bit at DuckDB, which can import/export Parquet and NDJSON files and can provide a SQL query interface over them with basically no configuration needed. A difficulty with NDJSON import is that the automatic schema inference fails when the data contains heterogeneous rows, which is the case for GKS data.

We are aiming for these deliverables:

  • Convert Variation NDJSON files to Parquet files using DuckDB. Determine whether there needs to be multiple sets of parquet files with different schemas or whether a big schema of union types is possible. (*not by adding a set of columns for each different type)
  • Load the parquet datasets into a DuckDB database file (based on the docs it sounds like the internal format is similar to parquet, but preloading it into a native database file may improve performance compared to loading parquet directories).
  • Provide notebooks demonstrating queries we anticipate to be common.

Stretch:

  • Look at table indexing. What storage overhead does adding indexes add? To what degree does it speed up common queries.
  • Write an Anyvar storage backend to use this.
  • Investigate the write performance. If this serves as a backend to a writeable API like Anyvar, how quickly can it write new records? Parquet is not known for write performance, as it is designed for analytics/warehousing. Is it good enough? Is writing rows a bottleneck?

Required Skills

  • Python
  • SQL
  • Experience defining relational data schemas
@theferrit32
Copy link
Author

Semi-related, this was just posted today:

Pg_parquet: An extension to connect Postgres and parquet

https://news.ycombinator.com/item?id=41871068

pg_parquet is a PostgreSQL extension that allows you to read and write Parquet files, which are located in S3 or file system, from PostgreSQL via COPY TO/FROM commands

https://www.crunchydata.com/blog/pg_parquet-an-extension-to-connect-postgres-and-parquet

@korikuzma korikuzma added the ASHG-2024 ASHG GKS Unconference 2024 label Nov 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ASHG-2024 ASHG GKS Unconference 2024
Projects
None yet
Development

No branches or pull requests

2 participants