You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We would like to load GKS data into a relational model for rich querying. We currently heavily use BigQuery in backend processes but would like something free and portable that can be distributed to users. Anyvar uses Postgresql with most data stored in JSON columns. We would like something easier to configure and run than Postgres, such as embedded, single file/directory database like SQLite, Rocksdb, Duckdb. We have looked a little bit at DuckDB, which can import/export Parquet and NDJSON files and can provide a SQL query interface over them with basically no configuration needed. A difficulty with NDJSON import is that the automatic schema inference fails when the data contains heterogeneous rows, which is the case for GKS data.
We are aiming for these deliverables:
Convert Variation NDJSON files to Parquet files using DuckDB. Determine whether there needs to be multiple sets of parquet files with different schemas or whether a big schema of union types is possible. (*not by adding a set of columns for each different type)
Load the parquet datasets into a DuckDB database file (based on the docs it sounds like the internal format is similar to parquet, but preloading it into a native database file may improve performance compared to loading parquet directories).
Provide notebooks demonstrating queries we anticipate to be common.
Stretch:
Look at table indexing. What storage overhead does adding indexes add? To what degree does it speed up common queries.
Write an Anyvar storage backend to use this.
Investigate the write performance. If this serves as a backend to a writeable API like Anyvar, how quickly can it write new records? Parquet is not known for write performance, as it is designed for analytics/warehousing. Is it good enough? Is writing rows a bottleneck?
Required Skills
Python
SQL
Experience defining relational data schemas
The text was updated successfully, but these errors were encountered:
pg_parquet is a PostgreSQL extension that allows you to read and write Parquet files, which are located in S3 or file system, from PostgreSQL via COPY TO/FROM commands
Submitter Name
Kyle Ferriter
Submitter Affiliation
Broad Institute
Project Details
We would like to load GKS data into a relational model for rich querying. We currently heavily use BigQuery in backend processes but would like something free and portable that can be distributed to users. Anyvar uses Postgresql with most data stored in JSON columns. We would like something easier to configure and run than Postgres, such as embedded, single file/directory database like SQLite, Rocksdb, Duckdb. We have looked a little bit at DuckDB, which can import/export Parquet and NDJSON files and can provide a SQL query interface over them with basically no configuration needed. A difficulty with NDJSON import is that the automatic schema inference fails when the data contains heterogeneous rows, which is the case for GKS data.
We are aiming for these deliverables:
Stretch:
Required Skills
The text was updated successfully, but these errors were encountered: