Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: automatically converting csv to parquet #181

Merged
merged 10 commits into from
May 28, 2024
Merged

Conversation

wangpatrick57
Copy link
Member

@wangpatrick57 wangpatrick57 commented May 1, 2024

Summary: Automatically converting CSV to Parquet before generating stats on the Parquet files.

Demo:
Screenshot 2024-05-01 at 18 58 27

Details:

  • For robustness, we don't use schema inference. We build a temporary DataFusion context, create the tables with the DDL statements, and then get the schema from DataFusion.
  • I forked csv2parquet here. One notable change is that it's now a library instead of a binary. Also, we turn empty strings for nullable Utf8 columns into nulls in-memory, because arrow's CSV reader doesn't seem to do this for Utf8 types. This has a huge effect on q-error on JOB.

@wangpatrick57 wangpatrick57 marked this pull request as ready for review May 1, 2024 22:42
@wangpatrick57 wangpatrick57 requested review from Gun9niR and AlSchlo and removed request for Gun9niR May 1, 2024 22:57
@wangpatrick57 wangpatrick57 changed the title Phw2/csv to parquet feat: automatically converting csv to parquet May 1, 2024
@wangpatrick57 wangpatrick57 merged commit 5255d8d into main May 28, 2024
1 check passed
@wangpatrick57 wangpatrick57 deleted the phw2/csv-to-parquet branch May 28, 2024 00:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants