Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serialisable partitioning spec #291

Open
fjetter opened this issue Jun 2, 2020 · 0 comments
Open

Serialisable partitioning spec #291

fjetter opened this issue Jun 2, 2020 · 0 comments

Comments

@fjetter
Copy link
Collaborator

fjetter commented Jun 2, 2020

Problem description

The physical layout and indexing of the dataset dominantly impacts read performances. Often dataset are designed in such a way to support a rather specific use case where many of the partitioning parameters must be set and even minor deviations or omittances would cause severe changes in performance. We offer increasingly many levers to control the dataset layout but do not offer a concise way to store, share, verify or reproduce this easily. Many of the performance critical parameters are not easily reconstructable

Things I have in mind which should be part of this specification are

  • Partition keys
  • Secondary indices
  • Bucket_by
  • Number of buckets
  • Columns we sorted the columns by
  • What hash function was used to calculate the buckets
  • Parquet chunk sizes used for write (assuming constant over the dataset)
  • Parquet compression algorithm

Benefits

  • Groundwork for more concise sanity checks, e.g. when updating a dataset
  • More efficient communication to consumers. So far we mostly communicate dataset schemas and rely on implicit knowledge about expected performance. With these information we can offer more informed decisions
  • Might offer a more streamlined interface (partition spec via config file?)

Open questions

  • Do we persist this information with the dataset or merely offer this as an interface?
  • How would we handle inhomogeneous attributes (e.g. parquet attributes)

I'm curious to know if other people consider this useful or not

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant