Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to not partition data with INCREMENTAL_BY_TIME_RANGE #3277

Open
a-noakes opened this issue Oct 22, 2024 · 0 comments
Open

Option to not partition data with INCREMENTAL_BY_TIME_RANGE #3277

a-noakes opened this issue Oct 22, 2024 · 0 comments
Labels
Feature Adds new functionality

Comments

@a-noakes
Copy link

We ingest quite a number of REST and SOAP API's into our platform, many of which doesn't produce large datasets. However, incremental by time range makes it very convenient to ingest data from these sources.

By default, partitioning is applied which can cause many partitions each of which containing a small amount of data. This requires additional overhead to read from. The general recommendation from databricks is to not partition data smaller then 1TB.

I ran some tests for a couple of very small datasets to compare the performance between a daily partitioned table and a non partitioned table. It may give a general indication.

Dataset 1: 33k rows (index 1)
Dataset 2: 240k rows (index 2)

Partitioned:

  • select * : [6s, 2s]
  • select * where range between: [6s, 3s]
  • insert / replace where: [5s, 4s]

Unpartitioned:

  • select * : [2s, 1s]
  • select * where range between: [2s, 1s]
  • insert / replace where: [5s, 5s]

I would be nice to have the option to choose whether to partition a table or not, giving the user more flexibility in performance tuning.

@treysp treysp added the Feature Adds new functionality label Oct 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature Adds new functionality
Projects
None yet
Development

No branches or pull requests

2 participants