You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We ingest quite a number of REST and SOAP API's into our platform, many of which doesn't produce large datasets. However, incremental by time range makes it very convenient to ingest data from these sources.
By default, partitioning is applied which can cause many partitions each of which containing a small amount of data. This requires additional overhead to read from. The general recommendation from databricks is to not partition data smaller then 1TB.
I ran some tests for a couple of very small datasets to compare the performance between a daily partitioned table and a non partitioned table. It may give a general indication.
We ingest quite a number of REST and SOAP API's into our platform, many of which doesn't produce large datasets. However, incremental by time range makes it very convenient to ingest data from these sources.
By default, partitioning is applied which can cause many partitions each of which containing a small amount of data. This requires additional overhead to read from. The general recommendation from databricks is to not partition data smaller then 1TB.
I ran some tests for a couple of very small datasets to compare the performance between a daily partitioned table and a non partitioned table. It may give a general indication.
Dataset 1: 33k rows (index 1)
Dataset 2: 240k rows (index 2)
Partitioned:
Unpartitioned:
I would be nice to have the option to choose whether to partition a table or not, giving the user more flexibility in performance tuning.
The text was updated successfully, but these errors were encountered: