This repository has been archived by the owner on Jan 12, 2024. It is now read-only.
Speed up querying of paritioned Parquet data on GCS #8
Labels
intake
Intake data catalogs
parquet
Apache Parquet is an open columnar data file format.
performance
Make data go faster by using less memory, disk, network, compute, etc.
Using
pd.read_parquet()
When using
pd.read_parquet()
reading data from a collection of remote parquet files using thegcs://
protocol takes twice as long as reading from a single parquet file, but no similar slowdown occurs locally:%%time
theuser
time does double locally for the partitioned data, but the elapsed time doesn't. Is it working with multiple threads locally, but only a single thread remotely?Using
intake_parquet
Even ignoring the close to 12 minutes of apparent network transfer time, the same query only took 25 seconds with
pd.read_parquet()
and here it took 3 minutes. Really need to be able to toggle caching on and off before I can experiment here.The text was updated successfully, but these errors were encountered: