Skip to content
This repository has been archived by the owner on Jan 12, 2024. It is now read-only.

Speed up querying of paritioned Parquet data on GCS #8

Closed
zaneselvans opened this issue Apr 7, 2022 · 1 comment
Closed

Speed up querying of paritioned Parquet data on GCS #8

zaneselvans opened this issue Apr 7, 2022 · 1 comment
Assignees
Labels
intake Intake data catalogs parquet Apache Parquet is an open columnar data file format. performance Make data go faster by using less memory, disk, network, compute, etc.

Comments

@zaneselvans
Copy link
Member

zaneselvans commented Apr 7, 2022

Using pd.read_parquet()

When using pd.read_parquet() reading data from a collection of remote parquet files using the gcs:// protocol takes twice as long as reading from a single parquet file, but no similar slowdown occurs locally:

# Select ~1% of the 800M rows in the dataset, from 6 of 1274 row groups: 
filters = [
    [('year', '=', 2019), ('state', '=', 'ID')],
    [('year', '=', 2019), ('state', '=', 'CO')],
    [('year', '=', 2019), ('state', '=', 'TX')],
    [('year', '=', 2020), ('state', '=', 'ID')],
    [('year', '=', 2020), ('state', '=', 'CO')],
    [('year', '=', 2020), ('state', '=', 'TX')]
]

single_file_local = pd.read_parquet("../data/hourly_emissions_epacems.parquet", filters=filters)
# CPU times: user 2.58 s, sys: 778 ms, total: 3.35 s
# Wall time: 2.23 s

multi_file_local = pd.read_parquet("../data/hourly_emissions_epacems", filters=filters)
# CPU times: user 4.57 s, sys: 1.01 s, total: 5.58 s
# Wall time: 2.67 s

single_file_remote = pd.read_parquet("gcs://catalyst.coop/intake/test/hourly_emissions_epacems.parquet", filters=filters)
# CPU times: user 5.33 s, sys: 1.22 s, total: 6.56 s
# Wall time: 25 s

multi_file_remote = pd.read_parquet("gcs://catalyst.coop/intake/test/hourly_emissions_epacems", filters=filters)
# CPU times: user 16.2 s, sys: 2.61 s, total: 18.8 s
# Wall time: 51.7 s
  • Is it not able to use the pushdown filtering remotely to only scan the files / block groups that have the requested data?
  • Looking at the reports from %%time the user time does double locally for the partitioned data, but the elapsed time doesn't. Is it working with multiple threads locally, but only a single thread remotely?

Using intake_parquet

Even ignoring the close to 12 minutes of apparent network transfer time, the same query only took 25 seconds with pd.read_parquet() and here it took 3 minutes. Really need to be able to toggle caching on and off before I can experiment here.

# Not sure giving it empty storage options had the effect of disabling caching.
# It seems to have re-downloaded the whole dataset and put it... where?
single_file_intake = pudl_cat.hourly_emissions_epacems(
    storage_options={}, filters=filters
).to_dask().compute()
# CPU times: user 2min 17s, sys: 44.2 s, total: 3min 1s
# Wall time: 14min 49s
@zaneselvans zaneselvans added intake Intake data catalogs parquet Apache Parquet is an open columnar data file format. performance Make data go faster by using less memory, disk, network, compute, etc. labels Apr 7, 2022
@zaneselvans zaneselvans self-assigned this Apr 21, 2022
@zaneselvans
Copy link
Member Author

Consolidating this issue with #4 since I'm pretty sure they're both fixable with external metadata.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
intake Intake data catalogs parquet Apache Parquet is an open columnar data file format. performance Make data go faster by using less memory, disk, network, compute, etc.
Projects
None yet
Development

No branches or pull requests

1 participant