-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allowing public catalog access and efficient file caching #26
Comments
There's quite a lot here, so I'll try to answer a piece at a time Much of the opinions depend on my assumptions of how you think people will actually access the data. Either of us may be mistaken! I am assuming you intend to use dask, since it is mentioned in the catalogue.
The unit of work for parquet is the row-group, so it doesn't much matter whether whether you have one row group or multiple per file - but that's for local files. As you say, checking out the metadata of the file, and thereby caching the whole thing, defeats the purpose. There are a few things you can do:
There isn't an obvious way to do this. I think I would do urlpath: '{{cache_opt}}{{ env(PUDL_INTAKE_PATH) }}/hourly_emissions_epacems/*.parquet'
...
parameters:
cache_opt:
description: "Whether to apply caching; select empty string to disable"
type: str
default: "simplecache::"
allowed: ["simplecache::", "blockcache::", ""]
So long as the user has |
Anonymous AccessI added Caching & Consolidating MetadataThe data only gets updated infrequently, once a month at most, and I think we'd probably just totally replace it (or create a new persistent version) so the File ~/mambaforge/envs/pudl-dev/lib/python3.10/site-packages/fastparquet/writer.py:1330, in consolidate_categories(fmd)
1329 def consolidate_categories(fmd):
-> 1330 key_value = [k for k in fmd.key_value_metadata
1331 if k.key == b'pandas'][0]
1332 meta = json.loads(key_value.value)
1333 cats = [c for c in meta['columns']
1334 if 'num_categories' in (c['metadata'] or [])]
IndexError: list index out of range Which looks like it's maybe related to the lack of pandas metadata in the parquet files, specifying what data types the columns should be use when they're read by pandas, even though the data is coming from a dataframe: pqwriter.write_table(
pa.Table.from_pandas(df, schema=schema, preserve_index=False)
) I was trying to stick with The Single vs. Partitioned FilesThe partitioned version of the data that I've created (with one row-group per file) has filenames of the form epacems-YYYY-SS.parquet where I originally thought that I could use When I try and access the partitioned dataset remotely it feels like everything takes waaay longer -- even when caching is disabled. Is that expected? There are 1274 individual parquet files. Is that excessive? Selecting one state-year of data took 30 seconds to complete on the monolithic file but 7 minutes on the partitioned files. Even when the data is cached locally there's a huge difference. On the cached monolithic file the same query takes about 2 seconds, but with the cached partitioned files it took almost 4 minutes. That seems like something must not be right. import intake
pudl_cat = intake.cat.pudl_cat
filters = [[('year', '=', 2020), ('state', '=', 'CO')]]
# Note: both datasets are already cached locally
# No significant network traffic happened while these were running
epacems_df = (
pudl_cat.hourly_emissions_epacems_partitioned(filters=filters)
.to_dask().compute()
)
# CPU times: user 38.2 s, sys: 3.12 s, total: 41.3 s
# Wall time: 3min 46s
epacems_df = (
pudl_cat.hourly_emissions_epacems(filters=filters)
.to_dask().compute()
)
# CPU times: user 613 ms, sys: 119 ms, total: 732 ms
# Wall time: 1.48 s Similarly pudl_cat.hourly_emissions_epacems_partitioned.discover()
# CPU times: user 4.69 s, sys: 383 ms, total: 5.07 s
# Wall time: 1min 59s
pudl_cat.hourly_emissions_epacems.discover()
# CPU times: user 269 ms, sys: 7.04 ms, total: 276 ms
# Wall time: 769 ms Unless we can make the partitioned files perform better it seems like the one big file is the best way to go for now, and everyone will just have to wait 10 minutes for it to cache locally the first time they access it. description: A catalog of open energy system data for use by climate advocates,
policymakers, journalists, researchers, and other members of civil society.
plugins:
source:
- module: intake_parquet
# - module: intake_sqlite
metadata:
parameters:
cache_method:
description: "Whether to cache data locally; empty string to disable caching."
type: str
default: "simplecache::"
allowed: ["simplecache::", ""]
creator:
title: "Catalyst Cooperative"
email: "[email protected]"
path: "https://catalyst.coop"
sources:
hourly_emissions_epacems:
driver: parquet
description: Hourly pollution emissions and plant operational data reported via
Continuous Emissions Monitoring Systems (CEMS) as required by 40 CFR Part 75.
Includes CO2, NOx, and SO2, as well as the heat content of fuel consumed and
gross power output. Hourly values reported by US EIA ORISPL code and emissions
unit (smokestack) ID.
metadata:
title: Continuous Emissions Monitoring System (CEMS) Hourly Data
type: application/parquet
provider: US Environmental Protection Agency Air Markets Program
path: "https://ampd.epa.gov/ampd"
license:
name: "CC-BY-4.0"
title: "Creative Commons Attribution 4.0"
path: "https://creativecommons.org/licenses/by/4.0"
args: # These arguments are for dask.dataframe.read_parquet()
engine: 'pyarrow'
urlpath: '{{ cache_method }}{{ env(PUDL_INTAKE_PATH) }}/hourly_emissions_epacems.parquet'
storage_options:
token: 'anon' # Explicitly use anonymous access.
simplecache:
cache_storage: '{{ env(PUDL_INTAKE_CACHE) }}' |
With some pointers from @martindurant in [this issue](intake/intake-parquet#26) I got anonymous public access working, and caching can now be turned off when appropriate. Accessing the partitioned data is still very slow in a variety of contexts for reasons I don't understand. I also hit a snag attempting to create a consolidated external `_metadata` file to hopefully speed up access to the partitioned data so... not sure what to do there. The current Tox/pytest setup expects to find data locally, which won't work right now on GitHub. Need to set the tests up better for real world use, and less for exploring different catalog configurations. Closes #5, #6
Note that
does not do the same thing as Obviously, it may be worth your while figuring out what is taking all that time; I recommend Another possible thing you might include as an option, is to follow the pattern in |
Another confusing hiccup. We've just turned on "requester pays" for the storage bucket containing these parquet files, and I added hourly_emissions_epacems:
driver: parquet
args:
engine: 'pyarrow'
urlpath: '{{ cache_method }}{{ env(PUDL_INTAKE_PATH) }}/hourly_emissions_epacems.parquet'
storage_options:
requester_pays: true
simplecache:
cache_storage: '{{ env(PUDL_INTAKE_CACHE) }}' However, the So this works fine: epacems_df = (
pudl_cat.hourly_emissions_epacems(
filters=filters,
cache_method="",
index=False,
split_row_groups=True,
)
.to_dask().compute()
) But this: epacems_df = (
pudl_cat.hourly_emissions_epacems(
filters=filters,
# cache_method="", # Uses the default value of "simplecache::"
index=False,
split_row_groups=True,
)
.to_dask().compute()
) Results in: ValueError: Bucket is requester pays. Set `requester_pays=True` when creating the GCSFileSystem. Does that make sense to you? Why would they behave differently? |
Yes: because the URL has multiple components, you need to sspecify which of them is to get the extra argument: hourly_emissions_epacems:
driver: parquet
args:
engine: 'pyarrow'
urlpath: '{{ cache_method }}{{ env(PUDL_INTAKE_PATH) }}/hourly_emissions_epacems.parquet'
storage_options:
s3:
requester_pays: true
simplecache:
cache_storage: '{{ env(PUDL_INTAKE_CACHE) }}' |
(hm, I'll have to think about whether this still works when caching is off - perhaps just test?) |
Hmm, yes this flips the problem. Now it works when simplecache is enabled and fails when caching is turned off. If I put |
hah, ok! This could probably use a little design on our end, but good that it works. |
Hi @martindurant! I'm working with @zaneselvans on the PUDL Intake Catalog. I've moved our data to an s3 bucket and I'm running into issues with the conditional caching work around discussed above. This is what our sources:
hourly_emissions_epacems:
description:
Hourly pollution emissions and plant operational data reported via
Continuous Emissions Monitoring Systems (CEMS) as required by 40 CFR Part 75.
Includes CO2, NOx, and SO2, as well as the heat content of fuel consumed and
gross power output. Hourly values reported by US EIA ORISPL code and emissions
unit (smokestack) ID.
driver: parquet
parameters:
cache_method:
description: "Whether to cache data locally; empty string to disable caching."
type: str
default: "simplecache::"
allowed: ["simplecache::", ""]
metadata:
title: Continuous Emissions Monitoring System (CEMS) Hourly Data
type: application/parquet
provider: US Environmental Protection Agency Air Markets Program
path: "https://ampd.epa.gov/ampd"
license:
name: "CC-BY-4.0"
title: "Creative Commons Attribution 4.0"
path: "https://creativecommons.org/licenses/by/4.0"
args: # These arguments are for dask.dataframe.read_parquet()
engine: "pyarrow"
split_row_groups: true
index: false
urlpath: "{{ cache_method }}{{ env(PUDL_INTAKE_PATH) }}/hourly_emissions_epacems.parquet"
storage_options:
s3:
anon: true
simplecache:
cache_storage: "{{ env(PUDL_INTAKE_CACHE) }}" Using
I've tried structuring the storage options this way but then I get a keyword argument for
Am I missing something or can I not specific |
I removed caching disabling because fsspec starting throwing unexpected keyword argument errors when making requests to s3 with caching disabled. See intake/intake-parquet#26 for the full explanation.
It seems like there are two different ways of passing options in to different parts of the file system / fsspec / intake system. One is through the URL that's used as the target that's being read, and another is through options being passed to create the fsspec filesystem. Looking at the fsspec docs on caching these two calls are using similar options, but provided differently: fs = fsspec.filesystem(
"filecache",
target_protocol='s3',
target_options={'anon': True},
cache_storage='/tmp/files/') of = fsspec.open(
"filecache::s3://bucket/key",
s3={'anon': True},
filecache={'cache_storage':'/tmp/files'}
) Somehow the information in the Intake catalog YAML is being used to construct... one of these. I think the conditional caching was being done by constructing a URL that could start either with But it's confusing because parameters for lots of different functions / classes are all mingled together in the YAML file and it's not clear which ones are going to end up where. Do they go to |
Yes, you are quite right, To answer the other question, I would be happy to consider a PR which either
|
I'm trying to provide convenient public access to data stored in Parquet files in Google cloud storage through an Intake catalog (see this repo) but have run into a few snags. I suspect that some of them may be easy to fix, but I don't have enough context on how the system works and trial-and-error seems slow. I haven't found many public examples of how catalogs are typically set up to learn from.
One big file, or many smaller files?
Conditionally disable caching
Allowing anonymous public access
https://storage.googleapis.com
seems much less performant than using thegs://
protocol.https://
doesn't seem to work at all -- it downloads everything no matter what.https://
URLs (I get a 403 forbidden error)) and neither does providing the URL of the "folder" that all the parquet files is in.gs://
? It seems like we just get access denied even when all the objects are public.Current catalog
The text was updated successfully, but these errors were encountered: