Skip to content
This repository has been archived by the owner on Jan 12, 2024. It is now read-only.

Switch to using requester pays on storage buckets #23

Merged
merged 22 commits into from
May 24, 2022
Merged
Show file tree
Hide file tree
Changes from 13 commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
9d9c1dc
[pre-commit.ci] pre-commit autoupdate
pre-commit-ci[bot] May 16, 2022
82eff5f
Add google auth to tox workflow to access requester-pays bucket
zaneselvans May 17, 2022
31cfa25
Pass GCP environment vars through to tox
zaneselvans May 17, 2022
1f86011
Add requester_pays to storage_options in catalog and read_parquet() t…
zaneselvans May 17, 2022
4306302
Remove GCP environment variables to see if they were necessary.
zaneselvans May 17, 2022
5730e58
Add export of environment variables back into workflow
zaneselvans May 17, 2022
74d6b06
Try passing GCP_PROJECT in to tox environment
zaneselvans May 17, 2022
fc20b0b
Try passing all GCP project envs in to tox environment
zaneselvans May 17, 2022
6dab958
Pass in only GOOGLE_APPLICATION_CREDENTIALS to tox
zaneselvans May 17, 2022
feba6e7
Pass in only GOOGLE_APPLICATION_CREDENTIALS to tox
zaneselvans May 17, 2022
a384620
Modify rstcheck configuration to work with v6.0.0
zaneselvans May 17, 2022
0a6e260
Ignore unused hyperlink reference warning
zaneselvans May 17, 2022
0ac3b43
Add requester_pays to the gs:// specific storage_options.
zaneselvans May 17, 2022
7fc42da
Move required read_parquet() options into catalog definition.
zaneselvans May 17, 2022
86d0de8
Ensure that example notebook works with requester_pays.
zaneselvans May 17, 2022
3e5b42d
Add requester pays instructions to README
zaneselvans May 19, 2022
0043917
Update pre-commit hooks to rstcheck 6.0.0a2
zaneselvans May 20, 2022
aeeae18
Merge branch 'dev' into pre-commit-ci-update-config
zaneselvans May 20, 2022
9cfbf88
Merge pull request #21 from catalyst-cooperative/pre-commit-ci-update…
zaneselvans May 20, 2022
4b3d1c5
Add upper bounds on dependencies so dependabot catches and tests chan…
zaneselvans May 23, 2022
efcdae2
Bring pre-commit-ci & dependabot auto-merge over from cheshire.
zaneselvans May 24, 2022
77f9db5
Run ci-notify regardless of whether ci-test passed.
zaneselvans May 24, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions .github/workflows/tox-pytest.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,13 @@ jobs:
with:
fetch-depth: 2

- id: 'auth'
name: 'Authenticate to Google Cloud'
uses: 'google-github-actions/auth@v0'
with:
credentials_json: '${{ secrets.GOOGLE_CREDENTIALS }}'
export_environment_variables: true

- name: Set up conda environment for testing
uses: conda-incubator/[email protected]
with:
Expand Down
131 changes: 96 additions & 35 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -50,55 +50,89 @@ Currently available datasets
Future datasets
~~~~~~~~~~~~~~~

* Raw FERC Form 1 DB (SQL) – `browse DB online <https://data.catalyst.coop/ferc1>`__
* PUDL DB (SQL) – `browse DB online <https://data.catalyst.coop/pudl>`__
* Census Demographic Profile 1 (SQL)
* Raw FERC Form 1 DB (SQLite) -- `browse DB online <https://data.catalyst.coop/ferc1>`__
* PUDL DB (SQLite) -- `browse DB online <https://data.catalyst.coop/pudl>`__
* Census Demographic Profile 1 (SQLite)

Ongoing Development
-------------------

Development is currently being organized under these epics in the main
PUDL repo:

* `Intake SQLite Driver <https://github.com/catalyst-cooperative/pudl/issues/1156>`__
* `EPA CEMS Intake Catalog <https://github.com/catalyst-cooperative/pudl/issues/1564>`__
* `Prototype SQLite Intake Catalog <https://github.com/catalyst-cooperative/pudl/issues/1156>`__
* `PUDL Intake Catalog <https://github.com/catalyst-cooperative/pudl/issues/1179>`__

See the `issues in this repository
See the `issues in the pudl-catalog repository
<https://github.com/catalyst-cooperative/pudl-catalog/issues>`__ for more
detailed tasks.

Planned data distribution system
Usage
-----

Public data and "requester pays"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add this stuff into the data access page instead of the readme??

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll break it out into a separate page and link to it. I agree this is too much detail for the README.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

We’re in the process of implementing automated nightly builds of all of our data
products for each development branch with new commits in the main PUDL
repository. This will allow us to do exhaustive integration testing and data
validation on a daily basis. If all of the tests and data validation pass, then
a new version of the data products (SQLite databases and Parquet files) will be
produced, and placed into cloud storage.
The data we're publishing in the PUDL Catalog is publicly accessible and distributed
under the permissive `CC-BY-4.0 <https://creativecommons.org/licenses/by/4.0>`__
license. Catalyst covers the cost of storing the data in Google cloud storage buckets.
However, there are also fees incurred when data leaves the Google cloud infrastructure.
Depending where you're downloading from, it costs $0.10-0.20 (USD) per GB.

These outputs will be made available via a data catalog on a corresponding
branch in this ``pudl-catalog`` repository. Ingeneral only the catalogs and data
resources corresponding to the ``HEAD`` of development and feature branches will
be available. Releases that are tagged on the ``main`` branch will be retained
long term.
In order to be able to share large amounts of public data without being exposed to large
unexpected bills from Google due to someone maliciously or accidentally downloading a
large volume of data programmatically, we've set the cloud storage to use `requester
pays <https://cloud.google.com/storage/docs/requester-pays>`__. This means the person
downloading the data is responsible for those (modest) costs instead. Downloading all of
the EPA CEMS, FERC 1, PUDL, and US Census data we're publishing from North America will
cost around $0.75, and it will be cached locally so that it's not downloaded again until
a new version is released.

The idea is that for any released version of PUDL, you should also be able to
install a corresponding data catalog, and know that the software and the data
are compatible. You can also install just the data catalog with minimal
dependencies, and not need to worry about the PUDL software that produced it at
all, if you simply want to access the DBs or Parquet files directly.
To set up a GCP billing project and use it for authentication when accessing the
catalog:

In development, this arrangement will mean that every morning you should have
access to a fully processed set of data products that reflect the branch of code
that you’re working on, rather than the data and code getting progressively
further out of sync as you do development, until you take the time to re-run the
full ETL locally yourself.
* `Create a project on GCP <https://cloud.google.com/resource-manager/docs/creating-managing-projects#creating_a_project>`__;
if this is the first time using GCP, a prompt should appear asking you to choose which
Google account to use for your GCP-related activities. (You should also receive $300
in initial cloud credits.
* `Create a Cloud Billing account <https://cloud.google.com/billing/docs/how-to/manage-billing-account#create_a_new_billing_account>`__
associated with the project and `enable billing for the project
<https://cloud.google.com/billing/docs/how-to/modify-project#enable_billing_for_a_project>`__
through this account.
* `Using Google Cloud IAM <https://cloud.google.com/iam/docs/granting-changing-revoking-access#granting-console>`__,
add the **Service Usage Consumer** role to your account, which enables it to make
billed requests on the behalf of the project.
* Install the `gcloud utilities <https://cloud.google.com/sdk/docs/install>`__ on your
computer. This can be done using ``conda`` (or ``mamba``):
zaneselvans marked this conversation as resolved.
Show resolved Hide resolved

.. code:: bash

conda install -c conda-forge google-cloud-sdk

* Initialize the ``gcloud`` command line interface, logging into the account used to
create the aforementioned project and selecting it as the default project; this will
allow the project to be used for requester pays access through the command line:

.. code:: bash

gcloud auth login
gcloud init
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this step asked me some questions:

Pick configuration to use:
 [1] Re-initialize this configuration [default] with new settings 
 [2] Create a new configuration

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it also asked:

Pick cloud project to use: 
 [1] northern-hope-350719
 [2] pudl-test
 [3] Enter a project ID
 [4] Create a new project

(I had just made the pudl-test project when i first logged in)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i choose 2

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which one did you choose? I think it should be the project you just created, and within which you should have given yourself the permission to spend money.


* Finally, use ``gcloud`` to establish application default credentials; this will allow
the project to be used for requester pays access through applications:

.. code:: bash

gcloud auth application-default login

Example Usage
-------------
* To test whether your GCP account is set up correctly and authenticated you can run the
following command to list the contents of the cloud storage bucket containing the
intake catalog data:

See the notebook included in this repository for more details.
.. code:: bash

gsutil ls gs://intake.catalyst.coop
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay i got:

(pudl-dev) %0 ~/code% gsutil ls gs://intake.catalyst.coop                                                            christinagosnell@New-Shiny-Thing
BadRequestException: 400 Bucket is a requester pays bucket but no user project provided.
[1]    18496 exit 1     gsutil ls gs://intake.catalyst.coop
``

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you get when you do

gcloud config configurations list

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(but good to know that the gsutil ls is a sufficient check!)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(pudl-dev) %0 ~/code% gcloud config configurations list                                                              christinagosnell@New-Shiny-Thing
NAME     IS_ACTIVE  ACCOUNT               PROJECT    COMPUTE_DEFAULT_ZONE  COMPUTE_DEFAULT_REGION
default  True       [email protected]  pudl-test


Import Intake Catalogs
~~~~~~~~~~~~~~~~~~~~~~
Expand Down Expand Up @@ -222,11 +256,7 @@ on that dataframe to actually read the data and return a pandas dataframe:
states=["ID", "CO", "TX"],
)
epacems_df = (
pudl_cat.hourly_emissions_epacems(
filters=filters
index=False,
split_row_groups=True,
)
pudl_cat.hourly_emissions_epacems(filters=filters)
.to_dask()
.compute()
)
Expand All @@ -253,6 +283,37 @@ on that dataframe to actually read the data and return a pandas dataframe:
469,4,2019-01-01 10:00:00+00:00,2019,CO,79,298,1.0,204.0,2129.2,126.2
469,4,2019-01-01 11:00:00+00:00,2019,CO,79,298,1.0,204.0,2160.6,128.1

See the Jupyter notebook included in this repository for more details.


Planned data distribution system
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

We’re in the process of implementing automated nightly builds of all of our data
products for each development branch with new commits in the main PUDL
repository. This will allow us to do exhaustive integration testing and data
validation on a daily basis. If all of the tests and data validation pass, then
a new version of the data products (SQLite databases and Parquet files) will be
produced, and placed into cloud storage.

These outputs will be made available via a data catalog on a corresponding
branch in this ``pudl-catalog`` repository. Ingeneral only the catalogs and data
resources corresponding to the ``HEAD`` of development and feature branches will
be available. Releases that are tagged on the ``main`` branch will be retained
long term.

The idea is that for any released version of PUDL, you should also be able to
install a corresponding data catalog, and know that the software and the data
are compatible. You can also install just the data catalog with minimal
dependencies, and not need to worry about the PUDL software that produced it at
all, if you simply want to access the DBs or Parquet files directly.

In development, this arrangement will mean that every morning you should have
access to a fully processed set of data products that reflect the branch of code
that you’re working on, rather than the data and code getting progressively
further out of sync as you do development, until you take the time to re-run the
full ETL locally yourself.

Benefits of Intake Catalogs
---------------------------

Expand Down
29 changes: 15 additions & 14 deletions notebooks/pudl-catalog.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,8 @@
"from pudl_catalog.helpers import year_state_filter\n",
"\n",
"TEST_YEARS = [2019, 2020]\n",
"TEST_STATES = [\"ID\", \"CO\", \"TX\"]"
"TEST_STATES = [\"ID\", \"ME\"]\n",
"TEST_FILTERS = year_state_filter(years=TEST_YEARS, states=TEST_STATES)"
]
},
{
Expand Down Expand Up @@ -117,7 +118,7 @@
"source": [
"%%time\n",
"# This takes forever and downloads the whole dataset\n",
"pudl_cat.hourly_emissions_epacems.discover()"
"pudl_cat.hourly_emissions_epacems().discover()"
]
},
{
Expand All @@ -136,16 +137,11 @@
"%%time\n",
"print(f\"Reading data from {os.getenv('PUDL_INTAKE_PATH')}\")\n",
"print(f\"Caching data to {os.getenv('PUDL_INTAKE_CACHE')}\")\n",
"filters = year_state_filter(\n",
" years=TEST_YEARS,\n",
" states=TEST_STATES,\n",
")\n",
"display(filters)\n",
"display(TEST_FILTERS)\n",
"epacems_df = (\n",
" pudl_cat.hourly_emissions_epacems(\n",
" filters=filters,\n",
" )\n",
" .to_dask().compute()\n",
" pudl_cat.hourly_emissions_epacems(filters=TEST_FILTERS)\n",
" .to_dask()\n",
" .compute()\n",
")"
]
},
Expand Down Expand Up @@ -181,7 +177,10 @@
"outputs": [],
"source": [
"%%time\n",
"df1 = pd.read_parquet(f\"{os.environ['PUDL_INTAKE_PATH']}/hourly_emissions_epacems/epacems-2020-ID.parquet\")"
"df1 = pd.read_parquet(\n",
" f\"{os.getenv('PUDL_INTAKE_PATH')}/hourly_emissions_epacems/epacems-2020-ID.parquet\",\n",
" storage_options={\"requester_pays\": True},\n",
")"
]
},
{
Expand All @@ -191,7 +190,9 @@
"outputs": [],
"source": [
"%%time\n",
"df2 = pudl_cat.hourly_emissions_epacems(filters=year_state_filter(years=[2020], states=[\"ID\"])).to_dask().compute()"
"df2 = pudl_cat.hourly_emissions_epacems(\n",
" filters=year_state_filter(years=[2020], states=[\"ID\"])\n",
").to_dask().compute()"
]
},
{
Expand Down Expand Up @@ -221,7 +222,7 @@
"import fsspec\n",
"epacems_pq = pq.read_table(\n",
" f\"{os.environ['PUDL_INTAKE_PATH']}/hourly_emissions_epacems/epacems-2020-ID.parquet\",\n",
" filesystem=fsspec.filesystem(\"gs\"),\n",
" filesystem=fsspec.filesystem(\"gs\", requester_pays=True),\n",
")\n",
"dtype_dict = {name: dtype for name, dtype in zip(epacems_pq.schema.names, epacems_pq.schema.types)}\n",
"pprint(dtype_dict, indent=4, sort_dicts=False)"
Expand Down
12 changes: 10 additions & 2 deletions src/pudl_catalog/pudl_catalog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -38,9 +38,13 @@ sources:
path: "https://creativecommons.org/licenses/by/4.0"
args: # These arguments are for dask.dataframe.read_parquet()
engine: 'pyarrow'
split_row_groups: true
index: false
urlpath: '{{ cache_method }}{{ env(PUDL_INTAKE_PATH) }}/hourly_emissions_epacems.parquet'
storage_options:
token: 'anon' # Explicitly use anonymous access.
requester_pays: true
gs:
requester_pays: true
simplecache:
cache_storage: '{{ env(PUDL_INTAKE_CACHE) }}'

Expand All @@ -62,9 +66,13 @@ sources:
path: "https://creativecommons.org/licenses/by/4.0"
args: # These arguments are for dask.dataframe.read_parquet()
engine: 'pyarrow'
split_row_groups: true
index: false
urlpath: '{{ cache_method }}{{ env(PUDL_INTAKE_PATH) }}/hourly_emissions_epacems/*.parquet'
storage_options:
token: 'anon' # Explicitly use anonymous access.
requester_pays: true
gs:
requester_pays: true
simplecache:
cache_storage: '{{ env(PUDL_INTAKE_CACHE) }}'

Expand Down
14 changes: 10 additions & 4 deletions tests/integration/hourly_emissions_epacems_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,11 @@ def expected_df() -> pd.DataFrame:
partition=False,
table_name="hourly_emissions_epacems",
)
return pd.read_parquet(epacems_url, filters=TEST_FILTERS)
return pd.read_parquet(
epacems_url,
filters=TEST_FILTERS,
storage_options={"requester_pays": True},
)


@pytest.mark.parametrize(
Expand All @@ -71,7 +75,11 @@ def test_read_parquet(
protocol=protocol, partition=partition, table_name="hourly_emissions_epacems"
)
start_time = time.time()
df = pd.read_parquet(epacems_url, filters=TEST_FILTERS)
df = pd.read_parquet(
epacems_url,
filters=TEST_FILTERS,
storage_options={"requester_pays": True},
)
elapsed_time = time.time() - start_time
logger.info(f" elapsed time: {elapsed_time:.2f}s")
assert_frame_equal(df, expected_df)
Expand Down Expand Up @@ -103,8 +111,6 @@ def test_intake_catalog(
pudl_cat[src](
filters=TEST_FILTERS,
cache_method="",
index=False,
split_row_groups=True,
)
.to_dask()
.compute()
Expand Down
2 changes: 2 additions & 0 deletions tox.ini
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,8 @@ passenv =
CONDA_PREFIX
GITHUB_*
HOME
GOOGLE_APPLICATION_CREDENTIALS

covargs = --cov={envsitepackagesdir}/pudl_catalog --cov-append --cov-report=xml
covreport = coverage report --sort=cover

Expand Down