Skip to content

Commit

Permalink
Merge remote-tracking branch 'origin/dev' into run-on-batch
Browse files Browse the repository at this point in the history
  • Loading branch information
rousik committed Aug 30, 2023
2 parents c57b036 + 4a3c4ad commit e211012
Show file tree
Hide file tree
Showing 91 changed files with 2,344 additions and 2,996 deletions.
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ repos:
- id: rm-unneeded-f-str

- repo: https://github.com/pre-commit/mirrors-prettier
rev: v3.0.1
rev: v3.0.2
hooks:
- id: prettier
types_or: [yaml]
Expand Down
14 changes: 7 additions & 7 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -64,18 +64,18 @@ What data is available?

PUDL currently integrates data from:

* `EIA Form 860 <https://www.eia.gov/electricity/data/eia860/>`__: 2001-2021
* `EIA Form 860m <https://www.eia.gov/electricity/data/eia860m/>`__: 2022-06
* `EIA Form 861 <https://www.eia.gov/electricity/data/eia861/>`__: 2001-2021
* `EIA Form 923 <https://www.eia.gov/electricity/data/eia923/>`__: 2001-2021
* `EPA Continuous Emissions Monitoring System (CEMS) <https://campd.epa.gov/>`__: 1995-2021
* `EIA Form 860 <https://www.eia.gov/electricity/data/eia860/>`__: 2001-2022
* `EIA Form 860m <https://www.eia.gov/electricity/data/eia860m/>`__: 2023-06
* `EIA Form 861 <https://www.eia.gov/electricity/data/eia861/>`__: 2001-2022
* `EIA Form 923 <https://www.eia.gov/electricity/data/eia923/>`__: 2001-2022
* `EPA Continuous Emissions Monitoring System (CEMS) <https://campd.epa.gov/>`__: 1995-2022
* `FERC Form 1 <https://www.ferc.gov/industries-data/electric/general-information/electric-industry-forms/form-1-electric-utility-annual>`__: 1994-2021
* `FERC Form 714 <https://www.ferc.gov/industries-data/electric/general-information/electric-industry-forms/form-no-714-annual-electric/data>`__: 2006-2020
* `US Census Demographic Profile 1 Geodatabase <https://www.census.gov/geographies/mapping-files/2010/geo/tiger-data.html>`__: 2010

Thanks to support from the `Alfred P. Sloan Foundation Energy & Environment
Program <https://sloan.org/programs/research/energy-and-environment>`__, from
2021 to 2023 we will be integrating the following data as well:
2021 to 2024 we will be integrating the following data as well:

* `EIA Form 176 <https://www.eia.gov/dnav/ng/TblDefs/NG_DataSources.html#s176>`__
(The Annual Report of Natural Gas Supply and Disposition)
Expand Down Expand Up @@ -124,7 +124,7 @@ Want access to all the published data in bulk? If you're familiar with Python
and `Jupyter Notebooks <https://jupyter.org/>`__ and are willing to install Docker you
can:

* `Download a PUDL data release <https://sandbox.zenodo.org/record/764696>`__ from
* `Download a PUDL data release <https://zenodo.org/record/3653158>`__ from
CERN's `Zenodo <https://zenodo.org>`__ archiving service.
* `Install Docker <https://docs.docker.com/get-docker/>`__
* Run the archived image using ``docker-compose up``
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@
"\n",
"# Local libraries\n",
"import pudl\n",
"from pudl.workspace.setup import PudlPaths\n",
"from pudl.analysis.ferc1_eia_train import *"
]
},
Expand All @@ -54,8 +55,7 @@
},
"outputs": [],
"source": [
"pudl_settings = pudl.workspace.setup.get_defaults()\n",
"pudl_engine = sa.create_engine(pudl_settings['pudl_db'])\n",
"pudl_engine = sa.create_engine(PudlPaths().pudl_db)\n",
"pudl_out = pudl.output.pudltabl.PudlTabl(pudl_engine, freq='AS', fill_net_gen=True)"
]
},
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@
"\n",
"# Local libraries\n",
"import pudl\n",
"from pudl.workspace.setup import PudlPaths\n",
"from pudl.analysis.ferc1_eia_train import *"
]
},
Expand All @@ -55,8 +56,7 @@
},
"outputs": [],
"source": [
"pudl_settings = pudl.workspace.setup.get_defaults()\n",
"pudl_engine = sa.create_engine(pudl_settings['pudl_db'])\n",
"pudl_engine = sa.create_engine(PudlPaths().pudl_db)\n",
"pudl_out = pudl.output.pudltabl.PudlTabl(pudl_engine, freq='AS', fill_net_gen=True)"
]
},
Expand Down
2 changes: 1 addition & 1 deletion docs/data_access.rst
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@ AWS CLI, or programmatically via the S3 API. They can also be downloaded directl
HTTPS using the following links:

* `PUDL SQLite DB <https://s3.us-west-2.amazonaws.com/intake.catalyst.coop/dev/pudl.sqlite>`__
* `EPA CEMS Hourly Emissions Parquet (1995-2021) <https://s3.us-west-2.amazonaws.com/intake.catalyst.coop/dev/hourly_emissions_epacems.parquet>`__
* `EPA CEMS Hourly Emissions Parquet (1995-2022) <https://s3.us-west-2.amazonaws.com/intake.catalyst.coop/dev/hourly_emissions_epacems.parquet>`__
* `Census DP1 SQLite DB (2010) <https://s3.us-west-2.amazonaws.com/intake.catalyst.coop/dev/censusdp1tract.sqlite>`__

* Raw FERC Form 1:
Expand Down
39 changes: 13 additions & 26 deletions docs/dev/datastore.rst
Original file line number Diff line number Diff line change
Expand Up @@ -38,15 +38,17 @@ For more detailed usage information, see:
$ pudl_datastore --help
The downloaded data will be used by the script to populate a datastore under
the ``data`` directory in your workspace, organized by data source, form, and
date::
your ``$PUDL_INPUT`` directory, organized by data source, form, and DOI::

data/censusdp1tract/
data/eia860/
data/eia860m/
data/eia861/
data/eia923/
data/epacems/
data/ferc1/
data/ferc2/
data/ferc60/
data/ferc714/

If the download fails to complete successfully, the script can be run repeatedly until
Expand All @@ -64,28 +66,13 @@ archival and versioning of datasets. See the `documentation
for information on adding datasets to the datastore.


Prepare the Datastore
^^^^^^^^^^^^^^^^^^^^^
Tell PUDL about the archive
^^^^^^^^^^^^^^^^^^^^^^^^^^^

If you have used pudl-archiver to prepare a Zenodo archive as above, you
can add support for your archive to the datastore by adding the DOI to
pudl.workspace.datastore.DOI, under "sandbox" or "production" as appropriate.

If you want to prepare an archive for the datastore separately, the following
are required.

#. The root path must contain a ``datapackage.json`` file that conforms to the
`frictionless datapackage spec <https://specs.frictionlessdata.io/data-package/>`__
#. Each listed resource among the ``datapackage.json`` resources must include:

* ``path`` containing the zenodo download url for the specific file.
* ``remote_url`` with the same url as the ``path``
* ``name`` of the file
* ``hash`` with the md5 hash of the file
* ``parts`` a set of key / value pairs defining additional attributes that
can be used to select a subset of the whole datapackage. For example, the
``epacems`` dataset is partitioned by year and state, and
``"parts": {"year": 2010, "state": "ca"}`` would indicate that the
resource contains data for the state of California in the year 2010.
Unpartitioned datasets like the ``ferc714`` which includes all years in
a single file, would have an empty ``"parts": {}``
Once you have used pudl-archiver to prepare a Zenodo archive as above, you
can make the PUDL Datastore aware of it by updating the appropriate DOI in
:class:`pudl.workspace.datastore.ZenodoFetcher`. DOIs can refer to resources from the
`Zenodo sandbox server <https://sandbox.zenodo.org>`__ for archives that are still in
testing or development (sandbox DOIs have a prefix of ``10.5072``), or the
`Zenodo production server <https://zenodo.org>`__ server if the archive is ready for
public use (production DOIs hae a prefix of ``10.5281``).
1 change: 0 additions & 1 deletion docs/dev/testing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -304,7 +304,6 @@ You can always check to see what custom flags exist by running
Path to a non-standard ETL settings file to use.
--gcs-cache-path=GCS_CACHE_PATH
If set, use this GCS path as a datastore cache layer.
--sandbox Use raw inputs from the Zenodo sandbox server.
The main flexibility that these custom options provide is in selecting where
the raw input data comes from and what data the tests should be run
Expand Down
2 changes: 2 additions & 0 deletions docs/release_notes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,8 @@ Data Coverage

* Updated :doc:`data_sources/eia860` to include early release data from 2022.
* Updated :doc:`data_sources/eia923` to include early release data from 2022.
* Updated :doc:`data_sources/epacems` to switch from the old FTP server to the new
CAMPD API, and to include 2022 data.
* New :ref:`epacamd_eia` crosswalk version v0.3, see issue :issue:`2317` and PR
:pr:`2316`. EPA's updates add manual matches and exclusions focusing on operating
units with a generator ID as of 2018.
Expand Down
4 changes: 2 additions & 2 deletions migrations/env.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
from sqlalchemy import engine_from_config, pool

from pudl.metadata.classes import Package
from pudl.workspace.setup import get_defaults
from pudl.workspace.setup import PudlPaths

# this is the Alembic Config object, which provides
# access to the values within the .ini file in use.
Expand All @@ -28,7 +28,7 @@
# my_important_option = config.get_main_option("my_important_option")
# ... etc.

db_location = get_defaults()["pudl_db"]
db_location = PudlPaths().pudl_db
logger.info(f"alembic config.sqlalchemy.url: {db_location}")
config.set_main_option("sqlalchemy.url", db_location)

Expand Down
Loading

0 comments on commit e211012

Please sign in to comment.