Merge remote-tracking branch 'origin/dev' into run-on-batch

catalyst-cooperative · Aug 30, 2023 · e211012 · e211012
2 parents c57b036 + 4a3c4ad
commit e211012
Show file tree

Hide file tree

Showing 91 changed files with 2,344 additions and 2,996 deletions.
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -52,7 +52,7 @@ repos:
       - id: rm-unneeded-f-str
 
   - repo: https://github.com/pre-commit/mirrors-prettier
-    rev: v3.0.1
+    rev: v3.0.2
     hooks:
       - id: prettier
         types_or: [yaml]

diff --git a/README.rst b/README.rst
@@ -64,18 +64,18 @@ What data is available?
 
 PUDL currently integrates data from:
 
-* `EIA Form 860 <https://www.eia.gov/electricity/data/eia860/>`__: 2001-2021
-* `EIA Form 860m <https://www.eia.gov/electricity/data/eia860m/>`__: 2022-06
-* `EIA Form 861 <https://www.eia.gov/electricity/data/eia861/>`__: 2001-2021
-* `EIA Form 923 <https://www.eia.gov/electricity/data/eia923/>`__: 2001-2021
-* `EPA Continuous Emissions Monitoring System (CEMS) <https://campd.epa.gov/>`__: 1995-2021
+* `EIA Form 860 <https://www.eia.gov/electricity/data/eia860/>`__: 2001-2022
+* `EIA Form 860m <https://www.eia.gov/electricity/data/eia860m/>`__: 2023-06
+* `EIA Form 861 <https://www.eia.gov/electricity/data/eia861/>`__: 2001-2022
+* `EIA Form 923 <https://www.eia.gov/electricity/data/eia923/>`__: 2001-2022
+* `EPA Continuous Emissions Monitoring System (CEMS) <https://campd.epa.gov/>`__: 1995-2022
 * `FERC Form 1 <https://www.ferc.gov/industries-data/electric/general-information/electric-industry-forms/form-1-electric-utility-annual>`__: 1994-2021
 * `FERC Form 714 <https://www.ferc.gov/industries-data/electric/general-information/electric-industry-forms/form-no-714-annual-electric/data>`__: 2006-2020
 * `US Census Demographic Profile 1 Geodatabase <https://www.census.gov/geographies/mapping-files/2010/geo/tiger-data.html>`__: 2010
 
 Thanks to support from the `Alfred P. Sloan Foundation Energy & Environment
 Program <https://sloan.org/programs/research/energy-and-environment>`__, from
-2021 to 2023 we will be integrating the following data as well:
+2021 to 2024 we will be integrating the following data as well:
 
 * `EIA Form 176 <https://www.eia.gov/dnav/ng/TblDefs/NG_DataSources.html#s176>`__
   (The Annual Report of Natural Gas Supply and Disposition)
@@ -124,7 +124,7 @@ Want access to all the published data in bulk? If you're familiar with Python
 and `Jupyter Notebooks <https://jupyter.org/>`__ and are willing to install Docker you
 can:
 
-* `Download a PUDL data release <https://sandbox.zenodo.org/record/764696>`__ from
+* `Download a PUDL data release <https://zenodo.org/record/3653158>`__ from
   CERN's `Zenodo <https://zenodo.org>`__ archiving service.
 * `Install Docker <https://docs.docker.com/get-docker/>`__
 * Run the archived image using ``docker-compose up``

diff --git a/devtools/ferc1-eia-glue/training_data/create_FERC1-EIA_manual_mapping_spreadsheets.ipynb b/devtools/ferc1-eia-glue/training_data/create_FERC1-EIA_manual_mapping_spreadsheets.ipynb
@@ -42,6 +42,7 @@
     "\n",
     "# Local libraries\n",
     "import pudl\n",
+    "from pudl.workspace.setup import PudlPaths\n",
     "from pudl.analysis.ferc1_eia_train import *"
    ]
   },
@@ -54,8 +55,7 @@
    },
    "outputs": [],
    "source": [
-    "pudl_settings = pudl.workspace.setup.get_defaults()\n",
-    "pudl_engine = sa.create_engine(pudl_settings['pudl_db'])\n",
+    "pudl_engine = sa.create_engine(PudlPaths().pudl_db)\n",
     "pudl_out = pudl.output.pudltabl.PudlTabl(pudl_engine, freq='AS', fill_net_gen=True)"
    ]
   },

diff --git a/...rc1-eia-glue/training_data/validate_and_integrate_FERC1-EIA_manually_mapped_records.ipynb b/...rc1-eia-glue/training_data/validate_and_integrate_FERC1-EIA_manually_mapped_records.ipynb
@@ -43,6 +43,7 @@
     "\n",
     "# Local libraries\n",
     "import pudl\n",
+    "from pudl.workspace.setup import PudlPaths\n",
     "from pudl.analysis.ferc1_eia_train import *"
    ]
   },
@@ -55,8 +56,7 @@
    },
    "outputs": [],
    "source": [
-    "pudl_settings = pudl.workspace.setup.get_defaults()\n",
-    "pudl_engine = sa.create_engine(pudl_settings['pudl_db'])\n",
+    "pudl_engine = sa.create_engine(PudlPaths().pudl_db)\n",
     "pudl_out = pudl.output.pudltabl.PudlTabl(pudl_engine, freq='AS', fill_net_gen=True)"
    ]
   },

diff --git a/docs/data_access.rst b/docs/data_access.rst
@@ -83,7 +83,7 @@ AWS CLI, or programmatically via the S3 API. They can also be downloaded directl
 HTTPS using the following links:
 
 * `PUDL SQLite DB <https://s3.us-west-2.amazonaws.com/intake.catalyst.coop/dev/pudl.sqlite>`__
-* `EPA CEMS Hourly Emissions Parquet (1995-2021) <https://s3.us-west-2.amazonaws.com/intake.catalyst.coop/dev/hourly_emissions_epacems.parquet>`__
+* `EPA CEMS Hourly Emissions Parquet (1995-2022) <https://s3.us-west-2.amazonaws.com/intake.catalyst.coop/dev/hourly_emissions_epacems.parquet>`__
 * `Census DP1 SQLite DB (2010) <https://s3.us-west-2.amazonaws.com/intake.catalyst.coop/dev/censusdp1tract.sqlite>`__
 
 * Raw FERC Form 1:

diff --git a/docs/dev/datastore.rst b/docs/dev/datastore.rst
@@ -38,15 +38,17 @@ For more detailed usage information, see:
     $ pudl_datastore --help
 
 The downloaded data will be used by the script to populate a datastore under
-the ``data`` directory in your workspace, organized by data source, form, and
-date::
+your ``$PUDL_INPUT`` directory, organized by data source, form, and DOI::
 
     data/censusdp1tract/
     data/eia860/
+    data/eia860m/
     data/eia861/
     data/eia923/
     data/epacems/
     data/ferc1/
+    data/ferc2/
+    data/ferc60/
     data/ferc714/
 
 If the download fails to complete successfully, the script can be run repeatedly until
@@ -64,28 +66,13 @@ archival and versioning of datasets. See the `documentation
 for information on adding datasets to the datastore.
 
 
-Prepare the Datastore
-^^^^^^^^^^^^^^^^^^^^^
+Tell PUDL about the archive
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-If you have used pudl-archiver to prepare a Zenodo archive as above, you
-can add support for your archive to the datastore by adding the DOI to
-pudl.workspace.datastore.DOI, under "sandbox" or "production" as appropriate.
-
-If you want to prepare an archive for the datastore separately, the following
-are required.
-
-#. The root path must contain a ``datapackage.json`` file that conforms to the
-`frictionless datapackage spec <https://specs.frictionlessdata.io/data-package/>`__
-#. Each listed resource among the ``datapackage.json`` resources must include:
-
-   * ``path`` containing the zenodo download url for the specific file.
-   * ``remote_url`` with the same url as the ``path``
-   * ``name`` of the file
-   * ``hash`` with the md5 hash of the file
-   * ``parts`` a set of key / value pairs defining additional attributes that
-     can be used to select a subset of the whole datapackage. For example, the
-     ``epacems`` dataset is partitioned by year and state, and
-     ``"parts": {"year": 2010, "state": "ca"}`` would indicate that the
-     resource contains data for the state of California in the year 2010.
-     Unpartitioned datasets like the ``ferc714`` which includes all years in
-     a single file, would have an empty ``"parts": {}``
+Once you have used pudl-archiver to prepare a Zenodo archive as above, you
+can make the PUDL Datastore aware of it by updating the appropriate DOI in
+:class:`pudl.workspace.datastore.ZenodoFetcher`. DOIs can refer to resources from the
+`Zenodo sandbox server <https://sandbox.zenodo.org>`__ for archives that are still in
+testing or development (sandbox DOIs have a prefix of ``10.5072``), or the
+`Zenodo production server <https://zenodo.org>`__ server if the archive is ready for
+public use (production DOIs hae a prefix of ``10.5281``).
diff --git a/docs/dev/testing.rst b/docs/dev/testing.rst
@@ -304,7 +304,6 @@ You can always check to see what custom flags exist by running
                         Path to a non-standard ETL settings file to use.
   --gcs-cache-path=GCS_CACHE_PATH
                         If set, use this GCS path as a datastore cache layer.
-  --sandbox             Use raw inputs from the Zenodo sandbox server.
 
 The main flexibility that these custom options provide is in selecting where
 the raw input data comes from and what data the tests should be run

diff --git a/docs/release_notes.rst b/docs/release_notes.rst
@@ -71,6 +71,8 @@ Data Coverage
 
 * Updated :doc:`data_sources/eia860` to include early release data from 2022.
 * Updated :doc:`data_sources/eia923` to include early release data from 2022.
+* Updated :doc:`data_sources/epacems` to switch from the old FTP server to the new
+  CAMPD API, and to include 2022 data.
 * New :ref:`epacamd_eia` crosswalk version v0.3, see issue :issue:`2317` and PR
   :pr:`2316`. EPA's updates add manual matches and exclusions focusing on operating
   units with a generator ID as of 2018.

diff --git a/migrations/env.py b/migrations/env.py
@@ -5,7 +5,7 @@
 from sqlalchemy import engine_from_config, pool
 
 from pudl.metadata.classes import Package
-from pudl.workspace.setup import get_defaults
+from pudl.workspace.setup import PudlPaths
 
 # this is the Alembic Config object, which provides
 # access to the values within the .ini file in use.
@@ -28,7 +28,7 @@
 # my_important_option = config.get_main_option("my_important_option")
 # ... etc.
 
-db_location = get_defaults()["pudl_db"]
+db_location = PudlPaths().pudl_db
 logger.info(f"alembic config.sqlalchemy.url: {db_location}")
 config.set_main_option("sqlalchemy.url", db_location)