diff --git a/README.rst b/README.rst
index df5edcda3e..45e08b00e3 100644
--- a/README.rst
+++ b/README.rst
@@ -59,23 +59,81 @@ it's often difficult to work with. PUDL takes the original spreadsheets, CSV fil
and databases and turns them into a unified resource. This allows users to spend more
time on novel analysis and less time on data preparation.
+The project is focused on serving researchers, activists, journalists, policy makers,
+and small businesses that might not otherwise be able to afford access to this data
+from commercial sources and who may not have the time or expertise to do all the
+data processing themselves from scratch.
+
+We want to make this data accessible and easy to work with for as wide an audience as
+possible: anyone from a grassroots youth climate organizers working with Google
+sheets to university researchers with access to scalable cloud computing
+resources and everyone in between!
+
+PUDL is comprised of three core components:
+
+- **Raw Data Archives**
+
+ - PUDL `archives `__
+ all the raw data inputs on `Zenodo `__
+ to ensure perminant, versioned access to the data. In the event that an agency
+ changes how they publish data or deletes old files, the ETL will still have access
+ to the original inputs. Each of the data inputs may have several different versions
+ archived, and all are assigned a unique DOI and made available through the REST API.
+ You can read more about the Raw Data Archives in the
+ `docs `__.
+- **ETL Pipeline**
+
+ - The ETL pipeline (this repo) ingests the raw archives, cleans them,
+ integrates them, and outputs them to a series of tables stored in SQLite Databases,
+ Parquet files, and pickle files (the Data Warehouse). Each release of the PUDL
+ Python package is embedded with a set of of DOIs to indicate which version of the
+ raw inputs it is meant to process. This process helps ensure that the ETL and it's
+ outputs are replicable. You can read more about the ETL in the
+ `docs `__.
+- **Data Warehouse**
+
+ - The outputs from the ETL, sometimes called "PUDL outputs",
+ are stored in a data warehouse as a collection of SQLite and Parquet files so that
+ users can access the data without having to run any code. Learn more about how to
+ access the data `here `__.
+
What data is available?
-----------------------
PUDL currently integrates data from:
-* `EIA Form 860 `__: 2001-2022
-* `EIA Form 860m `__: 2023-06
-* `EIA Form 861 `__: 2001-2022
-* `EIA Form 923 `__: 2001-2022
-* `EPA Continuous Emissions Monitoring System (CEMS) `__: 1995-2022
-* `FERC Form 1 `__: 1994-2021
-* `FERC Form 714 `__: 2006-2020
-* `US Census Demographic Profile 1 Geodatabase `__: 2010
+* **EIA Form 860**: 2001-2022
+ - `Source Docs `__
+ - `PUDL Docs `__
+* **EIA Form 860m**: 2023-06
+ - `Source Docs `__
+* **EIA Form 861**: 2001-2022
+ - `Source Docs `__
+ - `PUDL Docs `__
+* **EIA Form 923**: 2001-2022
+ - `Source Docs `__
+ - `PUDL Docs `__
+* **EPA Continuous Emissions Monitoring System (CEMS)**: 1995-2022
+ - `Source Docs `__
+ - `PUDL Docs `__
+* **FERC Form 1**: 1994-2021
+ - `Source Docs `__
+ - `PUDL Docs `__
+* **FERC Form 714**: 2006-2020
+ - `Source Docs `__
+ - `PUDL Docs `__
+* **FERC Form 2**: 2021 (raw only)
+ - `Source Docs `__
+* **FERC Form 6**: 2021 (raw only)
+ - `Source Docs `__
+* **FERC Form 60**: 2021 (raw only)
+ - `Source Docs `__
+* **US Census Demographic Profile 1 Geodatabase**: 2010
+ - `Source Docs `__
Thanks to support from the `Alfred P. Sloan Foundation Energy & Environment
Program `__, from
-2021 to 2024 we will be integrating the following data as well:
+2021 to 2024 we will be cleaning and integrating the following data as well:
* `EIA Form 176 `__
(The Annual Report of Natural Gas Supply and Disposition)
@@ -83,26 +141,12 @@ Program `__, from
* `FERC Form 2 `__
(Annual Report of Major Natural Gas Companies)
* `PHMSA Natural Gas Annual Report `__
-* Machine Readable Specifications of State Clean Energy Standards
-
-Who is PUDL for?
-----------------
-
-The project is focused on serving researchers, activists, journalists, policy makers,
-and small businesses that might not otherwise be able to afford access to this data
-from commercial sources and who may not have the time or expertise to do all the
-data processing themselves from scratch.
-
-We want to make this data accessible and easy to work with for as wide an audience as
-possible: anyone from a grassroots youth climate organizers working with Google
-sheets to university researchers with access to scalable cloud computing
-resources and everyone in between!
How do I access the data?
-------------------------
-There are several ways to access PUDL outputs. For more details you'll want
-to check out `the complete documentation
+There are several ways to access the information in the PUDL Data Warehouse.
+For more details you'll want to check out `the complete documentation
`__, but here's a quick overview:
Datasette
@@ -118,6 +162,19 @@ This access mode is good for casual data explorers or anyone who just wants to g
small subset of the data. It also lets you share links to a particular subset of the
data and provides a REST API for querying the data from other applications.
+Nightly Data Builds
+^^^^^^^^^^^^^^^^^^^
+If you are less concerned with reproducibility and want the freshest possible data
+we automatically upload the outputs of our nightly builds to public S3 storage buckets
+as part of the `AWS Open Data Registry
+`__. This data is based on
+the `dev branch `__, of PUDL, and
+is updated most weekday mornings. It is also the data used to populate Datasette.
+
+The nightly build outputs can be accessed using the AWS CLI, the S3 API, or downloaded
+directly via the web. See `Accessing Nightly Builds `__
+for links to the individual SQLite, JSON, and Apache Parquet outputs.
+
Docker + Jupyter
^^^^^^^^^^^^^^^^
Want access to all the published data in bulk? If you're familiar with Python
@@ -151,19 +208,6 @@ most users. You should check out the `Development section `__ for more
details.
-Nightly Data Builds
-^^^^^^^^^^^^^^^^^^^
-If you are less concerned with reproducibility and want the freshest possible data
-we automatically upload the outputs of our nightly builds to public S3 storage buckets
-as part of the `AWS Open Data Registry
-`__. This data is based on
-the `dev branch `__, of PUDL, and
-is updated most weekday mornings. It is also the data used to populate Datasette.
-
-The nightly build outputs can be accessed using the AWS CLI, the S3 API, or downloaded
-directly via the web. See `Accessing Nightly Builds `__
-for links to the individual SQLite, JSON, and Apache Parquet outputs.
-
Contributing to PUDL
--------------------
Find PUDL useful? Want to help make it better? There are lots of ways to help!
diff --git a/docs/data_access.rst b/docs/data_access.rst
index ab79f0ab51..282cdb9bff 100644
--- a/docs/data_access.rst
+++ b/docs/data_access.rst
@@ -2,12 +2,17 @@
Data Access
=======================================================================================
-We publish the :doc:`PUDL pipeline ` outputs in several ways to serve
+We publish the PUDL pipeline outputs in several ways to serve
different users and use cases. We're always trying to increase accessibility of the
PUDL data, so if you have a suggestion please `open a GitHub issue
`__. If you have a question you
can `create a GitHub discussion `__.
+PUDL's primary data output is the ``pudl.sqlite`` database. We recommend working with
+tables with the ``out_`` prefix, as these tables contain the most complete and easiest
+to work with data. For more information about the different types
+of tables, read through :ref:`PUDL's naming conventions `.
+
.. _access-modes:
---------------------------------------------------------------------------------------
diff --git a/docs/dev/data_guidelines.rst b/docs/dev/data_guidelines.rst
index 53bcb1c18a..dfbd40c0db 100644
--- a/docs/dev/data_guidelines.rst
+++ b/docs/dev/data_guidelines.rst
@@ -64,6 +64,8 @@ Examples of Unacceptable Changes
fuel heat content and net electricity generation. The heat rate would
be a derived value and not part of the original data.
+.. _tidy-data:
+
-------------------------------------------------------------------------------
Make Tidy Data
-------------------------------------------------------------------------------
@@ -117,24 +119,6 @@ that M/Mega is a million in SI. And a `BTU
energy required to raise the temperature of one an *avoirdupois pound* of water
by 1 degree *Farenheit*?! What century even is this?).
--------------------------------------------------------------------------------
-Silo the ETL Process
--------------------------------------------------------------------------------
-It should be possible to run the ETL process on each data source independently
-and with any combination of data sources included. This allows users to include
-only the data need. In some cases, like the :doc:`EIA 860
-<../data_sources/eia860>` and :doc:`EIA 923 <../data_sources/eia923>` data, two
-data sources may be so intertwined that keeping them separate doesn't really
-make sense. This should be the exception, however, not the rule.
-
--------------------------------------------------------------------------------
-Separate Data from Glue
--------------------------------------------------------------------------------
-The glue that relates different data sources to each other should be applied
-after or alongside the ETL process and not as a mandatory part of ETL. This
-makes it easy to pull individual data sources in and work with them even when
-the glue isn't working or doesn't yet exist.
-
-------------------------------------------------------------------------------
Partition Big Data
-------------------------------------------------------------------------------
@@ -146,35 +130,6 @@ them to pull in only certain years, certain states, or other sensible partitions
data so that they don’t run out of memory or disk space or have to wait hours while data
they don't need is being processed.
--------------------------------------------------------------------------------
-Naming Conventions
--------------------------------------------------------------------------------
- *There are only two hard problems in computer science: caching,
- naming things, and off-by-one errors.*
-
-Use Consistent Names
-^^^^^^^^^^^^^^^^^^^^
-If two columns in different tables record the same quantity in the same units,
-give them the same name. That way if they end up in the same dataframe for
-comparison it's easy to automatically rename them with suffixes indicating
-where they came from. For example, net electricity generation is reported to
-both :doc:`FERC Form 1 <../data_sources/ferc1>` and :doc:`EIA 923
-<../data_sources/eia923>`, so we've named columns ``net_generation_mwh`` in
-each of those data sources. Similarly, give non-comparable quantities reported
-in different data sources **different** column names. This helps make it clear
-that the quantities are actually different.
-
-Follow Existing Conventions
-^^^^^^^^^^^^^^^^^^^^^^^^^^^
-We are trying to use consistent naming conventions for the data tables,
-columns, data sources, and functions. Generally speaking PUDL is a collection
-of subpackages organized by purpose (extract, transform, load, analysis,
-output, datastore…), containing a module for each data source. Each data source
-has a short name that is used everywhere throughout the project and is composed of
-the reporting agency and the form number or another identifying abbreviation:
-``ferc1``, ``epacems``, ``eia923``, ``eia861``, etc. See the :doc:`naming
-conventions ` document for more details.
-
-------------------------------------------------------------------------------
Complete, Continuous Time Series
-------------------------------------------------------------------------------
diff --git a/docs/dev/naming_conventions.rst b/docs/dev/naming_conventions.rst
index 8648e573b4..0a39ecd64b 100644
--- a/docs/dev/naming_conventions.rst
+++ b/docs/dev/naming_conventions.rst
@@ -1,6 +1,189 @@
===============================================================================
Naming Conventions
===============================================================================
+ *There are only two hard problems in computer science: caching,
+ naming things, and off-by-one errors.*
+
+We try to use consistent naming conventions for the data tables, data assets,
+columns, data sources, and functions.
+
+.. _asset-naming:
+
+Asset Naming Conventions
+---------------------------------------------------
+
+PUDL's data processing is divided into three layers of Dagster assets: Raw, Core
+and Output. Dagster assets are the core unit of computation in PUDL. The outputs
+of assets can be persisted to any type of storage though PUDL outputs are typically
+tables in a SQLite database, parquet files or pickle files (read more about this here:
+:doc:`../index`). The asset name is used for the table or parquet file name. Asset
+names should generally follow this naming convention:
+
+.. code-block::
+
+ {layer}_{source}__{asset_type}_{asset_name}
+
+* ``layer`` is the processing layer of the asset. Acceptable values are:
+ ``raw``, ``core`` and ``out``. ``layer`` is required for all assets in all layers.
+* ``source`` is an abbreviation of the original source of the data. For example,
+ ``eia860``, ``ferc1`` and ``epacems``.
+* ``asset_type`` describes how the asset in modeled.
+* ``asset_name`` should describe the entity, categorical code type, or measurement of
+ the asset. Note: FERC Form 1 assets typically include the schedule number in the
+ ``asset_name`` so users and contributors know which schedule the cleaned asset
+ refers to.
+
+Raw layer
+^^^^^^^^^
+This layer contains assets that extract data from spreadsheets and databases
+and are persisted as pickle files.
+
+Naming convention: ``raw_{source}__{asset_name}``
+
+* ``asset_name`` is typically copied from the source data.
+* ``asset_type`` is not included in this layer because the data modeling does not
+ yet conform to PUDL standards. Raw assets are typically just copies of the
+ source data.
+
+Core layer
+^^^^^^^^^^
+This layer contains assets that typically break denormalized raw assets into
+well-modeled tables that serve as building blocks for downstream wide tables
+and analyses. Well-modeled means tables in the database have logical
+primary keys, foreign keys, datatypes and generally follow
+:ref:`Tidy Data standards `. Assets in this layer create
+consistent categorical variables, decuplicate and impute data.
+These assets are typically stored in parquet files or tables in a database.
+
+Naming convention: ``core_{source}__{asset_type}_{asset_name}``
+
+* ``source`` is sometimes ``pudl``. This means the asset
+ is a derived connection the contributors of PUDL created to connect multiple
+ datasets via manual or machine learning methods.
+
+* ``asset_type`` describes how the asset is modeled and its role in PUDL’s
+ collection of core assets. There are a handful of table types in this layer:
+
+ * ``assn``: Association tables provide connections between entities. This data
+ can be manually compiled or extracted from data sources. If the asset associates
+ data from two sources, the source names should be included in the ``asset_name``
+ in alphabetical order. Examples:
+
+ * ``core_pudl__assn_plants_eia`` associates EIA Plant IDs and manually assigned
+ PUDL Plant IDs.
+ * ``core_epa__assn_epacamd_eia`` associates EPA units with EIA plants, boilers,
+ and generators.
+ * ``codes``: Code tables contain more verbose descriptions of categorical codes
+ typically manually compiled from source data dictionaries. Examples:
+
+ * ``core_eia__codes_averaging_periods``
+ * ``core_eia__codes_balancing_authorities``
+ * ``entity``: Entity tables contain static information about entities. For example,
+ the state a plant is located in or the plant a boiler is a part of. Examples:
+
+ * ``core_eia__entity_boilers``
+ * ``core_eia923__entity_coalmine``.
+ * ``scd``: Slowly changing dimension tables describe attributes of entities that
+ rarely change. For example, the ownership or the capacity of a plant. Examples:
+
+ * ``core_eia860__scd_generators``
+ * ``core_eia860__scd_plants``.
+ * ``yearly/monthly/hourly``: Time series tables contain attributes about entities
+ that are expected to change for each reported timestamp. Time series tables
+ typically contain measurements of processes like net generation or co2 emissions.
+ Examples:
+
+ * ``core_ferc714__hourly_demand_pa``,
+ * ``core_ferc1__yearly_plant_in_service``.
+
+Output layer
+^^^^^^^^^^^^
+This layer uses assets in the Core layer to construct wide and complete tables
+suitable for users to perform analysis on. This layer can contain intermediate
+tables that bridge the core and user-facing tables.
+
+Naming convention: ``out_{source}__{asset_type}_{asset_name}``
+
+* ``source`` is optional in this layer because there can be assets that join data from
+ multiple sources.
+* ``asset_type`` is also optional. It will likely describe the frequency at which
+ the data is reported (annual/monthly/hourly).
+
+Intermediate Assets
+^^^^^^^^^^^^^^^^^^^
+Intermediate assets are logical steps towards a final well-modeled core or
+user-facing output asset. These assets are not intended to be persisted in the
+database or accessible to the user. These assets are denoted by a preceding
+underscore, like a private python method. For example, the intermediate asset
+``_core_eia860__plants`` is a logical step towards the
+``core_eia860__entity_plants`` and ``core_eia860__scd_plants`` assets.
+``_core_eia860__plants`` does some basic cleaning of the ``raw_eia860__plant``
+asset but still contains duplicate plant entities. The computation intensive
+harvesting process deduplicates ``_core_eia860__plants`` and outputs the
+``core_eia860__entity_plants`` and ``core_eia860__scd_plants`` assets which
+follow Tiny Data standards.
+
+Limit the number of intermediate assets to avoid an extremely
+cluttered DAG. It is appropriate to create an intermediate asset when:
+
+ * there is a short and long running portion of a process. It is convenient to separate
+ the long and short-running processing portions into separate assets so debugging the
+ short-running process doesn’t take forever.
+ * there is a logical step in a process that is frequently inspected for debugging. For
+ example, the pre harvest assets in the ``_core_eia860`` and ``_core_eia923`` groups
+ are frequently inspected when new years of data are added.
+
+
+Columns and Field Names
+-----------------------
+If two columns in different tables record the same quantity in the same units,
+give them the same name. That way if they end up in the same dataframe for
+comparison it's easy to automatically rename them with suffixes indicating
+where they came from. For example, net electricity generation is reported to
+both :doc:`FERC Form 1 <../data_sources/ferc1>` and
+:doc:`EIA 923<../data_sources/eia923>`, so we've named columns ``net_generation_mwh``
+in each of those data sources. Similarly, give non-comparable quantities reported in
+different data sources **different** column names. This helps make it clear that the
+quantities are actually different.
+
+* ``total`` should come at the beginning of the name (e.g.
+ ``total_expns_production``)
+* Identifiers should be structured ``type`` + ``_id_`` + ``source`` where
+ ``source`` is the agency or organization that has assigned the ID. (e.g.
+ ``plant_id_eia``)
+* The data source or label (e.g. ``plant_id_pudl``) should follow the thing it
+ is describing
+* Append units to field names where applicable (e.g.
+ ``net_generation_mwh``). This includes "per unit" signifiers (e.g. ``_pct``
+ for percent, ``_ppm`` for parts per million, or a generic ``_per_unit`` when
+ the type of unit varies, as in columns containing a heterogeneous collection
+ of fuels)
+* Financial values are assumed to be in nominal US dollars (I.e., the suffix
+ _usd is implied.)If they are not reported in USD, convert them to USD. If
+ they must be kept in their original form for some reason, append a suffix
+ that lets the user know they are not USD.
+* ``_id`` indicates the field contains a usually numerical reference to
+ another table, which will not be intelligible without looking up the value in
+ that other table.
+* The suffix ``_code`` indicates the field contains a short abbreviation from
+ a well defined list of values, that probably needs to be looked up if you
+ want to understand what it means.
+* The suffix ``_type`` (e.g. ``fuel_type``) indicates a human readable category
+ from a well defined list of values. Whenever possible we try to use these
+ longer descriptive names rather than codes.
+* ``_name`` indicates a longer human readable name, that is likely not well
+ categorized into a small set of acceptable values.
+* ``_date`` indicates the field contains a :class:`Date` object.
+* ``_datetime`` indicates the field contains a full :class:`Datetime` object.
+* ``_year`` indicates the field contains an :class:`integer` 4-digit year.
+* ``capacity`` refers to nameplate capacity (e.g. ``capacity_mw``)-- other
+ specific types of capacity are annotated.
+* Regardless of what label utilities are given in the original data source
+ (e.g. ``operator`` in EIA or ``respondent`` in FERC) we refer to them as
+ ``utilities`` in PUDL.
+
+Naming Conventions in Code
+--------------------------
In the PUDL codebase, we aspire to follow the naming and other conventions
detailed in :pep:`8`.
@@ -13,19 +196,14 @@ as we come across them again in maintaining the code.
(e.g. connect_db), unless the function returns a simple value (e.g. datadir).
* No duplication of information (e.g. form names).
* lowercase, underscores separate words (i.e. ``snake_case``).
-* Semi-private helper functions (functions used within a single module only
- and not exposed via the public API) should be preceded by an underscore.
+* Add a preceeding underscore to semi-private helper functions (functions used
+ within a single module only and not exposed via the public API).
* When the object is a table, use the full table name (e.g. ingest_fuel_ferc1).
* When dataframe outputs are built from multiple tables, identify the type of
information being pulled (e.g. "plants") and the source of the tables (e.g.
``eia`` or ``ferc1``). When outputs are built from a single table, simply use
the table name (e.g. ``core_eia923__monthly_boiler_fuel``).
-.. _glossary:
-
-Glossary of Abbreviations
--------------------------
-
General Abbreviations
^^^^^^^^^^^^^^^^^^^^^
@@ -76,61 +254,9 @@ Abbreviation Definition
Data Extraction Functions
--------------------------
+^^^^^^^^^^^^^^^^^^^^^^^^^
The lower level namespace uses an imperative verb to identify the action the
function performs followed by the object of extraction (e.g.
``get_eia860_file``). The upper level namespace identifies the dataset where
extraction is occurring.
-
-Output Functions
------------------
-
-When dataframe outputs are built from multiple tables, identify the type of
-information being pulled (e.g. ``plants``) and the source of the tables (e.g.
-``eia`` or ``ferc1``). When outputs are built from a single table, simply use
-the table name (e.g. ``core_eia923__monthly_boiler_fuel``).
-
-Table Names
------------
-
-See `this article `__ on database naming conventions.
-
-* Table names in snake_case
-* The data source should follow the thing it applies to e.g. ``plant_id_ferc1``
-
-Columns and Field Names
------------------------
-
-* ``total`` should come at the beginning of the name (e.g.
- ``total_expns_production``)
-* Identifiers should be structured ``type`` + ``_id_`` + ``source`` where
- ``source`` is the agency or organization that has assigned the ID. (e.g.
- ``plant_id_eia``)
-* The data source or label (e.g. ``plant_id_pudl``) should follow the thing it
- is describing
-* Units should be appended to field names where applicable (e.g.
- ``net_generation_mwh``). This includes "per unit" signifiers (e.g. ``_pct``
- for percent, ``_ppm`` for parts per million, or a generic ``_per_unit`` when
- the type of unit varies, as in columns containing a heterogeneous collection
- of fuels)
-* Financial values are assumed to be in nominal US dollars.
-* ``_id`` indicates the field contains a usually numerical reference to
- another table, which will not be intelligible without looking up the value in
- that other table.
-* The suffix ``_code`` indicates the field contains a short abbreviation from
- a well defined list of values, that probably needs to be looked up if you
- want to understand what it means.
-* The suffix ``_type`` (e.g. ``fuel_type``) indicates a human readable category
- from a well defined list of values. Whenever possible we try to use these
- longer descriptive names rather than codes.
-* ``_name`` indicates a longer human readable name, that is likely not well
- categorized into a small set of acceptable values.
-* ``_date`` indicates the field contains a :class:`Date` object.
-* ``_datetime`` indicates the field contains a full :class:`Datetime` object.
-* ``_year`` indicates the field contains an :class:`integer` 4-digit year.
-* ``capacity`` refers to nameplate capacity (e.g. ``capacity_mw``)-- other
- specific types of capacity are annotated.
-* Regardless of what label utilities are given in the original data source
- (e.g. ``operator`` in EIA or ``respondent`` in FERC) we refer to them as
- ``utilities`` in PUDL.
diff --git a/docs/index.rst b/docs/index.rst
index f7904d4a97..ff75d5f13f 100644
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -2,14 +2,218 @@
The Public Utility Data Liberation Project
===============================================================================
-.. include:: ../README.rst
- :start-after: readme-intro
+PUDL is a data processing pipeline created by `Catalyst Cooperative
+`__ that cleans, integrates, and standardizes some of the most
+widely used public energy datasets in the US. The data serve researchers, activists,
+journalists, and policy makers that might not have the technical expertise to access it
+in its raw form, the time to clean and prepare the data for bulk analysis, or the means
+to purchase it from existing commercial providers.
+
+---------------------------------------------------------------------------------------
+Available Data
+---------------------------------------------------------------------------------------
+
+We focus primarily on poorly curated data published by the US government in
+semi-structured but machine readable formats. For details on exactly what data is
+available from these data sources and what state it is in, see the the individual
+pages for each source:
+
+* :doc:`data_sources/eia860`
+* :doc:`data_sources/eia861`
+* :doc:`data_sources/eia923`
+* :doc:`data_sources/epacems`
+* :doc:`data_sources/ferc1`
+* :doc:`data_sources/ferc714`
+
+PUDL's clean and complete versions of these data sources are stored in the
+``pudl.sqlite`` database. Larger datasets like EPA CEMS are stored in parquet files.
+To get started using PUDL data, visit our :doc:`data_access` page, or continue reading
+to learn more about the PUDL data processing pipeline.
+
+We also publish SQLite databases containing relatively pristine versions of our more
+difficult to parse inputs, especially the old Visual FoxPro (DBF, pre-2021) and new XBRL
+data (2021+) published by FERC:
+
+* `FERC Form 1 (DBF) `__
+* `FERC Form 1 (XBRL) `__
+* `FERC Form 2 (XBRL) `__
+* `FERC Form 6 (XBRL) `__
+* `FERC Form 60 (XBRL) `__
+* `FERC Form 714 (XBRL) `__
+
+.. _raw-data-archive:
+
+---------------------------------------------------------------------------------------
+Raw Data Archives
+---------------------------------------------------------------------------------------
+
+PUDL depends on "raw" data inputs from sources that are known to occasionally update
+their data or alter the published format. These changes may be incompatible with the way
+the data are read and interpreted by PUDL, so, to ensure the integrity of our data
+processing, we periodically create archives of `the raw inputs on Zenodo
+`__. Each of the data inputs may
+have several different versions archived, and all are assigned a unique DOI and made
+available through the REST API. Each release of the PUDL Python package is embedded
+with a set of of DOIs to indicate which version of the raw inputs it is meant to
+process. This process helps ensure that our outputs are replicable.
+
+To enable programmatic access to individual partitions of the data (by year, state,
+etc.), we archive the raw inputs as `Frictionless Data Packages
+`__. The data packages contain both the
+raw data in their originally published format (CSVs, Excel spreadsheets, and Visual
+FoxPro database (DBF) files) and metadata that describes how each the
+dataset is partitioned.
+
+The PUDL software will download a copy of the appropriate raw inputs automatically as
+needed and organize them in a local :doc:`datastore `.
+
+.. seealso::
+
+ The software that creates and archives the raw inputs can be found in our
+ `PUDL Archiver `__
+ repository on GitHub.
+
+.. _etl-process:
+
+---------------------------------------------------------------------------------------
+The ETL Process
+---------------------------------------------------------------------------------------
+
+PUDL's ETL produces a data warehouse that can be used for analytics.
+The processing happens within Dagster assets that are persisted to storage,
+typically pickle, parquet or SQLite files. The raw data moves through three
+layers of processing.
+
+Raw Layer
+^^^^^^^^^
+
+Assets in the Raw layer read the raw data from the original heterogeneous formats into
+a collection of :class:`pandas.DataFrame` with uniform column names across all years so
+that it can be easily processed in bulk. Data distributed as binary database files, such
+as the DBF files from FERC Form 1, may be converted into a unified SQLite database
+before individual dataframes are created. Raw data assets are not written to
+``pudl.sqlite``. Instead they are persisted to pickle files and not distributed
+to users.
+
+.. seealso::
+
+ Module documentation within the :mod:`pudl.extract` subpackage.
+
+Core Layer
+^^^^^^^^^^
+
+The Core layer contains well-modeled assets that serve as building blocks for
+downstream wide tables and analyses. Well-modeled means tables in the database
+have logical primary keys, foreign keys, datatypes and generally follow
+:ref:`Tidy Data standards `. The assets are loaded to a SQLite
+database or Parquet file.
+
+These outputs can be accessed via Python, R, and many other tools. See the
+:doc:`data_dictionaries/pudl_db` page for a list of the normalized database tables and
+their contents.
+
+Data processing in the Core layer is generally broken down into two phases. Phase one
+focuses on cleaning and organizing data within individual tables while phase two focuses
+on the integration and deduplication of data between tables. These tasks can be tedious
+`data wrangling toil `__ that impose a
+huge amount of overhead on anyone trying to do analysis based on the publicly
+available data. PUDL implements common data cleaning operations in the hopes that we
+can all work on more interesting problems most of the time. These operations include:
+
+* Standardization of units (e.g. dollars not thousands of dollars)
+* Standardization of N/A values
+* Standardization of freeform names and IDs
+* Use of controlled vocabularies for categorical values like fuel type
+* Use of more readable codes and column names
+* Imposition of well defined, rich data types for each column
+* Converting local timestamps to UTC
+* Reshaping of data into well normalized tables which minimize data duplication
+* Inferring Plant IDs which link records across many years of FERC Form 1 data
+* Inferring linkages between FERC and EIA Plants and Utilities.
+* Inferring more complete associations between EIA boilers and generators
+
+.. seealso::
+
+ The module and per-table transform functions in the :mod:`pudl.transform`
+ sub-package have more details on the specific transformations applied to each
+ table.
+
+Many of the original datasets contain large amounts of duplicated data. For instance,
+the EIA reports the name of each power plant in every table that refers to otherwise
+unique plant-related data. Similarly, many attributes like plant latitude and
+longitude are reported separately every year. Often, these reported values are not
+self-consistent. There may be several different spellings of a plant's name, or an
+incorrectly reported latitude in one year.
+
+Assets in the Core layer attempt to eliminate this kind of inconsistent and duplicate
+information when normalizing the tables by choosing only the most consistently reported
+value for inclusion in the final database. If a value which should be static is not
+consistently reported, it may also be set to N/A.
+
+Output Layer
+^^^^^^^^^^^^^^^^^^^^
+
+Assets in the Core layer normalize the data to make storage more efficient and avoid
+data integrity issues, but you may want to combine information from more than one of
+the tables to make the data more readable and readily interpretable. For example, PUDL
+stores the name that EIA uses to refer to a power plant in the
+:ref:`core_eia__entity_plants` table in association with the plant's unique numeric ID.
+If you are working with data from the :ref:`core_eia923__monthly_fuel_receipts_costs`
+table, which records monthly per-plant fuel deliveries, you may want to have the name
+of the plant alongside the fuel delivery information since it's more recognizable than
+the plant ID.
+
+Rather than requiring everyone to write their own SQL ``SELECT`` and ``JOIN`` statements
+or do a bunch of :func:`pandas.merge` operations to bring together data, PUDL provides a
+variety of output tables that contain all of the useful information in one place. In
+some cases, like with EIA, the output tables are composed to closely resemble the raw
+spreadsheet tables you're familiar with.
+
+The Output layer also contains tables produced by analytical routines for
+calculating derived values like the heat rate by generation unit (:meth:`hr_by_unit
+`) or the capacity factor by generator
+(:meth:`capacity_factor `). We intend to
+integrate more analytical outputs into the library over time.
+
+.. seealso::
+
+ * `The PUDL Examples GitHub repo `__
+ to see how to access the PUDL Database directly, use the output functions, or
+ work with the EPA CEMS data using Dask.
+ * `How to Learn Dask in 2021 `__
+ is a great collection of self-guided resources if you are already familiar with
+ Python, Pandas, and NumPy.
+
+.. _test-and-validate:
+
+---------------------------------------------------------------------------------------
+Data Validation
+---------------------------------------------------------------------------------------
+We have a growing collection of data validation test cases that we run before
+publishing a data release to try and avoid publishing data with known issues. Most of
+these validations are described in the :mod:`pudl.validate` module. They check things
+like:
+
+* The heat content of various fuel types are within expected bounds.
+* Coal ash, moisture, mercury, sulfur etc. content are within expected bounds
+* Generator heat rates and capacity factors are realistic for the type of prime mover
+ being reported.
+
+Some data validations are currently only specified within our test suite, including:
+
+* The expected number of records within each table
+* The fact that there are no entirely N/A columns
+
+A variety of database integrity checks are also run either during the data processing
+or when the data is loaded into SQLite.
+
+See our :doc:`dev/testing` documentation for more information.
+
.. toctree::
:hidden:
:maxdepth: 2
- intro
data_access
data_sources/index
data_dictionaries/index
diff --git a/docs/intro.rst b/docs/intro.rst
deleted file mode 100644
index 94f75a592b..0000000000
--- a/docs/intro.rst
+++ /dev/null
@@ -1,234 +0,0 @@
-=======================================================================================
-Introduction
-=======================================================================================
-
-PUDL is a data processing pipeline created by `Catalyst Cooperative
-`__ that cleans, integrates, and standardizes some of the most
-widely used public energy datasets in the US. The data serve researchers, activists,
-journalists, and policy makers that might not have the technical expertise to access it
-in its raw form, the time to clean and prepare the data for bulk analysis, or the means
-to purchase it from existing commercial providers.
-
----------------------------------------------------------------------------------------
-Available Data
----------------------------------------------------------------------------------------
-
-We focus primarily on poorly curated data published by the US government in
-semi-structured but machine readable formats. For details on exactly what data is
-available from these data sources and what state it is in, see the the individual
-pages for each source:
-
-* :doc:`data_sources/eia860`
-* :doc:`data_sources/eia861`
-* :doc:`data_sources/eia923`
-* :doc:`data_sources/epacems`
-* :doc:`data_sources/ferc1`
-* :doc:`data_sources/ferc714`
-
-We also publish SQLite databases containing relatively pristine versions of our more
-difficult to parse inputs, especially the old Visual FoxPro (DBF, pre-2021) and new XBRL
-data (2021+) published by FERC:
-* `FERC Form 1 (DBF) `__
-* `FERC Form 1 (XBRL) `__
-* `FERC Form 2 (XBRL) `__
-* `FERC Form 6 (XBRL) `__
-* `FERC Form 60 (XBRL) `__
-* `FERC Form 714 (XBRL) `__
-
-To get started using PUDL data, visit our :doc:`data_access` page, or continue reading
-to learn more about the PUDL data processing pipeline.
-
-.. _raw-data-archive:
-
----------------------------------------------------------------------------------------
-Raw Data Archives
----------------------------------------------------------------------------------------
-
-PUDL depends on "raw" data inputs from sources that are known to occasionally update
-their data or alter the published format. These changes may be incompatible with the way
-the data are read and interpreted by PUDL, so, to ensure the integrity of our data
-processing, we periodically create archives of `the raw inputs on Zenodo
-`__. Each of the data inputs may
-have several different versions archived, and all are assigned a unique DOI and made
-available through the REST API. Each release of the PUDL Python package is embedded
-with a set of of DOIs to indicate which version of the raw inputs it is meant to
-process. This process helps ensure that our outputs are replicable.
-
-To enable programmatic access to individual partitions of the data (by year, state,
-etc.), we archive the raw inputs as `Frictionless Data Packages
-`__. The data packages contain both the
-raw data in their originally published format (CSVs, Excel spreadsheets, and Visual
-FoxPro database (DBF) files) and metadata that describes how each the
-dataset is partitioned.
-
-The PUDL software will download a copy of the appropriate raw inputs automatically as
-needed and organize them in a local :doc:`datastore `.
-
-.. seealso::
-
- The software that creates and archives the raw inputs can be found in our
- `PUDL Archiver `__
- repository on GitHub.
-
-.. _etl-process:
-
----------------------------------------------------------------------------------------
-The ETL Process
----------------------------------------------------------------------------------------
-
-The core of PUDL's work takes place in the ETL (Extract, Transform, and Load)
-process.
-
-Extract
-^^^^^^^
-
-The Extract step reads the raw data from the original heterogeneous formats into a
-collection of :class:`pandas.DataFrame` with uniform column names across all years so
-that it can be easily processed in bulk. Data distributed as binary database files, such
-as the DBF files from FERC Form 1, may be converted into a unified SQLite database
-before individual dataframes are created.
-
-.. seealso::
-
- Module documentation within the :mod:`pudl.extract` subpackage.
-
-Transform
-^^^^^^^^^
-
-The Transform step is generally broken down into two phases. Phase one focuses on
-cleaning and organizing data within individual tables while phase two focuses on the
-integration and deduplication of data between tables. These tasks can be tedious
-`data wrangling toil `__ that impose a
-huge amount of overhead on anyone trying to do analysis based on the publicly
-available data. PUDL implements common data cleaning operations in the hopes that we
-can all work on more interesting problems most of the time. These operations include:
-
-* Standardization of units (e.g. dollars not thousands of dollars)
-* Standardization of N/A values
-* Standardization of freeform names and IDs
-* Use of controlled vocabularies for categorical values like fuel type
-* Use of more readable codes and column names
-* Imposition of well defined, rich data types for each column
-* Converting local timestamps to UTC
-* Reshaping of data into well normalized tables which minimize data duplication
-* Inferring Plant IDs which link records across many years of FERC Form 1 data
-* Inferring linkages between FERC and EIA Plants and Utilities.
-* Inferring more complete associations between EIA boilers and generators
-
-.. seealso::
-
- The module and per-table transform functions in the :mod:`pudl.transform`
- sub-package have more details on the specific transformations applied to each
- table.
-
-Many of the original datasets contain large amounts of duplicated data. For instance,
-the EIA reports the name of each power plant in every table that refers to otherwise
-unique plant-related data. Similarly, many attributes like plant latitude and
-longitude are reported separately every year. Often, these reported values are not
-self-consistent. There may be several different spellings of a plant's name, or an
-incorrectly reported latitude in one year.
-
-The transform step attempts to eliminate this kind of inconsistent and duplicate
-information when normalizing the tables by choosing only the most consistently reported
-value for inclusion in the final database. If a value which should be static is not
-consistently reported, it may also be set to N/A.
-
-.. seealso::
-
- * `Tidy Data `__ by Hadley
- Wickham, Journal of Statistical Software (2014).
- * `A Simple Guide to the Five Normal Forms in Relational Database Theory `__
- by William Kent, Communications of the ACM (1983).
-
-Load
-^^^^
-
-At the end of the Transform step, we have collections of :class:`pandas.DataFrame` that
-correspond to database tables. These are loaded into a SQLite database.
-To handle the ~1 billion row :doc:`data_sources/epacems`, we load the dataframes into
-an Apache Parquet dataset that is partitioned by state and year.
-
-These outputs can be accessed via Python, R, and many other tools. See the
-:doc:`data_dictionaries/pudl_db` page for a list of the normalized database tables and
-their contents.
-
-.. seealso::
-
- Module documentation within the :mod:`pudl.load` sub-package.
-
-.. _db-and-outputs:
-
----------------------------------------------------------------------------------------
-Output Tables
----------------------------------------------------------------------------------------
-
-Denormalized Outputs
-^^^^^^^^^^^^^^^^^^^^
-
-We normalize the data to make storage more efficient and avoid data integrity issues,
-but you may want to combine information from more than one of the tables to make the
-data more readable and readily interpretable. For example, PUDL stores the name that EIA
-uses to refer to a power plant in the :ref:`core_eia__entity_plants` table in
-association with the plant's unique numeric ID. If you are working with data from the
-:ref:`core_eia923__monthly_fuel_receipts_costs` table, which records monthly per-plant
-fuel deliveries, you may want to have the name of the plant alongside the fuel delivery
-information since it's more recognizable than the plant ID.
-
-Rather than requiring everyone to write their own SQL ``SELECT`` and ``JOIN`` statements
-or do a bunch of :func:`pandas.merge` operations to bring together data, PUDL provides a
-variety of predefined queries as methods of the :class:`pudl.output.pudltabl.PudlTabl`
-class. These methods perform common joins to return output tables (pandas DataFrames)
-that contain all of the useful information in one place. In some cases, like with EIA,
-the output tables are composed to closely resemble the raw spreadsheet tables you're
-familiar with.
-
-.. note::
-
- In the future, we intend to replace the simple denormalized output tables with
- database views that are integrated into the distributed SQLite database directly.
- This will provide the same convenience without requiring use of the Python software
- layer.
-
-Analysis Outputs
-^^^^^^^^^^^^^^^^
-
-There are several analytical routines built into the
-:mod:`pudl.output.pudltabl.PudlTabl` output objects for calculating derived values
-like the heat rate by generation unit (:meth:`hr_by_unit
-`) or the capacity factor by generator
-(:meth:`capacity_factor `). We intend to
-integrate more analytical outputs into the library over time.
-
-.. seealso::
-
- * `The PUDL Examples GitHub repo `__
- to see how to access the PUDL Database directly, use the output functions, or
- work with the EPA CEMS data using Dask.
- * `How to Learn Dask in 2021 `__
- is a great collection of self-guided resources if you are already familiar with
- Python, Pandas, and NumPy.
-
-.. _test-and-validate:
-
----------------------------------------------------------------------------------------
-Data Validation
----------------------------------------------------------------------------------------
-We have a growing collection of data validation test cases that we run before
-publishing a data release to try and avoid publishing data with known issues. Most of
-these validations are described in the :mod:`pudl.validate` module. They check things
-like:
-
-* The heat content of various fuel types are within expected bounds.
-* Coal ash, moisture, mercury, sulfur etc. content are within expected bounds
-* Generator heat rates and capacity factors are realistic for the type of prime mover
- being reported.
-
-Some data validations are currently only specified within our test suite, including:
-
-* The expected number of records within each table
-* The fact that there are no entirely N/A columns
-
-A variety of database integrity checks are also run either during the ETL process or
-when the data is loaded into SQLite.
-
-See our :doc:`dev/testing` documentation for more information.
diff --git a/docs/release_notes.rst b/docs/release_notes.rst
index 9b477f8fa6..73c2fd4d92 100644
--- a/docs/release_notes.rst
+++ b/docs/release_notes.rst
@@ -67,6 +67,22 @@ Dagster Adoption
* :mod:`pudl.convert.censusdp1tract_to_sqlite` and :mod:`pudl.output.censusdp1tract`
are now integrated into dagster. See :issue:`1973` and :pr:`2621`.
+New Asset Naming Convention
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+There are hundreds of new tables in ``pudl.sqlite`` now that the methods in ``PudlTabl``
+have been converted to Dagster assets. This significant increase in tables and diversity
+of table types prompted us to create a new naming convention to make the table names
+more descriptive and organized. You can read about the new naming convention in the
+:ref:`docs `.
+
+To help users migrate away from using ``PudlTabl`` and our temporary table names,
+we've created a `google sheet `__
+that maps the old table names and ``PudlTabl`` methods to the new table names.
+
+We've added deprecation warnings to the ``PudlTabl`` class. We plan to remove
+``PudlTabl`` from the ``pudl`` package once our known users have
+succesfully migrated to pulling data directly from ``pudl.sqlite``.
+
Data Coverage
^^^^^^^^^^^^^
@@ -296,7 +312,7 @@ Deprecations
* Replace references to deprecated ``pudl-scrapers`` and
``pudl-zenodo-datastore`` repositories with references to `pudl-archiver
`__ repository in
- :doc:`intro`, :doc:`dev/datastore`, and :doc:`dev/annual_updates`. See :pr:`2190`.
+ ``intro``, :doc:`dev/datastore`, and :doc:`dev/annual_updates`. See :pr:`2190`.
* :mod:`pudl.etl` is now a subpackage that collects all pudl assets into a dagster
`Definition `__. All
``pudl.etl._etl_{datasource}`` functions have been deprecated. The coordination
diff --git a/src/pudl/metadata/templates/datasette-metadata.yml.jinja b/src/pudl/metadata/templates/datasette-metadata.yml.jinja
index 8872e6ff7b..2dd8f3c649 100644
--- a/src/pudl/metadata/templates/datasette-metadata.yml.jinja
+++ b/src/pudl/metadata/templates/datasette-metadata.yml.jinja
@@ -42,12 +42,15 @@ databases:
Catalyst Cooperative as part of the
Public Utility
Data Liberation Project.
- Caution:
+ Note:
- - Please note that tables beginning with "denorm_" are temporary tables whose
- names and metadata will shortly change, as we migrate new tables into our database.
- - The structure of the data and the API are not necessarily stable, so don't
- build any critical infrastructure on top of this just yet.
+ - We recommend working
+ with tables with the ``out_`` prefix as these tables contain the most complete
+ data.
+
+ - For more information about the different types of tables, read through
+ PUDL's naming conventions
+
- If you find something wrong, please
make an issue
on GitHub to let us know.
diff --git a/src/pudl/output/pudltabl.py b/src/pudl/output/pudltabl.py
index 08f03a9137..ede31c3f00 100644
--- a/src/pudl/output/pudltabl.py
+++ b/src/pudl/output/pudltabl.py
@@ -89,6 +89,11 @@ def __init__(
unit_ids: If True, use several heuristics to assign
individual generators to functional units. EXPERIMENTAL.
"""
+ logger.warning(
+ "PudlTabl is deprecated and will be removed from the pudl package "
+ "once known users have migrated to accessing the data directly from "
+ "pudl.sqlite. "
+ )
if not isinstance(pudl_engine, sa.engine.base.Engine):
raise TypeError(
"PudlTabl needs pudl_engine to be a SQLAlchemy Engine, but we "
@@ -296,6 +301,12 @@ def _get_table_from_db(
"It is retained for backwards compatibility only."
)
table_name = self._agg_table_name(table_name)
+ logger.warning(
+ "PudlTabl is deprecated and will be removed from the pudl package "
+ "once known users have migrated to accessing the data directly from "
+ "pudl.sqlite. To access the data returned by this method, "
+ f"use the {table_name} table in the pudl.sqlite database."
+ )
resource = Resource.from_id(table_name)
return pd.concat(
[