From 77a16f54fb37b05d8e67cea489766a469c2e3a95 Mon Sep 17 00:00:00 2001 From: bendnorman Date: Wed, 20 Sep 2023 11:03:21 -0400 Subject: [PATCH 01/10] Update contributor facing documentation with new asset naming conventions --- docs/dev/data_guidelines.rst | 49 +------- docs/dev/naming_conventions.rst | 200 +++++++++++++++++++++++--------- 2 files changed, 144 insertions(+), 105 deletions(-) diff --git a/docs/dev/data_guidelines.rst b/docs/dev/data_guidelines.rst index 53bcb1c18a..dfbd40c0db 100644 --- a/docs/dev/data_guidelines.rst +++ b/docs/dev/data_guidelines.rst @@ -64,6 +64,8 @@ Examples of Unacceptable Changes fuel heat content and net electricity generation. The heat rate would be a derived value and not part of the original data. +.. _tidy-data: + ------------------------------------------------------------------------------- Make Tidy Data ------------------------------------------------------------------------------- @@ -117,24 +119,6 @@ that M/Mega is a million in SI. And a `BTU energy required to raise the temperature of one an *avoirdupois pound* of water by 1 degree *Farenheit*?! What century even is this?). -------------------------------------------------------------------------------- -Silo the ETL Process -------------------------------------------------------------------------------- -It should be possible to run the ETL process on each data source independently -and with any combination of data sources included. This allows users to include -only the data need. In some cases, like the :doc:`EIA 860 -<../data_sources/eia860>` and :doc:`EIA 923 <../data_sources/eia923>` data, two -data sources may be so intertwined that keeping them separate doesn't really -make sense. This should be the exception, however, not the rule. - -------------------------------------------------------------------------------- -Separate Data from Glue -------------------------------------------------------------------------------- -The glue that relates different data sources to each other should be applied -after or alongside the ETL process and not as a mandatory part of ETL. This -makes it easy to pull individual data sources in and work with them even when -the glue isn't working or doesn't yet exist. - ------------------------------------------------------------------------------- Partition Big Data ------------------------------------------------------------------------------- @@ -146,35 +130,6 @@ them to pull in only certain years, certain states, or other sensible partitions data so that they don’t run out of memory or disk space or have to wait hours while data they don't need is being processed. -------------------------------------------------------------------------------- -Naming Conventions -------------------------------------------------------------------------------- - *There are only two hard problems in computer science: caching, - naming things, and off-by-one errors.* - -Use Consistent Names -^^^^^^^^^^^^^^^^^^^^ -If two columns in different tables record the same quantity in the same units, -give them the same name. That way if they end up in the same dataframe for -comparison it's easy to automatically rename them with suffixes indicating -where they came from. For example, net electricity generation is reported to -both :doc:`FERC Form 1 <../data_sources/ferc1>` and :doc:`EIA 923 -<../data_sources/eia923>`, so we've named columns ``net_generation_mwh`` in -each of those data sources. Similarly, give non-comparable quantities reported -in different data sources **different** column names. This helps make it clear -that the quantities are actually different. - -Follow Existing Conventions -^^^^^^^^^^^^^^^^^^^^^^^^^^^ -We are trying to use consistent naming conventions for the data tables, -columns, data sources, and functions. Generally speaking PUDL is a collection -of subpackages organized by purpose (extract, transform, load, analysis, -output, datastore…), containing a module for each data source. Each data source -has a short name that is used everywhere throughout the project and is composed of -the reporting agency and the form number or another identifying abbreviation: -``ferc1``, ``epacems``, ``eia923``, ``eia861``, etc. See the :doc:`naming -conventions ` document for more details. - ------------------------------------------------------------------------------- Complete, Continuous Time Series ------------------------------------------------------------------------------- diff --git a/docs/dev/naming_conventions.rst b/docs/dev/naming_conventions.rst index 8648e573b4..68526e31b7 100644 --- a/docs/dev/naming_conventions.rst +++ b/docs/dev/naming_conventions.rst @@ -1,6 +1,147 @@ =============================================================================== Naming Conventions =============================================================================== + *There are only two hard problems in computer science: caching, + naming things, and off-by-one errors.* + +We try to use consistent naming conventions for the data tables, data assets, +columns, data sources, and functions. + + +Asset Naming Conventions +--------------------------------------------------- + +PUDL's data processing is divided into three layers of dagster assets: Raw, Core +and Output. Asset names should generally follow this naming convention: + +.. code-block:: + + {layer}_{source}__{asset_type}_{asset_name} + +* ``layer`` is the processing layer of the asset. Acceptable values are: + ``raw``, ``core`` and ``out``. ``layer`` is required for all assets in all layers. +* ``source`` is an abbreviation of the original source of the data. For example, + ``eia860``, ``ferc1`` and ``epacems``. +* ``asset_type`` describes how the asset in modeled. +* ``asset_name`` should describe the entity, categorical code type, or measurement of + the asset. + +Raw layer +^^^^^^^^^ +* This layer contains assets that extract data from spreadsheets and databases + and are persisted as pickle files. +* Naming convention: ``raw_{source}__{asset_name}`` +* ``asset_name`` is typically copied from the source data. +* ``asset_type`` is not included in this layer because the data modeling does not + yet conform to PUDL standards. Raw assets are typically just copies of the + source data. + +Core layer +^^^^^^^^^^ +* This layer contains well-modeled assets that serve as building blocks for downstream + wide tables and analyses. Well-modeled means tables in the database have logical + primary keys, foreign keys, datatypes and generally follow + :ref:`Tidy Data standards `. + These assets are typically stored in parquet files or tables in a database. +* Naming convention: ``core_{source}__{asset_type}_{asset_name}`` +* ``asset_type`` describes how the asset is modeled and its role in PUDL’s + collection of core assets. There are a handful of table types in this layer: + + * ``assn``: Association tables provide connections between entities. This data + can be manually compiled or extracted from data sources. Examples: + ``core_pudl__assn_plants_eia``, ``core_eia861__assn_utility``. + * ``codes``: Code tables contain more verbose descriptions of categorical codes + typically manually compiled from source data dictionaries. Examples: + ``core_eia__codes_averaging_periods``, ``core_eia__codes_balancing_authorities`` + * ``entity``: Entity tables contain static information about entities. For example, + the state a plant is located in, or the plant a boiler is a part of. Examples: + ``core_eia__entity_boilers``, ``core_eia923__entity_coalmine``. + * ``scd``: Slowly changing dimension tables describe attributes of entities that + rarely change. For example, the ownership or the capacity of a plant. Examples: + ``core_eia860__scd_generators``, ``core_eia860__scd_plants``. + * ``yearly/monthly/hourly``: Time series tables contain attributes about entities + that are expected to change for each reported timestamp. Time series tables + typically contain measurements of processes like net generation or co2 emissions. + Examples: ``core_ferc714__hourly_demand_pa``, + ``core_ferc1__yearly_plant_in_service``. + +Output layer +^^^^^^^^^^^^ +* This layer uses assets in the Core layer to construct wide and complete tables + suitable for users to perform analysis on. This layer can contain intermediate + tables that bridge the core and user-facing tables. +* Naming convention: ``out_{source}__{asset_type}_{asset_name}`` +* ``source`` is optional in this layer because there can be assets that join data from + multiple sources. +* ``asset_type`` is also optional. It will likely describe the frequency at which + the data is reported (annual/monthly/hourly). + +Intermediate Assets +^^^^^^^^^^^^^^^^^^^ +* Intermediate assets are logical steps towards a final well-modeled core asset or + user-facing output asset. These assets are not intended to be persisted in the + database or accessible to the user. These assets are denoted by a preceding + underscore, like a private python method. For example, the intermediate asset + ``_core_eia860__plants`` is a logical step towards the + ``core_eia860__entity_plants`` and ``core_eia860__scd_plants`` assets. +* The number of intermediate assets should be limited to avoid an extremely + cluttered DAG. It is appropriate to create an intermediate asset when: + + * there is a short and long running portion of a process. It is convenient to separate + the long and short-running processing portions into separate assets so debugging the + short-running process doesn’t take forever. + * there is a logical step in a process that is frequently inspected for debugging. For + example, the pre harvest assets in the ``_core_eia860`` and ``_core_eia923`` groups + are frequently inspected when new years of data are added. + + +Columns and Field Names +^^^^^^^^^^^^^^^^^^^^^^^ +If two columns in different tables record the same quantity in the same units, +give them the same name. That way if they end up in the same dataframe for +comparison it's easy to automatically rename them with suffixes indicating +where they came from. For example, net electricity generation is reported to +both :doc:`FERC Form 1 <../data_sources/ferc1>` and :doc:`EIA 923 +<../data_sources/eia923>`, so we've named columns ``net_generation_mwh`` in +each of those data sources. Similarly, give non-comparable quantities reported +in different data sources **different** column names. This helps make it clear +that the quantities are actually different. + +* ``total`` should come at the beginning of the name (e.g. + ``total_expns_production``) +* Identifiers should be structured ``type`` + ``_id_`` + ``source`` where + ``source`` is the agency or organization that has assigned the ID. (e.g. + ``plant_id_eia``) +* The data source or label (e.g. ``plant_id_pudl``) should follow the thing it + is describing +* Units should be appended to field names where applicable (e.g. + ``net_generation_mwh``). This includes "per unit" signifiers (e.g. ``_pct`` + for percent, ``_ppm`` for parts per million, or a generic ``_per_unit`` when + the type of unit varies, as in columns containing a heterogeneous collection + of fuels) +* Financial values are assumed to be in nominal US dollars. +* ``_id`` indicates the field contains a usually numerical reference to + another table, which will not be intelligible without looking up the value in + that other table. +* The suffix ``_code`` indicates the field contains a short abbreviation from + a well defined list of values, that probably needs to be looked up if you + want to understand what it means. +* The suffix ``_type`` (e.g. ``fuel_type``) indicates a human readable category + from a well defined list of values. Whenever possible we try to use these + longer descriptive names rather than codes. +* ``_name`` indicates a longer human readable name, that is likely not well + categorized into a small set of acceptable values. +* ``_date`` indicates the field contains a :class:`Date` object. +* ``_datetime`` indicates the field contains a full :class:`Datetime` object. +* ``_year`` indicates the field contains an :class:`integer` 4-digit year. +* ``capacity`` refers to nameplate capacity (e.g. ``capacity_mw``)-- other + specific types of capacity are annotated. +* Regardless of what label utilities are given in the original data source + (e.g. ``operator`` in EIA or ``respondent`` in FERC) we refer to them as + ``utilities`` in PUDL. + +Naming Conventions in Code +-------------------------- In the PUDL codebase, we aspire to follow the naming and other conventions detailed in :pep:`8`. @@ -21,11 +162,6 @@ as we come across them again in maintaining the code. ``eia`` or ``ferc1``). When outputs are built from a single table, simply use the table name (e.g. ``core_eia923__monthly_boiler_fuel``). -.. _glossary: - -Glossary of Abbreviations -------------------------- - General Abbreviations ^^^^^^^^^^^^^^^^^^^^^ @@ -76,61 +212,9 @@ Abbreviation Definition Data Extraction Functions -------------------------- +^^^^^^^^^^^^^^^^^^^^^^^^^ The lower level namespace uses an imperative verb to identify the action the function performs followed by the object of extraction (e.g. ``get_eia860_file``). The upper level namespace identifies the dataset where extraction is occurring. - -Output Functions ------------------ - -When dataframe outputs are built from multiple tables, identify the type of -information being pulled (e.g. ``plants``) and the source of the tables (e.g. -``eia`` or ``ferc1``). When outputs are built from a single table, simply use -the table name (e.g. ``core_eia923__monthly_boiler_fuel``). - -Table Names ------------ - -See `this article `__ on database naming conventions. - -* Table names in snake_case -* The data source should follow the thing it applies to e.g. ``plant_id_ferc1`` - -Columns and Field Names ------------------------ - -* ``total`` should come at the beginning of the name (e.g. - ``total_expns_production``) -* Identifiers should be structured ``type`` + ``_id_`` + ``source`` where - ``source`` is the agency or organization that has assigned the ID. (e.g. - ``plant_id_eia``) -* The data source or label (e.g. ``plant_id_pudl``) should follow the thing it - is describing -* Units should be appended to field names where applicable (e.g. - ``net_generation_mwh``). This includes "per unit" signifiers (e.g. ``_pct`` - for percent, ``_ppm`` for parts per million, or a generic ``_per_unit`` when - the type of unit varies, as in columns containing a heterogeneous collection - of fuels) -* Financial values are assumed to be in nominal US dollars. -* ``_id`` indicates the field contains a usually numerical reference to - another table, which will not be intelligible without looking up the value in - that other table. -* The suffix ``_code`` indicates the field contains a short abbreviation from - a well defined list of values, that probably needs to be looked up if you - want to understand what it means. -* The suffix ``_type`` (e.g. ``fuel_type``) indicates a human readable category - from a well defined list of values. Whenever possible we try to use these - longer descriptive names rather than codes. -* ``_name`` indicates a longer human readable name, that is likely not well - categorized into a small set of acceptable values. -* ``_date`` indicates the field contains a :class:`Date` object. -* ``_datetime`` indicates the field contains a full :class:`Datetime` object. -* ``_year`` indicates the field contains an :class:`integer` 4-digit year. -* ``capacity`` refers to nameplate capacity (e.g. ``capacity_mw``)-- other - specific types of capacity are annotated. -* Regardless of what label utilities are given in the original data source - (e.g. ``operator`` in EIA or ``respondent`` in FERC) we refer to them as - ``utilities`` in PUDL. From 4d5b57d662346e4b4f7d652f59350491e7250a95 Mon Sep 17 00:00:00 2001 From: bendnorman Date: Wed, 20 Sep 2023 15:30:58 -0400 Subject: [PATCH 02/10] Add new naming convention to user facing documentation --- docs/data_access.rst | 6 + docs/dev/naming_conventions.rst | 1 + docs/intro.rst | 112 +++++++----------- .../templates/datasette-metadata.yml.jinja | 11 +- 4 files changed, 55 insertions(+), 75 deletions(-) diff --git a/docs/data_access.rst b/docs/data_access.rst index 869f84ceef..ca169afa23 100644 --- a/docs/data_access.rst +++ b/docs/data_access.rst @@ -8,6 +8,12 @@ PUDL data, so if you have a suggestion please `open a GitHub issue `__. If you have a question you can `create a GitHub discussion `__. +PUDL's primary data output is the ``pudl.sqlite`` database. It contains a collection +of tables that follow :ref:`PUDL's asset naming convention `. Tables +with the ``core_`` prefix are normalized tables that serve as building blocks for the +more denormalized and easy to work with ``output_`` tables. **We recommend only working +with ``output_`` tables.** + .. _access-modes: --------------------------------------------------------------------------------------- diff --git a/docs/dev/naming_conventions.rst b/docs/dev/naming_conventions.rst index 68526e31b7..9d1e521b77 100644 --- a/docs/dev/naming_conventions.rst +++ b/docs/dev/naming_conventions.rst @@ -7,6 +7,7 @@ Naming Conventions We try to use consistent naming conventions for the data tables, data assets, columns, data sources, and functions. +.. _asset-naming: Asset Naming Conventions --------------------------------------------------- diff --git a/docs/intro.rst b/docs/intro.rst index 94f75a592b..c65642f43c 100644 --- a/docs/intro.rst +++ b/docs/intro.rst @@ -73,31 +73,44 @@ needed and organize them in a local :doc:`datastore `. .. _etl-process: --------------------------------------------------------------------------------------- -The ETL Process +The Data Warehouse Design --------------------------------------------------------------------------------------- -The core of PUDL's work takes place in the ETL (Extract, Transform, and Load) -process. +PUDL's data processing produces a data warehouse that can be used for analytics. +The processing happens within Dagster assets that are persisted to storage, +typically pickle, parquet or SQLite files. The raw data moves through three +layers of the data warehouse. -Extract -^^^^^^^ +Raw Layer +^^^^^^^^^ -The Extract step reads the raw data from the original heterogeneous formats into a -collection of :class:`pandas.DataFrame` with uniform column names across all years so +Assets in the Raw layer read the raw data from the original heterogeneous formats into +a collection of :class:`pandas.DataFrame` with uniform column names across all years so that it can be easily processed in bulk. Data distributed as binary database files, such as the DBF files from FERC Form 1, may be converted into a unified SQLite database -before individual dataframes are created. +before individual dataframes are created. Raw data assets are typically persisted to +pickle files and are not distributed to users. .. seealso:: Module documentation within the :mod:`pudl.extract` subpackage. -Transform -^^^^^^^^^ +Core Layer +^^^^^^^^^^ + +The Core layer contains well-modeled assets that serve as building blocks for +downstream wide tables and analyses. Well-modeled means tables in the database +have logical primary keys, foreign keys, datatypes and generally follow +:ref:`Tidy Data standards `. The assets are loaded to a SQLite +database or Parquet file. + +These outputs can be accessed via Python, R, and many other tools. See the +:doc:`data_dictionaries/pudl_db` page for a list of the normalized database tables and +their contents. -The Transform step is generally broken down into two phases. Phase one focuses on -cleaning and organizing data within individual tables while phase two focuses on the -integration and deduplication of data between tables. These tasks can be tedious +Data processing in the Core layer is generally broken down into two phases. Phase one +focuses on cleaning and organizing data within individual tables while phase two focuses +on the integration and deduplication of data between tables. These tasks can be tedious `data wrangling toil `__ that impose a huge amount of overhead on anyone trying to do analysis based on the publicly available data. PUDL implements common data cleaning operations in the hopes that we @@ -128,73 +141,32 @@ longitude are reported separately every year. Often, these reported values are n self-consistent. There may be several different spellings of a plant's name, or an incorrectly reported latitude in one year. -The transform step attempts to eliminate this kind of inconsistent and duplicate +Assets in the Core layer attempt to eliminate this kind of inconsistent and duplicate information when normalizing the tables by choosing only the most consistently reported value for inclusion in the final database. If a value which should be static is not consistently reported, it may also be set to N/A. -.. seealso:: - - * `Tidy Data `__ by Hadley - Wickham, Journal of Statistical Software (2014). - * `A Simple Guide to the Five Normal Forms in Relational Database Theory `__ - by William Kent, Communications of the ACM (1983). - -Load -^^^^ - -At the end of the Transform step, we have collections of :class:`pandas.DataFrame` that -correspond to database tables. These are loaded into a SQLite database. -To handle the ~1 billion row :doc:`data_sources/epacems`, we load the dataframes into -an Apache Parquet dataset that is partitioned by state and year. - -These outputs can be accessed via Python, R, and many other tools. See the -:doc:`data_dictionaries/pudl_db` page for a list of the normalized database tables and -their contents. - -.. seealso:: - - Module documentation within the :mod:`pudl.load` sub-package. - -.. _db-and-outputs: - ---------------------------------------------------------------------------------------- -Output Tables ---------------------------------------------------------------------------------------- - -Denormalized Outputs +Output Layer ^^^^^^^^^^^^^^^^^^^^ -We normalize the data to make storage more efficient and avoid data integrity issues, -but you may want to combine information from more than one of the tables to make the -data more readable and readily interpretable. For example, PUDL stores the name that EIA -uses to refer to a power plant in the :ref:`core_eia__entity_plants` table in -association with the plant's unique numeric ID. If you are working with data from the -:ref:`core_eia923__monthly_fuel_receipts_costs` table, which records monthly per-plant -fuel deliveries, you may want to have the name of the plant alongside the fuel delivery -information since it's more recognizable than the plant ID. +Assets in the Core layer normalize the data to make storage more efficient and avoid +data integrity issues, but you may want to combine information from more than one of +the tables to make the data more readable and readily interpretable. For example, PUDL +stores the name that EIA uses to refer to a power plant in the +:ref:`core_eia__entity_plants` table in association with the plant's unique numeric ID. +If you are working with data from the :ref:`core_eia923__monthly_fuel_receipts_costs` +table, which records monthly per-plant fuel deliveries, you may want to have the name +of the plant alongside the fuel delivery information since it's more recognizable than +the plant ID. Rather than requiring everyone to write their own SQL ``SELECT`` and ``JOIN`` statements or do a bunch of :func:`pandas.merge` operations to bring together data, PUDL provides a -variety of predefined queries as methods of the :class:`pudl.output.pudltabl.PudlTabl` -class. These methods perform common joins to return output tables (pandas DataFrames) -that contain all of the useful information in one place. In some cases, like with EIA, -the output tables are composed to closely resemble the raw spreadsheet tables you're -familiar with. - -.. note:: - - In the future, we intend to replace the simple denormalized output tables with - database views that are integrated into the distributed SQLite database directly. - This will provide the same convenience without requiring use of the Python software - layer. - -Analysis Outputs -^^^^^^^^^^^^^^^^ +variety of output tables that contain all of the useful information in one place. In +some cases, like with EIA, the output tables are composed to closely resemble the raw +spreadsheet tables you're familiar with. -There are several analytical routines built into the -:mod:`pudl.output.pudltabl.PudlTabl` output objects for calculating derived values -like the heat rate by generation unit (:meth:`hr_by_unit +The Output layer also contains tables produced by analytical routines for +calculating derived values like the heat rate by generation unit (:meth:`hr_by_unit `) or the capacity factor by generator (:meth:`capacity_factor `). We intend to integrate more analytical outputs into the library over time. diff --git a/src/pudl/metadata/templates/datasette-metadata.yml.jinja b/src/pudl/metadata/templates/datasette-metadata.yml.jinja index 8872e6ff7b..4fc8e3aa42 100644 --- a/src/pudl/metadata/templates/datasette-metadata.yml.jinja +++ b/src/pudl/metadata/templates/datasette-metadata.yml.jinja @@ -42,12 +42,13 @@ databases: Catalyst Cooperative as part of the Public Utility Data Liberation Project.

-

Caution:

+

Note:

    -
  • Please note that tables beginning with "denorm_" are temporary tables whose - names and metadata will shortly change, as we migrate new tables into our database.
  • -
  • The structure of the data and the API are not necessarily stable, so don't - build any critical infrastructure on top of this just yet.
  • +
  • Tables with the "core_" prefix are normalized tables that serve as building blocks for the + more denormalized and easy to work with "out_" tables.
  • +
  • We recommend only working with "out_" tables.
  • +
  • To learn more about how the database is organized check out + PUDL's naming conventions.
  • If you find something wrong, please make an issue on GitHub to let us know.
  • From 4d256ecd88c49167c78ae9061e93bd73b1c5affc Mon Sep 17 00:00:00 2001 From: bendnorman Date: Tue, 26 Sep 2023 15:22:14 +0200 Subject: [PATCH 03/10] Respond to first round of Austen's comments --- docs/data_access.rst | 9 ++- docs/dev/naming_conventions.rst | 66 +++++++++++++------ docs/intro.rst | 1 + .../templates/datasette-metadata.yml.jinja | 12 ++-- 4 files changed, 57 insertions(+), 31 deletions(-) diff --git a/docs/data_access.rst b/docs/data_access.rst index ca169afa23..bfacdb6501 100644 --- a/docs/data_access.rst +++ b/docs/data_access.rst @@ -8,11 +8,10 @@ PUDL data, so if you have a suggestion please `open a GitHub issue `__. If you have a question you can `create a GitHub discussion `__. -PUDL's primary data output is the ``pudl.sqlite`` database. It contains a collection -of tables that follow :ref:`PUDL's asset naming convention `. Tables -with the ``core_`` prefix are normalized tables that serve as building blocks for the -more denormalized and easy to work with ``output_`` tables. **We recommend only working -with ``output_`` tables.** +PUDL's primary data output is the ``pudl.sqlite`` database. We recommend working +with tables with the ``out_`` prefix as these tables contain the most complete +data. For more information about the different types of tables, read through +:ref:`PUDL's naming conventions `. .. _access-modes: diff --git a/docs/dev/naming_conventions.rst b/docs/dev/naming_conventions.rst index 9d1e521b77..5becde4b9b 100644 --- a/docs/dev/naming_conventions.rst +++ b/docs/dev/naming_conventions.rst @@ -12,8 +12,12 @@ columns, data sources, and functions. Asset Naming Conventions --------------------------------------------------- -PUDL's data processing is divided into three layers of dagster assets: Raw, Core -and Output. Asset names should generally follow this naming convention: +PUDL's data processing is divided into three layers of Dagster assets: Raw, Core +and Output. Dagster assets are the core unit of computation in PUDL. The outputs +of assets can be persisted to any type of storage though PUDL outputs are typically +tables in a SQLite database, parquet files or pickle files. The asset name is used +for the table or parquet file name. Asset names should generally follow this naming +convention: .. code-block:: @@ -39,10 +43,12 @@ Raw layer Core layer ^^^^^^^^^^ -* This layer contains well-modeled assets that serve as building blocks for downstream - wide tables and analyses. Well-modeled means tables in the database have logical +* This layer contains assets that typically break denormalized raw assets into + well-modeled tables that serve as building blocks for downstream wide tables + and analyses. Well-modeled means tables in the database have logical primary keys, foreign keys, datatypes and generally follow - :ref:`Tidy Data standards `. + :ref:`Tidy Data standards `. Assets in this layer create + consistent categorical variables, decuplicate and impute data. These assets are typically stored in parquet files or tables in a database. * Naming convention: ``core_{source}__{asset_type}_{asset_name}`` * ``asset_type`` describes how the asset is modeled and its role in PUDL’s @@ -50,27 +56,37 @@ Core layer * ``assn``: Association tables provide connections between entities. This data can be manually compiled or extracted from data sources. Examples: - ``core_pudl__assn_plants_eia``, ``core_eia861__assn_utility``. + + * ``core_pudl__assn_plants_eia`` associates EIA Plant IDs and manually assigned + PUDL Plant IDs. * ``codes``: Code tables contain more verbose descriptions of categorical codes typically manually compiled from source data dictionaries. Examples: - ``core_eia__codes_averaging_periods``, ``core_eia__codes_balancing_authorities`` + + * ``core_eia__codes_averaging_periods`` + * ``core_eia__codes_balancing_authorities`` * ``entity``: Entity tables contain static information about entities. For example, - the state a plant is located in, or the plant a boiler is a part of. Examples: - ``core_eia__entity_boilers``, ``core_eia923__entity_coalmine``. + the state a plant is located in or the plant a boiler is a part of. Examples: + + * ``core_eia__entity_boilers`` + * ``core_eia923__entity_coalmine``. * ``scd``: Slowly changing dimension tables describe attributes of entities that rarely change. For example, the ownership or the capacity of a plant. Examples: - ``core_eia860__scd_generators``, ``core_eia860__scd_plants``. + + * ``core_eia860__scd_generators`` + * ``core_eia860__scd_plants``. * ``yearly/monthly/hourly``: Time series tables contain attributes about entities that are expected to change for each reported timestamp. Time series tables typically contain measurements of processes like net generation or co2 emissions. - Examples: ``core_ferc714__hourly_demand_pa``, - ``core_ferc1__yearly_plant_in_service``. + Examples: + + * ``core_ferc714__hourly_demand_pa``, + * ``core_ferc1__yearly_plant_in_service``. Output layer ^^^^^^^^^^^^ -* This layer uses assets in the Core layer to construct wide and complete tables - suitable for users to perform analysis on. This layer can contain intermediate - tables that bridge the core and user-facing tables. +* Assets in this layer use the well modeled tables from the Core layer to construct + wide and complete tables suitable for users to perform analysis on. This layer + contains intermediate tables that bridge the core and user-facing tables. * Naming convention: ``out_{source}__{asset_type}_{asset_name}`` * ``source`` is optional in this layer because there can be assets that join data from multiple sources. @@ -79,13 +95,18 @@ Output layer Intermediate Assets ^^^^^^^^^^^^^^^^^^^ -* Intermediate assets are logical steps towards a final well-modeled core asset or +* Intermediate assets are logical steps towards a final well-modeled core or user-facing output asset. These assets are not intended to be persisted in the database or accessible to the user. These assets are denoted by a preceding underscore, like a private python method. For example, the intermediate asset ``_core_eia860__plants`` is a logical step towards the ``core_eia860__entity_plants`` and ``core_eia860__scd_plants`` assets. -* The number of intermediate assets should be limited to avoid an extremely + ``_core_eia860__plants`` does some basic cleaning of the ``raw_eia860__plant`` + asset but still contains duplicate plant entities. The computation intensive + harvesting process deduplicates ``_core_eia860__plants`` and outputs the + ``core_eia860__entity_plants`` and ``core_eia860__scd_plants`` assets which + follow Tiny Data standards. +* Limit the number of intermediate assets to avoid an extremely cluttered DAG. It is appropriate to create an intermediate asset when: * there is a short and long running portion of a process. It is convenient to separate @@ -115,12 +136,15 @@ that the quantities are actually different. ``plant_id_eia``) * The data source or label (e.g. ``plant_id_pudl``) should follow the thing it is describing -* Units should be appended to field names where applicable (e.g. +* Append units to field names where applicable (e.g. ``net_generation_mwh``). This includes "per unit" signifiers (e.g. ``_pct`` for percent, ``_ppm`` for parts per million, or a generic ``_per_unit`` when the type of unit varies, as in columns containing a heterogeneous collection of fuels) -* Financial values are assumed to be in nominal US dollars. +* Financial values are assumed to be in nominal US dollars (I.e., the suffix + _usd is implied.)If they are not reported in USD, convert them to USD. If + they must be kept in their original form for some reason, append a suffix + that lets the user know they are not USD. * ``_id`` indicates the field contains a usually numerical reference to another table, which will not be intelligible without looking up the value in that other table. @@ -155,8 +179,8 @@ as we come across them again in maintaining the code. (e.g. connect_db), unless the function returns a simple value (e.g. datadir). * No duplication of information (e.g. form names). * lowercase, underscores separate words (i.e. ``snake_case``). -* Semi-private helper functions (functions used within a single module only - and not exposed via the public API) should be preceded by an underscore. +* Add a preceeding underscore to semi-private helper functions (functions used + within a single module only and not exposed via the public API). * When the object is a table, use the full table name (e.g. ingest_fuel_ferc1). * When dataframe outputs are built from multiple tables, identify the type of information being pulled (e.g. "plants") and the source of the tables (e.g. diff --git a/docs/intro.rst b/docs/intro.rst index c65642f43c..e338d35306 100644 --- a/docs/intro.rst +++ b/docs/intro.rst @@ -28,6 +28,7 @@ pages for each source: We also publish SQLite databases containing relatively pristine versions of our more difficult to parse inputs, especially the old Visual FoxPro (DBF, pre-2021) and new XBRL data (2021+) published by FERC: + * `FERC Form 1 (DBF) `__ * `FERC Form 1 (XBRL) `__ * `FERC Form 2 (XBRL) `__ diff --git a/src/pudl/metadata/templates/datasette-metadata.yml.jinja b/src/pudl/metadata/templates/datasette-metadata.yml.jinja index 4fc8e3aa42..2dd8f3c649 100644 --- a/src/pudl/metadata/templates/datasette-metadata.yml.jinja +++ b/src/pudl/metadata/templates/datasette-metadata.yml.jinja @@ -44,11 +44,13 @@ databases: Data Liberation Project.

    Note:

      -
    • Tables with the "core_" prefix are normalized tables that serve as building blocks for the - more denormalized and easy to work with "out_" tables.
    • -
    • We recommend only working with "out_" tables.
    • -
    • To learn more about how the database is organized check out - PUDL's naming conventions.
    • +
    • We recommend working + with tables with the ``out_`` prefix as these tables contain the most complete + data. +
    • +
    • For more information about the different types of tables, read through + PUDL's naming conventions +
    • If you find something wrong, please make an issue on GitHub to let us know.
    • From 1a9028df22215704acced445d953cd77fe959f5e Mon Sep 17 00:00:00 2001 From: bendnorman Date: Tue, 26 Sep 2023 16:21:48 +0200 Subject: [PATCH 04/10] Update rename-core-assets and clarify raw asset sentence --- docs/intro.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/intro.rst b/docs/intro.rst index e338d35306..7bf02258a8 100644 --- a/docs/intro.rst +++ b/docs/intro.rst @@ -89,8 +89,8 @@ Assets in the Raw layer read the raw data from the original heterogeneous format a collection of :class:`pandas.DataFrame` with uniform column names across all years so that it can be easily processed in bulk. Data distributed as binary database files, such as the DBF files from FERC Form 1, may be converted into a unified SQLite database -before individual dataframes are created. Raw data assets are typically persisted to -pickle files and are not distributed to users. +before individual dataframes are created. Raw data assets are not written to +``pudl.sqlite``, persisted to pickle files and not distributed to users. .. seealso:: From 32dc9ac7c76b8e83673e4d963f133367d04a8af3 Mon Sep 17 00:00:00 2001 From: bendnorman Date: Tue, 26 Sep 2023 17:17:51 +0200 Subject: [PATCH 05/10] Restrict astroid version to avoid random autoapi error --- pyproject.toml | 1 + 1 file changed, 1 insertion(+) diff --git a/pyproject.toml b/pyproject.toml index 55e1fb2f6f..7c470b0679 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -116,6 +116,7 @@ dev = [ "twine>=3.3,<4.1", ] doc = [ + "astroid<3.0.0", "doc8>=1.1,<1.2", "furo>=2022.4.7", "sphinx-autoapi>=1.8,<2.2", From 33fab91ef2b3b2fdd07a6403049d7c8c49292c82 Mon Sep 17 00:00:00 2001 From: bendnorman Date: Wed, 1 Nov 2023 12:05:33 -0800 Subject: [PATCH 06/10] Incorporate some docs changes from #2912 --- README.rst | 97 ++++++++++++++++++++++++--------- docs/dev/naming_conventions.rst | 81 ++++++++++++++------------- docs/intro.rst | 10 ++-- pyproject.toml | 1 - 4 files changed, 121 insertions(+), 68 deletions(-) diff --git a/README.rst b/README.rst index df5edcda3e..3e993f95dc 100644 --- a/README.rst +++ b/README.rst @@ -59,23 +59,71 @@ it's often difficult to work with. PUDL takes the original spreadsheets, CSV fil and databases and turns them into a unified resource. This allows users to spend more time on novel analysis and less time on data preparation. +PUDL is comprised of three core components: + +- **Raw Data Archives** + + - PUDL `archives `__ + all the raw data inputs on `Zenodo `__ + to ensure perminant, versioned access to the data. In the event that an agency + changes how they publish data or deletes old files, the ETL will still have access + to the original inputs. Each of the data inputs may have several different versions + archived, and all are assigned a unique DOI and made available through the REST API. + You can read more about the Raw Data Archives in the + `docs `__. +- **ETL Pipeline** + + - The ETL pipeline (this repo) ingests the raw archives, cleans them, + integrates them, and outputs them to a series of tables stored in SQLite Databases, + Parquet files, and pickle files (the Data Warehouse). Each release of the PUDL + Python package is embedded with a set of of DOIs to indicate which version of the + raw inputs it is meant to process. This process helps ensure that the ETL and it's + outputs are replicable. You can read more about the ETL in the + `docs `__. +- **Data Warehouse** + + - The outputs from the ETL, sometimes called "PUDL outputs", + are stored in a data warehouse as a collection of SQLite and Parquet files so that + users can access the data without having to run any code. Learn more about how to + access the data `here `__. + What data is available? ----------------------- PUDL currently integrates data from: -* `EIA Form 860 `__: 2001-2022 -* `EIA Form 860m `__: 2023-06 -* `EIA Form 861 `__: 2001-2022 -* `EIA Form 923 `__: 2001-2022 -* `EPA Continuous Emissions Monitoring System (CEMS) `__: 1995-2022 -* `FERC Form 1 `__: 1994-2021 -* `FERC Form 714 `__: 2006-2020 -* `US Census Demographic Profile 1 Geodatabase `__: 2010 +* **EIA Form 860**: 2001-2022 + - `Source Docs `__ + - `PUDL Docs `__ +* **EIA Form 860m**: 2023-06 + - `Source Docs `__ +* **EIA Form 861**: 2001-2022 + - `Source Docs `__ + - `PUDL Docs `__ +* **EIA Form 923**: 2001-2022 + - `Source Docs `__ + - `PUDL Docs `__ +* **EPA Continuous Emissions Monitoring System (CEMS)**: 1995-2022 + - `Source Docs `__ + - `PUDL Docs `__ +* **FERC Form 1**: 1994-2021 + - `Source Docs `__ + - `PUDL Docs `__ +* **FERC Form 714**: 2006-2020 + - `Source Docs `__ + - `PUDL Docs `__ +* **FERC Form 2**: 2021 (raw only) + - `Source Docs `__ +* **FERC Form 6**: 2021 (raw only) + - `Source Docs `__ +* **FERC Form 60**: 2021 (raw only) + - `Source Docs `__ +* **US Census Demographic Profile 1 Geodatabase**: 2010 + - `Source Docs `__ Thanks to support from the `Alfred P. Sloan Foundation Energy & Environment Program `__, from -2021 to 2024 we will be integrating the following data as well: +2021 to 2024 we will be cleaning and integrating the following data as well: * `EIA Form 176 `__ (The Annual Report of Natural Gas Supply and Disposition) @@ -83,7 +131,6 @@ Program `__, from * `FERC Form 2 `__ (Annual Report of Major Natural Gas Companies) * `PHMSA Natural Gas Annual Report `__ -* Machine Readable Specifications of State Clean Energy Standards Who is PUDL for? ---------------- @@ -101,8 +148,8 @@ resources and everyone in between! How do I access the data? ------------------------- -There are several ways to access PUDL outputs. For more details you'll want -to check out `the complete documentation +There are several ways to access the information in the PUDL Data Warehouse. +For more details you'll want to check out `the complete documentation `__, but here's a quick overview: Datasette @@ -118,6 +165,19 @@ This access mode is good for casual data explorers or anyone who just wants to g small subset of the data. It also lets you share links to a particular subset of the data and provides a REST API for querying the data from other applications. +Nightly Data Builds +^^^^^^^^^^^^^^^^^^^ +If you are less concerned with reproducibility and want the freshest possible data +we automatically upload the outputs of our nightly builds to public S3 storage buckets +as part of the `AWS Open Data Registry +`__. This data is based on +the `dev branch `__, of PUDL, and +is updated most weekday mornings. It is also the data used to populate Datasette. + +The nightly build outputs can be accessed using the AWS CLI, the S3 API, or downloaded +directly via the web. See `Accessing Nightly Builds `__ +for links to the individual SQLite, JSON, and Apache Parquet outputs. + Docker + Jupyter ^^^^^^^^^^^^^^^^ Want access to all the published data in bulk? If you're familiar with Python @@ -151,19 +211,6 @@ most users. You should check out the `Development section `__ for more details. -Nightly Data Builds -^^^^^^^^^^^^^^^^^^^ -If you are less concerned with reproducibility and want the freshest possible data -we automatically upload the outputs of our nightly builds to public S3 storage buckets -as part of the `AWS Open Data Registry -`__. This data is based on -the `dev branch `__, of PUDL, and -is updated most weekday mornings. It is also the data used to populate Datasette. - -The nightly build outputs can be accessed using the AWS CLI, the S3 API, or downloaded -directly via the web. See `Accessing Nightly Builds `__ -for links to the individual SQLite, JSON, and Apache Parquet outputs. - Contributing to PUDL -------------------- Find PUDL useful? Want to help make it better? There are lots of ways to help! diff --git a/docs/dev/naming_conventions.rst b/docs/dev/naming_conventions.rst index 5becde4b9b..178f22934c 100644 --- a/docs/dev/naming_conventions.rst +++ b/docs/dev/naming_conventions.rst @@ -15,9 +15,9 @@ Asset Naming Conventions PUDL's data processing is divided into three layers of Dagster assets: Raw, Core and Output. Dagster assets are the core unit of computation in PUDL. The outputs of assets can be persisted to any type of storage though PUDL outputs are typically -tables in a SQLite database, parquet files or pickle files. The asset name is used -for the table or parquet file name. Asset names should generally follow this naming -convention: +tables in a SQLite database, parquet files or pickle files (read more about this here: +:doc:`../intro`). The asset name is used for the table or parquet file name. Asset +names should generally follow this naming convention: .. code-block:: @@ -33,9 +33,11 @@ convention: Raw layer ^^^^^^^^^ -* This layer contains assets that extract data from spreadsheets and databases - and are persisted as pickle files. -* Naming convention: ``raw_{source}__{asset_name}`` +This layer contains assets that extract data from spreadsheets and databases +and are persisted as pickle files. + +Naming convention: ``raw_{source}__{asset_name}`` + * ``asset_name`` is typically copied from the source data. * ``asset_type`` is not included in this layer because the data modeling does not yet conform to PUDL standards. Raw assets are typically just copies of the @@ -43,14 +45,16 @@ Raw layer Core layer ^^^^^^^^^^ -* This layer contains assets that typically break denormalized raw assets into - well-modeled tables that serve as building blocks for downstream wide tables - and analyses. Well-modeled means tables in the database have logical - primary keys, foreign keys, datatypes and generally follow - :ref:`Tidy Data standards `. Assets in this layer create - consistent categorical variables, decuplicate and impute data. - These assets are typically stored in parquet files or tables in a database. -* Naming convention: ``core_{source}__{asset_type}_{asset_name}`` +This layer contains assets that typically break denormalized raw assets into +well-modeled tables that serve as building blocks for downstream wide tables +and analyses. Well-modeled means tables in the database have logical +primary keys, foreign keys, datatypes and generally follow +:ref:`Tidy Data standards `. Assets in this layer create +consistent categorical variables, decuplicate and impute data. +These assets are typically stored in parquet files or tables in a database. + +Naming convention: ``core_{source}__{asset_type}_{asset_name}`` + * ``asset_type`` describes how the asset is modeled and its role in PUDL’s collection of core assets. There are a handful of table types in this layer: @@ -84,10 +88,12 @@ Core layer Output layer ^^^^^^^^^^^^ -* Assets in this layer use the well modeled tables from the Core layer to construct - wide and complete tables suitable for users to perform analysis on. This layer - contains intermediate tables that bridge the core and user-facing tables. -* Naming convention: ``out_{source}__{asset_type}_{asset_name}`` +This layer uses assets in the Core layer to construct wide and complete tables +suitable for users to perform analysis on. This layer can contain intermediate +tables that bridge the core and user-facing tables. + +Naming convention: ``out_{source}__{asset_type}_{asset_name}`` + * ``source`` is optional in this layer because there can be assets that join data from multiple sources. * ``asset_type`` is also optional. It will likely describe the frequency at which @@ -95,19 +101,20 @@ Output layer Intermediate Assets ^^^^^^^^^^^^^^^^^^^ -* Intermediate assets are logical steps towards a final well-modeled core or - user-facing output asset. These assets are not intended to be persisted in the - database or accessible to the user. These assets are denoted by a preceding - underscore, like a private python method. For example, the intermediate asset - ``_core_eia860__plants`` is a logical step towards the - ``core_eia860__entity_plants`` and ``core_eia860__scd_plants`` assets. - ``_core_eia860__plants`` does some basic cleaning of the ``raw_eia860__plant`` - asset but still contains duplicate plant entities. The computation intensive - harvesting process deduplicates ``_core_eia860__plants`` and outputs the - ``core_eia860__entity_plants`` and ``core_eia860__scd_plants`` assets which - follow Tiny Data standards. -* Limit the number of intermediate assets to avoid an extremely - cluttered DAG. It is appropriate to create an intermediate asset when: +Intermediate assets are logical steps towards a final well-modeled core or +user-facing output asset. These assets are not intended to be persisted in the +database or accessible to the user. These assets are denoted by a preceding +underscore, like a private python method. For example, the intermediate asset +``_core_eia860__plants`` is a logical step towards the +``core_eia860__entity_plants`` and ``core_eia860__scd_plants`` assets. +``_core_eia860__plants`` does some basic cleaning of the ``raw_eia860__plant`` +asset but still contains duplicate plant entities. The computation intensive +harvesting process deduplicates ``_core_eia860__plants`` and outputs the +``core_eia860__entity_plants`` and ``core_eia860__scd_plants`` assets which +follow Tiny Data standards. + +Limit the number of intermediate assets to avoid an extremely +cluttered DAG. It is appropriate to create an intermediate asset when: * there is a short and long running portion of a process. It is convenient to separate the long and short-running processing portions into separate assets so debugging the @@ -118,16 +125,16 @@ Intermediate Assets Columns and Field Names -^^^^^^^^^^^^^^^^^^^^^^^ +----------------------- If two columns in different tables record the same quantity in the same units, give them the same name. That way if they end up in the same dataframe for comparison it's easy to automatically rename them with suffixes indicating where they came from. For example, net electricity generation is reported to -both :doc:`FERC Form 1 <../data_sources/ferc1>` and :doc:`EIA 923 -<../data_sources/eia923>`, so we've named columns ``net_generation_mwh`` in -each of those data sources. Similarly, give non-comparable quantities reported -in different data sources **different** column names. This helps make it clear -that the quantities are actually different. +both :doc:`FERC Form 1 <../data_sources/ferc1>` and +:doc:`EIA 923<../data_sources/eia923>`, so we've named columns ``net_generation_mwh`` +in each of those data sources. Similarly, give non-comparable quantities reported in +different data sources **different** column names. This helps make it clear that the +quantities are actually different. * ``total`` should come at the beginning of the name (e.g. ``total_expns_production``) diff --git a/docs/intro.rst b/docs/intro.rst index 7bf02258a8..cbb0b78cc4 100644 --- a/docs/intro.rst +++ b/docs/intro.rst @@ -74,13 +74,13 @@ needed and organize them in a local :doc:`datastore `. .. _etl-process: --------------------------------------------------------------------------------------- -The Data Warehouse Design +The ETL Process --------------------------------------------------------------------------------------- -PUDL's data processing produces a data warehouse that can be used for analytics. +PUDL's ETL produces a data warehouse that can be used for analytics. The processing happens within Dagster assets that are persisted to storage, typically pickle, parquet or SQLite files. The raw data moves through three -layers of the data warehouse. +layers of processing. Raw Layer ^^^^^^^^^ @@ -201,7 +201,7 @@ Some data validations are currently only specified within our test suite, includ * The expected number of records within each table * The fact that there are no entirely N/A columns -A variety of database integrity checks are also run either during the ETL process or -when the data is loaded into SQLite. +A variety of database integrity checks are also run either during the data processing +or when the data is loaded into SQLite. See our :doc:`dev/testing` documentation for more information. diff --git a/pyproject.toml b/pyproject.toml index 6f10b2ff8b..feabc5f463 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -130,7 +130,6 @@ dev = [ "twine>=4,<4.1", ] doc = [ - "astroid<3.0.0", "doc8>=1.1,<1.2", "furo>=2022.4.7", "sphinx-autoapi>=3,<4", From 10111e4ae61c976c0ec9c9e86b2e185dc87f2731 Mon Sep 17 00:00:00 2001 From: bendnorman Date: Mon, 6 Nov 2023 17:01:49 -0900 Subject: [PATCH 07/10] Remove README.rst from index.rst and move intro content to index --- README.rst | 23 ++-- docs/data_access.rst | 10 +- docs/dev/naming_conventions.rst | 2 +- docs/index.rst | 210 +++++++++++++++++++++++++++++++- docs/intro.rst | 207 ------------------------------- docs/release_notes.rst | 2 +- 6 files changed, 224 insertions(+), 230 deletions(-) delete mode 100644 docs/intro.rst diff --git a/README.rst b/README.rst index 3e993f95dc..45e08b00e3 100644 --- a/README.rst +++ b/README.rst @@ -59,6 +59,16 @@ it's often difficult to work with. PUDL takes the original spreadsheets, CSV fil and databases and turns them into a unified resource. This allows users to spend more time on novel analysis and less time on data preparation. +The project is focused on serving researchers, activists, journalists, policy makers, +and small businesses that might not otherwise be able to afford access to this data +from commercial sources and who may not have the time or expertise to do all the +data processing themselves from scratch. + +We want to make this data accessible and easy to work with for as wide an audience as +possible: anyone from a grassroots youth climate organizers working with Google +sheets to university researchers with access to scalable cloud computing +resources and everyone in between! + PUDL is comprised of three core components: - **Raw Data Archives** @@ -132,19 +142,6 @@ Program `__, from (Annual Report of Major Natural Gas Companies) * `PHMSA Natural Gas Annual Report `__ -Who is PUDL for? ----------------- - -The project is focused on serving researchers, activists, journalists, policy makers, -and small businesses that might not otherwise be able to afford access to this data -from commercial sources and who may not have the time or expertise to do all the -data processing themselves from scratch. - -We want to make this data accessible and easy to work with for as wide an audience as -possible: anyone from a grassroots youth climate organizers working with Google -sheets to university researchers with access to scalable cloud computing -resources and everyone in between! - How do I access the data? ------------------------- diff --git a/docs/data_access.rst b/docs/data_access.rst index dde39b0510..282cdb9bff 100644 --- a/docs/data_access.rst +++ b/docs/data_access.rst @@ -2,16 +2,16 @@ Data Access ======================================================================================= -We publish the :doc:`PUDL pipeline ` outputs in several ways to serve +We publish the PUDL pipeline outputs in several ways to serve different users and use cases. We're always trying to increase accessibility of the PUDL data, so if you have a suggestion please `open a GitHub issue `__. If you have a question you can `create a GitHub discussion `__. -PUDL's primary data output is the ``pudl.sqlite`` database. We recommend working -with tables with the ``out_`` prefix as these tables contain the most complete -data. For more information about the different types of tables, read through -:ref:`PUDL's naming conventions `. +PUDL's primary data output is the ``pudl.sqlite`` database. We recommend working with +tables with the ``out_`` prefix, as these tables contain the most complete and easiest +to work with data. For more information about the different types +of tables, read through :ref:`PUDL's naming conventions `. .. _access-modes: diff --git a/docs/dev/naming_conventions.rst b/docs/dev/naming_conventions.rst index 178f22934c..2cbe6145de 100644 --- a/docs/dev/naming_conventions.rst +++ b/docs/dev/naming_conventions.rst @@ -16,7 +16,7 @@ PUDL's data processing is divided into three layers of Dagster assets: Raw, Core and Output. Dagster assets are the core unit of computation in PUDL. The outputs of assets can be persisted to any type of storage though PUDL outputs are typically tables in a SQLite database, parquet files or pickle files (read more about this here: -:doc:`../intro`). The asset name is used for the table or parquet file name. Asset +:doc:`../index`). The asset name is used for the table or parquet file name. Asset names should generally follow this naming convention: .. code-block:: diff --git a/docs/index.rst b/docs/index.rst index f7904d4a97..d21bfc3044 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -2,14 +2,218 @@ The Public Utility Data Liberation Project =============================================================================== -.. include:: ../README.rst - :start-after: readme-intro +PUDL is a data processing pipeline created by `Catalyst Cooperative +`__ that cleans, integrates, and standardizes some of the most +widely used public energy datasets in the US. The data serve researchers, activists, +journalists, and policy makers that might not have the technical expertise to access it +in its raw form, the time to clean and prepare the data for bulk analysis, or the means +to purchase it from existing commercial providers. + +--------------------------------------------------------------------------------------- +Available Data +--------------------------------------------------------------------------------------- + +We focus primarily on poorly curated data published by the US government in +semi-structured but machine readable formats. For details on exactly what data is +available from these data sources and what state it is in, see the the individual +pages for each source: + +* :doc:`data_sources/eia860` +* :doc:`data_sources/eia861` +* :doc:`data_sources/eia923` +* :doc:`data_sources/epacems` +* :doc:`data_sources/ferc1` +* :doc:`data_sources/ferc714` + +PUDL's clean and complete versions of these data sources are stored in the +``pudl.sqlite`` database and ``core_epacems__hourly_emissions.parquet`` files. +To get started using PUDL data, visit our :doc:`data_access` page, or continue reading +to learn more about the PUDL data processing pipeline. + +We also publish SQLite databases containing relatively pristine versions of our more +difficult to parse inputs, especially the old Visual FoxPro (DBF, pre-2021) and new XBRL +data (2021+) published by FERC: + +* `FERC Form 1 (DBF) `__ +* `FERC Form 1 (XBRL) `__ +* `FERC Form 2 (XBRL) `__ +* `FERC Form 6 (XBRL) `__ +* `FERC Form 60 (XBRL) `__ +* `FERC Form 714 (XBRL) `__ + +.. _raw-data-archive: + +--------------------------------------------------------------------------------------- +Raw Data Archives +--------------------------------------------------------------------------------------- + +PUDL depends on "raw" data inputs from sources that are known to occasionally update +their data or alter the published format. These changes may be incompatible with the way +the data are read and interpreted by PUDL, so, to ensure the integrity of our data +processing, we periodically create archives of `the raw inputs on Zenodo +`__. Each of the data inputs may +have several different versions archived, and all are assigned a unique DOI and made +available through the REST API. Each release of the PUDL Python package is embedded +with a set of of DOIs to indicate which version of the raw inputs it is meant to +process. This process helps ensure that our outputs are replicable. + +To enable programmatic access to individual partitions of the data (by year, state, +etc.), we archive the raw inputs as `Frictionless Data Packages +`__. The data packages contain both the +raw data in their originally published format (CSVs, Excel spreadsheets, and Visual +FoxPro database (DBF) files) and metadata that describes how each the +dataset is partitioned. + +The PUDL software will download a copy of the appropriate raw inputs automatically as +needed and organize them in a local :doc:`datastore `. + +.. seealso:: + + The software that creates and archives the raw inputs can be found in our + `PUDL Archiver `__ + repository on GitHub. + +.. _etl-process: + +--------------------------------------------------------------------------------------- +The ETL Process +--------------------------------------------------------------------------------------- + +PUDL's ETL produces a data warehouse that can be used for analytics. +The processing happens within Dagster assets that are persisted to storage, +typically pickle, parquet or SQLite files. The raw data moves through three +layers of processing. + +Raw Layer +^^^^^^^^^ + +Assets in the Raw layer read the raw data from the original heterogeneous formats into +a collection of :class:`pandas.DataFrame` with uniform column names across all years so +that it can be easily processed in bulk. Data distributed as binary database files, such +as the DBF files from FERC Form 1, may be converted into a unified SQLite database +before individual dataframes are created. Raw data assets are not written to +``pudl.sqlite``. Instead they are persisted to pickle files and not distributed +to users. + +.. seealso:: + + Module documentation within the :mod:`pudl.extract` subpackage. + +Core Layer +^^^^^^^^^^ + +The Core layer contains well-modeled assets that serve as building blocks for +downstream wide tables and analyses. Well-modeled means tables in the database +have logical primary keys, foreign keys, datatypes and generally follow +:ref:`Tidy Data standards `. The assets are loaded to a SQLite +database or Parquet file. + +These outputs can be accessed via Python, R, and many other tools. See the +:doc:`data_dictionaries/pudl_db` page for a list of the normalized database tables and +their contents. + +Data processing in the Core layer is generally broken down into two phases. Phase one +focuses on cleaning and organizing data within individual tables while phase two focuses +on the integration and deduplication of data between tables. These tasks can be tedious +`data wrangling toil `__ that impose a +huge amount of overhead on anyone trying to do analysis based on the publicly +available data. PUDL implements common data cleaning operations in the hopes that we +can all work on more interesting problems most of the time. These operations include: + +* Standardization of units (e.g. dollars not thousands of dollars) +* Standardization of N/A values +* Standardization of freeform names and IDs +* Use of controlled vocabularies for categorical values like fuel type +* Use of more readable codes and column names +* Imposition of well defined, rich data types for each column +* Converting local timestamps to UTC +* Reshaping of data into well normalized tables which minimize data duplication +* Inferring Plant IDs which link records across many years of FERC Form 1 data +* Inferring linkages between FERC and EIA Plants and Utilities. +* Inferring more complete associations between EIA boilers and generators + +.. seealso:: + + The module and per-table transform functions in the :mod:`pudl.transform` + sub-package have more details on the specific transformations applied to each + table. + +Many of the original datasets contain large amounts of duplicated data. For instance, +the EIA reports the name of each power plant in every table that refers to otherwise +unique plant-related data. Similarly, many attributes like plant latitude and +longitude are reported separately every year. Often, these reported values are not +self-consistent. There may be several different spellings of a plant's name, or an +incorrectly reported latitude in one year. + +Assets in the Core layer attempt to eliminate this kind of inconsistent and duplicate +information when normalizing the tables by choosing only the most consistently reported +value for inclusion in the final database. If a value which should be static is not +consistently reported, it may also be set to N/A. + +Output Layer +^^^^^^^^^^^^^^^^^^^^ + +Assets in the Core layer normalize the data to make storage more efficient and avoid +data integrity issues, but you may want to combine information from more than one of +the tables to make the data more readable and readily interpretable. For example, PUDL +stores the name that EIA uses to refer to a power plant in the +:ref:`core_eia__entity_plants` table in association with the plant's unique numeric ID. +If you are working with data from the :ref:`core_eia923__monthly_fuel_receipts_costs` +table, which records monthly per-plant fuel deliveries, you may want to have the name +of the plant alongside the fuel delivery information since it's more recognizable than +the plant ID. + +Rather than requiring everyone to write their own SQL ``SELECT`` and ``JOIN`` statements +or do a bunch of :func:`pandas.merge` operations to bring together data, PUDL provides a +variety of output tables that contain all of the useful information in one place. In +some cases, like with EIA, the output tables are composed to closely resemble the raw +spreadsheet tables you're familiar with. + +The Output layer also contains tables produced by analytical routines for +calculating derived values like the heat rate by generation unit (:meth:`hr_by_unit +`) or the capacity factor by generator +(:meth:`capacity_factor `). We intend to +integrate more analytical outputs into the library over time. + +.. seealso:: + + * `The PUDL Examples GitHub repo `__ + to see how to access the PUDL Database directly, use the output functions, or + work with the EPA CEMS data using Dask. + * `How to Learn Dask in 2021 `__ + is a great collection of self-guided resources if you are already familiar with + Python, Pandas, and NumPy. + +.. _test-and-validate: + +--------------------------------------------------------------------------------------- +Data Validation +--------------------------------------------------------------------------------------- +We have a growing collection of data validation test cases that we run before +publishing a data release to try and avoid publishing data with known issues. Most of +these validations are described in the :mod:`pudl.validate` module. They check things +like: + +* The heat content of various fuel types are within expected bounds. +* Coal ash, moisture, mercury, sulfur etc. content are within expected bounds +* Generator heat rates and capacity factors are realistic for the type of prime mover + being reported. + +Some data validations are currently only specified within our test suite, including: + +* The expected number of records within each table +* The fact that there are no entirely N/A columns + +A variety of database integrity checks are also run either during the data processing +or when the data is loaded into SQLite. + +See our :doc:`dev/testing` documentation for more information. + .. toctree:: :hidden: :maxdepth: 2 - intro data_access data_sources/index data_dictionaries/index diff --git a/docs/intro.rst b/docs/intro.rst deleted file mode 100644 index cbb0b78cc4..0000000000 --- a/docs/intro.rst +++ /dev/null @@ -1,207 +0,0 @@ -======================================================================================= -Introduction -======================================================================================= - -PUDL is a data processing pipeline created by `Catalyst Cooperative -`__ that cleans, integrates, and standardizes some of the most -widely used public energy datasets in the US. The data serve researchers, activists, -journalists, and policy makers that might not have the technical expertise to access it -in its raw form, the time to clean and prepare the data for bulk analysis, or the means -to purchase it from existing commercial providers. - ---------------------------------------------------------------------------------------- -Available Data ---------------------------------------------------------------------------------------- - -We focus primarily on poorly curated data published by the US government in -semi-structured but machine readable formats. For details on exactly what data is -available from these data sources and what state it is in, see the the individual -pages for each source: - -* :doc:`data_sources/eia860` -* :doc:`data_sources/eia861` -* :doc:`data_sources/eia923` -* :doc:`data_sources/epacems` -* :doc:`data_sources/ferc1` -* :doc:`data_sources/ferc714` - -We also publish SQLite databases containing relatively pristine versions of our more -difficult to parse inputs, especially the old Visual FoxPro (DBF, pre-2021) and new XBRL -data (2021+) published by FERC: - -* `FERC Form 1 (DBF) `__ -* `FERC Form 1 (XBRL) `__ -* `FERC Form 2 (XBRL) `__ -* `FERC Form 6 (XBRL) `__ -* `FERC Form 60 (XBRL) `__ -* `FERC Form 714 (XBRL) `__ - -To get started using PUDL data, visit our :doc:`data_access` page, or continue reading -to learn more about the PUDL data processing pipeline. - -.. _raw-data-archive: - ---------------------------------------------------------------------------------------- -Raw Data Archives ---------------------------------------------------------------------------------------- - -PUDL depends on "raw" data inputs from sources that are known to occasionally update -their data or alter the published format. These changes may be incompatible with the way -the data are read and interpreted by PUDL, so, to ensure the integrity of our data -processing, we periodically create archives of `the raw inputs on Zenodo -`__. Each of the data inputs may -have several different versions archived, and all are assigned a unique DOI and made -available through the REST API. Each release of the PUDL Python package is embedded -with a set of of DOIs to indicate which version of the raw inputs it is meant to -process. This process helps ensure that our outputs are replicable. - -To enable programmatic access to individual partitions of the data (by year, state, -etc.), we archive the raw inputs as `Frictionless Data Packages -`__. The data packages contain both the -raw data in their originally published format (CSVs, Excel spreadsheets, and Visual -FoxPro database (DBF) files) and metadata that describes how each the -dataset is partitioned. - -The PUDL software will download a copy of the appropriate raw inputs automatically as -needed and organize them in a local :doc:`datastore `. - -.. seealso:: - - The software that creates and archives the raw inputs can be found in our - `PUDL Archiver `__ - repository on GitHub. - -.. _etl-process: - ---------------------------------------------------------------------------------------- -The ETL Process ---------------------------------------------------------------------------------------- - -PUDL's ETL produces a data warehouse that can be used for analytics. -The processing happens within Dagster assets that are persisted to storage, -typically pickle, parquet or SQLite files. The raw data moves through three -layers of processing. - -Raw Layer -^^^^^^^^^ - -Assets in the Raw layer read the raw data from the original heterogeneous formats into -a collection of :class:`pandas.DataFrame` with uniform column names across all years so -that it can be easily processed in bulk. Data distributed as binary database files, such -as the DBF files from FERC Form 1, may be converted into a unified SQLite database -before individual dataframes are created. Raw data assets are not written to -``pudl.sqlite``, persisted to pickle files and not distributed to users. - -.. seealso:: - - Module documentation within the :mod:`pudl.extract` subpackage. - -Core Layer -^^^^^^^^^^ - -The Core layer contains well-modeled assets that serve as building blocks for -downstream wide tables and analyses. Well-modeled means tables in the database -have logical primary keys, foreign keys, datatypes and generally follow -:ref:`Tidy Data standards `. The assets are loaded to a SQLite -database or Parquet file. - -These outputs can be accessed via Python, R, and many other tools. See the -:doc:`data_dictionaries/pudl_db` page for a list of the normalized database tables and -their contents. - -Data processing in the Core layer is generally broken down into two phases. Phase one -focuses on cleaning and organizing data within individual tables while phase two focuses -on the integration and deduplication of data between tables. These tasks can be tedious -`data wrangling toil `__ that impose a -huge amount of overhead on anyone trying to do analysis based on the publicly -available data. PUDL implements common data cleaning operations in the hopes that we -can all work on more interesting problems most of the time. These operations include: - -* Standardization of units (e.g. dollars not thousands of dollars) -* Standardization of N/A values -* Standardization of freeform names and IDs -* Use of controlled vocabularies for categorical values like fuel type -* Use of more readable codes and column names -* Imposition of well defined, rich data types for each column -* Converting local timestamps to UTC -* Reshaping of data into well normalized tables which minimize data duplication -* Inferring Plant IDs which link records across many years of FERC Form 1 data -* Inferring linkages between FERC and EIA Plants and Utilities. -* Inferring more complete associations between EIA boilers and generators - -.. seealso:: - - The module and per-table transform functions in the :mod:`pudl.transform` - sub-package have more details on the specific transformations applied to each - table. - -Many of the original datasets contain large amounts of duplicated data. For instance, -the EIA reports the name of each power plant in every table that refers to otherwise -unique plant-related data. Similarly, many attributes like plant latitude and -longitude are reported separately every year. Often, these reported values are not -self-consistent. There may be several different spellings of a plant's name, or an -incorrectly reported latitude in one year. - -Assets in the Core layer attempt to eliminate this kind of inconsistent and duplicate -information when normalizing the tables by choosing only the most consistently reported -value for inclusion in the final database. If a value which should be static is not -consistently reported, it may also be set to N/A. - -Output Layer -^^^^^^^^^^^^^^^^^^^^ - -Assets in the Core layer normalize the data to make storage more efficient and avoid -data integrity issues, but you may want to combine information from more than one of -the tables to make the data more readable and readily interpretable. For example, PUDL -stores the name that EIA uses to refer to a power plant in the -:ref:`core_eia__entity_plants` table in association with the plant's unique numeric ID. -If you are working with data from the :ref:`core_eia923__monthly_fuel_receipts_costs` -table, which records monthly per-plant fuel deliveries, you may want to have the name -of the plant alongside the fuel delivery information since it's more recognizable than -the plant ID. - -Rather than requiring everyone to write their own SQL ``SELECT`` and ``JOIN`` statements -or do a bunch of :func:`pandas.merge` operations to bring together data, PUDL provides a -variety of output tables that contain all of the useful information in one place. In -some cases, like with EIA, the output tables are composed to closely resemble the raw -spreadsheet tables you're familiar with. - -The Output layer also contains tables produced by analytical routines for -calculating derived values like the heat rate by generation unit (:meth:`hr_by_unit -`) or the capacity factor by generator -(:meth:`capacity_factor `). We intend to -integrate more analytical outputs into the library over time. - -.. seealso:: - - * `The PUDL Examples GitHub repo `__ - to see how to access the PUDL Database directly, use the output functions, or - work with the EPA CEMS data using Dask. - * `How to Learn Dask in 2021 `__ - is a great collection of self-guided resources if you are already familiar with - Python, Pandas, and NumPy. - -.. _test-and-validate: - ---------------------------------------------------------------------------------------- -Data Validation ---------------------------------------------------------------------------------------- -We have a growing collection of data validation test cases that we run before -publishing a data release to try and avoid publishing data with known issues. Most of -these validations are described in the :mod:`pudl.validate` module. They check things -like: - -* The heat content of various fuel types are within expected bounds. -* Coal ash, moisture, mercury, sulfur etc. content are within expected bounds -* Generator heat rates and capacity factors are realistic for the type of prime mover - being reported. - -Some data validations are currently only specified within our test suite, including: - -* The expected number of records within each table -* The fact that there are no entirely N/A columns - -A variety of database integrity checks are also run either during the data processing -or when the data is loaded into SQLite. - -See our :doc:`dev/testing` documentation for more information. diff --git a/docs/release_notes.rst b/docs/release_notes.rst index 9b477f8fa6..fc37f23a0c 100644 --- a/docs/release_notes.rst +++ b/docs/release_notes.rst @@ -296,7 +296,7 @@ Deprecations * Replace references to deprecated ``pudl-scrapers`` and ``pudl-zenodo-datastore`` repositories with references to `pudl-archiver `__ repository in - :doc:`intro`, :doc:`dev/datastore`, and :doc:`dev/annual_updates`. See :pr:`2190`. + ``intro``, :doc:`dev/datastore`, and :doc:`dev/annual_updates`. See :pr:`2190`. * :mod:`pudl.etl` is now a subpackage that collects all pudl assets into a dagster `Definition `__. All ``pudl.etl._etl_{datasource}`` functions have been deprecated. The coordination From 85c6fe35612cd23755b4b4fab279323ebb70b5f9 Mon Sep 17 00:00:00 2001 From: bendnorman Date: Tue, 7 Nov 2023 16:27:30 -0900 Subject: [PATCH 08/10] Add deprecation warnings to PudlTabl and add minor naming docs updates --- docs/dev/naming_conventions.rst | 15 +++++++++++++-- docs/release_notes.rst | 17 +++++++++++++++++ src/pudl/output/pudltabl.py | 11 +++++++++++ 3 files changed, 41 insertions(+), 2 deletions(-) diff --git a/docs/dev/naming_conventions.rst b/docs/dev/naming_conventions.rst index 178f22934c..3c2d0908b0 100644 --- a/docs/dev/naming_conventions.rst +++ b/docs/dev/naming_conventions.rst @@ -29,7 +29,9 @@ names should generally follow this naming convention: ``eia860``, ``ferc1`` and ``epacems``. * ``asset_type`` describes how the asset in modeled. * ``asset_name`` should describe the entity, categorical code type, or measurement of - the asset. + the asset. Note: FERC Form 1 assets typically include the schedule number in the + ``asset_name`` so users and contributors know which schedule the cleaned asset + refers to. Raw layer ^^^^^^^^^ @@ -55,14 +57,23 @@ These assets are typically stored in parquet files or tables in a database. Naming convention: ``core_{source}__{asset_type}_{asset_name}`` +* ``source`` is sometimes ``pudl``. This means the asset + is a derived connection the contributors of PUDL created to connect multiple + datasets via manual or machine learning methods. + * ``asset_type`` describes how the asset is modeled and its role in PUDL’s collection of core assets. There are a handful of table types in this layer: * ``assn``: Association tables provide connections between entities. This data - can be manually compiled or extracted from data sources. Examples: + can be manually compiled or extracted from data sources. If the asset associates + data from two sources, the source names should be included in the ``asset_name``. + The source names should appear in the same order for all assets that associate + the two sources. Examples: * ``core_pudl__assn_plants_eia`` associates EIA Plant IDs and manually assigned PUDL Plant IDs. + * ``core_epa__assn_epacamd_eia`` associates EPA units with EIA plants, boilers, + and generators. * ``codes``: Code tables contain more verbose descriptions of categorical codes typically manually compiled from source data dictionaries. Examples: diff --git a/docs/release_notes.rst b/docs/release_notes.rst index 9b477f8fa6..5c2536dbd9 100644 --- a/docs/release_notes.rst +++ b/docs/release_notes.rst @@ -67,6 +67,23 @@ Dagster Adoption * :mod:`pudl.convert.censusdp1tract_to_sqlite` and :mod:`pudl.output.censusdp1tract` are now integrated into dagster. See :issue:`1973` and :pr:`2621`. +New Asset Naming Convention +^^^^^^^^^^^^^^^^^^^^^^^^^^^ +There are hundreds of new tables in ``pudl.sqlite`` now that the methods in ``PudlTabl`` +have been converted to Dagster assets. This significant increase in tables and diversity +of table types prompted us to create a new naming convention to make the table names +more descriptive and organized. You can read about the new naming convention in the +:ref:`docs `. + +To help users migrate away from using ``PudlTabl`` and our temporary table names, +we've created a `google sheet `__ +that maps the old table names and ``PudlTabl`` methods to the new table names. + +We plan to remove ``PudlTabl`` from the pudl package once our known users have +succesfully migrated to pulling data directly from ``pudl.sqlite``. We've added +deprecation warnings to the ``PudlTabl`` class. We expect to remove ``PudlTabl`` +at the end of February 2024. + Data Coverage ^^^^^^^^^^^^^ diff --git a/src/pudl/output/pudltabl.py b/src/pudl/output/pudltabl.py index 08f03a9137..eefc4bb63b 100644 --- a/src/pudl/output/pudltabl.py +++ b/src/pudl/output/pudltabl.py @@ -89,6 +89,12 @@ def __init__( unit_ids: If True, use several heuristics to assign individual generators to functional units. EXPERIMENTAL. """ + logger.warning( + "PudlTabl is deprecated and will be removed from the pudl package" + "at the end of February 2024. To acccess the data returned by" + "this class, pull the desired table directly from the pudl.sqlite" + "database." + ) if not isinstance(pudl_engine, sa.engine.base.Engine): raise TypeError( "PudlTabl needs pudl_engine to be a SQLAlchemy Engine, but we " @@ -296,6 +302,11 @@ def _get_table_from_db( "It is retained for backwards compatibility only." ) table_name = self._agg_table_name(table_name) + logger.warning( + "PudlTabl is deprecated and will be removed from the pudl package" + "at the end of February 2024. To access the data returned by this method," + f"use the {table_name} table in the pudl.sqlite database." + ) resource = Resource.from_id(table_name) return pd.concat( [ From 479ec7f921999bfa79af0188220f01f70a1b62fc Mon Sep 17 00:00:00 2001 From: bendnorman Date: Wed, 8 Nov 2023 10:58:32 -0900 Subject: [PATCH 09/10] Remove PudlTabl removal data and make assn table name sources alphabetical --- docs/dev/naming_conventions.rst | 5 ++--- docs/release_notes.rst | 7 +++---- src/pudl/output/pudltabl.py | 12 ++++++------ 3 files changed, 11 insertions(+), 13 deletions(-) diff --git a/docs/dev/naming_conventions.rst b/docs/dev/naming_conventions.rst index 3c2d0908b0..5ccf005030 100644 --- a/docs/dev/naming_conventions.rst +++ b/docs/dev/naming_conventions.rst @@ -66,9 +66,8 @@ Naming convention: ``core_{source}__{asset_type}_{asset_name}`` * ``assn``: Association tables provide connections between entities. This data can be manually compiled or extracted from data sources. If the asset associates - data from two sources, the source names should be included in the ``asset_name``. - The source names should appear in the same order for all assets that associate - the two sources. Examples: + data from two sources, the source names should be included in the ``asset_name`` + in alphabetical order. Examples: * ``core_pudl__assn_plants_eia`` associates EIA Plant IDs and manually assigned PUDL Plant IDs. diff --git a/docs/release_notes.rst b/docs/release_notes.rst index 5c2536dbd9..5bdc338d95 100644 --- a/docs/release_notes.rst +++ b/docs/release_notes.rst @@ -79,10 +79,9 @@ To help users migrate away from using ``PudlTabl`` and our temporary table names we've created a `google sheet `__ that maps the old table names and ``PudlTabl`` methods to the new table names. -We plan to remove ``PudlTabl`` from the pudl package once our known users have -succesfully migrated to pulling data directly from ``pudl.sqlite``. We've added -deprecation warnings to the ``PudlTabl`` class. We expect to remove ``PudlTabl`` -at the end of February 2024. +We've added deprecation warnings to the ``PudlTabl`` class. We plan to remove +``PudlTabl`` from the ``pudl`` package once our known users have +succesfully migrated to pulling data directly from ``pudl.sqlite``. Data Coverage ^^^^^^^^^^^^^ diff --git a/src/pudl/output/pudltabl.py b/src/pudl/output/pudltabl.py index eefc4bb63b..ede31c3f00 100644 --- a/src/pudl/output/pudltabl.py +++ b/src/pudl/output/pudltabl.py @@ -90,10 +90,9 @@ def __init__( individual generators to functional units. EXPERIMENTAL. """ logger.warning( - "PudlTabl is deprecated and will be removed from the pudl package" - "at the end of February 2024. To acccess the data returned by" - "this class, pull the desired table directly from the pudl.sqlite" - "database." + "PudlTabl is deprecated and will be removed from the pudl package " + "once known users have migrated to accessing the data directly from " + "pudl.sqlite. " ) if not isinstance(pudl_engine, sa.engine.base.Engine): raise TypeError( @@ -303,8 +302,9 @@ def _get_table_from_db( ) table_name = self._agg_table_name(table_name) logger.warning( - "PudlTabl is deprecated and will be removed from the pudl package" - "at the end of February 2024. To access the data returned by this method," + "PudlTabl is deprecated and will be removed from the pudl package " + "once known users have migrated to accessing the data directly from " + "pudl.sqlite. To access the data returned by this method, " f"use the {table_name} table in the pudl.sqlite database." ) resource = Resource.from_id(table_name) From c3298047a93020fac6bb97e57d6cce2cbdf3467b Mon Sep 17 00:00:00 2001 From: bendnorman Date: Wed, 8 Nov 2023 11:35:05 -0900 Subject: [PATCH 10/10] Explain why CEMS is stored as parquet --- docs/index.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/index.rst b/docs/index.rst index d21bfc3044..ff75d5f13f 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -26,7 +26,7 @@ pages for each source: * :doc:`data_sources/ferc714` PUDL's clean and complete versions of these data sources are stored in the -``pudl.sqlite`` database and ``core_epacems__hourly_emissions.parquet`` files. +``pudl.sqlite`` database. Larger datasets like EPA CEMS are stored in parquet files. To get started using PUDL data, visit our :doc:`data_access` page, or continue reading to learn more about the PUDL data processing pipeline.