Merge pull request #2874 from catalyst-cooperative/create-naming-conv…

…ention-docs Add naming new naming convention to docs
catalyst-cooperative · Nov 10, 2023 · cb9b188 · cb9b188
2 parents 0c3b9ae + 53d5618
commit cb9b188
Show file tree

Hide file tree

Showing 9 changed files with 519 additions and 389 deletions.
diff --git a/README.rst b/README.rst
@@ -59,50 +59,94 @@ it's often difficult to work with. PUDL takes the original spreadsheets, CSV fil
 and databases and turns them into a unified resource. This allows users to spend more
 time on novel analysis and less time on data preparation.
 
+The project is focused on serving researchers, activists, journalists, policy makers,
+and small businesses that might not otherwise be able to afford access to this data
+from commercial sources and who may not have the time or expertise to do all the
+data processing themselves from scratch.
+
+We want to make this data accessible and easy to work with for as wide an audience as
+possible: anyone from a grassroots youth climate organizers working with Google
+sheets to university researchers with access to scalable cloud computing
+resources and everyone in between!
+
+PUDL is comprised of three core components:
+
+- **Raw Data Archives**
+
+  - PUDL `archives <https://github.com/catalyst-cooperative/pudl-archiver>`__
+    all the raw data inputs on `Zenodo <https://zenodo.org/communities/catalyst-cooperative/?page=1&size=20>`__
+    to ensure perminant, versioned access to the data. In the event that an agency
+    changes how they publish data or deletes old files, the ETL will still have access
+    to the original inputs. Each of the data inputs may have several different versions
+    archived, and all are assigned a unique DOI and made available through the REST API.
+    You can read more about the Raw Data Archives in the
+    `docs <https://catalystcoop-pudl.readthedocs.io/en/dev/intro.html#raw-data-archives>`__.
+- **ETL Pipeline**
+
+  - The ETL pipeline (this repo) ingests the raw archives, cleans them,
+    integrates them, and outputs them to a series of tables stored in SQLite Databases,
+    Parquet files, and pickle files (the Data Warehouse). Each release of the PUDL
+    Python package is embedded with a set of of DOIs to indicate which version of the
+    raw inputs it is meant to process. This process helps ensure that the ETL and it's
+    outputs are replicable. You can read more about the ETL in the
+    `docs <https://catalystcoop-pudl.readthedocs.io/en/dev/intro.html#the-etl-process>`__.
+- **Data Warehouse**
+
+  - The outputs from the ETL, sometimes called "PUDL outputs",
+    are stored in a data warehouse as a collection of SQLite and Parquet files so that
+    users can access the data without having to run any code. Learn more about how to
+    access the data `here <https://catalystcoop-pudl.readthedocs.io/en/dev/data_access.html>`__.
+
 What data is available?
 -----------------------
 
 PUDL currently integrates data from:
 
-* `EIA Form 860 <https://www.eia.gov/electricity/data/eia860/>`__: 2001-2022
-* `EIA Form 860m <https://www.eia.gov/electricity/data/eia860m/>`__: 2023-06
-* `EIA Form 861 <https://www.eia.gov/electricity/data/eia861/>`__: 2001-2022
-* `EIA Form 923 <https://www.eia.gov/electricity/data/eia923/>`__: 2001-2022
-* `EPA Continuous Emissions Monitoring System (CEMS) <https://campd.epa.gov/>`__: 1995-2022
-* `FERC Form 1 <https://www.ferc.gov/industries-data/electric/general-information/electric-industry-forms/form-1-electric-utility-annual>`__: 1994-2021
-* `FERC Form 714 <https://www.ferc.gov/industries-data/electric/general-information/electric-industry-forms/form-no-714-annual-electric/data>`__: 2006-2020
-* `US Census Demographic Profile 1 Geodatabase <https://www.census.gov/geographies/mapping-files/2010/geo/tiger-data.html>`__: 2010
+* **EIA Form 860**: 2001-2022
+  - `Source Docs <https://www.eia.gov/electricity/data/eia860/>`__
+  - `PUDL Docs <https://catalystcoop-pudl.readthedocs.io/en/dev/data_sources/eia860.html>`__
+* **EIA Form 860m**: 2023-06
+  - `Source Docs <https://www.eia.gov/electricity/data/eia860m/>`__
+* **EIA Form 861**: 2001-2022
+  - `Source Docs <https://www.eia.gov/electricity/data/eia861/>`__
+  - `PUDL Docs <https://catalystcoop-pudl.readthedocs.io/en/dev/data_sources/eia861.html>`__
+* **EIA Form 923**: 2001-2022
+  - `Source Docs <https://www.eia.gov/electricity/data/eia923/>`__
+  - `PUDL Docs <https://catalystcoop-pudl.readthedocs.io/en/dev/data_sources/eia923.html>`__
+* **EPA Continuous Emissions Monitoring System (CEMS)**: 1995-2022
+  - `Source Docs <https://campd.epa.gov/>`__
+  - `PUDL Docs <https://catalystcoop-pudl.readthedocs.io/en/dev/data_sources/epacems.html>`__
+* **FERC Form 1**: 1994-2021
+  - `Source Docs <https://www.ferc.gov/industries-data/electric/general-information/electric-industry-forms/form-1-electric-utility-annual>`__
+  - `PUDL Docs <https://catalystcoop-pudl.readthedocs.io/en/dev/data_sources/ferc1.html>`__
+* **FERC Form 714**: 2006-2020
+  - `Source Docs <https://www.ferc.gov/industries-data/electric/general-information/electric-industry-forms/form-no-714-annual-electric/data>`__
+  - `PUDL Docs <https://catalystcoop-pudl.readthedocs.io/en/dev/data_sources/ferc714.html>`__
+* **FERC Form 2**: 2021 (raw only)
+  - `Source Docs <https://www.ferc.gov/industries-data/natural-gas/industry-forms/form-2-2a-3-q-gas-historical-vfp-data>`__
+* **FERC Form 6**: 2021 (raw only)
+  - `Source Docs <https://www.ferc.gov/general-information-1/oil-industry-forms/form-6-6q-historical-vfp-data>`__
+* **FERC Form 60**: 2021 (raw only)
+  - `Source Docs <https://www.ferc.gov/form-60-annual-report-centralized-service-companies>`__
+* **US Census Demographic Profile 1 Geodatabase**: 2010
+  - `Source Docs <https://www.census.gov/geographies/mapping-files/2010/geo/tiger-data.html>`__
 
 Thanks to support from the `Alfred P. Sloan Foundation Energy & Environment
 Program <https://sloan.org/programs/research/energy-and-environment>`__, from
-2021 to 2024 we will be integrating the following data as well:
+2021 to 2024 we will be cleaning and integrating the following data as well:
 
 * `EIA Form 176 <https://www.eia.gov/dnav/ng/TblDefs/NG_DataSources.html#s176>`__
   (The Annual Report of Natural Gas Supply and Disposition)
 * `FERC Electric Quarterly Reports (EQR) <https://www.ferc.gov/industries-data/electric/power-sales-and-markets/electric-quarterly-reports-eqr>`__
 * `FERC Form 2 <https://www.ferc.gov/industries-data/natural-gas/overview/general-information/natural-gas-industry-forms/form-22a-data>`__
   (Annual Report of Major Natural Gas Companies)
 * `PHMSA Natural Gas Annual Report <https://www.phmsa.dot.gov/data-and-statistics/pipeline/gas-distribution-gas-gathering-gas-transmission-hazardous-liquids>`__
-* Machine Readable Specifications of State Clean Energy Standards
-
-Who is PUDL for?
-----------------
-
-The project is focused on serving researchers, activists, journalists, policy makers,
-and small businesses that might not otherwise be able to afford access to this data
-from commercial sources and who may not have the time or expertise to do all the
-data processing themselves from scratch.
-
-We want to make this data accessible and easy to work with for as wide an audience as
-possible: anyone from a grassroots youth climate organizers working with Google
-sheets to university researchers with access to scalable cloud computing
-resources and everyone in between!
 
 How do I access the data?
 -------------------------
 
-There are several ways to access PUDL outputs. For more details you'll want
-to check out `the complete documentation
+There are several ways to access the information in the PUDL Data Warehouse.
+For more details you'll want to check out `the complete documentation
 <https://catalystcoop-pudl.readthedocs.io>`__, but here's a quick overview:
 
 Datasette
@@ -118,6 +162,19 @@ This access mode is good for casual data explorers or anyone who just wants to g
 small subset of the data. It also lets you share links to a particular subset of the
 data and provides a REST API for querying the data from other applications.
 
+Nightly Data Builds
+^^^^^^^^^^^^^^^^^^^
+If you are less concerned with reproducibility and want the freshest possible data
+we automatically upload the outputs of our nightly builds to public S3 storage buckets
+as part of the `AWS Open Data Registry
+<https://registry.opendata.aws/catalyst-cooperative-pudl/>`__.  This data is based on
+the `dev branch <https://github.com/catalyst-cooperative/pudl/tree/dev>`__, of PUDL, and
+is updated most weekday mornings. It is also the data used to populate Datasette.
+
+The nightly build outputs can be accessed using the AWS CLI, the S3 API, or downloaded
+directly via the web. See `Accessing Nightly Builds <https://catalystcoop-pudl.readthedocs.io/en/latest/data_access.html#access-nightly-builds>`__
+for links to the individual SQLite, JSON, and Apache Parquet outputs.
+
 Docker + Jupyter
 ^^^^^^^^^^^^^^^^
 Want access to all the published data in bulk? If you're familiar with Python
@@ -151,19 +208,6 @@ most users. You should check out the `Development section <https://catalystcoop-
 of the main `PUDL documentation <https://catalystcoop-pudl.readthedocs.io>`__ for more
 details.
 
-Nightly Data Builds
-^^^^^^^^^^^^^^^^^^^
-If you are less concerned with reproducibility and want the freshest possible data
-we automatically upload the outputs of our nightly builds to public S3 storage buckets
-as part of the `AWS Open Data Registry
-<https://registry.opendata.aws/catalyst-cooperative-pudl/>`__.  This data is based on
-the `dev branch <https://github.com/catalyst-cooperative/pudl/tree/dev>`__, of PUDL, and
-is updated most weekday mornings. It is also the data used to populate Datasette.
-
-The nightly build outputs can be accessed using the AWS CLI, the S3 API, or downloaded
-directly via the web. See `Accessing Nightly Builds <https://catalystcoop-pudl.readthedocs.io/en/latest/data_access.html#access-nightly-builds>`__
-for links to the individual SQLite, JSON, and Apache Parquet outputs.
-
 Contributing to PUDL
 --------------------
 Find PUDL useful? Want to help make it better? There are lots of ways to help!

diff --git a/docs/data_access.rst b/docs/data_access.rst
@@ -2,12 +2,17 @@
 Data Access
 =======================================================================================
 
-We publish the :doc:`PUDL pipeline <intro>` outputs in several ways to serve
+We publish the PUDL pipeline outputs in several ways to serve
 different users and use cases. We're always trying to increase accessibility of the
 PUDL data, so if you have a suggestion please `open a GitHub issue
 <https://github.com/catalyst-cooperative/pudl/issues>`__. If you have a question you
 can `create a GitHub discussion <https://github.com/orgs/catalyst-cooperative/discussions/new?category=help-me>`__.
 
+PUDL's primary data output is the ``pudl.sqlite`` database. We recommend working with
+tables with the ``out_`` prefix, as these tables contain the most complete and easiest
+to work with data. For more information about the different types
+of tables, read through :ref:`PUDL's naming conventions <asset-naming>`.
+
 .. _access-modes:
 
 ---------------------------------------------------------------------------------------

diff --git a/docs/dev/data_guidelines.rst b/docs/dev/data_guidelines.rst
@@ -64,6 +64,8 @@ Examples of Unacceptable Changes
   fuel heat content and net electricity generation. The heat rate would
   be a derived value and not part of the original data.
 
+.. _tidy-data:
+
 -------------------------------------------------------------------------------
 Make Tidy Data
 -------------------------------------------------------------------------------
@@ -117,24 +119,6 @@ that M/Mega is a million in SI. And a `BTU
 energy required to raise the temperature of one an *avoirdupois pound* of water
 by 1 degree *Farenheit*?! What century even is this?).
 
--------------------------------------------------------------------------------
-Silo the ETL Process
--------------------------------------------------------------------------------
-It should be possible to run the ETL process on each data source independently
-and with any combination of data sources included. This allows users to include
-only the data need. In some cases, like the :doc:`EIA 860
-<../data_sources/eia860>` and :doc:`EIA 923 <../data_sources/eia923>` data, two
-data sources may be so intertwined that keeping them separate doesn't really
-make sense. This should be the exception, however, not the rule.
-
--------------------------------------------------------------------------------
-Separate Data from Glue
--------------------------------------------------------------------------------
-The glue that relates different data sources to each other should be applied
-after or alongside the ETL process and not as a mandatory part of ETL. This
-makes it easy to pull individual data sources in and work with them even when
-the glue isn't working or doesn't yet exist.
-
 -------------------------------------------------------------------------------
 Partition Big Data
 -------------------------------------------------------------------------------
@@ -146,35 +130,6 @@ them to pull in only certain years, certain states, or other sensible partitions
 data so that they don’t run out of memory or disk space or have to wait hours while data
 they don't need is being processed.
 
--------------------------------------------------------------------------------
-Naming Conventions
--------------------------------------------------------------------------------
-    *There are only two hard problems in computer science: caching,
-    naming things, and off-by-one errors.*
-
-Use Consistent Names
-^^^^^^^^^^^^^^^^^^^^
-If two columns in different tables record the same quantity in the same units,
-give them the same name. That way if they end up in the same dataframe for
-comparison it's easy to automatically rename them with suffixes indicating
-where they came from. For example, net electricity generation is reported to
-both :doc:`FERC Form 1 <../data_sources/ferc1>` and :doc:`EIA 923
-<../data_sources/eia923>`, so we've named columns ``net_generation_mwh`` in
-each of those data sources. Similarly, give non-comparable quantities reported
-in different data sources **different** column names. This helps make it clear
-that the quantities are actually different.
-
-Follow Existing Conventions
-^^^^^^^^^^^^^^^^^^^^^^^^^^^
-We are trying to use consistent naming conventions for the data tables,
-columns, data sources, and functions. Generally speaking PUDL is a collection
-of subpackages organized by purpose (extract, transform, load, analysis,
-output, datastore…), containing a module for each data source. Each data source
-has a short name that is used everywhere throughout the project and is composed of
-the reporting agency and the form number or another identifying abbreviation:
-``ferc1``, ``epacems``, ``eia923``, ``eia861``, etc. See the :doc:`naming
-conventions <naming_conventions>` document for more details.
-
 -------------------------------------------------------------------------------
 Complete, Continuous Time Series
 -------------------------------------------------------------------------------