Skip to content

Commit

Permalink
FERC 714: transform of hourly demand table (dbf +xbrl) (#3842)
Browse files Browse the repository at this point in the history
* first very wip draft of transofmring the hourly 714 table

* early processing of datetimes and initial cleaning of timezone codes

* lil function suffix cleanup

* group the table-specific transforms into staticmethods of a table transform class

* yay add the hour into the csv report_date early so i'm not oopsies loosing all the report_dates plus lots of documentation

* lil extra doc clean

* Map FERC 714 XBRL and CSV IDs (#3849)

* Add respondent ID csv

* Add notes columns to CSV

* Preliminary fixes to the 714 data source page

* integrate the respondent_id_ferc714 map into transforms

* Add notes on CSV-XBRL ID linkage to docs

* Write preliminary transform class and function for XBRL and CSV core_ferc714__yearly_planning_area_demand_forecast table

* wip first round of respondent table transforming

* Combine XBRL and CSV tables

* Add forecast to forecast column names

* Add migration file for new forecast cols

* finish eia_code mapping and wrap up transforms

* udpate docs

* udpate docs again lol spaces

* fix forcast to forecast type and add to run() docstring

* convert :meth: to :func:

* lower expected forecast year range

* fix docs typo

* Use split/apply/combine for deduping and update assertion

* responding to pr comments mostly doc updates

* Add new years to Ferc714CheckSpec

* update docs

* first pass of adding respondend id tables

* add alembic migration for the glue tables

* remove the lil post process step

* Light edits

* release notes and metadata updates

* Add table description for annual forecast table and fix indentation errors

* update docs and metadata, plus stop trying to impute midnight jan 1st 2024

* update the validation test expectations for the analysis downstream stuff

* update the settinggggsss omigosh plus restrict the imputations based on the years processed

* add module-level design notes

* add move color to the fast test 12 assertion

* remove the lil context thing that is no longer necessary

---------

Co-authored-by: E. Belfer <[email protected]>
Co-authored-by: Austen Sharpe <[email protected]>
Co-authored-by: e-belfer <[email protected]>
Co-authored-by: Austen Sharpe <[email protected]>
  • Loading branch information
5 people authored Sep 25, 2024
1 parent ee74680 commit b291160
Show file tree
Hide file tree
Showing 22 changed files with 1,769 additions and 560 deletions.
10 changes: 10 additions & 0 deletions docs/release_notes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,16 @@ PUDL Release Notes
v2024.X.x (2024-XX-XX)
---------------------------------------------------------------------------------------

New Data Coverage
^^^^^^^^^^^^^^^^^

FERC Form 714
~~~~~~~~~~~~~
* Integrate 2021-2023 years of the FERC Form 714 data. FERC updated its reporting
format for 2021 from a CSV files to XBRL files. This update integrates the two
raw data sources and extends the data coverage through 2023. See :issue:`3809`
and :pr:`3842`.

Schema Changes
^^^^^^^^^^^^^^
* Added :ref:`out_eia__yearly_assn_plant_parts_plant_gen` table. This table associates
Expand Down
76 changes: 57 additions & 19 deletions docs/templates/ferc714_child.rst.jinja
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
{% extends "data_source_parent.rst.jinja" %}

{% block background %}
FERC Form 714, otherwise known as the Annual Electric Balancing Authority Area and
Planning Area Report, collects data and provides insights about balancing authority
area and planning area operations.

{% endblock %}

Expand All @@ -13,28 +16,21 @@
{% block availability %}
The data we've integrated from FERC Form 714 includes:

* hourly electricity demand by utility or balancing authority from 2006-2020
* a table identifying the form respondents including their EIA utility or balancing
* Hourly electricity demand by utility or balancing authority.
* Annual demand forecast.
* A table identifying the form respondents including their EIA utility or balancing
authority ID, which allows us to link the FERC-714 data to other information
reported in :doc:`eia860` and :doc:`eia861`.

We have not yet had the opportunity to work with the most recent FERC-714 data (2021 and
later), which is now being published using the new XBRL format.

The hourly demand data for 2006-2020 is about 15 million records. There are about 200
respondents that show up in the respondents table.

WIth the EIA IDs, we link the hourly electricity demand to a particular georgraphic
region at the county level, because utilities and balancing authorities report their
service territories in :ref:`core_eia861__yearly_service_territory`, and from that
information we can estimate historical hourly electricity demand by state.
With the EIA IDs we can link the hourly electricity demand to a particular geographic
region at the county level because utilities and balancing authorities report their
service territories in :ref:`core_eia861__yearly_service_territory`. From that
information we estimate historical hourly electricity demand by state.

Plant operators reported in :ref:`core_eia860__scd_plants` and generator ownership
information reported in :ref:`core_eia860__scd_ownership` are linked to
:ref:`core_eia860__scd_utilities` and :ref:`core_eia861__yearly_balancing_authority` and
so can also be linked to the :ref:`core_ferc714__respondent_id` table, as well as the
:ref:`core_epacems__hourly_emissions` unit-level emissions and generation data reported
in :doc:`epacems`.
can therefore be linked to the :ref:`core_ferc714__respondent_id` table.

{% endblock %}

Expand All @@ -56,32 +52,44 @@ formats:
* **2021-present**: Standardized electronic filing using the XBRL (eXtensible Business
Reporting Language) dialect of XML.

We only have plans to integrate the data from the standardized electronic reporting era
since the format of the earlier data varies for each reporting balancing authority and
utility, and would be very labor intensive to parse and reconcile.
We only plan to integrate the data from the standardized electronic reporting era
(2006+) since the format of the earlier data varies for each reporting balancing authority
and utility, and would be very labor intensive to parse and reconcile.

{% endblock %}

{% block notable_irregularities %}

Timezone errors
---------------

The original hourly electricity demand time series is plagued with timezone and daylight
savings vs. standard time irregularities, which we have done our best to clean up. The
timestamps in the clean data are all in UTC, with a timezone code stored in a separate
column, so that the times can be easily localized or converted. It's certainly not
perfect, but its much better than the original data and it's easy to work with!

Sign errors
-----------

Not all respondents use the same sign convention for reporting "demand." The vast
majority consider demand / load that they serve to be a positive number, and so we've
standardized the data to use that convention.

Reporting gaps
--------------

There are a lot of reporting gaps, especially for smaller respondents. Sometimes these
are brief, and sometimes they are entire years. There are also a number of outliers and
suspicious values (e.g. a long series of identical consecutive values). We have some
tools that we've built to clean up these outliers in
:mod:`pudl.analysis.timeseries_cleaning`.

Respondent-to-balancing-authority inconsistencies
-------------------------------------------------

Because utilities and balancing authorities occasionally change their service
territories or merge, the demand reproted by any individual "respondent" may correspond
territories or merge, the demand reported by any individual "respondent" may correspond
to wildly different consumers in different years. To make it at least somewhat possible
to compare the reported data across time, we've also compiled historical service
territory maps for the respondents based on data reported in :doc:`eia861`. However,
Expand All @@ -93,4 +101,34 @@ be found in :mod:`pudl.analysis.service_territory` and :mod:`pudl.analysis.spati
The :mod:`pudl.analysis.state_demand` script brings together all of the above to
estimate historical hourly electricity demand by state for 2006-2020.

Combining XBRL and CSV data
---------------------------

The format of the company identifiers (CIDs) used in the CSV data (2006-2020) and the
XBRL data (2021+) differs. To link respondents between both data formats, we manually
map the IDs from both datasets and create a ``respondent_id_ferc714`` in
:mod:`pudl.package_data.glue.respondent_id_ferc714.csv`.

This CSV builds on the `migrated data
<https://www.ferc.gov/filing-forms/eforms-refresh/migrated-data-downloads>`__ provided
by FERC during the transition from CSV to XBRL data, which notes that:

Companies that did not have a CID prior to the migration have been assigned a CID that
begins with R, i.e., a temporary RID. These RIDs will be replaced in future with the
accurate CIDs and new datasets will be published.

The file names of the migrated data (which correspond to CSV IDs) and the respondent
CIDs in the migrated files provide the basis for ID mapping. Though CIDs are intended to
be static, some of the CIDs in the migrated data weren't found in the actual XBRL data,
and the same respondents were reporting data using different CIDs. To ensure accurate
record matching, we manually reviewed the CIDs for each respondent, matching based on
name and location. Some quirks to note:

* All respondents are matched 1:1 from CSV to XBRL data. Unmatched respondents mostly
occur due to mergers, splits, acquisitions, and companies that no longer exist.
* Some CIDs assigned during the migration process do not appear in the data. Given the
intention by FERC to make these CIDs permanent, they are still included in the mapping
CSV in case these respondents re-appear. All temporary IDs (beginning with R) were
removed.

{% endblock %}
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
"""Add my cool lil respondent id glue tables and other 714 xbrl updates
Revision ID: 8fffc1d0399a
Revises: a93bdb8d4fbd
Create Date: 2024-09-24 09:28:45.862748
"""
from alembic import op
import sqlalchemy as sa


# revision identifiers, used by Alembic.
revision = '8fffc1d0399a'
down_revision = 'a93bdb8d4fbd'
branch_labels = None
depends_on = None


def upgrade() -> None:
# ### commands auto generated by Alembic - please adjust! ###
op.create_table('core_pudl__assn_ferc714_pudl_respondents',
sa.Column('respondent_id_ferc714', sa.Integer(), nullable=False, comment='PUDL-assigned identifying a respondent to FERC Form 714. This ID associates natively reported respondent IDs from the orignal CSV and XBRL data sources.'),
sa.PrimaryKeyConstraint('respondent_id_ferc714', name=op.f('pk_core_pudl__assn_ferc714_pudl_respondents'))
)
op.create_table('core_pudl__assn_ferc714_csv_pudl_respondents',
sa.Column('respondent_id_ferc714', sa.Integer(), nullable=False, comment='PUDL-assigned identifying a respondent to FERC Form 714. This ID associates natively reported respondent IDs from the orignal CSV and XBRL data sources.'),
sa.Column('respondent_id_ferc714_csv', sa.Integer(), nullable=False, comment='FERC Form 714 respondent ID from CSV reported data - published from years: 2006-2020. This ID is linked to the newer years of reported XBRL data through the PUDL-assigned respondent_id_ferc714 ID. This ID was originally reported as respondent_id. Note that this ID does not correspond to FERC respondent IDs from other forms.'),
sa.ForeignKeyConstraint(['respondent_id_ferc714'], ['core_pudl__assn_ferc714_pudl_respondents.respondent_id_ferc714'], name=op.f('fk_core_pudl__assn_ferc714_csv_pudl_respondents_respondent_id_ferc714_core_pudl__assn_ferc714_pudl_respondents')),
sa.PrimaryKeyConstraint('respondent_id_ferc714', 'respondent_id_ferc714_csv', name=op.f('pk_core_pudl__assn_ferc714_csv_pudl_respondents'))
)
op.create_table('core_pudl__assn_ferc714_xbrl_pudl_respondents',
sa.Column('respondent_id_ferc714', sa.Integer(), nullable=False, comment='PUDL-assigned identifying a respondent to FERC Form 714. This ID associates natively reported respondent IDs from the orignal CSV and XBRL data sources.'),
sa.Column('respondent_id_ferc714_xbrl', sa.Text(), nullable=False, comment='FERC Form 714 respondent ID from XBRL reported data - published from years: 2021-present. This ID is linked to the older years of reported CSV data through the PUDL-assigned respondent_id_ferc714 ID. This ID was originally reported as entity_id. Note that this ID does not correspond to FERC respondent IDs from other forms.'),
sa.ForeignKeyConstraint(['respondent_id_ferc714'], ['core_pudl__assn_ferc714_pudl_respondents.respondent_id_ferc714'], name=op.f('fk_core_pudl__assn_ferc714_xbrl_pudl_respondents_respondent_id_ferc714_core_pudl__assn_ferc714_pudl_respondents')),
sa.PrimaryKeyConstraint('respondent_id_ferc714', 'respondent_id_ferc714_xbrl', name=op.f('pk_core_pudl__assn_ferc714_xbrl_pudl_respondents'))
)
with op.batch_alter_table('core_ferc714__respondent_id', schema=None) as batch_op:
batch_op.add_column(sa.Column('respondent_id_ferc714_csv', sa.Integer(), nullable=True, comment='FERC Form 714 respondent ID from CSV reported data - published from years: 2006-2020. This ID is linked to the newer years of reported XBRL data through the PUDL-assigned respondent_id_ferc714 ID. This ID was originally reported as respondent_id. Note that this ID does not correspond to FERC respondent IDs from other forms.'))
batch_op.add_column(sa.Column('respondent_id_ferc714_xbrl', sa.Text(), nullable=True, comment='FERC Form 714 respondent ID from XBRL reported data - published from years: 2021-present. This ID is linked to the older years of reported CSV data through the PUDL-assigned respondent_id_ferc714 ID. This ID was originally reported as entity_id. Note that this ID does not correspond to FERC respondent IDs from other forms.'))
batch_op.create_foreign_key(batch_op.f('fk_core_ferc714__respondent_id_respondent_id_ferc714_core_pudl__assn_ferc714_pudl_respondents'), 'core_pudl__assn_ferc714_pudl_respondents', ['respondent_id_ferc714'], ['respondent_id_ferc714'])

with op.batch_alter_table('core_ferc714__yearly_planning_area_demand_forecast', schema=None) as batch_op:
batch_op.add_column(sa.Column('summer_peak_demand_forecast_mw', sa.Float(), nullable=True, comment='The maximum forecasted hourly sumemr load (for the months of June through September).'))
batch_op.add_column(sa.Column('winter_peak_demand_forecast_mw', sa.Float(), nullable=True, comment='The maximum forecasted hourly winter load (for the months of January through March).'))
batch_op.add_column(sa.Column('net_demand_forecast_mwh', sa.Float(), nullable=True, comment='Net forecasted electricity demand for the specific period in megawatt-hours (MWh).'))
batch_op.drop_constraint('fk_core_ferc714__yearly_planning_area_demand_forecast_respondent_id_ferc714_core_ferc714__respondent_id', type_='foreignkey')
batch_op.create_foreign_key(batch_op.f('fk_core_ferc714__yearly_planning_area_demand_forecast_respondent_id_ferc714_core_pudl__assn_ferc714_pudl_respondents'), 'core_pudl__assn_ferc714_pudl_respondents', ['respondent_id_ferc714'], ['respondent_id_ferc714'])
batch_op.drop_column('summer_peak_demand_mw')
batch_op.drop_column('net_demand_mwh')
batch_op.drop_column('winter_peak_demand_mw')

with op.batch_alter_table('out_ferc714__respondents_with_fips', schema=None) as batch_op:
batch_op.drop_constraint('fk_out_ferc714__respondents_with_fips_respondent_id_ferc714_core_ferc714__respondent_id', type_='foreignkey')
batch_op.create_foreign_key(batch_op.f('fk_out_ferc714__respondents_with_fips_respondent_id_ferc714_core_pudl__assn_ferc714_pudl_respondents'), 'core_pudl__assn_ferc714_pudl_respondents', ['respondent_id_ferc714'], ['respondent_id_ferc714'])

with op.batch_alter_table('out_ferc714__summarized_demand', schema=None) as batch_op:
batch_op.drop_constraint('fk_out_ferc714__summarized_demand_respondent_id_ferc714_core_ferc714__respondent_id', type_='foreignkey')
batch_op.create_foreign_key(batch_op.f('fk_out_ferc714__summarized_demand_respondent_id_ferc714_core_pudl__assn_ferc714_pudl_respondents'), 'core_pudl__assn_ferc714_pudl_respondents', ['respondent_id_ferc714'], ['respondent_id_ferc714'])

# ### end Alembic commands ###


def downgrade() -> None:
# ### commands auto generated by Alembic - please adjust! ###
with op.batch_alter_table('out_ferc714__summarized_demand', schema=None) as batch_op:
batch_op.drop_constraint(batch_op.f('fk_out_ferc714__summarized_demand_respondent_id_ferc714_core_pudl__assn_ferc714_pudl_respondents'), type_='foreignkey')
batch_op.create_foreign_key('fk_out_ferc714__summarized_demand_respondent_id_ferc714_core_ferc714__respondent_id', 'core_ferc714__respondent_id', ['respondent_id_ferc714'], ['respondent_id_ferc714'])

with op.batch_alter_table('out_ferc714__respondents_with_fips', schema=None) as batch_op:
batch_op.drop_constraint(batch_op.f('fk_out_ferc714__respondents_with_fips_respondent_id_ferc714_core_pudl__assn_ferc714_pudl_respondents'), type_='foreignkey')
batch_op.create_foreign_key('fk_out_ferc714__respondents_with_fips_respondent_id_ferc714_core_ferc714__respondent_id', 'core_ferc714__respondent_id', ['respondent_id_ferc714'], ['respondent_id_ferc714'])

with op.batch_alter_table('core_ferc714__yearly_planning_area_demand_forecast', schema=None) as batch_op:
batch_op.add_column(sa.Column('winter_peak_demand_mw', sa.FLOAT(), nullable=True))
batch_op.add_column(sa.Column('net_demand_mwh', sa.FLOAT(), nullable=True))
batch_op.add_column(sa.Column('summer_peak_demand_mw', sa.FLOAT(), nullable=True))
batch_op.drop_constraint(batch_op.f('fk_core_ferc714__yearly_planning_area_demand_forecast_respondent_id_ferc714_core_pudl__assn_ferc714_pudl_respondents'), type_='foreignkey')
batch_op.create_foreign_key('fk_core_ferc714__yearly_planning_area_demand_forecast_respondent_id_ferc714_core_ferc714__respondent_id', 'core_ferc714__respondent_id', ['respondent_id_ferc714'], ['respondent_id_ferc714'])
batch_op.drop_column('net_demand_forecast_mwh')
batch_op.drop_column('winter_peak_demand_forecast_mw')
batch_op.drop_column('summer_peak_demand_forecast_mw')

with op.batch_alter_table('core_ferc714__respondent_id', schema=None) as batch_op:
batch_op.drop_constraint(batch_op.f('fk_core_ferc714__respondent_id_respondent_id_ferc714_core_pudl__assn_ferc714_pudl_respondents'), type_='foreignkey')
batch_op.drop_column('respondent_id_ferc714_xbrl')
batch_op.drop_column('respondent_id_ferc714_csv')

op.drop_table('core_pudl__assn_ferc714_xbrl_pudl_respondents')
op.drop_table('core_pudl__assn_ferc714_csv_pudl_respondents')
op.drop_table('core_pudl__assn_ferc714_pudl_respondents')
# ### end Alembic commands ###
40 changes: 29 additions & 11 deletions src/pudl/analysis/state_demand.py
Original file line number Diff line number Diff line change
Expand Up @@ -293,9 +293,9 @@ def load_hourly_demand_matrix_ferc714(
matrix = out_ferc714__hourly_planning_area_demand.pivot(
index="datetime", columns="respondent_id_ferc714", values="demand_mwh"
)
# List timezone by year for each respondent
# List timezone by year for each respondent by the datetime
out_ferc714__hourly_planning_area_demand["year"] = (
out_ferc714__hourly_planning_area_demand["report_date"].dt.year
out_ferc714__hourly_planning_area_demand["datetime"].dt.year
)
utc_offset = out_ferc714__hourly_planning_area_demand.groupby(
["respondent_id_ferc714", "year"], as_index=False
Expand Down Expand Up @@ -378,7 +378,9 @@ def filter_ferc714_hourly_demand_matrix(
return df


def impute_ferc714_hourly_demand_matrix(df: pd.DataFrame) -> pd.DataFrame:
def impute_ferc714_hourly_demand_matrix(
df: pd.DataFrame, years: list[int]
) -> pd.DataFrame:
"""Impute null values in FERC 714 hourly demand matrix.
Imputation is performed separately for each year,
Expand All @@ -390,17 +392,28 @@ def impute_ferc714_hourly_demand_matrix(df: pd.DataFrame) -> pd.DataFrame:
Args:
df: FERC 714 hourly demand matrix,
as described in :func:`load_ferc714_hourly_demand_matrix`.
years: list of years to input
Returns:
Copy of `df` with imputed values.
"""
results = []
for year, gdf in df.groupby(df.index.year):
logger.info(f"Imputing year {year}")
keep = df.columns[~gdf.isnull().all()]
tsi = pudl.analysis.timeseries_cleaning.Timeseries(gdf[keep])
result = tsi.to_dataframe(tsi.impute(method="tnn"), copy=False)
results.append(result)
# sort here and then don't sort in the groupby so we can process
# the newer years of data first. This is so we can see early if
# new data causes any failures.
df = df.sort_index(ascending=False)
for year, gdf in df.groupby(df.index.year, sort=False):
# remove the records o/s of the working years because some
# respondents report one record of midnight of January first
# of the next year (report_date.dt.year + 1). and
# impute_ferc714_hourly_demand_matrix chunks over years at a time
# and having only one record
if year in years:
logger.info(f"Imputing year {year}")
keep = df.columns[~gdf.isnull().all()]
tsi = pudl.analysis.timeseries_cleaning.Timeseries(gdf[keep])
result = tsi.to_dataframe(tsi.impute(method="tnn"), copy=False)
results.append(result)
return pd.concat(results)


Expand Down Expand Up @@ -474,8 +487,12 @@ def _out_ferc714__hourly_demand_matrix(
return df


@asset(compute_kind="NumPy")
@asset(
compute_kind="NumPy",
required_resource_keys={"dataset_settings"},
)
def _out_ferc714__hourly_imputed_demand(
context,
_out_ferc714__hourly_demand_matrix: pd.DataFrame,
_out_ferc714__utc_offset: pd.DataFrame,
) -> pd.DataFrame:
Expand All @@ -492,7 +509,8 @@ def _out_ferc714__hourly_imputed_demand(
Returns:
df: DataFrame with imputed FERC714 hourly demand.
"""
df = impute_ferc714_hourly_demand_matrix(_out_ferc714__hourly_demand_matrix)
years = context.resources.dataset_settings.ferc714.years
df = impute_ferc714_hourly_demand_matrix(_out_ferc714__hourly_demand_matrix, years)
df = melt_ferc714_hourly_demand_matrix(df, _out_ferc714__utc_offset)
return df

Expand Down
Loading

0 comments on commit b291160

Please sign in to comment.