Integrate Eia923 Q2 2024 Data #3768

aesharpe · 2024-08-06T20:25:00Z

Overview

Merge before #3767

Closes #3760

What problem does this address?
Add Q2 2024 923 data

What did you change?

Update 923 DOI
Update 923 package_data
Update release_notes
Update minmax row validation tests
Map unmapped plant / utility IDs
Update Data Source documentation page to describe duplicate primary key quirks
No need to update working partitions or ETL settings because they already specify 2024.

Testing

How did you make sure this worked? How can a reviewer verify this?

Materialize all EIA860 and 923 raw and assn assets
Materialize all EIA860 and 923 assets core and out assets (including moce)
Run make pytest-minmax-rows

To-do list

Give feedback

If updating analyses or data processing functions: make sure to update or write data validation tests (e.g. test_minmax_rows())
Update the release notes: reference the PR and related issues.
Run make pytest-coverage locally to ensure that the merge queue will accept your PR.
Review the PR yourself and call out any questions or issues you have
For minor ETL changes or data additions, once make pytest-coverage passes, make sure you have a fresh full PUDL DB downloaded locally, materialize new/changed assets and all their downstream assets and run relevant data validation tests using pytest and --live-dbs.
For bigger ETL or data changes run the full ETL locally and then run the data validations using make pytest-validate.
Alternatively, run the build-deploy-pudl GitHub Action manually.
Options

cmgosnell

i'm suspicious of the gf and bf row count updates. otherwise this is obviously pretty chill and straightforward

cmgosnell · 2024-08-07T14:11:19Z

test/validate/eia_test.py

-        ("bf_eia923", 1_642_829, 1_642_829, 135_980),
+        ("bf_eia923", 1_642_806, 1_642_806, 135_980),
        ("bga_eia860", 153_487, 153_487, 153_487),
-        ("boil_eia860", 89_051, 89_051, 89_051),
+        ("boil_eia860", 89_050, 89_050, 89_050),


do you have a sense of why these went down? the bf in particular here seems wrong because there should be a few more months of data in there for the raw and monthly tables

The new months of 923 data show up as data inputs into existing columns. Once we hit January there will be rows (transformed from the raw table columns) for every month, regardless of whether there is data in them. Row fluctuations are more likely due to retroactive data changes which can lead to positive and negative changes.

I dug a little deeper in to the bf table and uncovered some unexpected cruft that explains some of the row count changes. See my comment below!

cmgosnell · 2024-08-07T14:12:08Z

test/validate/eia_test.py

-        ("gens_eia860", 590_881, 590_881, 590_881),
-        ("gf_eia923", 3_064_042, 3_064_042, 260_842),
+        ("gens_eia860", 591_256, 591_256, 591_256),
+        ("gf_eia923", 3_064_045, 3_064_045, 260_842),


i would also assume that this gf rows would go up by a few months of data

aesharpe · 2024-08-08T03:15:30Z

I dug into why the out_eia923__monthly_boiler_fuel table row count changed. This table contains rows for all months in a given report year, even if they have not occurred and contain blank data (because of how the raw data has columns for each month that get filled in with each monthly release). This means that row changes within a calendar year should be minimal and pertain to little data tweaks vs. the addition of new months of data.

SEPARATE BUT RELEVANT: It's probably a good idea for us to drop the rows from the months that don't exist yet and contain no new data!

While investigating what might be the source of the row changes, found some sneaky primary keys that we weren't previously accounting for and seem to be causing some of the weirdness.

Our current primary keys are:

pks = [
    "plant_id_eia",
    "boiler_id",
    "energy_source_code",
    "prime_mover_code",
    "report_date",
]

All of the rows that change between the old version of this table and the new one have one of two things in common. The raw tables have either:

A secret associated_heat_and_power primary key
A secret operator_name primary key

Example (1):

	plant_id_eia	boiler_id	energy_source_code	prime_mover_code	report_year	associated_combined_heat_power	fuel_mmbtu_per_unit_january	fuel_mmbtu_per_unit_february	fuel_mmbtu_per_unit_march	fuel_mmbtu_per_unit_april
139643	1167	9	DFO	ST	2024	N	.	.	0	5.817
139644	1167	9	DFO	ST	2024	Y	5.817	5.817	.	.

The latter is more perplexing because the values with different operator_name values have the same operator_id.

Example (2):

	plant_id_eia	boiler_id	energy_source_code	prime_mover_code	report_year	operator_name	operator_id	fuel_mmbtu_per_unit_january	fuel_mmbtu_per_unit_february	fuel_mmbtu_per_unit_march	fuel_mmbtu_per_unit_april
141646	10398	A	NG	ST	2024	ArcelorMittal Cleveland Inc	9454	1.058	1.058	.	.
141647	10398	A	NG	ST	2024	Cleveland Cliffs	9454	.	.	1.058	1.061

As you can see, for both of these examples, it's clearly the same plant that is getting reported differently each month. Either with a different operator name or a different combined heat power association. My guess is that there is something similar going on with the other 923 tables like generation_fuel.

I emailed EIA about these discrepancies because I'm not sure:

A) if it's possible for a plant to be associated with a combined heat and power system one month and not another.

It looks like the unit associated with the combined heat and power system was retired in March thus switching the boolean from Y to N.

B) Whether the operator_name discrepancy is a meaningful distinction that should be reflected in the data.

It seems like operator_name is a mostly meaningless string and we should rely on operator_id (we might have to address the link between operator_name and operator_id in an association table, however).

Depending on how people feel about things, my suggestion would be to combine the rows with different operator_name values and add associated_heat_power to the list of primary keys (unless it's deemed impossible/a reporting error).

UPDATE: We decided to keep both of them out of the primary keys because they affect so few rows.

cmgosnell · 2024-08-08T15:19:16Z

I dug into why the out_eia923__monthly_boiler_fuel table row count changed. This table contains rows for all months in a given report year, even if they have not occurred and contain blank data (because of how the raw data has columns for each month that get filled in with each monthly release). This means that row changes within a calendar year should be minimal and pertain to little data tweaks vs. the addition of new months of data.

Okay this makes sense. we stack the monthly columns from the raw data into one report_date column so it makes sense to me that we'd have empty rows for ytd releases.

SEPARATE BUT RELEVANT: It's probably a good idea for us to drop the rows from the months that don't exist yet and contain no new data!

I agree we could use the working_partitions to restrict the report_date. This doesn't feel like a top priority, but it seems simple/quick enough

cmgosnell · 2024-08-08T20:40:18Z

on the sneaky pks, austen and i just chatted about this and we came to the conclusion that the number of total records was going down with this new quarter of data because of the combination of the all-of-the-months-are-there-even-in-the-ytd-tables thing and because we've been dropping records like the ones that austen has displayed above via pudl.transform.eia923.remove_duplicate_pks_boiler_fuel_eia923, which removes the records (after the months have been stacked so they are monthly records) with duplicate pks and completely null or 0's in the data.

So in example 1 March got dropped. and in example 2 all of the months past april got dropped.

tl;dr we think this is okay and actually expected overall. but some documentation would be helpful (maybe a note in the data updates doc page?).

aesharpe · 2024-08-09T19:02:24Z

I agree we could use the working_partitions to restrict the report_date. This doesn't feel like a top priority, but it seems simple/quick enough

I would do this, but but working_partitions for 923 are single years rather than year-month combinations. I'm not sure how I would restrict the data with just the year. Seems like it would be ideal to have the month in there, but that's a bigger undertaking.

One option is to drop all rows with NA values in the value columns (because there will be no data in months that haven't been integrated yet) instead of just rows with duplicate PKs. But we might lose rows of past months that happen to have all NA values (which I think we should keep!). Though, we already lose some of these when we drop NA or 0 values for duplicate rows.

Christina is out this week and we need to get this PR merged in.

Add DOI for new 923

b4d92b7

aesharpe self-assigned this Aug 6, 2024

aesharpe added eia923 Anything having to do with EIA Form 923 data-update When fresh data is integrated into PUDL from quarterly or annual updates labels Aug 6, 2024

aesharpe added 4 commits August 6, 2024 14:28

Update release notes

fd6c53f

Update 923 package data

d900ac5

Map unmapped plants and utilities

a0d739c

update minmax row validation test

c8e4fde

aesharpe marked this pull request as ready for review August 7, 2024 01:16

aesharpe requested a review from cmgosnell August 7, 2024 01:16

cmgosnell previously requested changes Aug 7, 2024

View reviewed changes

Add description of duplicate rows to docs and docstrings

522a94b

e-belfer self-requested a review August 12, 2024 16:54

Merge branch 'main' into eia923-q2-24

5cbae58

zaneselvans requested review from zaneselvans and removed request for e-belfer August 12, 2024 18:48

zaneselvans approved these changes Aug 12, 2024

View reviewed changes

aesharpe added this pull request to the merge queue Aug 12, 2024

Merged via the queue into main with commit 61ec1a7 Aug 12, 2024
14 checks passed

aesharpe deleted the eia923-q2-24 branch August 12, 2024 20:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate Eia923 Q2 2024 Data #3768

Integrate Eia923 Q2 2024 Data #3768

aesharpe commented Aug 6, 2024 •

edited

Loading

To-do list

cmgosnell left a comment

cmgosnell Aug 7, 2024

aesharpe Aug 8, 2024

cmgosnell Aug 7, 2024

aesharpe commented Aug 8, 2024 •

edited

Loading

cmgosnell commented Aug 8, 2024 •

edited

Loading

cmgosnell commented Aug 8, 2024

aesharpe commented Aug 9, 2024 •

edited

Loading

Integrate Eia923 Q2 2024 Data #3768

Integrate Eia923 Q2 2024 Data #3768

Conversation

aesharpe commented Aug 6, 2024 • edited Loading

Overview

Testing

To-do list

cmgosnell left a comment

Choose a reason for hiding this comment

cmgosnell Aug 7, 2024

Choose a reason for hiding this comment

aesharpe Aug 8, 2024

Choose a reason for hiding this comment

cmgosnell Aug 7, 2024

Choose a reason for hiding this comment

aesharpe commented Aug 8, 2024 • edited Loading

cmgosnell commented Aug 8, 2024 • edited Loading

cmgosnell commented Aug 8, 2024

aesharpe commented Aug 9, 2024 • edited Loading

aesharpe commented Aug 6, 2024 •

edited

Loading

aesharpe commented Aug 8, 2024 •

edited

Loading

cmgosnell commented Aug 8, 2024 •

edited

Loading

aesharpe commented Aug 9, 2024 •

edited

Loading