Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate Eia923 Q2 2024 Data #3768

Merged
merged 7 commits into from
Aug 12, 2024
Merged

Integrate Eia923 Q2 2024 Data #3768

merged 7 commits into from
Aug 12, 2024

Conversation

aesharpe
Copy link
Member

@aesharpe aesharpe commented Aug 6, 2024

Overview

Merge before #3767

Closes #3760

What problem does this address?
Add Q2 2024 923 data

What did you change?

  • Update 923 DOI
  • Update 923 package_data
  • Update release_notes
  • Update minmax row validation tests
  • Map unmapped plant / utility IDs
  • Update Data Source documentation page to describe duplicate primary key quirks
  • No need to update working partitions or ETL settings because they already specify 2024.

Testing

How did you make sure this worked? How can a reviewer verify this?

  • Materialize all EIA860 and 923 raw and assn assets
  • Materialize all EIA860 and 923 assets core and out assets (including moce)
  • Run make pytest-minmax-rows

To-do list

@aesharpe aesharpe self-assigned this Aug 6, 2024
@aesharpe aesharpe added eia923 Anything having to do with EIA Form 923 data-update When fresh data is integrated into PUDL from quarterly or annual updates labels Aug 6, 2024
@aesharpe aesharpe marked this pull request as ready for review August 7, 2024 01:16
@aesharpe aesharpe requested a review from cmgosnell August 7, 2024 01:16
Copy link
Member

@cmgosnell cmgosnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm suspicious of the gf and bf row count updates. otherwise this is obviously pretty chill and straightforward

Comment on lines -49 to +51
("bf_eia923", 1_642_829, 1_642_829, 135_980),
("bf_eia923", 1_642_806, 1_642_806, 135_980),
("bga_eia860", 153_487, 153_487, 153_487),
("boil_eia860", 89_051, 89_051, 89_051),
("boil_eia860", 89_050, 89_050, 89_050),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you have a sense of why these went down? the bf in particular here seems wrong because there should be a few more months of data in there for the raw and monthly tables

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new months of 923 data show up as data inputs into existing columns. Once we hit January there will be rows (transformed from the raw table columns) for every month, regardless of whether there is data in them. Row fluctuations are more likely due to retroactive data changes which can lead to positive and negative changes.

I dug a little deeper in to the bf table and uncovered some unexpected cruft that explains some of the row count changes. See my comment below!

Comment on lines -54 to +55
("gens_eia860", 590_881, 590_881, 590_881),
("gf_eia923", 3_064_042, 3_064_042, 260_842),
("gens_eia860", 591_256, 591_256, 591_256),
("gf_eia923", 3_064_045, 3_064_045, 260_842),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would also assume that this gf rows would go up by a few months of data

@aesharpe
Copy link
Member Author

aesharpe commented Aug 8, 2024

I dug into why the out_eia923__monthly_boiler_fuel table row count changed. This table contains rows for all months in a given report year, even if they have not occurred and contain blank data (because of how the raw data has columns for each month that get filled in with each monthly release). This means that row changes within a calendar year should be minimal and pertain to little data tweaks vs. the addition of new months of data.

SEPARATE BUT RELEVANT: It's probably a good idea for us to drop the rows from the months that don't exist yet and contain no new data!

While investigating what might be the source of the row changes, found some sneaky primary keys that we weren't previously accounting for and seem to be causing some of the weirdness.

Our current primary keys are:

pks = [
    "plant_id_eia",
    "boiler_id",
    "energy_source_code",
    "prime_mover_code",
    "report_date",
]

All of the rows that change between the old version of this table and the new one have one of two things in common. The raw tables have either:

  1. A secret associated_heat_and_power primary key
  2. A secret operator_name primary key

Example (1):

plant_id_eia boiler_id energy_source_code prime_mover_code report_year associated_combined_heat_power fuel_mmbtu_per_unit_january fuel_mmbtu_per_unit_february fuel_mmbtu_per_unit_march fuel_mmbtu_per_unit_april
139643 1167 9 DFO ST 2024 N . . 0 5.817
139644 1167 9 DFO ST 2024 Y 5.817 5.817 . .

The latter is more perplexing because the values with different operator_name values have the same operator_id.

Example (2):

plant_id_eia boiler_id energy_source_code prime_mover_code report_year operator_name operator_id fuel_mmbtu_per_unit_january fuel_mmbtu_per_unit_february fuel_mmbtu_per_unit_march fuel_mmbtu_per_unit_april
141646 10398 A NG ST 2024 ArcelorMittal Cleveland Inc 9454 1.058 1.058 . .
141647 10398 A NG ST 2024 Cleveland Cliffs 9454 . . 1.058 1.061

As you can see, for both of these examples, it's clearly the same plant that is getting reported differently each month. Either with a different operator name or a different combined heat power association. My guess is that there is something similar going on with the other 923 tables like generation_fuel.

I emailed EIA about these discrepancies because I'm not sure:

A) if it's possible for a plant to be associated with a combined heat and power system one month and not another.

It looks like the unit associated with the combined heat and power system was retired in March thus switching the boolean from Y to N.

B) Whether the operator_name discrepancy is a meaningful distinction that should be reflected in the data.

It seems like operator_name is a mostly meaningless string and we should rely on operator_id (we might have to address the link between operator_name and operator_id in an association table, however).

Depending on how people feel about things, my suggestion would be to combine the rows with different operator_name values and add associated_heat_power to the list of primary keys (unless it's deemed impossible/a reporting error).

UPDATE: We decided to keep both of them out of the primary keys because they affect so few rows.

@cmgosnell
Copy link
Member

cmgosnell commented Aug 8, 2024

I dug into why the out_eia923__monthly_boiler_fuel table row count changed. This table contains rows for all months in a given report year, even if they have not occurred and contain blank data (because of how the raw data has columns for each month that get filled in with each monthly release). This means that row changes within a calendar year should be minimal and pertain to little data tweaks vs. the addition of new months of data.

Okay this makes sense. we stack the monthly columns from the raw data into one report_date column so it makes sense to me that we'd have empty rows for ytd releases.

SEPARATE BUT RELEVANT: It's probably a good idea for us to drop the rows from the months that don't exist yet and contain no new data!

I agree we could use the working_partitions to restrict the report_date. This doesn't feel like a top priority, but it seems simple/quick enough

@cmgosnell
Copy link
Member

on the sneaky pks, austen and i just chatted about this and we came to the conclusion that the number of total records was going down with this new quarter of data because of the combination of the all-of-the-months-are-there-even-in-the-ytd-tables thing and because we've been dropping records like the ones that austen has displayed above via pudl.transform.eia923.remove_duplicate_pks_boiler_fuel_eia923, which removes the records (after the months have been stacked so they are monthly records) with duplicate pks and completely null or 0's in the data.

So in example 1 March got dropped. and in example 2 all of the months past april got dropped.

tl;dr we think this is okay and actually expected overall. but some documentation would be helpful (maybe a note in the data updates doc page?).

@aesharpe
Copy link
Member Author

aesharpe commented Aug 9, 2024

I agree we could use the working_partitions to restrict the report_date. This doesn't feel like a top priority, but it seems simple/quick enough

I would do this, but but working_partitions for 923 are single years rather than year-month combinations. I'm not sure how I would restrict the data with just the year. Seems like it would be ideal to have the month in there, but that's a bigger undertaking.

One option is to drop all rows with NA values in the value columns (because there will be no data in months that haven't been integrated yet) instead of just rows with duplicate PKs. But we might lose rows of past months that happen to have all NA values (which I think we should keep!). Though, we already lose some of these when we drop NA or 0 values for duplicate rows.

@e-belfer e-belfer self-requested a review August 12, 2024 16:54
@zaneselvans zaneselvans requested review from zaneselvans and removed request for e-belfer August 12, 2024 18:48
@zaneselvans zaneselvans dismissed cmgosnell’s stale review August 12, 2024 19:05

Christina is out this week and we need to get this PR merged in.

@aesharpe aesharpe added this pull request to the merge queue Aug 12, 2024
Merged via the queue into main with commit 61ec1a7 Aug 12, 2024
14 checks passed
@aesharpe aesharpe deleted the eia923-q2-24 branch August 12, 2024 20:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data-update When fresh data is integrated into PUDL from quarterly or annual updates eia923 Anything having to do with EIA Form 923
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

EIA 923 Q2 2024 Update
3 participants