-
-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate Eia923 Q2 2024 Data #3768
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm suspicious of the gf and bf row count updates. otherwise this is obviously pretty chill and straightforward
("bf_eia923", 1_642_829, 1_642_829, 135_980), | ||
("bf_eia923", 1_642_806, 1_642_806, 135_980), | ||
("bga_eia860", 153_487, 153_487, 153_487), | ||
("boil_eia860", 89_051, 89_051, 89_051), | ||
("boil_eia860", 89_050, 89_050, 89_050), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you have a sense of why these went down? the bf in particular here seems wrong because there should be a few more months of data in there for the raw and monthly tables
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new months of 923 data show up as data inputs into existing columns. Once we hit January there will be rows (transformed from the raw table columns) for every month, regardless of whether there is data in them. Row fluctuations are more likely due to retroactive data changes which can lead to positive and negative changes.
I dug a little deeper in to the bf table and uncovered some unexpected cruft that explains some of the row count changes. See my comment below!
("gens_eia860", 590_881, 590_881, 590_881), | ||
("gf_eia923", 3_064_042, 3_064_042, 260_842), | ||
("gens_eia860", 591_256, 591_256, 591_256), | ||
("gf_eia923", 3_064_045, 3_064_045, 260_842), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would also assume that this gf rows would go up by a few months of data
I dug into why the SEPARATE BUT RELEVANT: It's probably a good idea for us to drop the rows from the months that don't exist yet and contain no new data! While investigating what might be the source of the row changes, found some sneaky primary keys that we weren't previously accounting for and seem to be causing some of the weirdness. Our current primary keys are:
All of the rows that change between the old version of this table and the new one have one of two things in common. The raw tables have either:
Example (1):
The latter is more perplexing because the values with different Example (2):
As you can see, for both of these examples, it's clearly the same plant that is getting reported differently each month. Either with a different operator name or a different combined heat power association. My guess is that there is something similar going on with the other 923 tables like I emailed EIA about these discrepancies because I'm not sure: A) if it's possible for a plant to be associated with a combined heat and power system one month and not another. It looks like the unit associated with the combined heat and power system was retired in March thus switching the boolean from Y to N. B) Whether the It seems like Depending on how people feel about things, my suggestion would be to combine the rows with different UPDATE: We decided to keep both of them out of the primary keys because they affect so few rows. |
Okay this makes sense. we stack the monthly columns from the raw data into one report_date column so it makes sense to me that we'd have empty rows for ytd releases.
I agree we could use the working_partitions to restrict the report_date. This doesn't feel like a top priority, but it seems simple/quick enough |
on the sneaky pks, austen and i just chatted about this and we came to the conclusion that the number of total records was going down with this new quarter of data because of the combination of the all-of-the-months-are-there-even-in-the-ytd-tables thing and because we've been dropping records like the ones that austen has displayed above via So in example 1 March got dropped. and in example 2 all of the months past april got dropped. tl;dr we think this is okay and actually expected overall. but some documentation would be helpful (maybe a note in the data updates doc page?). |
I would do this, but but One option is to drop all rows with NA values in the value columns (because there will be no data in months that haven't been integrated yet) instead of just rows with duplicate PKs. But we might lose rows of past months that happen to have all NA values (which I think we should keep!). Though, we already lose some of these when we drop NA or 0 values for duplicate rows. |
Christina is out this week and we need to get this PR merged in.
Overview
Merge before #3767
Closes #3760
What problem does this address?
Add Q2 2024 923 data
What did you change?
package_data
release_notes
Testing
How did you make sure this worked? How can a reviewer verify this?
To-do list