You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
zaneselvans opened this issue
Oct 26, 2021
· 2 comments
Labels
bugThings that are just plain broken.gluePUDL specific structures & metadata. Stuff that connects datasets together.testingWriting tests, creating test data, automating testing, etc.
Near the end of pudl.glue.ferc1_eia.glue() we have a defensive assertion that checks for NA values in the dataframes containing unmapped EIA and FERC Plants and Utilities. However, with the 2020 ID mapping spreadsheet this assertion fails during the glue step of the ETL:
# At this point there should be at most one row in each of these data# frames with NaN values after we drop_duplicates in each. This is because# there will be some plants and utilities that only exist in FERC, or only# exist in EIA, and while they will have PUDL IDs, they may not have# FERC/EIA info (and it'll get pulled in as NaN)fordf, df_ninzip(
[plants_eia, plants_ferc1, utilities_eia, utilities_ferc1],
['plants_eia', 'plants_ferc1', 'utilities_eia', 'utilities_ferc1']
):
ifdf[pd.isnull(df).any(axis=1)].shape[0] >1:
raiseAssertionError(
f"FERC to EIA glue breaking in {df_n}. There are too many null ""fields. Check the mapping spreadhseet.")
# AssertionError: FERC to EIA glue breaking in plants_eia. There are too many null fields. Check the mapping spreadhseet.
I suspect that this is because there are a few pages of EIA plants that have plant_id_eia values but no plant_name_eia. The same problem comes up for utilities. For now this has been changed to a logger warning, and the ETL passes, but we should get an appropriate error check in here.
I also suspect that the lack of plant & utility names is probably due to the way we're dropping lots of harvestable fields before sending the data into the harvesting step (See #509).
Possible solutions
Fill in dummy names like we do for information-poor FERC utilities
Choose some expected number of Null values that's specific to each of the plant/utility tables and check against that.
Feed all of the plant/utility names (and other data) into the harvesting step and see if we can get rid of all of these disconcerting null values.
The text was updated successfully, but these errors were encountered:
zaneselvans
added
bug
Things that are just plain broken.
glue
PUDL specific structures & metadata. Stuff that connects datasets together.
testing
Writing tests, creating test data, automating testing, etc.
labels
Oct 26, 2021
The full ETL with all FERC1 and EIA 860/923 data will run without
obvious errors. There are still tests and validations that fail, but at
least you can load the DB.
This does *not* include eia860m or EPA CEMS data yet. FERC-714 and
EIA-861 also remain to be updated for 2020.
Issues that remain:
* Something screwy is going on with FERC respondent 542 -- it shows up
only in the `f1_respondent_id` table, and has all Null data there...
and our unmapped utility finder script failed to identify it. See
#1304
* A defensive assertion aimed at identifying human errors in the ID
mapping sheet is failing because (probably?) we have a fair number of
plants and utilities with IDs but no names in there now. See #1305 and
also #1232
bugThings that are just plain broken.gluePUDL specific structures & metadata. Stuff that connects datasets together.testingWriting tests, creating test data, automating testing, etc.
Near the end of
pudl.glue.ferc1_eia.glue()
we have a defensive assertion that checks for NA values in the dataframes containing unmapped EIA and FERC Plants and Utilities. However, with the 2020 ID mapping spreadsheet this assertion fails during theglue
step of the ETL:I suspect that this is because there are a few pages of EIA plants that have
plant_id_eia
values but noplant_name_eia
. The same problem comes up for utilities. For now this has been changed to a logger warning, and the ETL passes, but we should get an appropriate error check in here.I also suspect that the lack of plant & utility names is probably due to the way we're dropping lots of harvestable fields before sending the data into the harvesting step (See #509).
Possible solutions
The text was updated successfully, but these errors were encountered: