Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update defensive assertion for EIA Plant/Util ID mapping #1305

Open
zaneselvans opened this issue Oct 26, 2021 · 2 comments
Open

Update defensive assertion for EIA Plant/Util ID mapping #1305

zaneselvans opened this issue Oct 26, 2021 · 2 comments
Labels
bug Things that are just plain broken. glue PUDL specific structures & metadata. Stuff that connects datasets together. testing Writing tests, creating test data, automating testing, etc.

Comments

@zaneselvans
Copy link
Member

zaneselvans commented Oct 26, 2021

Near the end of pudl.glue.ferc1_eia.glue() we have a defensive assertion that checks for NA values in the dataframes containing unmapped EIA and FERC Plants and Utilities. However, with the 2020 ID mapping spreadsheet this assertion fails during the glue step of the ETL:

        # At this point there should be at most one row in each of these data
        # frames with NaN values after we drop_duplicates in each. This is because
        # there will be some plants and utilities that only exist in FERC, or only
        # exist in EIA, and while they will have PUDL IDs, they may not have
        # FERC/EIA info (and it'll get pulled in as NaN)

        for df, df_n in zip(
            [plants_eia, plants_ferc1, utilities_eia, utilities_ferc1],
            ['plants_eia', 'plants_ferc1', 'utilities_eia', 'utilities_ferc1']
        ):
            if df[pd.isnull(df).any(axis=1)].shape[0] > 1:
               raise AssertionError(
                    f"FERC to EIA glue breaking in {df_n}. There are too many null "
                    "fields. Check the mapping spreadhseet.")

# AssertionError: FERC to EIA glue breaking in plants_eia. There are too many null fields. Check the mapping spreadhseet.

I suspect that this is because there are a few pages of EIA plants that have plant_id_eia values but no plant_name_eia. The same problem comes up for utilities. For now this has been changed to a logger warning, and the ETL passes, but we should get an appropriate error check in here.

I also suspect that the lack of plant & utility names is probably due to the way we're dropping lots of harvestable fields before sending the data into the harvesting step (See #509).

Possible solutions

  • Fill in dummy names like we do for information-poor FERC utilities
  • Choose some expected number of Null values that's specific to each of the plant/utility tables and check against that.
  • Feed all of the plant/utility names (and other data) into the harvesting step and see if we can get rid of all of these disconcerting null values.
@zaneselvans zaneselvans added bug Things that are just plain broken. glue PUDL specific structures & metadata. Stuff that connects datasets together. testing Writing tests, creating test data, automating testing, etc. labels Oct 26, 2021
@MichaelTiemannOSC
Copy link

See #509?

@zaneselvans
Copy link
Member Author

Ah yeah you're right #1232 is a closed duplicate. They just changed the "closed" color to purple instead of red so I didn't realize.

zaneselvans added a commit that referenced this issue Oct 26, 2021
The full ETL with all FERC1 and EIA 860/923 data will run without
obvious errors. There are still tests and validations that fail, but at
least you can load the DB.

This does *not* include eia860m or EPA CEMS data yet. FERC-714 and
EIA-861 also remain to be updated for 2020.

Issues that remain:
* Something screwy is going on with FERC respondent 542 -- it shows up
  only in the `f1_respondent_id` table, and has all Null data there...
  and our unmapped utility finder script failed to identify it.  See
  #1304
* A defensive assertion aimed at identifying human errors in the ID
  mapping sheet is failing because (probably?) we have a fair number of
  plants and utilities with IDs but no names in there now. See #1305 and
  also #1232
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Things that are just plain broken. glue PUDL specific structures & metadata. Stuff that connects datasets together. testing Writing tests, creating test data, automating testing, etc.
Projects
None yet
Development

No branches or pull requests

2 participants