Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retain all harvestable fields during EIA transforms #509

Open
50 tasks
Tracked by #639
zaneselvans opened this issue Jan 18, 2020 · 7 comments · May be fixed by #2333
Open
50 tasks
Tracked by #639

Retain all harvestable fields during EIA transforms #509

zaneselvans opened this issue Jan 18, 2020 · 7 comments · May be fixed by #2333
Assignees
Labels
community data-cleaning Tasks related to cleaning & regularizing data during ETL. eia860 Anything having to do with EIA Form 860 eia923 Anything having to do with EIA Form 923 good-first-issue Good issues for first-time contributors. Self-contained, low context, no credentials required. harvest Normalization of poorly normalized inputs and reconciliation of internal inconsistencies inframundo ready requires-debug Things that have been worked on but hit an issue that requires debugging.
Milestone

Comments

@zaneselvans
Copy link
Member

zaneselvans commented Jan 18, 2020

In many of our older EIA transformation functions, we preemptively drop columns from the tables that are being processed, in order to produce normalized tables. However, many of these columns contain information about the entities (plants, generators, utilities) that should be integrated into the entity harvesting and resolution process, which happens after the transform step.

Discarded Columns

  • Check whether the column name is defined in pudl.metadata.fields
  • If it is not defined but does correspond to an existing column, change the name in the appropriate column_map.csv under src/pudl/package_data/{data source}/ so that it matches the DB schema.
  • If the column does not correspond to any existing defined field, it may be appropriate to discard it. E.g. total_fuel_consumption_mmbtu is an annual total of monthly values that are retained, and so we don't need it.
  • If the column corresponds to a defined field (either before or after the name has been fixed) then retain it and debug any issues that keeping it around results in later on in the transform process. It should make it to the harvesting step and go into the process of informing plant/generator/boiler/utility attributes.

EIA-860

pudl.transform.eia860.ownership()

  • None

pudl.transform.eia860.generators()

  • None

pudl.transform.eia860.plants()

  • None

pudl.transform.eia860.utilities()

  • None

EIA-923

pudl.transform.eia923.plants()

  • None

pudl.transform.eia923.generation_fuel()

  • combined_heat_power
  • plant_name_eia
  • operator_name (probably utility_name_eia)
  • operator_id (probably utility_id_eia)
  • plant_state
  • census_region
  • nerc_region
  • naics_code
  • fuel_unit (should probably be dropped, since unit is implied by fuel type)
  • total_fuel_consumption_quantity (annual total?)
  • electric_fuel_consumption_quantity (annual total?)
  • total_fuel_consumption_mmbtu (annual total?)
  • elec_fuel_consumption_mmbtu (annual total?)
  • net_generation_megawatthours (annual total?)
  • early_release

pudl.transform.eia923.boiler_fuel()

This one may give you trouble. See #1847 and #1836.

  • combined_heat_power
  • plant_name_eia
  • operator_name (probably utility_name_eia)
  • operator_id (probably utility_id_eia)
  • plant_state
  • census_region
  • nerc_region
  • naics_code
  • fuel_unit (should probably be dropped, since unit is implied by fuel type)
  • total_fuel_consumption_quantity (annual total?)
  • balancing_authority_code_eia
  • early_release
  • reporting_frequency_code
  • data_maturity (WE add this field in the extraction... getting dropped b/c of aggregations. See enable non-data columns in aggregated boiler_fuel_eia923 table #1847)

pudl.transform.eia923.generation()

  • combined_heat_power
  • plant_name_eia
  • operator_name (probably utility_name_eia)
  • operator_id (probably utility_id_eia)
  • plant_state
  • census_region
  • nerc_region
  • naics_code
  • early_release

pudl.transform.eia923.coalmine()

  • None -- we really do just want the very small set of columns retained here, as we're stripping them out to create a new table, normalizing the Fuel Receipts & Costs table.

pudl.transform.eia923.fuel_receipts_costs()

  • plant_name_eia
  • plant_state
  • operator_name (probably utility_name_eia)
  • operator_id (probably utility_id_eia)
  • mine_id_msha (should be dropped)
  • mine_type_code (should be dropped)
  • state (of the mine?)
  • county_id_fips (of the mine?)
  • state_id_fips (of the mine?)
  • mine_name (should be dropped)
  • regulated (mine or plant?)
  • early_release
@zaneselvans zaneselvans added eia923 Anything having to do with EIA Form 923 eia860 Anything having to do with EIA Form 860 data-cleaning Tasks related to cleaning & regularizing data during ETL. labels Jan 18, 2020
@cmgosnell cmgosnell added the harvest Normalization of poorly normalized inputs and reconciliation of internal inconsistencies label Sep 30, 2020
@cmgosnell cmgosnell added the ready label Sep 2, 2021
@zaneselvans
Copy link
Member Author

zaneselvans commented Sep 2, 2022

@cmgosnell and I are going to help get @knordback working on this issue as a way to become more familiar with the harvesting process, working with our code, Jupyter, etc.

knordback added a commit to knordback/pudl that referenced this issue Dec 31, 2022
… field may or may not actually want to change
@zaneselvans zaneselvans linked a pull request Jan 5, 2023 that will close this issue
8 tasks
@zaneselvans zaneselvans changed the title Do not remove harvested fields during transform Retain all harvestable fields during EIA transforms Feb 10, 2023
@zaneselvans zaneselvans moved this from 🆕 New to 🚧 In progress in Catalyst Megaproject Feb 14, 2023
@zaneselvans zaneselvans removed a link to a pull request Feb 14, 2023
8 tasks
@zaneselvans
Copy link
Member Author

@cmgosnell while talking over some of these fields with @knordback yesterday, I noticed that the associated_combined_heat_power field is part of the generators_entity_eia table, but there's another combined_heat_power field being reported in e.g. the generation_fuel_eia923 table, and looking at the spreadsheets, it seems like that field pertains to the plant (which makes some sense given that generation_fuel_eia923 is reported on a date, plant, prime-mover, fuel basis).

Are these different attributes? Should there be a CHP field at both the generator and the plant level? Should this really be a permanent attribute, or is it another one that changes slowly? Does the generator field really just indicate that the generator is part of a plant that does CHP? Or that it's part of a generation unit that does CHP? Could the plant or plant-prime-fuel level CHP status be inferred from the generator-level CHP attributes?

Right now we're discarding the CHP column reported in generation_fuel_eia923.

@grgmiller or @gschivley do either of you have more context on the relationship between these two different CHP fields?

@cmgosnell
Copy link
Member

I don't know exactly. associated_combined_heat_power originates in the generator table. I would not be surprised if there were plants that had some units contributing to a CHP and some that just generated power. I don't think it's generally a good idea to base any logic about the workings of a plant based off of the reporting structure of the generation_fuel_eia923 table. I personally would check whether this value is actually consistent across all generators within a plant before thinking about moving it. But also i could definitely imagine this changing over time (albeit very rarely!).

@zaneselvans
Copy link
Member Author

It seems like we should probably do an exhaustive check of all the currently "permanent" generator attributes on the pre-harvested dataframes... and see how permanent they actually are.

@grgmiller
Copy link
Collaborator

I do not have any context on these two fields.

@knordback
Copy link
Collaborator

I'll hold off on this one for now.

@zaneselvans zaneselvans added the good-first-issue Good issues for first-time contributors. Self-contained, low context, no credentials required. label Apr 7, 2023
@knordback
Copy link
Collaborator

I think this is mostly done. Based on notes above I left in code dropping some of the fields in clean_generation_fuel_eia923() and clean_fuel_receipts_costs_eia923(), but I'm not certain I'm interpreting the notes correctly. There's also implicit dropping in plants_eia923(), and I don't know if that's as desired or not.

@zaneselvans zaneselvans linked a pull request May 24, 2023 that will close this issue
@zaneselvans zaneselvans moved this from In progress to In review in Catalyst Megaproject Jun 5, 2023
@zaneselvans zaneselvans added this to the 2023 Spring milestone Jun 5, 2023
@bendnorman bendnorman added the requires-debug Things that have been worked on but hit an issue that requires debugging. label Jul 24, 2023
@bendnorman bendnorman moved this from In review to Backlog in Catalyst Megaproject Jul 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community data-cleaning Tasks related to cleaning & regularizing data during ETL. eia860 Anything having to do with EIA Form 860 eia923 Anything having to do with EIA Form 923 good-first-issue Good issues for first-time contributors. Self-contained, low context, no credentials required. harvest Normalization of poorly normalized inputs and reconciliation of internal inconsistencies inframundo ready requires-debug Things that have been worked on but hit an issue that requires debugging.
Projects
Status: Icebox
Development

Successfully merging a pull request may close this issue.

6 participants