Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 509 #2333

Open
wants to merge 40 commits into
base: main
Choose a base branch
from
Open

Bug 509 #2333

Show file tree
Hide file tree
Changes from 10 commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
c48584e
Remove plant_name_eia in pudl.transform.eia923.fuel_receipts_costs()
knordback Feb 14, 2023
13a0dc1
Don't discard plant_name_eia from pudl.transform.eia923.generation_fu…
knordback Feb 17, 2023
b4a3c28
Don't drop operator_name/utility_name_eia in pudl.transform.eia923.ge…
knordback Feb 25, 2023
35b6c60
Move newly added code to what seems like a more sensible location
knordback Feb 27, 2023
d6021ac
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Feb 27, 2023
adc4503
Extract the core of AbstractTableTransformer.enforce_schema() into Re…
knordback Feb 28, 2023
2bcd803
Merge branch 'bug-509' of https://github.com/catalyst-cooperative/pud…
knordback Feb 28, 2023
c44949b
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Feb 28, 2023
731baef
Use Resource.endorse_schema() to clean up non-matching fields in Data…
knordback Mar 1, 2023
eef6cbf
Merge branch 'bug-509' of https://github.com/catalyst-cooperative/pud…
knordback Mar 1, 2023
fbe7b21
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 1, 2023
e2768f9
Move DataFrame cleaning to the end of transform()
knordback Mar 1, 2023
46d7e66
Merge branch 'bug-509' of https://github.com/catalyst-cooperative/pud…
knordback Mar 1, 2023
508ba29
Call enforce_schema() on entities DataFrames as well
knordback Mar 2, 2023
126102e
Merge branch 'dev' into bug-509
knordback Mar 2, 2023
e489ed4
Merge branch 'dev' into bug-509
zaneselvans Mar 4, 2023
dc5068e
Don't drop combined_heat_power in boiler_fuel()
knordback Mar 17, 2023
fdada61
One more incremental change in boiler_fuel()
knordback Mar 20, 2023
40ef5f4
Several more fields not dropped
knordback Mar 20, 2023
dd9982a
Merge in substantial changes from dev.
zaneselvans Mar 21, 2023
bb60591
More un-dropping of fields
knordback Mar 21, 2023
fc099cc
Merge branch 'dev' into bug-509
knordback Apr 12, 2023
c91edf1
Merge branch 'dev' into bug-509
knordback May 15, 2023
2efd152
Remove code made obsolete in merge from dev
knordback May 16, 2023
d464394
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 16, 2023
3675eba
Don't drop total_fuel_consumption_quantity in clean_boiler_fuel_eia923()
knordback May 18, 2023
cb74321
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 18, 2023
54dd4cc
More field non-dropping
knordback May 19, 2023
761ba7b
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 19, 2023
67105a1
Don't drop fields in clean_generation_eia923()
knordback May 22, 2023
ee0ae79
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 22, 2023
d962d4f
Bug 509 work in clean_fuel_receipts_costs_eia923()
knordback May 23, 2023
1d466f1
Don't drop fields in clean_generation_fuel_eia923()
knordback May 24, 2023
15cb0c6
Remove a useless function call
knordback May 24, 2023
97ff719
Merge branch 'dev' into bug-509
knordback May 24, 2023
a4f6234
Merge branch 'dev' into bug-509
knordback Jun 1, 2023
968267c
Merge branch 'dev' into bug-509
knordback Jun 8, 2023
a6f7f9b
Merge branch 'dev' into bug-509
zaneselvans Jun 10, 2023
5d3a0e8
Merge branch 'dev' into bug-509
knordback Jun 14, 2023
eaa7615
Merge branch 'dev' into bug-509
knordback Jun 30, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions src/pudl/metadata/classes.py
Original file line number Diff line number Diff line change
Expand Up @@ -1676,6 +1676,23 @@ def encode(self, df: pd.DataFrame) -> pd.DataFrame:
)
return df

def enforce_schema(self, df: pd.DataFrame) -> pd.DataFrame:
"""Drop columns not in the DB schema and enforce specified types."""
expected_cols = pd.Index(self.get_field_names())
missing_cols = list(expected_cols.difference(df.columns))
if missing_cols:
raise ValueError(
f"{self.name}: Missing columns found when enforcing table "
f"schema: {missing_cols}"
)
df = self.format_df(df)
pk = self.schema.primary_key
if pk and not df[df.duplicated(subset=pk)].empty:
raise ValueError(
f"{self.name} Duplicate primary keys when enforcing schema."
)
return df


# ---- Package ---- #

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ plant_id_eia,plant_id,plant_id,plant_id,plant_id,plant_id,plant_id,plant_id,plan
combined_heat_power,combined_heat_power_plant,combined_heat_power_plant,combined_heat_power_plant,combined_heat_power_plant,combined_heat_power_plant,combined_heat_power_plant,combined_heat_power_plant,combined_heat_power_plant,combined_heat_power_plant,combined_heat_power_plant,combined_heat_power_plant,combined_heat_and_power_plant,combined_heat_power_plant,combined_heat_and_power_plant,combined_heat_and_power_plant,combined_heat_and_power_plant,combined_heat_and_power_plant,combined_heat_and_power_plant,combined_heat_and_power_plant,combined_heat_and_power_plant,combined_heat_and_power_plant
nuclear_unit_id,nuclear_unit_i_d,nuclear_unit_i_d,nuclear_unit_i_d,nuclear_unit_i_d,nuclear_unit_i_d,nuclear_unit_i_d,nuclear_unit_i_d,nuclear_unit_i_d,nuclear_unit_i_d,nuclear_unit_i_d,nuclear_unit_id,nuclear_unit_id,nuclear_unit_id,nuclear_unit_id,nuclear_unit_id,nuclear_unit_id,nuclear_unit_id,nuclear_unit_id,nuclear_unit_id,nuclear_unit_id,nuclear_unit_id
plant_name_eia,plant_name,plant_name,plant_name,plant_name,plant_name,plant_name,plant_name,plant_name,plant_name,plant_name,plant_name,plant_name,plant_name,plant_name,plant_name,plant_name,plant_name,plant_name,plant_name,plant_name,plant_name
operator_name,operator_name,operator_name,operator_name,operator_name,operator_name,operator_name,operator_name,operator_name,operator_name,operator_name,operator_name,operator_name,operator_name,operator_name,operator_name,operator_name,operator_name,operator_name,operator_name,operator_name,operator_name
utility_name_eia,operator_name,operator_name,operator_name,operator_name,operator_name,operator_name,operator_name,operator_name,operator_name,operator_name,operator_name,operator_name,operator_name,operator_name,operator_name,operator_name,operator_name,operator_name,operator_name,operator_name,operator_name
operator_id,operator_id,operator_id,operator_id,operator_id,operator_id,operator_id,operator_id,operator_id,operator_id,operator_id,operator_id,operator_id,operator_id,operator_id,operator_id,operator_id,operator_id,operator_id,operator_id,operator_id,operator_id
plant_state,state,state,state,state,state,state,state,state,state,state,state,plant_state,state,plant_state,plant_state,plant_state,plant_state,plant_state,plant_state,plant_state,plant_state
census_region,census_region,census_region,census_region,census_region,census_region,census_region,census_region,census_region,census_region,census_region,census_region,census_region,census_region,census_region,census_region,census_region,census_region,census_region,census_region,census_region,census_region
Expand Down
14 changes: 1 addition & 13 deletions src/pudl/transform/classes.py
Original file line number Diff line number Diff line change
Expand Up @@ -1263,17 +1263,5 @@ def enforce_schema(self, df: pd.DataFrame) -> pd.DataFrame:
"""Drop columns not in the DB schema and enforce specified types."""
logger.info(f"{self.table_id.value}: Enforcing database schema on dataframe.")
resource = Package.from_resource_ids().get_resource(self.table_id.value)
expected_cols = pd.Index(resource.get_field_names())
missing_cols = list(expected_cols.difference(df.columns))
if missing_cols:
raise ValueError(
f"{self.table_id.value}: Missing columns found when enforcing table "
f"schema: {missing_cols}"
)
df = resource.format_df(df)
pk = resource.schema.primary_key
if pk and not df[df.duplicated(subset=pk)].empty:
raise ValueError(
f"{self.table_id.value} Duplicate primary keys when enforcing schema."
)
df = resource.enforce_schema(df)
return df
11 changes: 11 additions & 0 deletions src/pudl/transform/eia.py
Original file line number Diff line number Diff line change
Expand Up @@ -1170,6 +1170,17 @@ def transform(
"boilers_annual_eia",
)

# Remove fields that came from input data but aren't in the
# corresponding SQLite tables. The data may still exist but has been
# moved elsewhere.
for cat in eia_transformed_dfs:
resource = (
pudl.metadata.classes.Package.from_resource_ids().
get_resource(cat)
)
eia_transformed_dfs[cat] = resource.enforce_schema(
eia_transformed_dfs[cat])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason not to wait to do this until the very end (after the balancing authority fix below)?

Is there any reason not to call enforce_schema() on the entity tables too?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I had already moved the code to the very end of the function.

And I added code to do the same thing for the entity tables. That doesn't seem to cause any distress.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow that's great! I'm slightly surprised. Do you want to try running tox -e nuke overnight and see if stays happy?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems not successful:
nuke: exit -9 (641.56 seconds) /home/knordback/Kurt/pudl> bash -c 'coverage run --append src/pudl/cli.py --logfile tox-nuke.log --clobber src/pudl/package_data/settings/etl_full.yml' pid=793582
.pkg: _exit> python /home/knordback/.conda/envs/pudl-dev/lib/python3.10/site-packages/pyproject_api/_backend.py True setuptools.build_meta
nuke: FAIL code -9 (7080.40=setup[98.70]+cmd[0.12,0.54,0.48,0.39,0.41,0.42,0.39,0.53,0.24,6.33,6.21,1.24,0.03,1.72,1.23,161.77,33.89,25.29,3980.21,0.01,2118.71,641.56] seconds)
evaluation failed :( (7081.04 seconds)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of those tests run successfully. So does tox with no arguments (which I thought was the commit criterion). So it seems to be something specific to the "nuke".

I am indeed getting copious output, but it's not (to my eye) indicating what the problem is. As mentioned above, I get some warnings but nothing obviously connected to this.

I'll be away next week. I'll will look into this more when I get back.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Zane. I'm back, for a little while, and have some time to work on this again. I can continue un-dropping fields, but I'm not sure how worried to be about not being able to run tox -e nuke successfully. As noted, the other tests run fine for me. Output is attached in case you want to take a look.
tox-out.txt

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tried running the full ETL with the pudl_etl command line tool? The place that it's failing in your logs is the most memory intensive part of the ETL so my guess is it's running out of memory and crashing? I thought you'd been able to do the full ETL previously? How much memory + swap do you have? I think it may take up to 25GB as it is now.

tox with no arguments is what gets run in CI on GitHub. It only processes 1 year of data. For changes like the one you're making that will affect all years of data we try and run the full ETL locally before it gets merged into dev.

tox -e nuke is a blunt instrument that not only runs the full ETL, but also all the unit + integration tests on all the data + data validation tests. You probably want to be trying to run the full ETL, but only for the EIA data, to check whether everything is working. The devtools/eia-etl-debug.ipynb notebook is the easiest / most efficient way to do that right now, since you don't have to run the extract step over and over again. It'll also avoid this (FERC 1) related memory intensive step.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have 16GB ram + 15GB swap. I thought I could run the full ETL but apparently not that either. I was able to run devtools/eia-etl-debug.ipynb (first time using Jupyter -- it's kinda slick).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jupyter is great for working with data. Lots more human-friendly tools for seeing what's going on in there! Definitely worth getting familiar with.


eia_transformed_dfs["plants_eia860"] = fillna_balancing_authority_codes_via_names(
df=eia_transformed_dfs["plants_eia860"]
).pipe(
Expand Down
3 changes: 0 additions & 3 deletions src/pudl/transform/eia923.py
Original file line number Diff line number Diff line change
Expand Up @@ -617,8 +617,6 @@ def generation_fuel(eia923_dfs, eia923_transformed_dfs):
# Drop fields we're not inserting into the generation_fuel_eia923 table.
cols_to_drop = [
"combined_heat_power",
"plant_name_eia",
"operator_name",
"operator_id",
"plant_state",
"census_region",
Expand Down Expand Up @@ -1072,7 +1070,6 @@ def fuel_receipts_costs(eia923_dfs, eia923_transformed_dfs):
# Drop fields we're not inserting into the fuel_receipts_costs_eia923
# table.
cols_to_drop = [
"plant_name_eia",
"plant_state",
"operator_name",
"operator_id",
Expand Down