Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Force tables to have all columns that are defined in schema #147

Merged
merged 3 commits into from
Sep 27, 2023

Conversation

jdangerx
Copy link
Member

@jdangerx jdangerx commented Sep 27, 2023

In catalyst-cooperative/pudl#2897 I found that we were missing some columns because the .unstack() in construct_dataframe doesn't create columns for values that don't show up at all, even if they're defined in the metadata. Applying a reindex makes sure we get everything.

This also was causing some integration test failures - when running the ETL in-process, we would:

  1. write 2021 data for a table
  2. construct dataframe for 2022 data, which has a slightly different column set because of different reported values
  3. fail when trying to write 2022 data to SQLite

Lastly, I wonder if there's a way we could keep our extracted tables tidy - our transforms in PUDL promptly re-stack these wide tables in wide_to_tidy, so maybe we can skip that completely. But that's definitely out of scope of this PR.

@jdangerx jdangerx requested a review from zschira September 27, 2023 13:46
@codecov
Copy link

codecov bot commented Sep 27, 2023

Codecov Report

All modified lines are covered by tests ✅

Comparison is base (a29bee2) 93.09% compared to head (c987bd6) 93.09%.
Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #147   +/-   ##
=======================================
  Coverage   93.09%   93.09%           
=======================================
  Files           8        8           
  Lines         594      594           
=======================================
  Hits          553      553           
  Misses         41       41           
Files Coverage Δ
src/ferc_xbrl_extractor/datapackage.py 98.70% <ø> (ø)

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@zschira zschira left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I definitely think that moving towards producing tidy tables is definitely a good idea. As for when we actually get around to doing it, I'm not sure. Maybe when we get the first taxonomy update that substantially changes the structure of some tables?

@jdangerx jdangerx merged commit 61525e5 into main Sep 27, 2023
12 checks passed
@jdangerx jdangerx deleted the enforce-schema branch September 27, 2023 21:05
@jdangerx
Copy link
Member Author

Yeah, I think we plant our mental seeds now and then reap them when we have to integrate 2023 data 🌱

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants