-
-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate where/when we filter for the freshest XBRL data #3861
Conversation
…nd the filtering into the transform step
test/validate/ferc1_test.py
Outdated
NOTE: None of this actually calls any of the code over in pudl.transform.ferc1 rn. | ||
That's maybe okay because we are mostly testing the data updates mostly here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this feel okay? we could pull some of the bits from the transform filter out into smaller functions to actually run here... but i don't think it is worth it looking at the bits and the setup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to test the actual code we're running in the transform - otherwise it's too easy for the test to drift from the implementation. One thing we could do is:
- make
filter_for_freshest
take an optionaldeduper
param that defaults to__apply_diffs
- but in test, you could pass in various dedupe methodologies. - call
filter_for_freshest(table_name)
andfilter_for_freshest(table_name, deduper = __best_snapshot)
in test, then__compare_dedupe_methodologies()
the two.
How does that sound to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm after trying to do this, the main snag here is that filter_for_freshest
does the prep to get the never_duped
and duped_groups
, does the filtering, then concats them back together.
So instead of passing in an optional deduper
method I think it'll be simpler to pass in a compare_methods
bool, which defaults to False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whee! A few changes requested in testing (mostly around actually testing the code we're running in production) but otherwise looking good!
test/validate/ferc1_test.py
Outdated
NOTE: None of this actually calls any of the code over in pudl.transform.ferc1 rn. | ||
That's maybe okay because we are mostly testing the data updates mostly here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to test the actual code we're running in the transform - otherwise it's too easy for the test to drift from the implementation. One thing we could do is:
- make
filter_for_freshest
take an optionaldeduper
param that defaults to__apply_diffs
- but in test, you could pass in various dedupe methodologies. - call
filter_for_freshest(table_name)
andfilter_for_freshest(table_name, deduper = __best_snapshot)
in test, then__compare_dedupe_methodologies()
the two.
How does that sound to you?
test/validate/ferc1_test.py
Outdated
Instead of stacking the two datasets, merging by context, and then | ||
looking for left_only or right_only values, we just count non-null | ||
values. This is because we would want to use the report_year as a | ||
merge key, but that isn't available until after we pipe the | ||
dataframe through `refine_report_year`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will have refine_report_year
at this point, right? Can we do this more useful comparison now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah okay so now __compare_dedupe_methodologies
we could merge with indicator=True
? i can try that out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or do you mean merge the two filtered dfs and check column by column for nulls in one filter method but not in the other?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I wasn't clear! I had meant that we should check each individual XBRL fact to see if it had changed:
- from null in apply-diffs to not-null in snapshot
- vice versa
- from one value in apply-diffs to a different value in snapshot
Originally / in this comment, I had envisioned this Pandas workflow:
- stack datasets so there's only one data column
- merge by XBRL context columns + variable name column
- look at the prevalence of
left_only
/right_only
values & any rows where thevalue_left
andvalue_right
were different
But! I think your plan of merging the dfs and checking column-by-column also works! In my head, the merge indicator operates on a row-by-row basis, so I figured stacking to one-xbrl-fact-per-row would be easiest. But you are the one who is writing this and also has more pandas
experience so I'll defer to your judgment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah yes stacking the non pk/context columns is a step of this that i hadn't considered and would make this more zippy! ty for this extra context i will try it out
df_out.stack(params.stacked_column_name, dropna=False) | ||
df_out.stack(params.stacked_column_name, future_stack=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry this is fully unrelated... but this was to remove an annoying future warning. i tested the update by re-running all the core and out ferc assets then running the ferc validation tests.
"""Create PUDL input and output directories if they don't already exist.""" | ||
self.input_dir.mkdir(parents=True, exist_ok=True) | ||
self.output_dir.mkdir(parents=True, exist_ok=True) | ||
return self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was to remove another more different warning about model_validator
's needing self
to be returned. i honestly don't know how this didn't junk everything up before
# some sample guys found to have higher filtering diffs | ||
"core_ferc1__yearly_utility_plant_summary_sched200", | ||
"core_ferc1__yearly_plant_in_service_sched204", | ||
"core_ferc1__yearly_operating_expenses_sched320", | ||
"core_ferc1__yearly_income_statements_sched114", | ||
], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i added some more guys
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚢
primary_keys = get_primary_key_raw_xbrl( | ||
raw_table_name.removeprefix("raw_ferc1_xbrl__"), "ferc1" | ||
) | ||
filter_for_freshest_data_xbrl( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a little confusing that the assertions don't live in the test function, but I guess this works! I think someone confused about what this is actually testing would just go into filter_for_freshest
and immediately see the assertion errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea i absolutely agree with this. it just would have been a lot of contrived passing things around and making little functions to get the right inputs in the right place. this feels simpler for now but also i can def imagine wanting to come back and break this up so the __compare_dedupe_methodologies step happens entirely in the test.
* migrate the comparing of the freshest data methodologies into tests and the filtering into the transform step * omigosh more moving of stuff * why did it let me commit that when the unit test failed? bc its marked slow? * rejigger the filtering so we can run it in the tests * re-work how we compare the filtering methodologies * transition fully to new compare diffs
ah i re-added this to the merge queue w/o pushing my changes.... oopsies. |
Overview
This was done in conjunction with @jdangerx !
What problem does this address?
It came to our attention in #3857 that
filter_for_freshest_data
being inside of the XBRL io_manager was... a lot! It included a pretty slow step that was mostly a validation step.What did you change?
filter_for_freshest_data
processing out of theio_manager
intotransform.ferc1
. this took the longest to figure out where/when in the process this should happen.... we ended up putting it inferc1_transform_asset_factory
because that is where the raw tables get concatenated when there is more than one input raw table. So that is the last place where we have individual raw tables.__compare_dedupe_methodologies
into a validation testferc1_transform_asset_factory
to make sure all of the raw tables either have an instant or duration table... because we had a type in one of the input tables 😬Implications??
Testing
How did you make sure this worked? How can a reviewer verify this?
To-do list