Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make leafy balance sheet assets & liabilities data #2805

Merged
merged 60 commits into from
Sep 20, 2023

Conversation

cmgosnell
Copy link
Member

@cmgosnell cmgosnell commented Aug 24, 2023

PR Overview

Some final fixes that are specific to the balance sheet asset calculation tree, allowing it to be used for annotating and filtering the exploded balance sheet asset & liability data tables.

Tasks

PR Checklist

  • Merge the most recent version of the branch you are merging into (probably dev).
  • All CI checks are passing. Run tests locally to debug failures
  • Make sure you've included good docstrings.
  • For major data coverage & analysis changes, run data validation tests
  • Include unit tests for new functions and classes.
  • Defensive data quality/sanity checks in analyses & data processing functions.
  • Update the release notes and reference reference the PR and related issues.
  • Do your own explanatory review of the PR to help the reviewer understand what's going on and identify issues preemptively.

@zaneselvans zaneselvans linked an issue Aug 24, 2023 that may be closed by this pull request
@zaneselvans zaneselvans added ferc1 Anything having to do with FERC Form 1 metadata Anything having to do with the content, formatting, or storage of metadata. Mostly datapackages. xbrl Related to the FERC XBRL transition labels Aug 24, 2023
@zaneselvans
Copy link
Member

zaneselvans commented Aug 25, 2023

For reasons I don't yet understand, I am getting non-deterministic behavior in the graph pruning process. For both balance_sheet_assets_ferc1 and income_statement_ferc1 whether or not we get a tree at the end depends on something random, and both of them sometimes come out as trees with no multi-parent nodes.

Running the balance_sheet_asset_ferc1 explosion process 10 times, I got the following sets of stepchildren:

[
    [
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='utility_plant_acquisition_adjustment', utility_type='gas', plant_status=pd.NA, plant_function=pd.NA),
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='amortization_of_other_utility_plant_utility_plant_in_service', utility_type='gas', plant_status=pd.NA, plant_function=pd.NA),
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='depreciation_utility_plant_in_service', utility_type='other1', plant_status=pd.NA, plant_function=pd.NA),
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='construction_work_in_progress', utility_type='gas', plant_status=pd.NA, plant_function=pd.NA),
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='construction_work_in_progress', utility_type='common', plant_status=pd.NA, plant_function=pd.NA),
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='depreciation_utility_plant_in_service', utility_type='common', plant_status=pd.NA, plant_function=pd.NA),
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='utility_plant_acquisition_adjustment', utility_type='electric', plant_status=pd.NA, plant_function=pd.NA),
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='depreciation_utility_plant_in_service', utility_type='electric', plant_status=pd.NA, plant_function=pd.NA),
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='utility_plant_held_for_future_use', utility_type='common', plant_status=pd.NA, plant_function=pd.NA),
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='utility_plant_held_for_future_use', utility_type='electric', plant_status=pd.NA, plant_function=pd.NA),
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='utility_plant_held_for_future_use', utility_type='gas', plant_status=pd.NA, plant_function=pd.NA),
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='amortization_of_other_utility_plant_utility_plant_in_service', utility_type='electric', plant_status=pd.NA, plant_function=pd.NA),
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='abandonment_of_leases', utility_type='common', plant_status=pd.NA, plant_function=pd.NA),
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='amortization_of_other_utility_plant_utility_plant_in_service', utility_type='common', plant_status=pd.NA, plant_function=pd.NA),
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='amortization_of_other_utility_plant_utility_plant_in_service', utility_type='other1', plant_status=pd.NA, plant_function=pd.NA),
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='depreciation_utility_plant_in_service', utility_type='gas', plant_status=pd.NA, plant_function=pd.NA),
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='abandonment_of_leases', utility_type='other1', plant_status=pd.NA, plant_function=pd.NA),
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='abandonment_of_leases', utility_type='electric', plant_status=pd.NA, plant_function=pd.NA),
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='construction_work_in_progress', utility_type='electric', plant_status=pd.NA, plant_function=pd.NA),
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='utility_plant_acquisition_adjustment', utility_type='common', plant_status=pd.NA, plant_function=pd.NA),
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='abandonment_of_leases', utility_type='gas', plant_status=pd.NA, plant_function=pd.NA)
    ],
    [
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='utility_plant_in_service_classified_and_unclassified', utility_type='common', plant_status=pd.NA, plant_function=pd.NA),
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='utility_plant_in_service_classified_and_unclassified', utility_type='other3', plant_status=pd.NA, plant_function=pd.NA),
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='utility_plant_in_service_classified_and_unclassified', utility_type='other1', plant_status=pd.NA, plant_function=pd.NA),
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='amortization_of_other_utility_plant_utility_plant_in_service', utility_type='gas', plant_status=pd.NA, plant_function=pd.NA),
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='utility_plant_in_service_classified_and_unclassified', utility_type='other2', plant_status=pd.NA, plant_function=pd.NA),
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='utility_plant_in_service_classified_and_unclassified', utility_type='electric', plant_status=pd.NA, plant_function=pd.NA),
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='utility_plant_in_service_classified_and_unclassified', utility_type='gas', plant_status=pd.NA, plant_function=pd.NA),
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='amortization_of_plant_acquisition_adjustment', utility_type='gas', plant_status=pd.NA, plant_function=pd.NA),
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='depreciation_utility_plant_in_service', utility_type='gas', plant_status=pd.NA, plant_function=pd.NA),
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='abandonment_of_leases', utility_type='gas', plant_status=pd.NA, plant_function=pd.NA)
    ],
    [],
    [
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='utility_plant_in_service_classified_and_unclassified', utility_type='electric', plant_status=pd.NA, plant_function=pd.NA),
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='utility_plant_in_service_plant_purchased_or_sold', utility_type='other1', plant_status=pd.NA, plant_function=pd.NA),
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='utility_plant_in_service_classified_and_unclassified', utility_type='other1', plant_status=pd.NA, plant_function=pd.NA),
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='amortization_of_other_utility_plant_utility_plant_in_service', utility_type='gas', plant_status=pd.NA, plant_function=pd.NA),
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='abandonment_of_leases', utility_type='gas', plant_status=pd.NA, plant_function=pd.NA),
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='utility_plant_in_service_plant_purchased_or_sold', utility_type='electric', plant_status=pd.NA, plant_function=pd.NA),
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='utility_plant_in_service_completed_construction_not_classified', utility_type='other1', plant_status=pd.NA, plant_function=pd.NA),
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='depreciation_utility_plant_in_service', utility_type='gas', plant_status=pd.NA, plant_function=pd.NA),
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='utility_plant_in_service_classified_and_property_under_capital_leases', utility_type='other1', plant_status=pd.NA, plant_function=pd.NA),
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='utility_plant_in_service_classified_and_property_under_capital_leases', utility_type='electric', plant_status=pd.NA, plant_function=pd.NA),
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='amortization_of_plant_acquisition_adjustment', utility_type='gas', plant_status=pd.NA, plant_function=pd.NA),
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='utility_plant_in_service_completed_construction_not_classified', utility_type='electric', plant_status=pd.NA, plant_function=pd.NA)
    ],
    [
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='utility_plant_in_service_plant_purchased_or_sold', utility_type='other1', plant_status=pd.NA, plant_function=pd.NA),
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='utility_plant_in_service_classified_and_property_under_capital_leases', utility_type='electric', plant_status=pd.NA, plant_function=pd.NA),
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='utility_plant_in_service_plant_purchased_or_sold', utility_type='electric', plant_status=pd.NA, plant_function=pd.NA),
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='utility_plant_in_service_completed_construction_not_classified', utility_type='electric', plant_status=pd.NA, plant_function=pd.NA),
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='utility_plant_in_service_completed_construction_not_classified', utility_type='other1', plant_status=pd.NA, plant_function=pd.NA),
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='utility_plant_in_service_classified_and_property_under_capital_leases', utility_type='other1', plant_status=pd.NA, plant_function=pd.NA)
    ],
    [],
    [
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='amortization_of_other_utility_plant_utility_plant_in_service', utility_type='gas', plant_status=pd.NA, plant_function=pd.NA),
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='abandonment_of_leases', utility_type='gas', plant_status=pd.NA, plant_function=pd.NA),
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='amortization_of_plant_acquisition_adjustment', utility_type='gas', plant_status=pd.NA, plant_function=pd.NA),
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='depreciation_utility_plant_in_service', utility_type='gas', plant_status=pd.NA, plant_function=pd.NA)
    ],
    [
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='abandonment_of_leases', utility_type='gas', plant_status=pd.NA, plant_function=pd.NA),
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='amortization_of_plant_acquisition_adjustment', utility_type='gas', plant_status=pd.NA, plant_function=pd.NA),
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='depreciation_utility_plant_in_service', utility_type='gas', plant_status=pd.NA, plant_function=pd.NA),
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='amortization_of_other_utility_plant_utility_plant_in_service', utility_type='gas', plant_status=pd.NA, plant_function=pd.NA)
    ],
    [
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='utility_plant_in_service_classified_and_unclassified', utility_type='gas', plant_status=pd.NA, plant_function=pd.NA),
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='utility_plant_in_service_classified_and_unclassified', utility_type='common', plant_status=pd.NA, plant_function=pd.NA),
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='utility_plant_in_service_classified_and_unclassified', utility_type='electric', plant_status=pd.NA, plant_function=pd.NA),
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='utility_plant_held_for_future_use', utility_type='electric', plant_status=pd.NA, plant_function=pd.NA),
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='construction_work_in_progress', utility_type='gas', plant_status=pd.NA, plant_function=pd.NA),
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='construction_work_in_progress', utility_type='common', plant_status=pd.NA, plant_function=pd.NA),
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='utility_plant_acquisition_adjustment', utility_type='electric', plant_status=pd.NA, plant_function=pd.NA),
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='utility_plant_acquisition_adjustment', utility_type='common', plant_status=pd.NA, plant_function=pd.NA),
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='utility_plant_held_for_future_use', utility_type='common', plant_status=pd.NA, plant_function=pd.NA),
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='utility_plant_acquisition_adjustment', utility_type='gas', plant_status=pd.NA, plant_function=pd.NA),
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='utility_plant_held_for_future_use', utility_type='gas', plant_status=pd.NA, plant_function=pd.NA),
        NodeId(table_name='utility_plant_summary_ferc1', xbrl_factoid='construction_work_in_progress', utility_type='electric', plant_status=pd.NA, plant_function=pd.NA)
    ],
    [],
]

Interestingly, the result remains consistent so long as I don't restart the notebook kernel -- I can re-run all the cells and get the same result over and over again, as long as I don't restart the kernel.

My intuition is that it has something to do with the way we're iteratively removing nodes that have only stepchildren in the forest() method:

        for node in self.stepparents(forest):
            children = set(forest.successors(node))
            stepchildren = set(self.stepchildren(forest)).intersection(children)
            if (children == stepchildren) & (len(children) > 0):
                forest.remove_node(node)

The problem here is that we're iterating over something that's being mutated within the loop... which could yield some non-deterministic behavior if the ordering of what's returned is random.

We could address this either by storing a constant list of stepparents before looping, or by accummulating a list of all the nodes we want to remove and then removing them outside the loop, but I think these are subtly different.

Freezing the initial list of stepparents before iterating:

        stepparents = self.stepparents(forest)
        for node in stepparents:
            children = set(forest.successors(node))
            stepchildren = set(self.stepchildren(forest)).intersection(children)
            if (children == stepchildren) & (len(children) > 0):
                forest.remove_node(node)

Results in what seems to be identical behavior (but is still safer!)

If we also wait to remove the set of nodes that meet our criteria until after the loop exits:

        stepparents = self.stepparents(forest)
        nodes_to_remove = []
        for node in stepparents:
            children = set(forest.successors(node))
            stepchildren = set(self.stepchildren(forest)).intersection(children)
            if children and (children == stepchildren):
                nodes_to_remove.append(node)
        forest.remove_nodes_from(nodes_to_remove)

Then we always get the same output and there are no stepchildren!!!!

However...

image
image

Going back to removing nodes that only have stepchildren ONLY if they also have utility_type == "total":

        stepparents = self.stepparents(forest)
        nodes_to_remove = []
        for node in stepparents:
            children = set(forest.successors(node))
            stepchildren = set(self.stepchildren(forest)).intersection(children)
            if (
                len(children) != 0
                and (children == stepchildren)
                and (node.utility_type == "total")
            ):
                nodes_to_remove.append(node)
        forest.remove_nodes_from(nodes_to_remove)

We get an output that looks more like we expect, but is it actually correct?

image
image

@zaneselvans zaneselvans changed the title the place to actually make leafy data omigosh Make leafy balance sheet assets & liabilities data Aug 25, 2023
@cmgosnell
Copy link
Member Author

I think the former way to do this it removed one stepparent at a time. So every time during the for node in self.stepparents(forest): loop the forest, it was checking against a new forest. which is good. because as we've discussed there are multiple valid paths and we want one of them. if you remove all of these stepparents then your removing all paths.

I think a better way to get a deterministic outcome would be to sort the stepparents going into the loop so it happens in the same order every time. I'm gonna check doing something like that.

@zaneselvans
Copy link
Member

Sorting the list of nodes to iterate over and allowing the list to mutate while we iterate may make the process deterministic, but will it necessarily be correct? What's our criteria for choosing which path to keep?

This is why originally I only removed the stepparents that also had utility_type=="total" as a way to choose which path to keep and which to remove.

@cmgosnell
Copy link
Member Author

well turns out it didn't make the process deterministic! sooo there is that.

but in terms of what is "right", i don't think it matters which path we choose if there are two paths to the same set of outcomes. your right that choosing the totals as the ones to remove will help narrow the choice down, but it looks like there are still a lot of leftover stepparents with that method.

@cmgosnell
Copy link
Member Author

we employed sorted lists and it locked in the orderrrrr 🎉

Comment on lines 1948 to 1958
# There are a few rare instances where a particular node is specified with more
# than one weight (-1 vs. 1) and in those cases, we always want to keep the
# weight of -1, since it affects the overall root->leaf calculation outcome.
# Maybe this should actually happen in set_forest_attributes?
multi_valued_weights = (
self.exploded_calcs.groupby(self.calc_cols, dropna=False)["weight"]
.transform("nunique")
.gt(1)
)
calcs_to_drop = multi_valued_weights & (self.exploded_calcs.weight == 1)
deduplicated_calcs = self.exploded_calcs.drop(calcs_to_drop.index)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this works fine and i think is a good solution for now, but i think the more holistic way to do this would be to get all of the nodes direct predecessor nodes in order to generate both the parent_cols and calc_cols in order to select the individual records to select on within exploded_calcs during the merge in set_forest_attributes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should talk about this. I'm not sure I understand what you're saying. I agree this probably belongs over in set_forest_attributes() or maybe it its own deduplicate_calcs() method that's called by set_forest_attributes()

@zaneselvans zaneselvans marked this pull request as ready for review August 25, 2023 16:07
@zaneselvans
Copy link
Member

zaneselvans commented Aug 25, 2023

I think we need to check whether the nodes we're dropping when enforcing the tree structure:

  • are NOT tagged
  • do NOT appear in any calculations with a weight of -1

If either of those statements are untrue, then by removing them we will be altering the outcome of the final root -> leaf calculations and what tags end up being applied to the leaf nodes.

The tag check is a little weird because right now we're applying tags only on the basis of matching the table & fact

@codecov
Copy link

codecov bot commented Aug 25, 2023

Codecov Report

Patch coverage: 88.8% and project coverage change: +0.1% 🎉

Comparison is base (671a969) 88.4% compared to head (421bada) 88.6%.
Report is 497 commits behind head on explode_ferc1.

Additional details and impacted files
@@               Coverage Diff               @@
##           explode_ferc1   #2805     +/-   ##
===============================================
+ Coverage           88.4%   88.6%   +0.1%     
===============================================
  Files                 89      90      +1     
  Lines              10711   10832    +121     
===============================================
+ Hits                9478    9599    +121     
  Misses              1233    1233             
Files Changed Coverage Δ
src/pudl/convert/epacems_to_parquet.py 64.7% <ø> (ø)
src/pudl/extract/eia860.py 100.0% <ø> (ø)
src/pudl/extract/eia861.py 94.7% <ø> (ø)
src/pudl/metadata/codes.py 100.0% <ø> (ø)
src/pudl/metadata/dfs.py 100.0% <ø> (ø)
src/pudl/metadata/fields.py 100.0% <ø> (ø)
src/pudl/metadata/resources/eia860.py 100.0% <ø> (ø)
src/pudl/metadata/resources/ferc1.py 100.0% <ø> (ø)
src/pudl/metadata/sources.py 100.0% <ø> (ø)
src/pudl/output/eia.py 59.0% <ø> (ø)
... and 59 more

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Comment on lines 1969 to 1979
# There are a few rare instances where a particular node is specified with more
# than one weight (-1 vs. 1) and in those cases, we always want to keep the
# weight of -1, since it affects the overall root->leaf calculation outcome.
# Maybe this should actually happen in set_forest_attributes?
multi_valued_weights = (
self.exploded_calcs.groupby(self.calc_cols, dropna=False)["weight"]
.transform("nunique")
.gt(1)
)
calcs_to_drop = multi_valued_weights & (self.exploded_calcs.weight == 1)
deduplicated_calcs = self.exploded_calcs.loc[~calcs_to_drop]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how many true multi_valued_weights do we expect? it might be good to add a little assertion with some expected value to make sure we don't end up unexpectedly dropping a ton.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. It should be very small proportion compared to the number of overall calculations.

@zaneselvans
Copy link
Member

@e-belfer I fleshed out the Tasklist based on the comments @cmgosnell left before she went on vacation and notes from our handoff discussion. The tasks that sound significant include the following (which hopefully aren't just gobbledygook...)

Re-evaluate conflicting weights check and passthrough calculations

Previously we were excising passthrough calculations, that had a single parent and a single child (plus a correction node) to redirect calculations that involved duplicate values, but doing this we had to be careful never to remove a node with a weight that would alter the full root-to-leaf calculation weight (i.e. any weight other than 1.0).

Since adding all of the other dimensions (utility_type, plant_status and plant_function) it seems that there are no longer any passthrough calculations... so all of that complication may be unnecessary, and if that's true it would be nice to remove it.

However, another kind of issue with "conflicting" weights has come up, in which there are multiple paths to a few nodes in the forest, because sometimes calculations are decomposed in two different ways in parallel -- one breaking things down by xbrl fact, and another breaking things down by the additional dimensions. We have to choose one or the other in order to avoid having duplicate value getting aggregated by the tree, but at some level it doesn't matter which one we choose (unless there's another metadata dimension like is_in_rate_base that applies to one decomposition and not the other which we need to capture).

Right now we are using a heuristic and some random chance to decide which calculations to drop. Any time that all of a node's children have multiple parents, we know that we can safely remove that node without dropping any of the downstream (leafward) values. This gets rid of almost all of the duplicate paths, but there are a couple of cases where some of the children don't quite cover one of the decompositions completely, so we have to purposefully select the decomposition that does include all of the children and drop the node that doesn't.

I think that Christina is suggesting that as long as removing a node doesn't result in the loss of any leaves, and we do end up with forest structure in the end, then the calculations should be valid.

Debugging tag conflicts & tag propagation

Some similar issues come up with the tagging -- which nodes get manually tagged is a bit random, and based on what facts are most salient to the tagger. We need to propagate those tags all the way down to the leaf nodes, so that when we do the root-to-leaf calculation, we have all of the tagging information available. But this creates the possibility of conflicts between separate tags. If a node and one of its descendants have different tags, then whichever one is propagated last is going to end up having precedence, which may or may not be correct. I think these conflicts are probably user input errors and should be flagged and resolved manually.

We also need to check and make sure that none of the nodes we are pruning contain tags that are then lost from the tree.

It also looks like there's some kind of issue with the way the tags are being propagated to the leaves, since there are at leaset a few cases in which the tags of parent nodes aren't showing up in their children. But it could also be that I'm generating the forest_as_table before I've propagated the tags (need to double-check).

@zaneselvans
Copy link
Member

zaneselvans commented Sep 15, 2023

Some observations after adding all of the dimensions and merging in the intra-dimensional total fixes. Discussing with one or more of @cmgosnell @e-belfer or @jdangerx would probably be helpful.

No more passthroughs

  • There are no longer any nodes with a a single parent and a single child + a _correction.
  • Should there be something analogous but with a different definition because the totals contains subdimensions now so having a "single" child previously means having only within-fact subcomponents?
  • Or should this concept be retired?

We have actual forests

  • All 3 of the calculation forests are free of multi-parent nodes, as required... though this is kind of by construction / brute force on our part.

Some calculations aren't matching exactly

  • balance_sheet_liabilities_ferc1 and income_statement_ferc1 have single-digit percentage calculation errors.
  • More than half of the balance_sheet_assets_ferc1 don't exactly match reported values.
  • Programmatically things seem like they're probably pretty good, but something needs to be fixed in the balance sheet assets table(s).
  • The balance_sheet_assets table is also the only one where we have to manually identify "almost pure stepparents" for removal so maybe we're messing something up there.

Pruned and conflicting tags

  • The balance_sheet_assets_ferc1 table has 2 tags that conflict (is_in_rate_base is partial vs. yes). This seems like it's probably just a manual tagging error that we can resolve.
  • balance_sheet_assets_ferc1 also has a handful of nodes which were explicitly tagged, but that don't appear in the final calculation forest, either because they were orphans from the start or they got pruned. It might be that the tagged nodes which are getting removed are from inferred calculations, and it's not important that they're being removed.
  • It might also be that we are choosing to remove nodes that have been manually identified as important in the process of enforcing a hierarchical calculation tree, in which case we might need to choose other nodes to remove.

Conflicting weights & multiple possible calculation paths

  • The income_statement_ferc1 table has a handful of nodes with weights of -1.0 that get removed due to conflict.
  • This may not actually be an issue. We're not losing any leaf nodes in the process, and those leaves still have a valid calculation path to the root nodes that's being retained.
  • Right now we are changing the structure of the tree and the weights in independent processes.
  • The cases where a node shows up more than once with different weights are probably also cases in which they show up in more than one calculation with different parents.
  • How do we ensure that the calculation path we choose (mostly via the choice of which stepparents to prune) is consistent with which calculation weight we keep when there's more than one option?
  • Maybe in the tree representation the weights should actually be associated with the incoming edge, rather than the child node. 😲

Debugging via calculation consistency

  • There are lots of different reasons that calculation might not match up.
  • Sometimes the formulae that come from FERC are wrong!
  • Sometimes the data entered by utilities are internally inconsistent (e.g. due to a revision)
  • Sometimes utilities use the wrong version of the form, so row numbers + factoids don't align correctly.
  • We may have incorrectly inferred/corrected/constructed the calculation formula.
  • We may have introduced inconsistency between the calculation path and the retained weights.
  • How can we differentiate between these cases and use calculation mismatches to identify problems and fix them?

@zaneselvans zaneselvans marked this pull request as ready for review September 18, 2023 19:00
@zaneselvans
Copy link
Member

@e-belfer here's what I've been doing to generate the outputs.

Imports & pull some tables

import pandas as pd

import pudl

from pudl.workspace.setup import PudlPaths
from pudl.etl import defs
from pudl.output.ferc1 import (
    Exploder,
    MetadataExploder,
    NodeId,
    XbrlCalculationForestFerc1,
    EXPLOSION_CALCULATION_TOLERANCES,
)

clean_xbrl_metadata_json = defs.load_asset_value(AssetKey("clean_xbrl_metadata_json"))
metadata_xbrl_ferc1 = defs.load_asset_value(AssetKey("metadata_xbrl_ferc1"))
calculation_components_xbrl_ferc1 = defs.load_asset_value(
    AssetKey("calculation_components_xbrl_ferc1")
)
tags_df = defs.load_asset_value(AssetKey("_out_ferc1__explosion_tags"))

Define the input arguments:

explosion_args = {
    "income_statement_ferc1": {
        "root_table": "income_statement_ferc1",
        "table_names_to_explode": [
            "income_statement_ferc1",
            "depreciation_amortization_summary_ferc1",
            "electric_operating_expenses_ferc1",
            "electric_operating_revenues_ferc1",
        ],
        "calculation_tolerance": EXPLOSION_CALCULATION_TOLERANCES["income_statement_ferc1"],
        "seeds": [
            NodeId(
                table_name="income_statement_ferc1",
                xbrl_factoid="net_income_loss",
                utility_type="total",
                plant_status=pd.NA,
                plant_function=pd.NA,
            ),
        ],
        "tags": tags_df,
    },
    "balance_sheet_assets_ferc1": {
        "root_table": "balance_sheet_assets_ferc1",
        "table_names_to_explode": [
            "balance_sheet_assets_ferc1",
            "utility_plant_summary_ferc1",
            "plant_in_service_ferc1",
            "electric_plant_depreciation_functional_ferc1",
        ],
        "calculation_tolerance": EXPLOSION_CALCULATION_TOLERANCES["balance_sheet_assets_ferc1"],
        "seeds": [
            NodeId(
                table_name="balance_sheet_assets_ferc1",
                xbrl_factoid="assets_and_other_debits",
                utility_type=pd.NA,
                plant_status=pd.NA,
                plant_function=pd.NA,
            )
        ],
        "tags": tags_df,
    },
    "balance_sheet_liabilities_ferc1": {
        "root_table": "balance_sheet_liabilities_ferc1",
        "table_names_to_explode": [
            "balance_sheet_liabilities_ferc1",
            "retained_earnings_ferc1",
        ],
        "calculation_tolerance": EXPLOSION_CALCULATION_TOLERANCES["balance_sheet_liabilities_ferc1"],
        "seeds": [
            NodeId(
                table_name="balance_sheet_liabilities_ferc1",
                xbrl_factoid="liabilities_and_other_credits",
                utility_type=pd.NA,
                plant_status=pd.NA,
                plant_function=pd.NA,
            )
        ],
        "tags": tags_df,
    },
}

Function to create exploded assets

def exploded_table(
    root_table: str,
    table_names_to_explode: list[str],
    calculation_tolerance: float,
    seeds: list[NodeId],
    tags: pd.DataFrame,
):
    metadata_xbrl_ferc1 = defs.load_asset_value(AssetKey("metadata_xbrl_ferc1"))
    calculation_components_xbrl_ferc1 = defs.load_asset_value(
        AssetKey("calculation_components_xbrl_ferc1")
    )

    dfs_to_explode = {
        table: pd.read_sql(table, pudl_engine) for table in table_names_to_explode
    }

    exploder = Exploder(
        root_table=root_table,
        table_names=table_names_to_explode,
        metadata_xbrl_ferc1=metadata_xbrl_ferc1,
        calculation_components_xbrl_ferc1=calculation_components_xbrl_ferc1,
        seed_nodes=seeds,
        calculation_tolerance=calculation_tolerance,
        tags=tags,
    )
    return {
        "exploder": exploder,
        "exploded_data": exploder.boom(tables_to_explode=dfs_to_explode),
    }

test_explode = {}
for root_table in explosion_args:
    print(f"Exploding: {root_table}")
    test_explode[root_table] = exploded_table(**explosion_args[root_table])

Inspect one of the explosions

root_table = "balance_sheet_assets_ferc1"
exploder = test_explode[root_table]["exploder"]
exploded_data = test_explode[root_table]["exploded_data"]
calc_forest = exploder.calculation_forest
exploded_meta = calc_forest.exploded_meta
forest_as_table = calc_forest.forest_as_table
leafy_meta = calc_forest.leafy_meta

Copy link
Member

@jdangerx jdangerx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple abstraction-level questions that we should probably discuss, though largely I think we can merge this. It would probably also be good to discuss my thoughts on how to clean up the graph more upstream and whether they're totally off base or not.

.pre-commit-config.yaml Show resolved Hide resolved
src/pudl/output/ferc1.py Show resolved Hide resolved
src/pudl/output/ferc1.py Show resolved Hide resolved
src/pudl/output/ferc1.py Show resolved Hide resolved
src/pudl/output/ferc1.py Outdated Show resolved Hide resolved
src/pudl/output/ferc1.py Show resolved Hide resolved
src/pudl/output/ferc1.py Show resolved Hide resolved
exploded_calcs=self.exploded_calcs,
tags=self.tags,
)
# HACK alter.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this mean?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmgosnell made this note and I've also been confused by it.

src/pudl/output/ferc1.py Show resolved Hide resolved
src/pudl/output/ferc1.py Show resolved Hide resolved
@jdangerx
Copy link
Member

Most of my comments were non-blocking and we addressed the blocking one on a synchronous call - see note above.

There are things we might want to change, but those can happen as a different PR onto explode_ferc1 - no need to scope creep this PR.

@zaneselvans zaneselvans merged commit 51c5efa into explode_ferc1 Sep 20, 2023
9 checks passed
@cmgosnell cmgosnell deleted the explode_tree_fixes branch October 31, 2023 12:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ferc1 Anything having to do with FERC Form 1 metadata Anything having to do with the content, formatting, or storage of metadata. Mostly datapackages. xbrl Related to the FERC XBRL transition
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Ensure calculation tree is complete and connected
4 participants