Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract 714 xbrl #3822

Merged
merged 20 commits into from
Sep 6, 2024
Merged

Extract 714 xbrl #3822

merged 20 commits into from
Sep 6, 2024

Conversation

aesharpe
Copy link
Member

@aesharpe aesharpe commented Aug 30, 2024

Overview

closes #3813
closes #3814
closes #3815

What problem does this address?

We don't currently extract XBRL tables for the FERC 714 data. This code adds the capability to do so.

What did you change?

pudl/extract/ferc714.py

  • Add _csv suffix to csv-specific extraction functions/variables.
  • Rename csv-specific encoding dictionary to FERC714_CSV_ENCODING
  • Add TABLE_NAME_MAP_FERC714 dictionary
  • Add a function called create_raw_ferc714_xbrl_assets to reference the raw xbrl tables in the FERC 714 SQLite db and call it.
  • Add a function called raw_ferc714_xbrl__metadata_json to make an asset for the metadata table

pudl/io_managers.py

  • Upgrade the FercXBRLSQLiteIOManager to accommodate more than just ferc1
  • Add a ferc714_xbrl_sqlite_io_manager function

pudl/settings.py

  • Add xbrl_years functino to the Ferc714Settings class to be able to access XBRL years.
  • Remove comments about not processing the XBRL data.

pudl/metadata/sources.py

  • Update the FERC714 working partitions.
  • Remove comments about not processing XBRL data.

Testing

How did you make sure this worked? How can a reviewer verify this?

Load each of the raw assets via and make sure they contain the expected years:

defs.load_asset_value("raw_ferc714_xbrl__table_name")

To-do list

src/pudl/extract/ferc714.py Outdated Show resolved Hide resolved
src/pudl/extract/ferc714.py Outdated Show resolved Hide resolved
src/pudl/io_managers.py Outdated Show resolved Hide resolved
@aesharpe aesharpe added the ferc714 Anything having to do with FERC Form 714 label Sep 3, 2024
@aesharpe aesharpe self-assigned this Sep 3, 2024
src/pudl/io_managers.py Outdated Show resolved Hide resolved
src/pudl/settings.py Outdated Show resolved Hide resolved
@aesharpe
Copy link
Member Author

aesharpe commented Sep 4, 2024

Unit test failures have something to do with this snipit I changed in io_managers.py:

ferc_settings = context.resources.dataset_settings.get_datasets()[
    self.db_name.replace("_xbrl", "")
]

The pytest test/unit/io_managers_test.py::test_ferc_xbrl_sqlite_io_manager_dedupes function does not like this.

UPDATE:

I fixed this by changing this snipit to:

ferc_settings = getattr(
    context.resources.dataset_settings, self.db_name.replace("_xbrl", "")
)

@aesharpe aesharpe marked this pull request as ready for review September 5, 2024 05:17
@aesharpe aesharpe added data-update When fresh data is integrated into PUDL from quarterly or annual updates xbrl Related to the FERC XBRL transition labels Sep 5, 2024
Copy link
Member

@cmgosnell cmgosnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two requests for changes in comments, plus the settings yml files need to updated! but those seem like minor tweaks overall this is looking great imo and I was able to materialize these xbrl raw assets locally

test/unit/io_managers_test.py Outdated Show resolved Hide resolved
src/pudl/settings.py Show resolved Hide resolved
Copy link
Member

@cmgosnell cmgosnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two blocking needs here:

to change
years = ", ".join(map(str, ferc714_settings.years))
to
years = ", ".join(map(str, ferc714_settings.csv_years))

in _extract_raw_ferc714_csv

and again to add the new years in the settings yml files.

@aesharpe
Copy link
Member Author

aesharpe commented Sep 5, 2024

and again to add the new years in the settings yml files.

The reason I didn't do this is because I think it will break the transforms and I didn't want to merge it into main that way.

@cmgosnell
Copy link
Member

The reason I didn't do this is because I think it will break the transforms and I didn't want to merge it into main that way.

hm this is a good point... but I think in this context because the only existing transforms are based on the csv extracted tables they will be fine. but still i hadn't put it in context that this was only the extraction so i will renege on this suggestion.

@cmgosnell cmgosnell added this pull request to the merge queue Sep 5, 2024
github-merge-queue bot pushed a commit that referenced this pull request Sep 5, 2024
* Add _csv suffix to current FERC714 raw assets and functions

* Add create_raw_ferc714_xbrl_assets and raw_ferc714_xbrl__metadata_json functions to FERC 714 extract module and include ferc714 io_manager.

* Update XBRL table dict to mimic the TABLE_NAME_MAP_FERC1 format and fix functions mimic those of FERC 1

* Add .get_datasets() to ferc_settings definition in FercXBRLSQLiteIOManager

* Update FERC714 working paritions and add xbrl_year function to Ferc714Settings

* Fix < to > in xbrl_years

* Change test output names to ferc1_xbrl from test_db

* Fix ferc_settings definition in FercXBRLSQLiteIOManager load_input function to fix unit tests

* Remove old FERC714_XBRL_FILES dict

* Add and update dictionary descriptions

* Update release notes

* Fix ferc io_managers and test

* Add csv_years property to Ferc714Settings

* add csv_years reference to raw_ferc714_csv_asset_factory function
@aesharpe aesharpe removed this pull request from the merge queue due to a manual request Sep 5, 2024
@zaneselvans
Copy link
Member

I think the integration test failures are probably because the tests don't know they need to run the FERC-714 XBRL to SQLite conversion before the main PUDL ETL can happen. You can follow the pattern that's used for the FERC Form 1 XBRL to SQLite conversion in test/conftest.py. These are the analogous fixtures:

@pytest.fixture(scope="session", name="ferc1_engine_xbrl")
def ferc1_xbrl_sql_engine(ferc1_xbrl_extract, dataset_settings_config) -> sa.Engine:
    """Grab a connection to the FERC Form 1 DB clone."""
    context = build_init_resource_context(
        resources={"dataset_settings": dataset_settings_config}
    )
    return ferc1_xbrl_sqlite_io_manager(context).engine


@pytest.fixture(scope="session", name="ferc1_xbrl_taxonomy_metadata")
def ferc1_xbrl_taxonomy_metadata(ferc1_engine_xbrl: sa.Engine):
    """Read the FERC 1 XBRL taxonomy metadata from JSON."""
    result = materialize_to_memory([raw_ferc1_xbrl__metadata_json])
    assert result.success

    return result.output_for_node("raw_ferc1_xbrl__metadata_json")

And then the PUDL IO Manager has a declared dependency on the FERC1 XBRL engine:

@pytest.fixture(scope="session")
def pudl_io_manager(
    ferc1_engine_dbf: sa.Engine,  # Implicit dependency
    ferc1_engine_xbrl: sa.Engine,  # Implicit dependency
    live_dbs: bool,
    pudl_datastore_config,
    dataset_settings_config,
    request,
) -> PudlMixedFormatIOManager:

@aesharpe
Copy link
Member Author

aesharpe commented Sep 6, 2024

@zaneselvans thank you for that suggestion! I just had to add one more fixture for ferc714_xbrl_extract and it was good to go.

@aesharpe
Copy link
Member Author

aesharpe commented Sep 6, 2024

mmm getting a unit test failure seemingly related to timeseries. Is timeseries cleaning part of the extraction or transformation phase @zaneselvans?

        for method in "tubal", "tnn":
            # Impute null values
            imputed0 = s.impute(mask=mask, method=method, rho0=1, maxiter=1)
            imputed = s.impute(mask=mask, method=method, rho0=1, maxiter=10)
            # Deviations between original and imputed values
            fit0 = s.summarize_imputed(imputed0, mask)
            fit = s.summarize_imputed(imputed, mask)
            # Mean MAPE (mean absolute percent error) is converging
>           assert fit["mape"].mean() < fit0["mape"].mean()
E           assert 0.030191300830775987 < 0.024084802339886784
E            +  where 0.030191300830775987 = mean()
E            +    where mean = 0    0.002769\n1    0.001304\n2    0.011850\n3    0.002106\n4    0.008777\n5    0.007833\n6    0.135690\n7    0.001992\n8    0.118109\n9    0.011482\nName: mape, dtype: float64.mean
E            +  and   0.024084802339886784 = mean()
E            +    where mean = 0    0.010499\n1    0.011460\n2    0.010525\n3    0.005897\n4    0.007852\n5    0.015753\n6    0.054820\n7    0.022598\n8    0.045421\n9    0.056023\nName: mape, dtype: float64.mean

test/unit/analysis/timeseries_cleaning_test.py:109: AssertionError
FAILED test/unit/analysis/timeseries_cleaning_test.py::test_flags_and_imputes_anomalies[5248964137-8991153078] - assert 0.030191300830775987 < 0.024084802339886784
 +  where 0.030191300830775987 = mean()
 +    where mean = 0    0.002769\n1    0.001304\n2    0.011850\n3    0.002106\n4    0.008777\n5    0.007833\n6    0.135690\n7    0.001992\n8    0.118109\n9    0.011482\nName: mape, dtype: float64.mean
 +  and   0.024084802339886784 = mean()
 +    where mean = 0    0.010499\n1    0.011460\n2    0.010525\n3    0.005897\n4    0.007852\n5    0.015753\n6    0.054820\n7    0.022598\n8    0.045421\n9    0.056023\nName: mape, dtype: float64.mean
===== 1 failed, 1645 passed, 1 skipped, 9 xfailed, 328 warnings in 40.74s ======
make: *** [Makefile:118: pytest-unit] Error 1

@zaneselvans
Copy link
Member

Oh, the timeseries issue is probably a rare stochastic failure -- it happens once in a blue moon, probably having something to do with the seed being used for the random number generator. If you re-run the unit tests I suspect they'll pass just fine.

@aesharpe
Copy link
Member Author

aesharpe commented Sep 6, 2024

Oh, the timeseries issue is probably a rare stochastic failure -- it happens once in a blue moon, probably having something to do with the seed being used for the random number generator. If you re-run the unit tests I suspect they'll pass just fine.

That is wild and also strangely comforting. I'm glad I didn't try and preemptively figure this out...haha

@aesharpe aesharpe added this pull request to the merge queue Sep 6, 2024
@zaneselvans
Copy link
Member

I spent a bunch of time at some point hunting down all the RNG initializations and tried to give them fixed seeds in the tests so we always get the same outputs, but I must have missed one somewhere, and it bites us every so often. It just didn't seem like it was worth continuing the hunt because it's so infrequent.

@zaneselvans
Copy link
Member

Especially in the unit tests, whenever there's something like this that seems totally out of left field and unrelated to what's being changed my first instinct is just to re-run the tests to see if it's real. Doesn't take much time or effort, and often it works.

Merged via the queue into main with commit 3435de7 Sep 6, 2024
17 checks passed
@aesharpe aesharpe deleted the extract-714-xbrl branch September 6, 2024 22:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data-update When fresh data is integrated into PUDL from quarterly or annual updates ferc714 Anything having to do with FERC Form 714 xbrl Related to the FERC XBRL transition
Projects
Archived in project
3 participants