Extract more data from FERC XBRLs and handle that new data in ETL #2821

jdangerx · 2023-09-01T20:07:50Z

Changes required to get the FERC 1 assets materializing properly, while pointed at the new XBRL extractor (for a more-complete xbrl2sqlite) and the new XBRL archives.

change how we find the filings to run xbrl2sqlite on
update how we call ferc-xbrl-extractor - to match what's on the api_compat branch
change num_transmission_circuits from int to number - this value contains "0.0" which int doesn't like, but float can handle. Potentially we can do a pre-conversion from string to float, and then convert that to int when we're applying dtypes. This was breaking transmission_statistics_ferc1.
drop end-of-previous-year values for the instant table that feeds utility_plant_summary_ferc1
add total and other as utility type categories across the board. These were being turned into NAs because they were "unrecognizable" categories, which then led to spurious dupes down the line, and broke electric_plant_depreciation_changes_ferc1.

One big change we made in the extractor itself was to take multiple filings from one entity and merge them, treating later filings as updates to earlier filings.

TODO:

remove "grab last filing only" logic - this should be obviated by the "take multiple filings and merge them" logic mentioned above.
- we do still have to refactor to pass the report publish time into ferc-xbrl-extractor instead of relying on the ReportDate fact, but I think that can be a follow-up PR.
fix unit tests that touch our extract.xbrl logic, since we changed the logic quite a lot
release api-compat as ferc-xbrl-extractor 1.0, and point PUDL at that instead of this git ref we have now
see if we can standardize on new-style or old-style archives - this code is currently written to handle "new-style" archives that have the taxonomies separately archived from the XBRL files, but all archives except for 10.5072/zenodo.1234455 are still old-style.

src/pudl/metadata/fields.py

src/pudl/transform/ferc1.py

pyproject.toml

jdangerx

Mostly a TODO list for further review / massage.

src/pudl/output/ferc1.py

src/pudl/output/ferc714.py

src/pudl/transform/classes.py

migrations/versions/11a43f756905_idk.py

migrations/versions/273a78878b74_purchased_storage_mwh.py

src/pudl/analysis/ferc1_eia_train.py

jdangerx · 2023-10-04T19:54:31Z

OK, this is in the commit message, but I went ahead and committed some changes. @aesharpe let me know if these are reasonable:

Totally new:

18012: pjm interconnection, llc / total
18013: new york state electric & gas corporation / see footnote
18014: southwest power pool, inc. / total
18015: public service company of colorado / community solar gardens
18016: the empire district electric company / n/a
each & 73 units at 2.52 mw each)
18017: wisconsin electric power company / see footnote
18018: upper michigan energy resources company (pudl determined) / total
18019: new york transco, llc / total
18020: wilderness line holdings, llc / total
18021: mt. carmel public utility co / total

Mapped to existing PUDL ID:

8671: pacific gas & electric company, small hydroelectric generating plants
15000: idaho power company / hydro
15001: idaho power company / internal combustion
15068: public service company of colorado / conventional hydro
12926: midamerican energy company / ida grove ii wind farm (8 units at 2.3 mw
1287: alaska electric light and power company / salmon creek hyrdo

Note the misspelling of the plant name in 1287.

Changed:

15031: mt. carmel public utility co / not applicable -> ameren
illinois company / not applicable

This one had a mismatch between utility_id_ferc 222, which corresponds
to Ameren, not Mt. Carmel (397).

codecov · 2023-10-04T21:01:12Z

Codecov Report

All modified lines are covered by tests ✅

Comparison is base (67822df) 88.6% compared to head (2abf505) 88.5%.
Report is 1 commits behind head on dev.

Additional details and impacted files

@@           Coverage Diff           @@
##             dev   #2821     +/-   ##
=======================================
- Coverage   88.6%   88.5%   -0.1%     
=======================================
  Files         90      90             
  Lines      10809   10795     -14     
=======================================
- Hits        9577    9563     -14     
  Misses      1232    1232

Files	Coverage Δ
src/pudl/analysis/ferc1_eia_train.py	`53.8% <100.0%> (+0.8%)`	⬆️
src/pudl/extract/xbrl.py	`95.5% <100.0%> (-1.6%)`	⬇️
src/pudl/metadata/classes.py	`86.4% <ø> (ø)`
src/pudl/output/ferc714.py	`96.2% <100.0%> (ø)`
src/pudl/transform/classes.py	`94.6% <100.0%> (+<0.1%)`	⬆️
src/pudl/transform/ferc1.py	`96.6% <100.0%> (+<0.1%)`	⬆️
src/pudl/transform/params/ferc1.py	`100.0% <ø> (ø)`
src/pudl/workspace/datastore.py	`77.1% <100.0%> (ø)`

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

The new extractor added some data to the 2021 XBRL archives. This caused some integration and validation test fails. I added some plants to the pudl_id mapping spreadsheet, all of which are considered totals. I.e., not real plants, but we're mapping them for the sake of giving them an ID (they are not connected to EIA records). Because this is how we treat other total records reported to FERC1. This also updates the way that values were assigned to a slice of the ferc1_eia_train output spreadsheets. NA values were causing an issue, so I had to change how the values were being converted. This also updates the test_minmax_rows test to reflect the new rows in the 2021 data.

Totally new: * 18012: pjm interconnection, llc / total * 18013: new york state electric & gas corporation / see footnote * 18014: southwest power pool, inc. / total * 18015: public service company of colorado / community solar gardens * 18016: the empire district electric company / n/a each & 73 units at 2.52 mw each) * 18017: wisconsin electric power company / see footnote * 18018: upper michigan energy resources company (pudl determined) / total * 18019: new york transco, llc / total * 18020: wilderness line holdings, llc / total * 18021: mt. carmel public utility co / total Mapped to existing PUDL ID: * 8671: pacific gas & electric company, small hydroelectric generating plants * 15000: idaho power company / hydro * 15001: idaho power company / internal combustion * 15068: public service company of colorado / conventional hydro * 12926: midamerican energy company / ida grove ii wind farm (8 units at 2.3 mw * 1287: alaska electric light and power company / salmon creek hyrdo Note the misspelling of the plant name in 1287. Changed: * 15031: mt. carmel public utility co / not applicable -> ameren illinois company / not applicable This one had a mismatch between utility_id_ferc 222, which corresponds to Ameren, not Mt. Carmel (397).

There are some missing data due to messy deduplication: #2822 But we'll do the deduplication better in here: #2899

… on disk???

zaneselvans

The integration tests are passing so I assume this isn't an issue, but I was confused by the removal of the ferc_xbrl fixture. Did it get replaced by something previously, but not ripped out?

zaneselvans · 2023-10-06T19:22:58Z

test/conftest.py

-@pytest.fixture(scope="session")
-def ferc_xbrl(
-    live_dbs,
-    ferc_to_sqlite_settings,
-    pudl_datastore_fixture,
-):
-    """Extract XBRL filings and produce raw DB+metadata files.
-
-    Extracts a subset of filings for each form for the year 2021.
-    """
-    if not live_dbs:
-        year = 2021
-
-        # Prep datastore
-        datastore = FercXbrlDatastore(pudl_datastore_fixture)
-
-        # Set step size for subsetting
-        step_size = 5
-
-        for form in XbrlFormNumber:
-            raw_archive, taxonomy_entry_point = datastore.get_taxonomy(year, form)
-
-            sqlite_engine = _get_sqlite_engine(form.value, True)
-
-            form_settings = ferc_to_sqlite_settings.get_xbrl_dataset_settings(form)
-
-            # Extract every fifth filing
-            filings_subset = datastore.get_filings(year, form)[::step_size]
-            xbrl.extract(
-                filings_subset,
-                sqlite_engine,
-                raw_archive,
-                form.value,
-                requested_tables=form_settings.tables,
-                batch_size=len(filings_subset) // step_size + 1,
-                workers=step_size,
-                # TODO(janrous): the following should ideally be provided by some
-                # ferc dataset metadata object rather than encoding this in settings.
-                datapackage_path=PudlPaths().output_file(
-                    f"ferc{form.value}_xbrl_datapackage.json"
-                ),
-                metadata_path=PudlPaths().output_file(
-                    f"ferc{form.value}_xbrl_taxonomy_metadata.json"
-                ),
-                archive_file_path=taxonomy_entry_point,
-            )
-
-


Without this fixture, how are the FERC XBRL databases being generated for use in the ETL tests, and how are we doing integration testing to ensure that we're able to extract data from all the forms? Is this just cruft that's been replaced by other fixtures now?

These guys are being generated by the ferc_to_sqlite_xbrl_only fixture, now.

jdangerx changed the title ~~2810 run 2021 ferc 1 data through new more complete extractor~~ Extract more data from FERC XBRLs and handle that new data in ETL Sep 1, 2023

jdangerx force-pushed the 2810-run-2021-ferc-1-data-through-new-more-complete-extractor branch 2 times, most recently from 560b394 to 8762269 Compare September 8, 2023 21:41

zaneselvans reviewed Sep 8, 2023

View reviewed changes

src/pudl/metadata/fields.py Outdated Show resolved Hide resolved

zaneselvans reviewed Sep 8, 2023

View reviewed changes

src/pudl/transform/ferc1.py Show resolved Hide resolved

zaneselvans added ferc1 Anything having to do with FERC Form 1 xbrl Related to the FERC XBRL transition labels Sep 8, 2023

zaneselvans mentioned this pull request Sep 11, 2023

Bump catalystcoop-ferc-xbrl-extractor from 0.8.3 to 1.0.0 #2845

Closed

zaneselvans linked an issue Sep 11, 2023 that may be closed by this pull request

Run 2021 FERC 1 data through new, more complete extractor #2810

Closed

jdangerx force-pushed the 2810-run-2021-ferc-1-data-through-new-more-complete-extractor branch from 529c10d to 1eaef5a Compare September 13, 2023 16:58

jdangerx marked this pull request as ready for review September 13, 2023 19:35

jdangerx force-pushed the 2810-run-2021-ferc-1-data-through-new-more-complete-extractor branch 2 times, most recently from 3d6912b to 8676c93 Compare September 18, 2023 19:03

zaneselvans reviewed Sep 22, 2023

View reviewed changes

pyproject.toml Outdated Show resolved Hide resolved

zaneselvans mentioned this pull request Sep 23, 2023

Fix issues arising from pandas v2.1 & ferc-xbrl-extractor v1.1.1 #2854

Merged

8 tasks

e-belfer assigned jdangerx Oct 2, 2023

jdangerx commented Oct 3, 2023

View reviewed changes

jdangerx force-pushed the 2810-run-2021-ferc-1-data-through-new-more-complete-extractor branch from cf34e93 to 8a77426 Compare October 3, 2023 20:24

zschira and others added 10 commits October 6, 2023 09:53

Update to use new version of ferc-xbrl-extractor

66044a4

Fix issues arising from stricter typing used in pandas 2.1

6f5a0af

Use integer transmission circuits.

8f0ed65

Use pd.to_numeric() to convert numeric strings to numeric dtypes.

103cbd3

Remove obsolete references to ferc1_schema tests.

534729d

Get rid of an unnecessary temporary variable.

6613941

Remove migrations that were already part of dev

ed9bf32

Update validation test expectations.

6f37ca8

There are some missing data due to messy deduplication: #2822 But we'll do the deduplication better in here: #2899

jdangerx force-pushed the 2810-run-2021-ferc-1-data-through-new-more-complete-extractor branch from f3674c0 to 6f37ca8 Compare October 6, 2023 13:53

Add clarification on a confusing bit of code.

5a06b41

jdangerx requested a review from zaneselvans October 6, 2023 14:10

jdangerx force-pushed the 2810-run-2021-ferc-1-data-through-new-more-complete-extractor branch from 4d35e21 to 58cd41b Compare October 6, 2023 14:17

jdangerx requested a review from aesharpe October 6, 2023 15:38

Filtering pudl_id_mapping.xlsx in LibreOffice changes the actual file…

2abf505

… on disk???

jdangerx force-pushed the 2810-run-2021-ferc-1-data-through-new-more-complete-extractor branch from 58cd41b to 2abf505 Compare October 6, 2023 16:53

zaneselvans approved these changes Oct 6, 2023

View reviewed changes

jdangerx merged commit e36cec5 into dev Oct 6, 2023
11 checks passed

zaneselvans deleted the 2810-run-2021-ferc-1-data-through-new-more-complete-extractor branch October 6, 2023 20:23

zaneselvans mentioned this pull request Oct 12, 2023

Merge dev into main for 2023-10-12 #2937

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract more data from FERC XBRLs and handle that new data in ETL #2821

Extract more data from FERC XBRLs and handle that new data in ETL #2821

jdangerx commented Sep 1, 2023 •

edited

Loading

jdangerx left a comment

jdangerx commented Oct 4, 2023

codecov bot commented Oct 4, 2023 •

edited

Loading

zaneselvans left a comment

zaneselvans Oct 6, 2023

jdangerx Oct 6, 2023

Extract more data from FERC XBRLs and handle that new data in ETL #2821

Extract more data from FERC XBRLs and handle that new data in ETL #2821

Conversation

jdangerx commented Sep 1, 2023 • edited Loading

jdangerx left a comment

Choose a reason for hiding this comment

jdangerx commented Oct 4, 2023

codecov bot commented Oct 4, 2023 • edited Loading

Codecov Report

zaneselvans left a comment

Choose a reason for hiding this comment

zaneselvans Oct 6, 2023

Choose a reason for hiding this comment

jdangerx Oct 6, 2023

Choose a reason for hiding this comment

jdangerx commented Sep 1, 2023 •

edited

Loading

codecov bot commented Oct 4, 2023 •

edited

Loading