Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restructure intro.rst and other pages for data warehouse #2912

Conversation

aesharpe
Copy link
Member

@aesharpe aesharpe commented Oct 2, 2023

Still WIP,

Need more input on the ETL section of the intro.rst page! I think you can probably just go ahead and work off this branch too add it @bendnorman what do you think?

…info. Add three components of PUDL description
@aesharpe aesharpe requested a review from bendnorman October 2, 2023 14:54
Copy link
Member

@bendnorman bendnorman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @aesharpe! I propose we:

What do you think?

README.rst Show resolved Hide resolved
README.rst Show resolved Hide resolved
Comment on lines +64 to +92
- **Raw Data Archives**

- We `archive <https://github.com/catalyst-cooperative/pudl-archiver>`__ all the raw
data inputs on `Zenodo <https://zenodo.org/communities/catalyst-cooperative/?page=1&size=20>`__
to ensure perminant, versioned access to the data. In the event that an agency
changes how they publish data or deletes old files, the ETL will still have access
to the original inputs. Each of the data inputs may have several different versions
archived, and all are assigned a unique DOI and made available through the REST API.
- **ETL Pipeline**

- The ETL pipeline (this repo) ingests the raw archives, cleans them, integrates
them, and outputs them to a series of tables stored in SQLite Databases, Parquet
files, and pickle files (the Data Warehouse). Each release of the PUDL Python
package is embedded with a set of of DOIs to indicate which version of the raw
inputs it is meant to process. This process helps ensure that the ETL and it's
outputs are replicable.
- **Data Warehouse**

- The outputs from the ETL, sometimes called "PUDL outputs", are stored in a data
warehouse so that users can access the data without having to run any code. The
majority of the outputs are stored in ``pudl.sqlite``, however CEMS data are stored
in seperate Parquet files due to their large size. The warehouse also contains
pickled interim assets from the ETL process, should users want to access the data
at various stages of the cleaning process, and SQLite databases for the raw FERC
inputs.

For more information about each of the components, read our
`documentation <https://catalystcoop-pudl--2874.org.readthedocs.build/en/2874/intro.html>`__
.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think including this early in the readme forces users to scroll through more text to get to the data access section which I'm assuming is what they care about.

I think this type of architecture information is more important for contributors which will be reading through the Development section of the docs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fair point. I don't want to assume users know what they want yet though and this provides them with the opportunity to understand what happens to the data before they use it. Maybe we could run this by some other people to see what they think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also just copy what's in the intro and use that instead:

- **Raw Data Archives** (raw, versioned inputs)
- **ETL Pipeline** (code to process, clean, and organize the raw inputs)
- **Data Warehouse** (location where ETL outputs, both interim and final, are stored)

docs/dev/naming_conventions.rst Outdated Show resolved Hide resolved
@@ -74,13 +46,43 @@ needed and organize them in a local :doc:`datastore <dev/datastore>`.
.. _etl-process:

---------------------------------------------------------------------------------------
The Data Warehouse Design
The ETL Pipeline
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm tempted to move this information to the development section of the docs. Do users actually care about the raw data archives, data warehouse and data validation?

I'm thinking we could move the The Data Warehouse Design and Data Validation sections to the Data and ETL Design Guidelines page?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little hesitant to have an ETL and Data Warehouse section because they cover similar topics. I think it's easier to think about our data processing just in terms of the raw, core, output layers as opposed to ETL steps.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok! As long as the concept of the Data Warehouse doesn't get lost in the code processing description I think that's fine. My concern with wanting to pull out the Data Warehouse section was similar to your comment above about people being primarily concerned with Data Access and therefore wanting to be able to jump strait to a Data Warehouse page / section might be nice, but I think depending on how we structure the rest of the docs this might not be an issue.

@aesharpe
Copy link
Member Author

aesharpe commented Oct 4, 2023

If we do this, I would wonder what the purpose of this introduction page is. Maybe we don't need it? Idk... I do feel like some brief description of what's going on would be nice as I don't think only developers would want to know this type of information. A lot of users might be curious what is actually happening to t the data they are using in between raw and final. Data and ETL Design Guidelines page feels a little bit hidden. Maybe we could take the mini paragraph descriptions for each section from the README page and put them in the intro instead of having longer descriptions there.

@bendnorman
Copy link
Member

I think you're right I shouldn't assume users don't care about how the data is processed. In that case, what if we just keep the data warehouse / processing language from create-naming-convention-docs branch on the intro page? Users can jump to the data access page if they like or continue to read about the data processing steps:

To get started using PUDL data, visit our Data Access page, or continue reading to learn more about the PUDL data processing pipeline.

Or we can move the data warehouse design language to the ETL Guidelines page and just link to it in the intro page.

I think we're starting to bump up against larger unanswered questions about our docs that are out of the scope of the renaming docs. To keep things simple, what if we:

  • Use the README changes on this branch
  • Keep the intro.rst page from the other branch.
  • Use the Naming Convention section changes from this branch

bendnorman added a commit that referenced this pull request Nov 1, 2023
@bendnorman
Copy link
Member

Changes in the branch were incorporated into #2874.

@bendnorman bendnorman closed this Nov 7, 2023
@bendnorman bendnorman deleted the create-naming-convention-docs-austen branch November 7, 2023 02:07
bendnorman added a commit that referenced this pull request Dec 16, 2023
…cols (#2818)

* Rename static tables

* Rename Census DP1 assets

* Test doc fix

* Update core table names for EIA 860, 923, harvested tables, FERC1, code

* Fix integration tests

* Fix alembic

* Rename 714, 861, epacems

* update tests and rest of assets

* Fix validation tests

* Rename ferc output assets

* Rename denorm_cash_flow_ferc1 and remove leading underscore from cross refs in pudl_db docs

* Rename a missing ferc output table and add migration

* Rename EIA denorm assets

* Recreate ferc rename migration

* Add docs cross ref fix for intermediate assets

* Resolve small denorm EIA rename issues

* Clean up notebooks

* Apply naming convention to allocate generation fuel assets

* Fix a missing gen fuel asset name in PudlTabl

* Update migrations post ferc1 output rename merge

* Update contributor facing documentation with new asset naming conventions

* Add new naming convention to user facing documentation

* Correct allocate-get-fuel down revision

* Apply new naming convention to ferc714 respondents, hourly demand and eia861 service territories

* Fix refs to renamed tables in release notes

* Rename ferc714 and eia861 output tables in integration tests

* Add missing balance authority fk migration

* Rename out_ferc714__fipsified_respondents to out_ferc714__respondents_with_fips

* Respond to first round of Austen's comments

* Update rename-core-assets and clarify raw asset sentence

* Restrict astroid version to avoid random autoapi error

* Reset migrations and fix old table refs in docs

* Fix names of inputs to exploded tables and xbrl calculation fixes

* Rename mcoe and ppl assets

* Fix small ppl migration issue

* Format and sort intermediate resource name cross refs in data dictionary

* Add upstream mcoe assets back to metadata

* Update stragler PudlTabl method name

* Add frequency to ppl asset name and some clean up

* rename six of the non-contreversial FERC1 tables (core + out)

* initial rename of the FERC1 core and out tables

* add db migration

* rename the ferc1 transformer classes in line with new table names

* Incorporate some docs changes from #2912

* FINAL FINAL rename of ferc assets

* ooooops remove the eia860m extraction edit bc that was not supposed to be in here ooop

* Remove README.rst from index.rst and move intro content to index

* Add deprecation warnings to PudlTabl and add minor naming docs updates

* Rename heat_rate_mmbtu_mwh -> heat_rate_mmbtu_mwh_by_unit

* Rename heat rate mmbtu mwh to follow existing naming convention

* Remove PudlTabl removal data and make assn table name sources alphabetical

* Explain why CEMS is stored as parquet

* Rename heat_rate_mmbtu_mwh_eia/ferc1 columns to unit_heat_rate_mmbtu_per_mwh_eia/ferc1

* Remove unused ppe_cols_to_grab variable

* Make association asset names more consistent

* Add association assset naming convention to docs

* Resolve migration issues with unit heat rate column

* Update conda-lock.yml and rendered conda environment files.

* Recreate heat rate migration revision

* Use pudl_sqlite_io_manager for fuel_cost_by_generator assets

* Update conda-lock.yml and rendered conda environment files.

* Checkout lock files from dev

* Update conda-lock.yml and rendered conda environment files.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

For more information, see https://pre-commit.ci

* Remove intro.rst and update ferc s3 urls again

* Update conda-lock.yml and rendered conda environment files.

* Remove some old table names from metaddata

* Update conda-lock.yml and rendered conda environment files.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

For more information, see https://pre-commit.ci

* Remove ref to non existant doc page, remove files no longer in dev

---------

Co-authored-by: bendnorman <[email protected]>
Co-authored-by: Bennett Norman <[email protected]>
Co-authored-by: Christina Gosnell <[email protected]>
Co-authored-by: bendnorman <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants