Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add naming new naming convention to docs #2874

Merged
merged 14 commits into from
Nov 10, 2023

Conversation

bendnorman
Copy link
Member

@bendnorman bendnorman commented Sep 20, 2023

PR Overview

This PR adds the new asset/table naming convention laid out in this design doc to the PUDL documents.

Changes in this PR:

  • Adds a note to data.catalyst.coop/pudl/ and data_access.rst that encourages users to only interact with out_ tables.
  • Removes portions docs/dev/data_guidelines.rst that aren't relevant anymore.
  • Describe the asset naming convention in docs/dev/naming_conventions.rst
  • Replaces ETL language in docs/intro.rst with new raw, core, output layers.

PR Checklist

  • Merge the most recent version of the branch you are merging into (probably dev).
  • All CI checks are passing. Run tests locally to debug failures
  • Make sure you've included good docstrings.
  • For major data coverage & analysis changes, run data validation tests
  • Include unit tests for new functions and classes.
  • Defensive data quality/sanity checks in analyses & data processing functions.
  • Update the release notes and reference reference the PR and related issues.
  • Do your own explanatory review of the PR to help the reviewer understand what's going on and identify issues preemptively.

@bendnorman bendnorman requested a review from aesharpe September 20, 2023 19:39
@codecov
Copy link

codecov bot commented Sep 20, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (c2af359) 88.6% compared to head (53d5618) 88.6%.
Report is 7 commits behind head on rename-core-assets.

Additional details and impacted files
@@                Coverage Diff                 @@
##           rename-core-assets   #2874   +/-   ##
==================================================
  Coverage                88.6%   88.6%           
==================================================
  Files                      91      91           
  Lines                   11019   11021    +2     
==================================================
+ Hits                     9771    9773    +2     
  Misses                   1248    1248           
Files Coverage Δ
src/pudl/output/pudltabl.py 88.7% <100.0%> (+0.2%) ⬆️

... and 7 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@aesharpe aesharpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High level comments:

  • Don't want to duplicate too much information between the intro page and the data naming conventions page.
  • I made suggestions to the naming conventions page before the introductions page, so any stuff fixed there should also be fixed in the introductions page too (or de-duplicated).
  • Worth thinking about the purpose of each of the docs pages. Docs are starting to get a little bit like a maze! (Not your fault or due to this PR specifically). We will probably address the bulk of this if/when we get the ComDev grant, but worth thinking about a little now. Specifically, the difference between Intro and Data Access.
  • Need to be careful to clarify what is in pudl.sqlite and what is not. Also center pudl.sqlite as the main resource.

Comment on lines 11 to 16
PUDL's primary data output is the ``pudl.sqlite`` database. It contains a collection
of tables that follow :ref:`PUDL's asset naming convention <asset-naming>`. Tables
with the ``core_`` prefix are normalized tables that serve as building blocks for the
more denormalized and easy to work with ``output_`` tables. **We recommend only working
with ``output_`` tables.**

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should consider the fact that many users may not know what normalized and denormalized data means in this context. It might make sense to get rid of the sentence

Tables with the ``core_`` prefix are normalized tables that serve as building blocks for the more denormalized and easy to work with ``output_`` tables.

and just say

We recommend working with tables with the ``output_`` prefix as these tables contain the most complete data. For more information about the different types of tables, read through the naming conventions.

Or something like that?

docs/dev/naming_conventions.rst Show resolved Hide resolved
docs/dev/naming_conventions.rst Outdated Show resolved Hide resolved
docs/dev/naming_conventions.rst Outdated Show resolved Hide resolved
docs/dev/naming_conventions.rst Show resolved Hide resolved
docs/intro.rst Outdated Show resolved Hide resolved
docs/dev/naming_conventions.rst Outdated Show resolved Hide resolved
docs/intro.rst Outdated Show resolved Hide resolved
docs/intro.rst Outdated Show resolved Hide resolved
docs/intro.rst Outdated Show resolved Hide resolved
@bendnorman bendnorman marked this pull request as draft September 26, 2023 14:20
@bendnorman
Copy link
Member Author

@aesharpe I incorporated most of the changes from #2912.

I feel good about the formatting and making a distinction between the data warehouse and ETL. The flow of the docs still feel a little awkward but I don't think it's due to the rename changes. I think an issue for another time.

Copy link
Member

@aesharpe aesharpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I think we are getting there. Just a few more comments about slimming stuff down and removing duplicate information.

docs/intro.rst Outdated Show resolved Hide resolved
docs/intro.rst Outdated Show resolved Hide resolved
docs/intro.rst Outdated Show resolved Hide resolved
README.rst Show resolved Hide resolved
docs/data_access.rst Outdated Show resolved Hide resolved
are stored in a data warehouse as a collection of SQLite and Parquet files so that
users can access the data without having to run any code. Learn more about how to
access the data `here <https://catalystcoop-pudl.readthedocs.io/en/dev/data_access.html>`__.

What data is available?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment in intro page regarding data sources

README.rst Outdated Show resolved Hide resolved
docs/intro.rst Outdated Show resolved Hide resolved
README.rst Show resolved Hide resolved
@bendnorman
Copy link
Member Author

I think the root issue of our docs maze is it’s not clear where the starting point is for users and contributors. Is it:

Should the starting point be different for contributors and users?

@aesharpe
Copy link
Member

aesharpe commented Nov 6, 2023

I think the root issue of our docs maze is it’s not clear where the starting point is for users and contributors. Is it:

Should the starting point be different for contributors and users?

Yeah.....this is so true! Is it possible to not have the README be the first page of the docs? That might help with the confusion if they were more seperate.

@bendnorman
Copy link
Member Author

Yes it is possible to not include the README in the docs which might be a good option.

@bendnorman
Copy link
Member Author

Ok! I made a few changes:

  1. combine who is PUDL for What is PUDL in README.rst
  2. No longer include README.rst in the docs and moved the content of intro.rst into index.rst
  3. Move the link to the data access page to the “Available Data” section of the index.rst page

@aesharpe
Copy link
Member

aesharpe commented Nov 8, 2023

Hooray @bendnorman this looks great! My only suggestion is teeny tiny and it's to add a little caveat that CEMS data is stored in parquet files and not the sqlite db because it's so big.

I'm referring to the spot on the into (now index) page where it says:

PUDL’s clean and complete versions of these data sources are stored in the pudl.sqlite database and core_epacems__hourly_emissions.parquet files.

@bendnorman
Copy link
Member Author

Wahoo! Made a change that explains larger datasets live in parquet files.

@bendnorman bendnorman marked this pull request as ready for review November 9, 2023 19:04
…lease-notes

Add naming convention change to release notes
@bendnorman bendnorman merged commit cb9b188 into rename-core-assets Nov 10, 2023
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants