Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apply new naming convention to PUDL assets #2765

Closed
12 tasks done
jdangerx opened this issue Jul 31, 2023 · 5 comments
Closed
12 tasks done

Apply new naming convention to PUDL assets #2765

jdangerx opened this issue Jul 31, 2023 · 5 comments
Assignees
Milestone

Comments

@jdangerx
Copy link
Member

jdangerx commented Jul 31, 2023

Pending approval of the design doc, we should actually apply the naming convention

Scope is placeholder, to be fleshed out on design doc completion.

Scope

  1. 7 of 7
  2. 9 of 9
    bendnorman
  3. 5 of 5
  4. 128 of 128
    metadata
    e-belfer
  5. 13 of 13
  6. eia923 new-data rmi
  7. bendnorman
  8. 2 of 2
    datasette output
    bendnorman
@jdangerx jdangerx converted this from a draft issue Jul 31, 2023
@bendnorman
Copy link
Member

bendnorman commented Aug 9, 2023

Hey @jdangerx! I was thinking about breaking up the work into two PRs off of dev:

  1. Rename the raw and intermediate core assets
  2. Rename the user-facing tables and documentation

I could also rename the user-facing tables in separate PRs to avoid a monster PR though the table names on dataset and s3 will follow different naming conventions during the conversion. This might be fine given the tables already follow different conventions! I could start with the output tables because we have a disclaimer that the current names are temporary and finish with the core tables which most users likely rely on. What do you think?

Also, at what point do you think we should widdle down the number of output assets persisted to the database? Maybe before we do an official data release so users don't get attached to a table to have it disappear down the line.

@jdangerx
Copy link
Member Author

jdangerx commented Aug 9, 2023

I think splitting this into:

  • raw_, _core
  • core
  • _output, output (I guess you could put the _output as a separate PR earlier in the process, since it's also hidden from users)

Makes sense to me! I'm OK with having discrepancies on datasette vs. S3 since we're pretty clear about the nightly builds being unstable.

I like doing the non-user-facing stuff first as practice for the user-facing stuff - we can see what breaks when we try to publish the data with the new names, before it breaks stuff for more important tables...

I think it's a good idea to move as much stuff from output into _output as possible, before we make output tables accessible to the public. Putting something in output is basically a promise to users that it will stay there - we can always "promote" _output tables to output. Same for core, really, though I think we have less leeway there.

@bendnorman
Copy link
Member

bendnorman commented Aug 15, 2023

Sounds good!

Based on feedback in #2503, people would prefer the changes to happen once. Here is a more detailed plan of how we can roll out these changes:

  1. Rename raw and intermediate core assets. Merge into dev. None of these assets are persisted to the database so this will not impact users.
  2. Rename output and core assets in a feature branch. This step will include converting some output assets that don't need to be user-facing to intermediate assets that aren't persisted in the database. We don't want to release a bunch of new tables in a tagged version and later remove them. This branch will also add a deprecation warning to PudlTabl. s3://intake.catalyst.coop should be renamed to s3://pudl.catalyst.coop Rename AWS bucket #2574 prior to step 3.
  3. Before merging the feature branch into dev, merge dev into main, tag a version pre-naming-change, and run a full build. Users can rely on this tagged version as they migrate to use the new table names and remove PudlTabl.
  4. Merge the feature branch into dev, merge dev into main and tag a release using our desired release naming convention: vYYYY.MM.DD.
  5. Help users migrate from depending on PudlTabl, old table names and pinning to dev to relying on the new tables in the tagged pudl.sqlite database in the public s3 bucket.
  6. Once core users are migrated, deprecate PudlTabl.

How does this plan sound to y'all? @zaneselvans @arengel @grgmiller @gschivley?

@bendnorman
Copy link
Member

Ok! I think we're going to merge this thing in! Here is a todo list for rolling this out:

Release 2022 data with old names

  • Open a PR merging the latest commits to pass the nightly builds on dev into main.
  • The commits are in main. The 2022 data will be in main so we can do a data release!
  • Create and push a tag called v2023.12.04. This will kick off a full build.
  • When the build passes we'll have directory called s3://pudl.catalyst.coop/v2023.12.04
  • Create a manual Zenodo data release with the tagged data (not entirely sure what this entails)
  • Notify known users that we'll be applying the rename to dev so if they are pinned to data on datasette or pulling from the s3://pudl.catalyst.coop/dev directory, their code will break. To resolve the breakage, point your code towards the data in s3://pudl.catalyst.coop/v2023.12.04 or replace your references to PUDL tables using this sheet. Users who are still using PudlTabl and are pinned to dev will not be affected because the PudlTabl methods have not been changed though we are planning on removing PudlTabl and users should migrate to pulling the data directly from the database.

Release PUDL with new names

  • Merge Feature branch: Rename core + output assets to match new naming protocols #2818 into dev.
  • Once the nightly builds have a successful run, open a PR from dev into main.
  • Once the new names are in main create a tag called v2023.12.{day}, push and wait for a build to pass
  • Create a manual Zenodo data release once the build passes
  • Notify our users about the changes and fix breakages (example notebooks... tbd)

Questions

  • Should we do a code freeze on dev and main while we are working through these release mechanics?
  • How do we want to name these releases? They will likely happen on different days so the names will be different or we could use a suffix like v2023.12.04.oldnames and v2023.12.04.

@bendnorman bendnorman moved this from In progress to In review in Catalyst Megaproject Dec 4, 2023
@e-belfer e-belfer moved this from In review to Backlog in Catalyst Megaproject Dec 4, 2023
@jdangerx jdangerx moved this from Backlog to In progress in Catalyst Megaproject Dec 4, 2023
@jdangerx
Copy link
Member Author

jdangerx commented Dec 4, 2023

Sounds like @zaneselvans and I will get cracking on the data release once #3086 is merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

4 participants