Allow a mix of Zenodo sandbox & production DOIs #2798

zaneselvans · 2023-08-19T01:48:57Z

PR Overview

Okay I did this off the clock since it has been driving me a little bit nuts and I wanted to do something technical that felt easy and satisfying as a break from the never-ending saga of #2016.

Historically we've required that all Zenodo DOIs in the datastore come either from the Sandbox or the Production server, which makes testing a single new archive on its own a hassle, and adds complexity across the whole application with switches for sandbox vs. not-sandbox data sources.

This commit removes this requirement, and allows a mix of sandbox and production DOIs to be used in development.

I also removed some very sparse documentation about how to create an archive in the Datastore by hand, which I think was very old and probably no longer supported and certainly not being tested, since it seemed likely to confuse and frustrate anyone who actually tried to do it.

There's a unit test which checks that all DOIs are production, rather than sandbox to make it difficult to accidentally check in code that refers to unofficial input data.

PR Checklist

Merge the most recent version of the branch you are merging into (probably dev).
All CI checks are passing. Run tests locally to debug failures
Make sure you've included good docstrings.
For major data coverage & analysis changes, run data validation tests
Include unit tests for new functions and classes.
Defensive data quality/sanity checks in analyses & data processing functions.
Update the release notes and reference reference the PR and related issues.
Do your own explanatory review of the PR to help the reviewer understand what's going on and identify issues preemptively.

Okay I did this off the clock since it has been driving me a little bit nuts. Historically we've required that all Zenodo DOIs in the datastore come either from the Sandbox or the Production server, which makes testing a single new archive on its own a hassle, and adds complexity across the whole application with switches for sandbox vs. not-sandbox data sources. This commit removes this requirement, and allows a mix of sandbox and production DOIs to be used in development. I also removed some very sparse documentation about how to create an archive in the Datastore by hand, which I think was very old and probably no longer supported and certainly not being tested, since it seemed likely to confuse and frustrate anyone who actually tried to do it. There's a unit test which checks that all DOIs are production, rather than sandbox to make it difficult to accidentally check in code that refers to unofficial input data.

docs/dev/datastore.rst

README.rst

zaneselvans · 2023-08-19T02:05:01Z

test/unit/workspace/datastore_test.py

+            doi = ds.get_doi(dataset)
+            self.assertFalse(
+                re.fullmatch(r"10\.5072/zenodo\.[0-9]{5,10}", doi),
+                msg=f"Zenodo sandbox DOI found for {dataset}: {doi}",
+            )


This ensures we don't accidentally leave any sandbox DOIs in the codebase.

zaneselvans · 2023-08-19T02:06:34Z

src/pudl/workspace/datastore.py

+        # Sandbox DOIs are provided for reference
+        "censusdp1tract": "10.5281/zenodo.4127049",
+        # "censusdp1tract": "10.5072/zenodo.674992",
+        "eia860": "10.5281/zenodo.8164776",
+        # "eia860": "10.5072/zenodo.1222854",
+        "eia860m": "10.5281/zenodo.8188017",
+        # "eia860m": "10.5072/zenodo.1225517",
+        "eia861": "10.5281/zenodo.8231268",
+        # "eia861": "10.5072/zenodo.1229930",
+        "eia923": "10.5281/zenodo.8172818",
+        # "eia923": "10.5072/zenodo.1217724",
+        "eia_bulk_elec": "10.5281/zenodo.7067367",
+        # "eia_bulk_elec": "10.5072/zenodo.1103572",
+        "epacamd_eia": "10.5281/zenodo.7900974",
+        # "epacamd_eia": "10.5072/zenodo.1199170",
+        "epacems": "10.5281/zenodo.6910058",
+        # "epacems": "10.5072/zenodo.672963",
+        "ferc1": "10.5281/zenodo.7314437",
+        # "ferc1": "10.5072/zenodo.1070868",
+        "ferc2": "10.5281/zenodo.8006881",
+        # "ferc2": "10.5072/zenodo.1188447",
+        "ferc6": "10.5281/zenodo.7130141",
+        # "ferc6": "10.5072/zenodo.1098088",
+        "ferc60": "10.5281/zenodo.7130146",
+        # "ferc60": "10.5072/zenodo.1098089",
+        "ferc714": "10.5281/zenodo.7139875",
+        # "ferc714": "10.5072/zenodo.1098302",


At some point I think we agree the DOIs should come out of the codebase and go into a settings file, but I'm not trying to do that in this PR. I left the sandbox DOIs here and commented out for easy reference if someone wants to test out one of them, or look up which Zenodo archive is referenced in the sandbox.

I wonder if turning this into pydantic settings object that could read from env variables (e.g. PUDL_FERC1_DOI) could be a good way to pass sandbox values here, with production defaults as... well, defaults :)

I think we need to store the DOIs in a file in the repo (which could be used to populate env vars) so we can look them up for cache invalidation, and easily edit them eventually with PRs when new archives become available.

But for this PR I just want to get to where we can have mixed sandbox/production DOIs to make integrating new archives by hand this fall easy.

Using a BaseSettings model was easy! Now the DOIs all get validated automatically by Pydanic, and they can optionally be set using environment variables too.

zaneselvans · 2023-08-19T02:08:31Z

src/pudl/workspace/datastore.py

+        if doi_prefix == "10.5072":
+            api_root = self.API_ROOT["sandbox"]
+        elif doi_prefix == "10.5281":
+            api_root = self.API_ROOT["production"]
+        else:
+            raise ValueError(f"Invalid Zenodo DOI: {doi}")
+        return f"{api_root}/deposit/depositions/{zenodo_id}"


I'm sure there's a more eloquent way of switching between production and sandbox on a per-dataset basis (rather than the whole instance of the class being tied to one or the other) but this seems relatively self-contained and not terrible for the moment.

I messed around with creating a DOI class:

class ZenodoDoi(BaseModel): """A class defining useful validations and methods for working with Zenodo DOIs.""" doi: constr(regex=r"^10\.(5072|5281)/zenodo\.[\d]+$") # noqa: F722 def __str__(self: Self) -> str: """String representation of the DOI""" return self.doi @property def is_prod(self: Self) -> bool: """Return True if DOI is from Zenodo production server, False otherwise.""" if self.doi.startswith("10.5281/zenodo"): return True else: assert self.doi.startswith("10.5072/zenodo") return False @property def token(self: Self) -> str: """Zenodo read-only personal access token corresponding to this DOI. Zenodo tokens recorded here should have read-only access to our archives. Including them here is correct in order to allow public use of this tool, so long as we stick to read-only keys. """ # Read-only personal access tokens for [email protected]: if self.is_prod: return "KXcG5s9TqeuPh1Ukt5QYbzhCElp9LxuqAuiwdqHP0WS4qGIQiydHn6FBtdJ5" else: return "qyPC29wGPaflUUVAv1oGw99ytwBqwEEdwi4NuUrpwc3xUcEwbmuB4emwysco" @property def zenodo_id(self: Self) -> str: """The Zenodo deposition ID, extracted from the DOI.""" match = re.search(r"(10\.5072|10\.5281)/zenodo.([\d]+)", self.doi) return match.groups()[1] @property def api_root(self: Self) -> HttpUrl: """Return appropriate production or sandbox Zenodo API root URL.""" if self.is_prod: return "https://zenodo.org/api" else: return "https://sandbox.zenodo.org/api" @property def url(self: Self) -> HttpUrl: """Zenodo URL corresponding to this DOI.""" return f"{self.api_root}/deposit/depositions/{self.zenodo_id}"

zaneselvans · 2023-08-19T02:09:40Z

src/pudl/workspace/datastore.py

-        help="Override pudl_in directory, defaults to setting in ~/.pudl.yml",
+        help="Input directory to use, overridng the $PUDL_INPUT environment variable.",


I think there are some other lingering references to .pudl.yml floating around that we should chase down now that we've switched over to using $PUDL_INPUT and $PUDL_OUTPUT entirely, but that's for another PR.

Do we actually use this command line flag or can we fully rely on env variables here?

I don't know. If anywhere I think it would be in the tests. I grepped a bit and didn't see it anywhere.

I took a look around and also only see "manually run this script to cache datastore locally." So my vote is to use the env vars completely.

Okay, I'll remove this and see if everything keeps working!

But more generally I think @rousik agreed to take on the "scour the repo for remaining references to .pudl.yml and friends" task.

codecov · 2023-08-19T03:31:08Z

Codecov Report

Patch coverage: 93.6% and no project coverage change.

Comparison is base (77aa2f4) 88.5% compared to head (d840c1d) 88.5%.
Report is 5 commits behind head on dev.

Additional details and impacted files

@@          Coverage Diff          @@
##             dev   #2798   +/-   ##
=====================================
  Coverage   88.5%   88.5%           
=====================================
  Files         90      90           
  Lines      10126   10152   +26     
=====================================
+ Hits        8964    8988   +24     
- Misses      1162    1164    +2

Files Changed	Coverage Δ
src/pudl/cli/etl.py	`56.5% <ø> (-1.0%)`	⬇️
src/pudl/ferc_to_sqlite/cli.py	`71.7% <ø> (-0.8%)`	⬇️
src/pudl/metadata/classes.py	`86.5% <ø> (ø)`
src/pudl/resources.py	`100.0% <ø> (ø)`
src/pudl/workspace/datastore.py	`76.2% <93.6%> (+1.7%)`	⬆️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

docs/dev/datastore.rst

rousik · 2023-08-19T02:24:12Z

src/pudl/workspace/datastore.py

+        if "sandbox" in url:
+            token = self.TOKEN["sandbox"]
+        else:
+            token = self.TOKEN["production"]


This is quite trivial code, but I think that for readability it might be better to extract this into self.get_token(url) method that does this. Could be made more testable and for sure more readable here.

You could even inline self.get_token(url) below.

rousik · 2023-08-19T02:31:41Z

src/pudl/workspace/datastore.py

@@ -240,16 +236,24 @@ def _fetch_from_url(self, url: str) -> requests.Response:

    def _doi_to_url(self, doi: str) -> str:


One possibility here would be to use pydantic constr which brings in some basic validation and make a new type in place of using plain str (see https://docs.pydantic.dev/latest/usage/types/string_types/#arguments-to-constr)

from pydantic import constr ZenodoDOI =constr(regex=r"(10\.5072|10\.5281)/zenodo.([\d]+)") def _doi_to_url(self, doi: ZenodoDOI): ...

Hm, looks like you've already attempted this in the other PR that comes my way?

I tried something like this in the other PR but it wasn't very satisfying. I think there's a simpler way to integrate some of those mechanics directly into the ZenodoFetcher class here.

rousik · 2023-08-19T19:16:06Z

src/pudl/workspace/datastore.py

-        help="Override pudl_in directory, defaults to setting in ~/.pudl.yml",
+        help="Input directory to use, overridng the $PUDL_INPUT environment variable.",


Do we actually use this command line flag or can we fully rely on env variables here?

rousik · 2023-08-21T04:44:08Z

src/pudl/workspace/datastore.py

+        # Sandbox DOIs are provided for reference
+        "censusdp1tract": "10.5281/zenodo.4127049",
+        # "censusdp1tract": "10.5072/zenodo.674992",
+        "eia860": "10.5281/zenodo.8164776",
+        # "eia860": "10.5072/zenodo.1222854",
+        "eia860m": "10.5281/zenodo.8188017",
+        # "eia860m": "10.5072/zenodo.1225517",
+        "eia861": "10.5281/zenodo.8231268",
+        # "eia861": "10.5072/zenodo.1229930",
+        "eia923": "10.5281/zenodo.8172818",
+        # "eia923": "10.5072/zenodo.1217724",
+        "eia_bulk_elec": "10.5281/zenodo.7067367",
+        # "eia_bulk_elec": "10.5072/zenodo.1103572",
+        "epacamd_eia": "10.5281/zenodo.7900974",
+        # "epacamd_eia": "10.5072/zenodo.1199170",
+        "epacems": "10.5281/zenodo.6910058",
+        # "epacems": "10.5072/zenodo.672963",
+        "ferc1": "10.5281/zenodo.7314437",
+        # "ferc1": "10.5072/zenodo.1070868",
+        "ferc2": "10.5281/zenodo.8006881",
+        # "ferc2": "10.5072/zenodo.1188447",
+        "ferc6": "10.5281/zenodo.7130141",
+        # "ferc6": "10.5072/zenodo.1098088",
+        "ferc60": "10.5281/zenodo.7130146",
+        # "ferc60": "10.5072/zenodo.1098089",
+        "ferc714": "10.5281/zenodo.7139875",
+        # "ferc714": "10.5072/zenodo.1098302",


I wonder if turning this into pydantic settings object that could read from env variables (e.g. PUDL_FERC1_DOI) could be a good way to pass sandbox values here, with production defaults as... well, defaults :)

rousik · 2023-08-21T04:45:54Z

test/unit/workspace/datastore_test.py

    def test_get_known_datasets(self):
        """Call to get_known_datasets() produces the expected results."""
        self.assertEqual(
-            sorted(datastore.ZenodoFetcher.DOI["production"]),
+            sorted(datastore.ZenodoFetcher.DOI),
            self.fetcher.get_known_datasets(),
        )


Huh, I'm not sure if the test does what it says it should.

zaneselvans · 2023-08-21T16:35:19Z

src/pudl/workspace/datastore.py

@@ -29,6 +30,7 @@
 # long as we stick to read-only keys.

 PUDL_YML = Path.home() / ".pudl.yml"
+ZenodoDOI = constr(regex=r"(10\.5072|10\.5281)/zenodo.([\d]+)")


This defines the type, but doesn't actually do any validation. I use the ZenodoDOI.validate() method in the ZenodoFetcher.__init__() to check.

zaneselvans · 2023-08-21T16:37:25Z

src/pudl/workspace/datastore.py

@@ -154,106 +156,111 @@ def get_json_string(self) -> str:
        return json.dumps(self.datapackage_json, sort_keys=True, indent=4)


-class ZenodoFetcher:
+class ZenodoFetcher(BaseModel):


Not sure it's ideal to turn this into a Pydantic Model, since we still have to do the validation of the DOIs manually (though using the Pydantic machinery) as we don't want to create a whole Model to contain the ZenodoDOI string (doing so would mean needing to reference like zen_doi.doi rather than treating it like a string -- I think in Pydantic 2 it's easy to swap in a non-dictionary root model, but we're not using v2 yet)

zaneselvans · 2023-08-21T16:47:17Z

Not sure what's going on with the unit test failure here, it works fine for me locally...

zaneselvans · 2023-08-22T04:30:46Z

@rousik I am stumped why the datastore unit tests are failing in CI but working fine locally. Is there anything that looks obviously fishy to you in the test? Does it work locally for you? The huge diff that it's reporting between the mocked and expected datapackage descriptor makes me wonder if it's getting the real datapackage descriptor rather than the mocked one somehow. But I don't know why that would happen in CI but not locally.

pytest test/unit/workspace/datastore_test.py

jdangerx

Generally looks good!

I have one major suggestion, which you can take or leave, about trying to encode the Zenodo environment explicitly - that should let us handle them in a more robust (and also more readable) way. It seems like a small-to-medium refactor effort because you've already got a few tests down!

Apart from that, there are a few small things (privatizing one function, changing some tests).

Let me know if you want to hop on a call to discuss!

docs/dev/datastore.rst

src/pudl/workspace/datastore.py

jdangerx · 2023-08-29T16:29:35Z

test/integration/zenodo_datapackage_test.py

-    @pytest.mark.xfail(
-        raises=(MaxRetryError, ConnectionError, RetryError, ResponseError)
+        raises=(
+            MaxRetryError,


We could mock out the Zenodo interaction, though I guess a lot of the funky logic is "are we using the right access token for the URL for the dataset" so maybe we just keep this around. How flaky does this turn out to be?

test/unit/workspace/datastore_test.py

jdangerx · 2023-08-29T16:35:12Z

src/pudl/workspace/datastore.py

-        help="Override pudl_in directory, defaults to setting in ~/.pudl.yml",
+        help="Input directory to use, overridng the $PUDL_INPUT environment variable.",


I took a look around and also only see "manually run this script to cache datastore locally." So my vote is to use the env vars completely.

jdangerx · 2023-08-29T16:41:11Z

test/integration/zenodo_datapackage_test.py

@@ -8,24 +8,19 @@


 class TestZenodoDatapackages:
-    """Ensure production & sandbox Datastores point to valid datapackages."""
+    """Ensure all DOIs in Datastore point to valid datapackages."""


It might be nice to also test that we can download a couple of the resources that are actually pointed at by the datapackage.json.

It would, and we used to do this, but all of the "Actually download something from Zenodo" tests have ended up being so flaky that we've marked them XFAIL since they were routinely breaking the tests for no real reason.

Yeah, I can definitely see how that would happen. Maybe someday we can use a "flaky test re-runner" like https://github.com/box/flaky.

Something like that pytest flaky plugin seems like it would be good! But that one looks a bit stale.

IIRC much of the flakiness wasn't "re-run immediately and things are okay" it was more like "network is out for an hour so you'll just keep failing until it comes back" which seems weird given that GitHub and CERN should both be online kind of always but... 🤷🏼

Co-authored-by: Dazhong Xia <[email protected]>

…urce_key() method

zaneselvans added datastore Managing the acquisition and organization of external raw data. zenodo Issues having to do with Zenodo data archiving and retrieval. labels Aug 19, 2023

zaneselvans requested a review from rousik August 19, 2023 01:48

zaneselvans added this to the 2023 Summer milestone Aug 19, 2023

zaneselvans linked an issue Aug 19, 2023 that may be closed by this pull request

Allow Datastore to use both sandbox and production DOIs #1863

Closed

zaneselvans commented Aug 19, 2023

View reviewed changes

rousik reviewed Aug 21, 2023

View reviewed changes

Integrate some Pydantic validation into ZenodoFetcher

0c3c050

zaneselvans commented Aug 21, 2023

View reviewed changes

Merge branch 'dev' into goodbye-sandbox

1e725e6

zaneselvans mentioned this pull request Aug 21, 2023

A fancier way of dealing with mixed Zenodo DOIs. #2799

Closed

8 tasks

zaneselvans added 3 commits August 21, 2023 13:37

Merge branch 'dev' into goodbye-sandbox

254ce9b

Update allowed tox versions.

c4b13fc

Merge branch 'dev' into goodbye-sandbox

89cea8c

zaneselvans added 3 commits August 22, 2023 16:10

Merge branch 'dev' into goodbye-sandbox

1971226

Merge branch 'dev' into goodbye-sandbox

2c27a7c

Create a ZenodoDoiSettings Pydantic BaseSettings class.

d840c1d

zaneselvans requested a review from rousik August 26, 2023 02:27

Update Zenodo DOI test to work better with new ZenodoDoiSettings

2d081f0

jdangerx self-requested a review August 28, 2023 20:12

jdangerx requested changes Aug 29, 2023

View reviewed changes

zaneselvans and others added 3 commits August 29, 2023 12:58

Update docs/dev/datastore.rst

b5889ce

Co-authored-by: Dazhong Xia <[email protected]>

Remove deprecated pudl_datastore --pudl_in option and unused get_reso…

6ac9d9f

…urce_key() method

Merge branch 'dev' into goodbye-sandbox

bb9f8fc

zaneselvans requested a review from jdangerx August 29, 2023 17:56

jdangerx approved these changes Aug 29, 2023

View reviewed changes

zaneselvans merged commit 4a3c4ad into dev Aug 29, 2023
4 checks passed

zaneselvans deleted the goodbye-sandbox branch August 29, 2023 18:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow a mix of Zenodo sandbox & production DOIs #2798

Allow a mix of Zenodo sandbox & production DOIs #2798

zaneselvans commented Aug 19, 2023 •

edited

Loading

zaneselvans Aug 19, 2023

zaneselvans Aug 19, 2023

rousik Aug 21, 2023

zaneselvans Aug 21, 2023

zaneselvans Aug 26, 2023

zaneselvans Aug 19, 2023

zaneselvans Aug 19, 2023

zaneselvans Aug 19, 2023

rousik Aug 19, 2023

zaneselvans Aug 21, 2023

jdangerx Aug 29, 2023

zaneselvans Aug 29, 2023

zaneselvans Aug 29, 2023

codecov bot commented Aug 19, 2023 •

edited

Loading

rousik Aug 19, 2023

rousik Aug 19, 2023

rousik Aug 21, 2023

zaneselvans Aug 21, 2023

rousik Aug 19, 2023

rousik Aug 21, 2023

rousik Aug 21, 2023

zaneselvans Aug 21, 2023

zaneselvans Aug 21, 2023

zaneselvans commented Aug 21, 2023

zaneselvans commented Aug 22, 2023

jdangerx left a comment

jdangerx Aug 29, 2023

jdangerx Aug 29, 2023

jdangerx Aug 29, 2023

zaneselvans Aug 29, 2023

jdangerx Aug 29, 2023

zaneselvans Aug 29, 2023

		help="Override pudl_in directory, defaults to setting in ~/.pudl.yml",
		help="Input directory to use, overridng the $PUDL_INPUT environment variable.",

		@@ -240,16 +236,24 @@ def _fetch_from_url(self, url: str) -> requests.Response:

		def _doi_to_url(self, doi: str) -> str:

Allow a mix of Zenodo sandbox & production DOIs #2798

Allow a mix of Zenodo sandbox & production DOIs #2798

Conversation

zaneselvans commented Aug 19, 2023 • edited Loading

PR Overview

PR Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Aug 19, 2023 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zaneselvans commented Aug 21, 2023

zaneselvans commented Aug 22, 2023

jdangerx left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zaneselvans commented Aug 19, 2023 •

edited

Loading

codecov bot commented Aug 19, 2023 •

edited

Loading