-
-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow a mix of Zenodo sandbox & production DOIs #2798
Conversation
Okay I did this off the clock since it has been driving me a little bit nuts. Historically we've required that all Zenodo DOIs in the datastore come either from the Sandbox or the Production server, which makes testing a single new archive on its own a hassle, and adds complexity across the whole application with switches for sandbox vs. not-sandbox data sources. This commit removes this requirement, and allows a mix of sandbox and production DOIs to be used in development. I also removed some very sparse documentation about how to create an archive in the Datastore by hand, which I think was very old and probably no longer supported and certainly not being tested, since it seemed likely to confuse and frustrate anyone who actually tried to do it. There's a unit test which checks that all DOIs are production, rather than sandbox to make it difficult to accidentally check in code that refers to unofficial input data.
doi = ds.get_doi(dataset) | ||
self.assertFalse( | ||
re.fullmatch(r"10\.5072/zenodo\.[0-9]{5,10}", doi), | ||
msg=f"Zenodo sandbox DOI found for {dataset}: {doi}", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ensures we don't accidentally leave any sandbox DOIs in the codebase.
src/pudl/workspace/datastore.py
Outdated
# Sandbox DOIs are provided for reference | ||
"censusdp1tract": "10.5281/zenodo.4127049", | ||
# "censusdp1tract": "10.5072/zenodo.674992", | ||
"eia860": "10.5281/zenodo.8164776", | ||
# "eia860": "10.5072/zenodo.1222854", | ||
"eia860m": "10.5281/zenodo.8188017", | ||
# "eia860m": "10.5072/zenodo.1225517", | ||
"eia861": "10.5281/zenodo.8231268", | ||
# "eia861": "10.5072/zenodo.1229930", | ||
"eia923": "10.5281/zenodo.8172818", | ||
# "eia923": "10.5072/zenodo.1217724", | ||
"eia_bulk_elec": "10.5281/zenodo.7067367", | ||
# "eia_bulk_elec": "10.5072/zenodo.1103572", | ||
"epacamd_eia": "10.5281/zenodo.7900974", | ||
# "epacamd_eia": "10.5072/zenodo.1199170", | ||
"epacems": "10.5281/zenodo.6910058", | ||
# "epacems": "10.5072/zenodo.672963", | ||
"ferc1": "10.5281/zenodo.7314437", | ||
# "ferc1": "10.5072/zenodo.1070868", | ||
"ferc2": "10.5281/zenodo.8006881", | ||
# "ferc2": "10.5072/zenodo.1188447", | ||
"ferc6": "10.5281/zenodo.7130141", | ||
# "ferc6": "10.5072/zenodo.1098088", | ||
"ferc60": "10.5281/zenodo.7130146", | ||
# "ferc60": "10.5072/zenodo.1098089", | ||
"ferc714": "10.5281/zenodo.7139875", | ||
# "ferc714": "10.5072/zenodo.1098302", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some point I think we agree the DOIs should come out of the codebase and go into a settings file, but I'm not trying to do that in this PR. I left the sandbox DOIs here and commented out for easy reference if someone wants to test out one of them, or look up which Zenodo archive is referenced in the sandbox.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if turning this into pydantic settings object that could read from env variables (e.g. PUDL_FERC1_DOI
) could be a good way to pass sandbox values here, with production defaults as... well, defaults :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to store the DOIs in a file in the repo (which could be used to populate env vars) so we can look them up for cache invalidation, and easily edit them eventually with PRs when new archives become available.
But for this PR I just want to get to where we can have mixed sandbox/production DOIs to make integrating new archives by hand this fall easy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a BaseSettings
model was easy! Now the DOIs all get validated automatically by Pydanic, and they can optionally be set using environment variables too.
src/pudl/workspace/datastore.py
Outdated
if doi_prefix == "10.5072": | ||
api_root = self.API_ROOT["sandbox"] | ||
elif doi_prefix == "10.5281": | ||
api_root = self.API_ROOT["production"] | ||
else: | ||
raise ValueError(f"Invalid Zenodo DOI: {doi}") | ||
return f"{api_root}/deposit/depositions/{zenodo_id}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sure there's a more eloquent way of switching between production and sandbox on a per-dataset basis (rather than the whole instance of the class being tied to one or the other) but this seems relatively self-contained and not terrible for the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I messed around with creating a DOI class:
class ZenodoDoi(BaseModel):
"""A class defining useful validations and methods for working with Zenodo DOIs."""
doi: constr(regex=r"^10\.(5072|5281)/zenodo\.[\d]+$") # noqa: F722
def __str__(self: Self) -> str:
"""String representation of the DOI"""
return self.doi
@property
def is_prod(self: Self) -> bool:
"""Return True if DOI is from Zenodo production server, False otherwise."""
if self.doi.startswith("10.5281/zenodo"):
return True
else:
assert self.doi.startswith("10.5072/zenodo")
return False
@property
def token(self: Self) -> str:
"""Zenodo read-only personal access token corresponding to this DOI.
Zenodo tokens recorded here should have read-only access to our archives.
Including them here is correct in order to allow public use of this tool, so
long as we stick to read-only keys.
"""
# Read-only personal access tokens for [email protected]:
if self.is_prod:
return "KXcG5s9TqeuPh1Ukt5QYbzhCElp9LxuqAuiwdqHP0WS4qGIQiydHn6FBtdJ5"
else:
return "qyPC29wGPaflUUVAv1oGw99ytwBqwEEdwi4NuUrpwc3xUcEwbmuB4emwysco"
@property
def zenodo_id(self: Self) -> str:
"""The Zenodo deposition ID, extracted from the DOI."""
match = re.search(r"(10\.5072|10\.5281)/zenodo.([\d]+)", self.doi)
return match.groups()[1]
@property
def api_root(self: Self) -> HttpUrl:
"""Return appropriate production or sandbox Zenodo API root URL."""
if self.is_prod:
return "https://zenodo.org/api"
else:
return "https://sandbox.zenodo.org/api"
@property
def url(self: Self) -> HttpUrl:
"""Zenodo URL corresponding to this DOI."""
return f"{self.api_root}/deposit/depositions/{self.zenodo_id}"
src/pudl/workspace/datastore.py
Outdated
help="Override pudl_in directory, defaults to setting in ~/.pudl.yml", | ||
help="Input directory to use, overridng the $PUDL_INPUT environment variable.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are some other lingering references to .pudl.yml
floating around that we should chase down now that we've switched over to using $PUDL_INPUT
and $PUDL_OUTPUT
entirely, but that's for another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we actually use this command line flag or can we fully rely on env variables here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know. If anywhere I think it would be in the tests. I grepped a bit and didn't see it anywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a look around and also only see "manually run this script to cache datastore locally." So my vote is to use the env vars completely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I'll remove this and see if everything keeps working!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But more generally I think @rousik agreed to take on the "scour the repo for remaining references to .pudl.yml
and friends" task.
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## dev #2798 +/- ##
=====================================
Coverage 88.5% 88.5%
=====================================
Files 90 90
Lines 10126 10152 +26
=====================================
+ Hits 8964 8988 +24
- Misses 1162 1164 +2
☔ View full report in Codecov by Sentry. |
src/pudl/workspace/datastore.py
Outdated
if "sandbox" in url: | ||
token = self.TOKEN["sandbox"] | ||
else: | ||
token = self.TOKEN["production"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is quite trivial code, but I think that for readability it might be better to extract this into self.get_token(url)
method that does this. Could be made more testable and for sure more readable here.
You could even inline self.get_token(url)
below.
src/pudl/workspace/datastore.py
Outdated
@@ -240,16 +236,24 @@ def _fetch_from_url(self, url: str) -> requests.Response: | |||
|
|||
def _doi_to_url(self, doi: str) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One possibility here would be to use pydantic constr
which brings in some basic validation and make a new type in place of using plain str
(see https://docs.pydantic.dev/latest/usage/types/string_types/#arguments-to-constr)
from pydantic import constr
ZenodoDOI =constr(regex=r"(10\.5072|10\.5281)/zenodo.([\d]+)")
def _doi_to_url(self, doi: ZenodoDOI):
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, looks like you've already attempted this in the other PR that comes my way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried something like this in the other PR but it wasn't very satisfying. I think there's a simpler way to integrate some of those mechanics directly into the ZenodoFetcher
class here.
src/pudl/workspace/datastore.py
Outdated
help="Override pudl_in directory, defaults to setting in ~/.pudl.yml", | ||
help="Input directory to use, overridng the $PUDL_INPUT environment variable.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we actually use this command line flag or can we fully rely on env variables here?
src/pudl/workspace/datastore.py
Outdated
# Sandbox DOIs are provided for reference | ||
"censusdp1tract": "10.5281/zenodo.4127049", | ||
# "censusdp1tract": "10.5072/zenodo.674992", | ||
"eia860": "10.5281/zenodo.8164776", | ||
# "eia860": "10.5072/zenodo.1222854", | ||
"eia860m": "10.5281/zenodo.8188017", | ||
# "eia860m": "10.5072/zenodo.1225517", | ||
"eia861": "10.5281/zenodo.8231268", | ||
# "eia861": "10.5072/zenodo.1229930", | ||
"eia923": "10.5281/zenodo.8172818", | ||
# "eia923": "10.5072/zenodo.1217724", | ||
"eia_bulk_elec": "10.5281/zenodo.7067367", | ||
# "eia_bulk_elec": "10.5072/zenodo.1103572", | ||
"epacamd_eia": "10.5281/zenodo.7900974", | ||
# "epacamd_eia": "10.5072/zenodo.1199170", | ||
"epacems": "10.5281/zenodo.6910058", | ||
# "epacems": "10.5072/zenodo.672963", | ||
"ferc1": "10.5281/zenodo.7314437", | ||
# "ferc1": "10.5072/zenodo.1070868", | ||
"ferc2": "10.5281/zenodo.8006881", | ||
# "ferc2": "10.5072/zenodo.1188447", | ||
"ferc6": "10.5281/zenodo.7130141", | ||
# "ferc6": "10.5072/zenodo.1098088", | ||
"ferc60": "10.5281/zenodo.7130146", | ||
# "ferc60": "10.5072/zenodo.1098089", | ||
"ferc714": "10.5281/zenodo.7139875", | ||
# "ferc714": "10.5072/zenodo.1098302", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if turning this into pydantic settings object that could read from env variables (e.g. PUDL_FERC1_DOI
) could be a good way to pass sandbox values here, with production defaults as... well, defaults :)
def test_get_known_datasets(self): | ||
"""Call to get_known_datasets() produces the expected results.""" | ||
self.assertEqual( | ||
sorted(datastore.ZenodoFetcher.DOI["production"]), | ||
sorted(datastore.ZenodoFetcher.DOI), | ||
self.fetcher.get_known_datasets(), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh, I'm not sure if the test does what it says it should.
src/pudl/workspace/datastore.py
Outdated
@@ -29,6 +30,7 @@ | |||
# long as we stick to read-only keys. | |||
|
|||
PUDL_YML = Path.home() / ".pudl.yml" | |||
ZenodoDOI = constr(regex=r"(10\.5072|10\.5281)/zenodo.([\d]+)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This defines the type, but doesn't actually do any validation. I use the ZenodoDOI.validate()
method in the ZenodoFetcher.__init__()
to check.
src/pudl/workspace/datastore.py
Outdated
@@ -154,106 +156,111 @@ def get_json_string(self) -> str: | |||
return json.dumps(self.datapackage_json, sort_keys=True, indent=4) | |||
|
|||
|
|||
class ZenodoFetcher: | |||
class ZenodoFetcher(BaseModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure it's ideal to turn this into a Pydantic Model, since we still have to do the validation of the DOIs manually (though using the Pydantic machinery) as we don't want to create a whole Model to contain the ZenodoDOI string (doing so would mean needing to reference like zen_doi.doi rather than treating it like a string -- I think in Pydantic 2 it's easy to swap in a non-dictionary root model, but we're not using v2 yet)
Not sure what's going on with the unit test failure here, it works fine for me locally... |
@rousik I am stumped why the datastore unit tests are failing in CI but working fine locally. Is there anything that looks obviously fishy to you in the test? Does it work locally for you? The huge diff that it's reporting between the mocked and expected datapackage descriptor makes me wonder if it's getting the real datapackage descriptor rather than the mocked one somehow. But I don't know why that would happen in CI but not locally. pytest test/unit/workspace/datastore_test.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good!
I have one major suggestion, which you can take or leave, about trying to encode the Zenodo environment explicitly - that should let us handle them in a more robust (and also more readable) way. It seems like a small-to-medium refactor effort because you've already got a few tests down!
Apart from that, there are a few small things (privatizing one function, changing some tests).
Let me know if you want to hop on a call to discuss!
@pytest.mark.xfail( | ||
raises=(MaxRetryError, ConnectionError, RetryError, ResponseError) | ||
raises=( | ||
MaxRetryError, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could mock out the Zenodo interaction, though I guess a lot of the funky logic is "are we using the right access token for the URL for the dataset" so maybe we just keep this around. How flaky does this turn out to be?
src/pudl/workspace/datastore.py
Outdated
help="Override pudl_in directory, defaults to setting in ~/.pudl.yml", | ||
help="Input directory to use, overridng the $PUDL_INPUT environment variable.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a look around and also only see "manually run this script to cache datastore locally." So my vote is to use the env vars completely.
@@ -8,24 +8,19 @@ | |||
|
|||
|
|||
class TestZenodoDatapackages: | |||
"""Ensure production & sandbox Datastores point to valid datapackages.""" | |||
"""Ensure all DOIs in Datastore point to valid datapackages.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be nice to also test that we can download a couple of the resources that are actually pointed at by the datapackage.json
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would, and we used to do this, but all of the "Actually download something from Zenodo" tests have ended up being so flaky that we've marked them XFAIL
since they were routinely breaking the tests for no real reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I can definitely see how that would happen. Maybe someday we can use a "flaky test re-runner" like https://github.com/box/flaky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like that pytest flaky plugin seems like it would be good! But that one looks a bit stale.
IIRC much of the flakiness wasn't "re-run immediately and things are okay" it was more like "network is out for an hour so you'll just keep failing until it comes back" which seems weird given that GitHub and CERN should both be online kind of always but... 🤷🏼
Co-authored-by: Dazhong Xia <[email protected]>
…urce_key() method
PR Overview
Okay I did this off the clock since it has been driving me a little bit nuts and I wanted to do something technical that felt easy and satisfying as a break from the never-ending saga of #2016.
Historically we've required that all Zenodo DOIs in the datastore come either from the Sandbox or the Production server, which makes testing a single new archive on its own a hassle, and adds complexity across the whole application with switches for sandbox vs. not-sandbox data sources.
This commit removes this requirement, and allows a mix of sandbox and production DOIs to be used in development.
I also removed some very sparse documentation about how to create an archive in the Datastore by hand, which I think was very old and probably no longer supported and certainly not being tested, since it seemed likely to confuse and frustrate anyone who actually tried to do it.
There's a unit test which checks that all DOIs are production, rather than sandbox to make it difficult to accidentally check in code that refers to unofficial input data.
PR Checklist
dev
).