Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow a mix of Zenodo sandbox & production DOIs #2798

Merged
merged 13 commits into from
Aug 29, 2023
Merged

Conversation

zaneselvans
Copy link
Member

@zaneselvans zaneselvans commented Aug 19, 2023

PR Overview

Okay I did this off the clock since it has been driving me a little bit nuts and I wanted to do something technical that felt easy and satisfying as a break from the never-ending saga of #2016.

Historically we've required that all Zenodo DOIs in the datastore come either from the Sandbox or the Production server, which makes testing a single new archive on its own a hassle, and adds complexity across the whole application with switches for sandbox vs. not-sandbox data sources.

This commit removes this requirement, and allows a mix of sandbox and production DOIs to be used in development.

I also removed some very sparse documentation about how to create an archive in the Datastore by hand, which I think was very old and probably no longer supported and certainly not being tested, since it seemed likely to confuse and frustrate anyone who actually tried to do it.

There's a unit test which checks that all DOIs are production, rather than sandbox to make it difficult to accidentally check in code that refers to unofficial input data.

PR Checklist

  • Merge the most recent version of the branch you are merging into (probably dev).
  • All CI checks are passing. Run tests locally to debug failures
  • Make sure you've included good docstrings.
  • For major data coverage & analysis changes, run data validation tests
  • Include unit tests for new functions and classes.
  • Defensive data quality/sanity checks in analyses & data processing functions.
  • Update the release notes and reference reference the PR and related issues.
  • Do your own explanatory review of the PR to help the reviewer understand what's going on and identify issues preemptively.

Okay I did this off the clock since it has been driving me a little bit nuts.

Historically we've required that all Zenodo DOIs in the datastore come either
from the Sandbox or the Production server, which makes testing a single new
archive on its own a hassle, and adds complexity across the whole application
with switches for sandbox vs. not-sandbox data sources.

This commit removes this requirement, and allows a mix of sandbox and production
DOIs to be used in development.

I also removed some very sparse documentation about how to create an archive in
the Datastore by hand, which I think was very old and probably no longer supported
and certainly not being tested, since it seemed likely to confuse and frustrate
anyone who actually tried to do it.

There's a unit test which checks that all DOIs are production, rather than sandbox
to make it difficult to accidentally check in code that refers to unofficial
input data.
@zaneselvans zaneselvans added datastore Managing the acquisition and organization of external raw data. zenodo Issues having to do with Zenodo data archiving and retrieval. labels Aug 19, 2023
@zaneselvans zaneselvans requested a review from rousik August 19, 2023 01:48
@zaneselvans zaneselvans added this to the 2023 Summer milestone Aug 19, 2023
@zaneselvans zaneselvans linked an issue Aug 19, 2023 that may be closed by this pull request
docs/dev/datastore.rst Show resolved Hide resolved
README.rst Show resolved Hide resolved
Comment on lines 248 to 252
doi = ds.get_doi(dataset)
self.assertFalse(
re.fullmatch(r"10\.5072/zenodo\.[0-9]{5,10}", doi),
msg=f"Zenodo sandbox DOI found for {dataset}: {doi}",
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ensures we don't accidentally leave any sandbox DOIs in the codebase.

Comment on lines 170 to 196
# Sandbox DOIs are provided for reference
"censusdp1tract": "10.5281/zenodo.4127049",
# "censusdp1tract": "10.5072/zenodo.674992",
"eia860": "10.5281/zenodo.8164776",
# "eia860": "10.5072/zenodo.1222854",
"eia860m": "10.5281/zenodo.8188017",
# "eia860m": "10.5072/zenodo.1225517",
"eia861": "10.5281/zenodo.8231268",
# "eia861": "10.5072/zenodo.1229930",
"eia923": "10.5281/zenodo.8172818",
# "eia923": "10.5072/zenodo.1217724",
"eia_bulk_elec": "10.5281/zenodo.7067367",
# "eia_bulk_elec": "10.5072/zenodo.1103572",
"epacamd_eia": "10.5281/zenodo.7900974",
# "epacamd_eia": "10.5072/zenodo.1199170",
"epacems": "10.5281/zenodo.6910058",
# "epacems": "10.5072/zenodo.672963",
"ferc1": "10.5281/zenodo.7314437",
# "ferc1": "10.5072/zenodo.1070868",
"ferc2": "10.5281/zenodo.8006881",
# "ferc2": "10.5072/zenodo.1188447",
"ferc6": "10.5281/zenodo.7130141",
# "ferc6": "10.5072/zenodo.1098088",
"ferc60": "10.5281/zenodo.7130146",
# "ferc60": "10.5072/zenodo.1098089",
"ferc714": "10.5281/zenodo.7139875",
# "ferc714": "10.5072/zenodo.1098302",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point I think we agree the DOIs should come out of the codebase and go into a settings file, but I'm not trying to do that in this PR. I left the sandbox DOIs here and commented out for easy reference if someone wants to test out one of them, or look up which Zenodo archive is referenced in the sandbox.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if turning this into pydantic settings object that could read from env variables (e.g. PUDL_FERC1_DOI) could be a good way to pass sandbox values here, with production defaults as... well, defaults :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to store the DOIs in a file in the repo (which could be used to populate env vars) so we can look them up for cache invalidation, and easily edit them eventually with PRs when new archives become available.

But for this PR I just want to get to where we can have mixed sandbox/production DOIs to make integrating new archives by hand this fall easy.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a BaseSettings model was easy! Now the DOIs all get validated automatically by Pydanic, and they can optionally be set using environment variables too.

Comment on lines 246 to 252
if doi_prefix == "10.5072":
api_root = self.API_ROOT["sandbox"]
elif doi_prefix == "10.5281":
api_root = self.API_ROOT["production"]
else:
raise ValueError(f"Invalid Zenodo DOI: {doi}")
return f"{api_root}/deposit/depositions/{zenodo_id}"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure there's a more eloquent way of switching between production and sandbox on a per-dataset basis (rather than the whole instance of the class being tied to one or the other) but this seems relatively self-contained and not terrible for the moment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I messed around with creating a DOI class:

class ZenodoDoi(BaseModel):
    """A class defining useful validations and methods for working with Zenodo DOIs."""

    doi: constr(regex=r"^10\.(5072|5281)/zenodo\.[\d]+$")  # noqa: F722

    def __str__(self: Self) -> str:
        """String representation of the DOI"""
        return self.doi

    @property
    def is_prod(self: Self) -> bool:
        """Return True if DOI is from Zenodo production server, False otherwise."""
        if self.doi.startswith("10.5281/zenodo"):
            return True
        else:
            assert self.doi.startswith("10.5072/zenodo")
            return False

    @property
    def token(self: Self) -> str:
        """Zenodo read-only personal access token corresponding to this DOI.

        Zenodo tokens recorded here should have read-only access to our archives.
        Including them here is correct in order to allow public use of this tool, so
        long as we stick to read-only keys.
        """
        # Read-only personal access tokens for [email protected]:
        if self.is_prod:
            return "KXcG5s9TqeuPh1Ukt5QYbzhCElp9LxuqAuiwdqHP0WS4qGIQiydHn6FBtdJ5"
        else:
            return "qyPC29wGPaflUUVAv1oGw99ytwBqwEEdwi4NuUrpwc3xUcEwbmuB4emwysco"

    @property
    def zenodo_id(self: Self) -> str:
        """The Zenodo deposition ID, extracted from the DOI."""
        match = re.search(r"(10\.5072|10\.5281)/zenodo.([\d]+)", self.doi)
        return match.groups()[1]

    @property
    def api_root(self: Self) -> HttpUrl:
        """Return appropriate production or sandbox Zenodo API root URL."""
        if self.is_prod:
            return "https://zenodo.org/api"
        else:
            return "https://sandbox.zenodo.org/api"

    @property
    def url(self: Self) -> HttpUrl:
        """Zenodo URL corresponding to this DOI."""
        return f"{self.api_root}/deposit/depositions/{self.zenodo_id}"

Comment on lines 471 to 465
help="Override pudl_in directory, defaults to setting in ~/.pudl.yml",
help="Input directory to use, overridng the $PUDL_INPUT environment variable.",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are some other lingering references to .pudl.yml floating around that we should chase down now that we've switched over to using $PUDL_INPUT and $PUDL_OUTPUT entirely, but that's for another PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually use this command line flag or can we fully rely on env variables here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know. If anywhere I think it would be in the tests. I grepped a bit and didn't see it anywhere.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a look around and also only see "manually run this script to cache datastore locally." So my vote is to use the env vars completely.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'll remove this and see if everything keeps working!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But more generally I think @rousik agreed to take on the "scour the repo for remaining references to .pudl.yml and friends" task.

@codecov
Copy link

codecov bot commented Aug 19, 2023

Codecov Report

Patch coverage: 93.6% and no project coverage change.

Comparison is base (77aa2f4) 88.5% compared to head (d840c1d) 88.5%.
Report is 5 commits behind head on dev.

Additional details and impacted files
@@          Coverage Diff          @@
##             dev   #2798   +/-   ##
=====================================
  Coverage   88.5%   88.5%           
=====================================
  Files         90      90           
  Lines      10126   10152   +26     
=====================================
+ Hits        8964    8988   +24     
- Misses      1162    1164    +2     
Files Changed Coverage Δ
src/pudl/cli/etl.py 56.5% <ø> (-1.0%) ⬇️
src/pudl/ferc_to_sqlite/cli.py 71.7% <ø> (-0.8%) ⬇️
src/pudl/metadata/classes.py 86.5% <ø> (ø)
src/pudl/resources.py 100.0% <ø> (ø)
src/pudl/workspace/datastore.py 76.2% <93.6%> (+1.7%) ⬆️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

docs/dev/datastore.rst Show resolved Hide resolved
Comment on lines 224 to 227
if "sandbox" in url:
token = self.TOKEN["sandbox"]
else:
token = self.TOKEN["production"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite trivial code, but I think that for readability it might be better to extract this into self.get_token(url) method that does this. Could be made more testable and for sure more readable here.

You could even inline self.get_token(url) below.

@@ -240,16 +236,24 @@ def _fetch_from_url(self, url: str) -> requests.Response:

def _doi_to_url(self, doi: str) -> str:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One possibility here would be to use pydantic constr which brings in some basic validation and make a new type in place of using plain str (see https://docs.pydantic.dev/latest/usage/types/string_types/#arguments-to-constr)

from pydantic import constr
ZenodoDOI =constr(regex=r"(10\.5072|10\.5281)/zenodo.([\d]+)")

def _doi_to_url(self, doi: ZenodoDOI):
   ...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, looks like you've already attempted this in the other PR that comes my way?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried something like this in the other PR but it wasn't very satisfying. I think there's a simpler way to integrate some of those mechanics directly into the ZenodoFetcher class here.

Comment on lines 471 to 465
help="Override pudl_in directory, defaults to setting in ~/.pudl.yml",
help="Input directory to use, overridng the $PUDL_INPUT environment variable.",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually use this command line flag or can we fully rely on env variables here?

Comment on lines 170 to 196
# Sandbox DOIs are provided for reference
"censusdp1tract": "10.5281/zenodo.4127049",
# "censusdp1tract": "10.5072/zenodo.674992",
"eia860": "10.5281/zenodo.8164776",
# "eia860": "10.5072/zenodo.1222854",
"eia860m": "10.5281/zenodo.8188017",
# "eia860m": "10.5072/zenodo.1225517",
"eia861": "10.5281/zenodo.8231268",
# "eia861": "10.5072/zenodo.1229930",
"eia923": "10.5281/zenodo.8172818",
# "eia923": "10.5072/zenodo.1217724",
"eia_bulk_elec": "10.5281/zenodo.7067367",
# "eia_bulk_elec": "10.5072/zenodo.1103572",
"epacamd_eia": "10.5281/zenodo.7900974",
# "epacamd_eia": "10.5072/zenodo.1199170",
"epacems": "10.5281/zenodo.6910058",
# "epacems": "10.5072/zenodo.672963",
"ferc1": "10.5281/zenodo.7314437",
# "ferc1": "10.5072/zenodo.1070868",
"ferc2": "10.5281/zenodo.8006881",
# "ferc2": "10.5072/zenodo.1188447",
"ferc6": "10.5281/zenodo.7130141",
# "ferc6": "10.5072/zenodo.1098088",
"ferc60": "10.5281/zenodo.7130146",
# "ferc60": "10.5072/zenodo.1098089",
"ferc714": "10.5281/zenodo.7139875",
# "ferc714": "10.5072/zenodo.1098302",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if turning this into pydantic settings object that could read from env variables (e.g. PUDL_FERC1_DOI) could be a good way to pass sandbox values here, with production defaults as... well, defaults :)

Comment on lines 258 to 263
def test_get_known_datasets(self):
"""Call to get_known_datasets() produces the expected results."""
self.assertEqual(
sorted(datastore.ZenodoFetcher.DOI["production"]),
sorted(datastore.ZenodoFetcher.DOI),
self.fetcher.get_known_datasets(),
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, I'm not sure if the test does what it says it should.

@@ -29,6 +30,7 @@
# long as we stick to read-only keys.

PUDL_YML = Path.home() / ".pudl.yml"
ZenodoDOI = constr(regex=r"(10\.5072|10\.5281)/zenodo.([\d]+)")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This defines the type, but doesn't actually do any validation. I use the ZenodoDOI.validate() method in the ZenodoFetcher.__init__() to check.

@@ -154,106 +156,111 @@ def get_json_string(self) -> str:
return json.dumps(self.datapackage_json, sort_keys=True, indent=4)


class ZenodoFetcher:
class ZenodoFetcher(BaseModel):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure it's ideal to turn this into a Pydantic Model, since we still have to do the validation of the DOIs manually (though using the Pydantic machinery) as we don't want to create a whole Model to contain the ZenodoDOI string (doing so would mean needing to reference like zen_doi.doi rather than treating it like a string -- I think in Pydantic 2 it's easy to swap in a non-dictionary root model, but we're not using v2 yet)

@zaneselvans
Copy link
Member Author

Not sure what's going on with the unit test failure here, it works fine for me locally...

@zaneselvans
Copy link
Member Author

@rousik I am stumped why the datastore unit tests are failing in CI but working fine locally. Is there anything that looks obviously fishy to you in the test? Does it work locally for you? The huge diff that it's reporting between the mocked and expected datapackage descriptor makes me wonder if it's getting the real datapackage descriptor rather than the mocked one somehow. But I don't know why that would happen in CI but not locally.

pytest test/unit/workspace/datastore_test.py

@zaneselvans zaneselvans requested a review from rousik August 26, 2023 02:27
@jdangerx jdangerx self-requested a review August 28, 2023 20:12
Copy link
Member

@jdangerx jdangerx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good!

I have one major suggestion, which you can take or leave, about trying to encode the Zenodo environment explicitly - that should let us handle them in a more robust (and also more readable) way. It seems like a small-to-medium refactor effort because you've already got a few tests down!

Apart from that, there are a few small things (privatizing one function, changing some tests).

Let me know if you want to hop on a call to discuss!

docs/dev/datastore.rst Outdated Show resolved Hide resolved
src/pudl/workspace/datastore.py Show resolved Hide resolved
src/pudl/workspace/datastore.py Outdated Show resolved Hide resolved
@pytest.mark.xfail(
raises=(MaxRetryError, ConnectionError, RetryError, ResponseError)
raises=(
MaxRetryError,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could mock out the Zenodo interaction, though I guess a lot of the funky logic is "are we using the right access token for the URL for the dataset" so maybe we just keep this around. How flaky does this turn out to be?

test/unit/workspace/datastore_test.py Outdated Show resolved Hide resolved
Comment on lines 471 to 465
help="Override pudl_in directory, defaults to setting in ~/.pudl.yml",
help="Input directory to use, overridng the $PUDL_INPUT environment variable.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a look around and also only see "manually run this script to cache datastore locally." So my vote is to use the env vars completely.

@@ -8,24 +8,19 @@


class TestZenodoDatapackages:
"""Ensure production & sandbox Datastores point to valid datapackages."""
"""Ensure all DOIs in Datastore point to valid datapackages."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be nice to also test that we can download a couple of the resources that are actually pointed at by the datapackage.json.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would, and we used to do this, but all of the "Actually download something from Zenodo" tests have ended up being so flaky that we've marked them XFAIL since they were routinely breaking the tests for no real reason.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I can definitely see how that would happen. Maybe someday we can use a "flaky test re-runner" like https://github.com/box/flaky.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like that pytest flaky plugin seems like it would be good! But that one looks a bit stale.

IIRC much of the flakiness wasn't "re-run immediately and things are okay" it was more like "network is out for an hour so you'll just keep failing until it comes back" which seems weird given that GitHub and CERN should both be online kind of always but... 🤷🏼

@zaneselvans zaneselvans requested a review from jdangerx August 29, 2023 17:56
@zaneselvans zaneselvans merged commit 4a3c4ad into dev Aug 29, 2023
4 checks passed
@zaneselvans zaneselvans deleted the goodbye-sandbox branch August 29, 2023 18:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datastore Managing the acquisition and organization of external raw data. zenodo Issues having to do with Zenodo data archiving and retrieval.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Allow Datastore to use both sandbox and production DOIs
3 participants