-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor integration tests to remove random collection sampling #749
base: main
Are you sure you want to change the base?
Conversation
"page_num": 1, | ||
"page_size": 100, | ||
"sort_key[]": "-usage_score", | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just copied this out of my browser dev tools' network tab after doing a similar query in earthdata search client. I'm sure we can run an equivalent query with earthaccess.
Worked on this with @itcarroll during hack day. Notes: #755 |
We considered the usefulness of random sampling tests. We don't think we should be doing this for integration tests, especially when they execute on every PR. We could, for example, run them on a cron job and create reports, but that seems like overkill when we have a community to help us identify datasets and connect with the right support channel if there's an issue with the provider. We may still consider a cron job for, for examle, recalculating the most popular datasets on a monthly basis. |
We decided we can hardcode a small number and expand the list as we go. Other things like random tests on a cron or updating the list of popular datasets on a cron can be addressed separately. |
d79f48f
to
194fd29
Compare
@betolink will take on work to update @mfisher87 will continue working on |
We will update the .txt files to .csv files and add boolean field for "does the collection have a EULA?" and then we'll use that field to mark those tests as |
Two major milestones:
Thanks to @DeanHenze and @Sherwin-14 for collaborating on this on today's hackathon! |
@@ -244,6 +244,9 @@ def _repr_html_(self) -> str: | |||
granule_html_repr = _repr_granule_html(self) | |||
return granule_html_repr | |||
|
|||
def __hash__(self) -> int: | |||
return hash(self["meta"]["concept-id"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@betolink @chuckwondo This seems reasonable to me, but please validate me :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about it for like 5 minutes, this is obviously a bad idea. This class is subclassing dict
. We'd need to implement like a frozendict
.
Also still TODO: Run generate.py in GHA on a monthly/quarterly cron and auto-open a PR with the changes to top collections? |
If we want to determine whether a collection has a EULA, this example was provided:
The metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like part of this issue may be related to work on EULAs in this issue. |
This provides a baseline proof of passing tests before a change
I don't know why this error is suddenly being reported in my PR. I don't have time now to figure it out. Could use help!
For some reason, running
is failing, when
is passing. |
Looks like this was a coincidence with intermittent uptime on an external dependency. |
@@ -0,0 +1,100 @@ | |||
C2799438299-POCLOUD | |||
C1996881146-POCLOUD | |||
# C2204129664-POCLOUD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This collection isn't working so good 🤒
We get 0 granules:
> assert len(granules) > 0, msg
E AssertionError: AssertionError for C2204129664-POCLOUD
E assert 0 > 0
E + where 0 = len([])
But it's still 3rd most-popular? I'm confused :)
@@ -6,6 +6,7 @@ | |||
from fsspec.core import strip_protocol | |||
|
|||
logger = logging.getLogger(__name__) | |||
pytestmark = pytest.mark.skip(reason="Tests are broken.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests are failing on the release. Maybe xfail is a better mark. I preferred not to get into fixing this in this PR.
mkdocs.yml
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I felt this needed simplification as I added more.
- We have a few "guide" things, so I gave them a naming pattern so they can be mentally grouped
- Removed the word "Our" because it wasn't adding anything
- "Naming conventions" felt out of place, too specific. Like the new integration test doc. So I created a new "Topics" subsection (but not a subdirectory to keep the URL flatter). I don't like "Topics", but it's the best I have thought of so far.
Wow, this is a big one! I can start today but I'm not sure if I can finish today! great work @mfisher87 !! |
Thanks for taking a look, @betolink ! There are some opportunities for refactoring, but I really tried to keep the scope narrow in this PR to avoid growing even bigger :) |
Resolves #215
Replaces random collection sampling with hardcoded lists of 100 top collections per provider in popularity order, with script to regenerate the lists as needed. Instead of sampling
n
random collections we selectn
most popular.There's still a clear need for refactoring of the 4 cloud/onprem download/open test modules. They share a lot of code that can be fixturized. I don't want this PR to grow larger than it already is, so IMO that should be a follow-up activity.